Origify
Overview
- Plagiarism Detection Software for Enhancing Academic Integrity in State Education
The project aims to develop a software solution to detect plagiarism in exam papers, ensuring authenticity, transparency, and quality in the educational system. By identifying and addressing instances of academic dishonesty, the software will contribute to the overall improvement of the state education system, fostering a culture of integrity, openness, and excellence.
Motives
- Authenticity, Transparency and quality
- Personal experience with the education system
Objective
- Fighting corruption in the state educational system
Architecture
Data Processing
-
Document Text Preprocessing:
- Tokenization: Breaking the text into individual tokens (words, punctuation, etc.) using
nltk.word_tokenize
. - Stop Word Removal: Removing commonly occurring words that carry little meaning (e.g., "the", "a", "is") using NLTK's
stopwords
corpus. - Stemming or Lemmatization: Reducing words to their base or root form using techniques like the
PorterStemmer
from NLTK.
- Tokenization: Breaking the text into individual tokens (words, punctuation, etc.) using
-
Document Vectorization:
- TF-IDF (Term Frequency-Inverse Document Frequency) Vectorization: Converting the preprocessed text into numerical vectors representing the importance of each word in the document and the corpus using
sklearn.feature_extraction.text.TfidfVectorizer
.
- Term Frequency
- Inverse Document Frequency
Where: is the total number of documents in the corpus. is the number of documents containing term in the corpus
- TF-IDF is calculated as the product of TF and IDF:
Where: is the term. is the document. is the corpus (collection of documents)
This TF-IDF value represents the importance of term
in document relative to the entire corpus . Higher values indicate that the term is more important or relevant to the document. - TF-IDF (Term Frequency-Inverse Document Frequency) Vectorization: Converting the preprocessed text into numerical vectors representing the importance of each word in the document and the corpus using
-
Similarity Calculation:
- Cosine Similarity: Calculating the cosine similarity between the TF-IDF vectors of the query document and each document in the corpus using
sklearn.metrics.pairwise.cosine_similarity
. This measures the cosine of the angle between the vectors, providing a numerical score representing their similarity.
- Cosine Similarity: Calculating the cosine similarity between the TF-IDF vectors of the query document and each document in the corpus using
TODOs
- Logo
- UI & UX design
- Read TFIDF
- Save & Load Model
- Convert to Text
- Converting image to text (OCR)
- Converting PDF to text
- Data scraping
- Ency
- Degrees
- ~ CEM
- BEM
- BAC
- Dzexams (this was easy)
- Setup a sub-process to clean data from duplication between the two data sources (Ency & Dzexams)
- Plain-Text Extraction
- Extract and Save Texts
- Stream Text Extraction to Preprocessing & Vectorization
- Preprocess data
- Save preprocessed text
- Vectorize data using TD-IDF
- Setup database
- Setup FastAPI
/compare
endpoint authentication - Setup a process of adding manual exams data
/upload
(maybe) endpoint- + checking if the uploaded data (exams) is (are) already availability
- Setup a HTTP endpoint to run the data mining process (admin)
- Wait-listing system
- Payment Gateway Integration
- Stripe
- CIB/EDAHABIA
- [ ]
Features
- User choose the subject of the passed paper to minimize resources
- Sharing option a signed official interactive link/doc/badge & Image like Spotify,
- Teachers scoring system (user type's the exam provider informations)
Developer Tech-Stack
- Frontend
- React Native
- Logsnag - Events Tracking
- Sentry - Performance Monitoring & Error Tracking
- https://react-native-document-scanner.js.org/ -- Camera Vision
- https://lucide.dev/ - icons & fonts
- Backend
- Python
- HTTPX
- selectolax
- playwright
- FastAPI
- sklearn
- nltk
- Betterstack
Data Sources
dzexams.com ency-education.com
Design Prototypes
Wireframe
Mobile User-flow
Marketing
- Social Media - [[Origify Beta Short Form Ad - 1]]
- TikTok
- Facebook Pages/Groups (education related)
- Public Posters
- DM's. Direct Message Students & Teachers
- Emails. Email students & Teachers (mimic arc browser copywriting)
Monetization
- Limited requested by week, buy to use - (credits system)
Papers
- A paper that documents this project problem, walkthrough, motive and its techniques (Completely inspired paper)
- How I arrived to the conclusion that the Tamazight language is in the process of dying based on the statistics reflects of the availability of the educational exam papers online (TODO: replace dying word)