← Back to projects

Origify

Source

Overview

The project aims to develop a software solution to detect plagiarism in exam papers, ensuring authenticity, transparency, and quality in the educational system. By identifying and addressing instances of academic dishonesty, the software will contribute to the overall improvement of the state education system, fostering a culture of integrity, openness, and excellence.

Motives

Objective

Architecture

Data Processing

  1. Document Text Preprocessing:

    • Tokenization: Breaking the text into individual tokens (words, punctuation, etc.) using nltk.word_tokenize.
    • Stop Word Removal: Removing commonly occurring words that carry little meaning (e.g., "the", "a", "is") using NLTK's stopwords corpus.
    • Stemming or Lemmatization: Reducing words to their base or root form using techniques like the PorterStemmer from NLTK.
  2. Document Vectorization:

    • TF-IDF (Term Frequency-Inverse Document Frequency) Vectorization: Converting the preprocessed text into numerical vectors representing the importance of each word in the document and the corpus using sklearn.feature_extraction.text.TfidfVectorizer.
    1. Term Frequency
    2. Inverse Document Frequency Where:
      • is the total number of documents in the corpus.
      • is the number of documents containing term in the corpus
    3. TF-IDF is calculated as the product of TF and IDF: Where:
      • is the term.
      • is the document.
      • is the corpus (collection of documents)

    This TF-IDF value represents the importance of term in document relative to the entire corpus . Higher values indicate that the term is more important or relevant to the document.

  3. Similarity Calculation:

    • Cosine Similarity: Calculating the cosine similarity between the TF-IDF vectors of the query document and each document in the corpus using sklearn.metrics.pairwise.cosine_similarity. This measures the cosine of the angle between the vectors, providing a numerical score representing their similarity.

TODOs

Features

Developer Tech-Stack

Data Sources

dzexams.com ency-education.com

Design Prototypes

Wireframe

Mobile User-flow

mobile_userflow

Marketing

Monetization

Papers