Cross-Language Plagiarism Detection Tool: Guidelines And Project Description

Guidelines for Progress Report and Final Presentation

This application aims to take in an input from a user which would be a text file. The two languages that can be read from the file should be English and Hindi. The users will be able to provide the input as an UNICODE file. To achieve this we create a “Hindi representation” of the sentence in English.The application will then search for similar files on the internet and provide as with the results that are relevant to the text file that is uploaded. To achieve this we create a “Hindi representation” of the sentence in English.

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

We went through many articles on the internet’s which were related to the development of the cross platform plagiarism tool. An article [1] on the stack overflow suggested that we could develop this application in Python using the NLTK library and GenSem library which is accomplished by creating the LDA or LSA of the document. We can ultimately use the Google Search API to search for those words.  NTLK [2] is the Natural Language Toolkit for the natural language processing. This toolkit supports libraries for classification, tokenization, stemming, tagging, parsing, semantic reasoning etc.

In [5], Chowet. al. mentions about the semantic plagiarism technique. Semantic plagiarism is where the sentence is reconstructed or some terms are changed into its corresponding synonyms. Both of these plagiarisms is hardly detected due to the difference in their fingerprints. Plagiarism detection tools that are available are not capable to detect such plagiarism cases.

Chow et. al. in [5] proposes a new approach in detecting both cross language and semantic plagiarism, where , the query document is shortened by utilising fuzzy swarm-based summarisation approach, the summary will give the most important keywords in the document. Input summary documents are translated into English using Google Translate Application Programming Interface (API) before the words are stemmed and the stop words are removed. Tokenized documents are sent to the Google AJAX Search API to search for similar documents throughout the World Wide Web. Stanford Parser and Word Net are used to determine the semantic similarity between the suspected documents with source documents. Stanford parser assigns each terms in the sentence to their corresponding roles such as Nouns, Verbs and Adjectives. Each sentence is then represented in a predicate form and similarity is measured based on those predicates using information from Word Net taxonomy. Testing dataset is built up from two sets of input documents which are produced based on different plagiarism techniques.

Bird et. al. in [3] overs the scope of using the NTLK toolkit for the natural language processing. We are thinking of using methodology where a Token class is used to represent of unit a text such as a word, sentence or a piece of document. Kuhn et. al.[4] describes the use of the application of semantic classification trees for the understanding of natural language processing. Speech understanding, semantic classification, machine learning, natural language and decision tree based capabilities for a translator application are covered up in this paper.

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

These paragraphs speakabout the speech classification, machine learning based learning of artificial neural networks, decision trees, tokenization and several other methods.In [6], Jeremy et. al. talks about different state-of-art methods to detect the plagiarism. Some of the methods used in the experiment are Cross-Language Character N-Gram (CL-CnG) , Cross-Language Conceptual Thesaurus-based Similarity (CL-CTS), Cross-Language Alignment-based Similarity Analysis (CL-ASA), Cross-Language Explicit Semantic Analysis (CL-ESA), Translation + Monolingual Analysis (T+MA).  According to the author, there is a common behaviour of each method across different language pairs. There is not only a strong correlation across languages but also across text units that were considered. If a method is efficient on a particular language pair, it will be similarly efficient on another language pair as long as enough lexical resources are available for these languages. There was a strong correlation across types of text when they investigated the behaviour of the methods across different types of texts on a particular language pair. It was found that a method could be optimized on a particular collection of text and applied efficiently on another collection. Finally, it was concluded that methods behave differently in clustering match and mismatched units, even if they seem similar in performance.

Project Description

The Project Activities are shown below(Barrón-Cedeño, Gupta and Rosso, 2013).

Developing a cross-language plagiarism detection tool

   User management

   Document management

   Translation of input documents

      Translate the plagiarized Hindi documents into English

      Improve the effectiveness of the detection process

      Use Google Translate AP

   Removing Stop Words

      Before passing the translated documents for comparison through the Internet

      Remove the stop words in the translated text

   Stemming Words

      Remove the affixes

      Generate root word

      Pattern matching

      Text Stemmer and Porter Stemmer

      Use of Porter Stemming algorithm

      Removing the commoner morphological and in flexional endings from words in English

   Identifying Similar Documents

      Collection of documents that located around the World Wide Web

      Enables small and characteristic fragments translation

      Query documents or texts are inserted

      Use of Google AJAX Search API

   Comparison of Similar Pattern

      Detect plagiarism

      Represent the sentence uniquely.

   Summary of the Result

      Gathering the result

      Plagiarism detection is displayed

      Highlight the similarities between the two files.

Resource Name



Max. Units

Std. Rate

Accrue At

Base Calendar

Project Manager







System Analyst





















Technical Writer







Code Designer







Overall Project Activities are shown below(Chauhan, Arora and Singhal, 2017).

Task Name





Resource Names

Developing a cross-language plagiarism detection tool

60 days

Wed 9/12/18

Tue 12/4/18

   User management

1 day

Wed 9/12/18

Wed 9/12/18

Designer, Developer

   Document management

2 days

Thu 9/13/18

Fri 9/14/18


Designer, Project Manager, Technical Writer

   Translation of input documents

8 days

Mon 9/17/18

Wed 9/26/18


      Translate the plagiarized Hindi documents into English

2 days

Mon 9/17/18

Tue 9/18/18

Code Designer, Developer, System Analyst

      Improve the effectiveness of the detection process

3 days

Wed 9/19/18

Fri 9/21/18



      Use Google Translate AP

3 days

Mon 9/24/18

Wed 9/26/18


Code Designer, Designer

   Removing Stop Words

5 days

Thu 9/27/18

Wed 10/3/18


      Before passing the translated documents for comparison through the Internet

2 days

Thu 9/27/18

Fri 9/28/18

Designer, System Analyst

      Remove the stop words in the translated text

3 days

Mon 10/1/18

Wed 10/3/18


Developer, Code Designer

   Stemming Words

15 days

Thu 10/4/18

Wed 10/24/18


      Remove the affixes

3 days

Thu 10/4/18

Mon 10/8/18


      Generate root word

4 days

Tue 10/9/18

Fri 10/12/18


System Analyst

      Pattern matching

2 days

Mon 10/15/18

Tue 10/16/18



      Text Stemmer and Porter Stemmer

2 days

Wed 10/17/18

Thu 10/18/18



      Use of Porter Stemming algorithm

2 days

Fri 10/19/18

Mon 10/22/18



      Removing the commoner morphological and in flexional endings from words in English

2 days

Tue 10/23/18

Wed 10/24/18


Developer, System Analyst

   Identifying Similar Documents

10 days

Thu 10/25/18

Wed 11/7/18


      Collection of documents that located around the World Wide Web

2 days

Thu 10/25/18

Fri 10/26/18

System Analyst

      Enables small and characteristic fragments translation

3 days

Thu 10/25/18

Mon 10/29/18


      Query documents or texts are inserted

3 days

Tue 10/30/18

Thu 11/1/18


System Analyst, Technical Writer

      Use of Google AJAX Search API

4 days

Fri 11/2/18

Wed 11/7/18


Code Designer, Developer

   Comparison of Similar Pattern

10 days

Thu 11/8/18

Wed 11/21/18


      Detect plagiarism

4 days

Tue 11/13/18

Fri 11/16/18


Code Designer, Project Manager, System Analyst

      Represent the sentence uniquely.

3 days

Mon 11/19/18

Wed 11/21/18


System Analyst, Technical Writer

   Summary of the Result

9 days

Thu 11/22/18

Tue 12/4/18


      Gathering the result

2 days

Thu 11/22/18

Fri 11/23/18

Project Manager, System Analyst

      Plagiarism detection is displayed

3 days

Mon 11/26/18

Wed 11/28/18


Designer, Developer, Project Manager

      Highlight the similarities between the two files.

4 days

Thu 11/29/18

Tue 12/4/18


Code Designer, Developer

Project charter is shown below.

Resource Cost status is shown below(Ehsan and Shakery, 2016).

Project Activities Cost is shown below.


Fixed Cost

Actual Cost

Remaining Cost


Baseline Cost

Cost Variance

Developing a cross-language plagiarism detection tool







MS Project file is attached here.

Plagiarism is turning into a difficult issue for scholarly network. The recognition of counterfeiting at different levels is an important issue. The complexity of the issue increments when we are finding the plagiarism detection in the source codes that might be in a similar language or they have been changed into different languages(Franco-Salvador et al., 2016). This kind of written falsification is found in the scholastic fills in as well as in the ventures managing programming planning. The real issue with the source code written fabrication is that distinctive programming languages may have different linguistic structure.

In view of language homogeneity or heterogeneity of the writings being looked at, plagiarism detection discovery can be characterized into monolingual and cross-lingual. The cross-language written misrepresentation recognition process is like the outside plagiarism detection identification assignment with a few alterations in heuristic recovery and itemized investigation stages(Gelbukh, 2009). In cross-language heuristic recovery, this stage expects to recover the accumulation of source hopeful archives from the informational index. Deciphering the info archive from the inquiry language to the source language might be required in this stage. The cross-language point by point examination level estimates the cross-language likeness between segments of the suspicious record and segments of the hopeful reports which recovered in the past stage(Kashkur, Parshutin and Borisov, 2010).

Language used : Java script.

Software Design for Cross language plagiarism detection tool is illustrated below(Kasprowicz and Wada, 2014).

If you type  the any text it change from English to hindi..(Potthast et al., 2010)


With the project being accomplished, we hope that we would be able to find plagiarism related to any articles on the web provided an input file to our application.


‘How to develop a plagiarism detector?’ extracted on 11 August 2018.

NTLK 3.3 documentation for Natural Language Toolkit extracted from on 11 August 2018.

Steven Bird, Edward Loper NTLK: The Natural Language Toolkit

Roland Kuhn, Renato De Mori – The Application of Semantic Classification Trees to Natural Language Understanding.

Chow Kok Kent, NaomieSalim- Web Based Cross Language Semantic Plagiarism Detection, 03 January, 2012

Jeremy Ferrero, Lauren Besacier, Didier Schwab, Frederic Agnes- Deep Investigation of Cross-Language Plagiarism Detection MethodsBarrón-Cedeño, A., Gupta, P. and Rosso, P. (2013). Methods for cross-language plagiarism detection. Knowledge-Based Systems, 50, pp.211-217.

Chauhan, S., Arora, A. and Singhal, Y. (2017). Plagiarism Detection of C Program using Assembly Language. International Journal of Computer Applications, 158(3), pp.17-22.

Ehsan, N. and Shakery, A. (2016). Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information. Information Processing & Management, 52(6), pp.1004-1017.

Franco-Salvador, M., Gupta, P., Rosso, P. and Banchs, R. (2016). Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language. Knowledge-Based Systems, 111, pp.87-99.

Gelbukh, A. (2009). Computational Linguistics and Intelligent Text Processing. Heidelberg: Springer.

Kashkur, M., Parshutin, S. and Borisov, A. (2010). Research into Plagiarism Cases and Plagiarism Detection Methods. Scientific Journal of Riga Technical University. Computer Sciences, 42(1).

Kasprowicz, D. and Wada, H. (2014). Methods for automated detection of plagiarism in integrated-circuit layouts. Microelectronics Journal, 45(9), pp.1212-1219.

Lee, Y. (2012). Plagiarism Detection among Source Codes using Adaptive Methods. KSII Transactions on Internet and Information Systems.

METHODS FOR INTRINSIC PLAGIARISM DETECTION. (2017). Informatics and Applications.

Potthast, M., Barrón-Cedeño, A., Stein, B. and Rosso, P. (2010). Cross-language plagiarism detection. Language Resources and Evaluation, 45(1), pp.45-62.

Calculate your order
Pages (275 words)
Standard price: $0.00
Client Reviews
Our Guarantees
100% Confidentiality
Information about customers is confidential and never disclosed to third parties.
Original Writing
We complete all papers from scratch. You can get a plagiarism report.
Timely Delivery
No missed deadlines – 97% of assignments are completed in time.
Money Back
If you're confident that a writer didn't follow your order details, ask for a refund.

Calculate the price of your order

You will get a personal manager and a discount.
We'll send you the first draft for approval by at
Total price:
Power up Your Academic Success with the
Team of Professionals. We’ve Got Your Back.
Power up Your Study Success with Experts We’ve Got Your Back.