Cross-Language Plagiarism Detection Tool: Guidelines And Project Description
Guidelines for Progress Report and Final Presentation
This application aims to take in an input from a user which would be a text file. The two languages that can be read from the file should be English and Hindi. The users will be able to provide the input as an UNICODE file. To achieve this we create a “Hindi representation” of the sentence in English.The application will then search for similar files on the internet and provide as with the results that are relevant to the text file that is uploaded. To achieve this we create a “Hindi representation” of the sentence in English.
We went through many articles on the internet’s which were related to the development of the cross platform plagiarism tool. An article [1] on the stack overflow suggested that we could develop this application in Python using the NLTK library and GenSem library which is accomplished by creating the LDA or LSA of the document. We can ultimately use the Google Search API to search for those words. NTLK [2] is the Natural Language Toolkit for the natural language processing. This toolkit supports libraries for classification, tokenization, stemming, tagging, parsing, semantic reasoning etc.
In [5], Chowet. al. mentions about the semantic plagiarism technique. Semantic plagiarism is where the sentence is reconstructed or some terms are changed into its corresponding synonyms. Both of these plagiarisms is hardly detected due to the difference in their fingerprints. Plagiarism detection tools that are available are not capable to detect such plagiarism cases.
Chow et. al. in [5] proposes a new approach in detecting both cross language and semantic plagiarism, where , the query document is shortened by utilising fuzzy swarm-based summarisation approach, the summary will give the most important keywords in the document. Input summary documents are translated into English using Google Translate Application Programming Interface (API) before the words are stemmed and the stop words are removed. Tokenized documents are sent to the Google AJAX Search API to search for similar documents throughout the World Wide Web. Stanford Parser and Word Net are used to determine the semantic similarity between the suspected documents with source documents. Stanford parser assigns each terms in the sentence to their corresponding roles such as Nouns, Verbs and Adjectives. Each sentence is then represented in a predicate form and similarity is measured based on those predicates using information from Word Net taxonomy. Testing dataset is built up from two sets of input documents which are produced based on different plagiarism techniques.
Bird et. al. in [3] overs the scope of using the NTLK toolkit for the natural language processing. We are thinking of using methodology where a Token class is used to represent of unit a text such as a word, sentence or a piece of document. Kuhn et. al.[4] describes the use of the application of semantic classification trees for the understanding of natural language processing. Speech understanding, semantic classification, machine learning, natural language and decision tree based capabilities for a translator application are covered up in this paper.
These paragraphs speakabout the speech classification, machine learning based learning of artificial neural networks, decision trees, tokenization and several other methods.In [6], Jeremy et. al. talks about different state-of-art methods to detect the plagiarism. Some of the methods used in the experiment are Cross-Language Character N-Gram (CL-CnG) , Cross-Language Conceptual Thesaurus-based Similarity (CL-CTS), Cross-Language Alignment-based Similarity Analysis (CL-ASA), Cross-Language Explicit Semantic Analysis (CL-ESA), Translation + Monolingual Analysis (T+MA). According to the author, there is a common behaviour of each method across different language pairs. There is not only a strong correlation across languages but also across text units that were considered. If a method is efficient on a particular language pair, it will be similarly efficient on another language pair as long as enough lexical resources are available for these languages. There was a strong correlation across types of text when they investigated the behaviour of the methods across different types of texts on a particular language pair. It was found that a method could be optimized on a particular collection of text and applied efficiently on another collection. Finally, it was concluded that methods behave differently in clustering match and mismatched units, even if they seem similar in performance.
Project Description
The Project Activities are shown below(Barrón-Cedeño, Gupta and Rosso, 2013).
Developing a cross-language plagiarism detection tool |
User management |
Document management |
Translation of input documents |
Translate the plagiarized Hindi documents into English |
Improve the effectiveness of the detection process |
Use Google Translate AP |
Removing Stop Words |
Before passing the translated documents for comparison through the Internet |
Remove the stop words in the translated text |
Stemming Words |
Remove the affixes |
Generate root word |
Pattern matching |
Text Stemmer and Porter Stemmer |
Use of Porter Stemming algorithm |
Removing the commoner morphological and in flexional endings from words in English |
Identifying Similar Documents |
Collection of documents that located around the World Wide Web |
Enables small and characteristic fragments translation |
Query documents or texts are inserted |
Use of Google AJAX Search API |
Comparison of Similar Pattern |
Detect plagiarism |
Represent the sentence uniquely. |
Summary of the Result |
Gathering the result |
Plagiarism detection is displayed |
Highlight the similarities between the two files. |
Resource Name |
Type |
Initials |
Max. Units |
Std. Rate |
Accrue At |
Base Calendar |
Project Manager |
Work |
P |
100% |
$1,000.00/hr |
Prorated |
Standard |
System Analyst |
Work |
S |
100% |
$1,000.00/hr |
Prorated |
Standard |
Developer |
Work |
D |
100% |
$1,000.00/hr |
Prorated |
Standard |
Designer |
Work |
D |
100% |
$1,000.00/hr |
Prorated |
Standard |
Technical Writer |
Work |
T |
100% |
$1,000.00/hr |
Prorated |
Standard |
Code Designer |
Work |
C |
100% |
$1,000.00/hr |
Prorated |
Standard |
Overall Project Activities are shown below(Chauhan, Arora and Singhal, 2017).
Task Name |
Duration |
Start |
Finish |
Predecessors |
Resource Names |
Developing a cross-language plagiarism detection tool |
60 days |
Wed 9/12/18 |
Tue 12/4/18 |
||
User management |
1 day |
Wed 9/12/18 |
Wed 9/12/18 |
Designer, Developer |
|
Document management |
2 days |
Thu 9/13/18 |
Fri 9/14/18 |
2 |
Designer, Project Manager, Technical Writer |
Translation of input documents |
8 days |
Mon 9/17/18 |
Wed 9/26/18 |
3 |
|
Translate the plagiarized Hindi documents into English |
2 days |
Mon 9/17/18 |
Tue 9/18/18 |
Code Designer, Developer, System Analyst |
|
Improve the effectiveness of the detection process |
3 days |
Wed 9/19/18 |
Fri 9/21/18 |
5 |
Developer |
Use Google Translate AP |
3 days |
Mon 9/24/18 |
Wed 9/26/18 |
6 |
Code Designer, Designer |
Removing Stop Words |
5 days |
Thu 9/27/18 |
Wed 10/3/18 |
4 |
|
Before passing the translated documents for comparison through the Internet |
2 days |
Thu 9/27/18 |
Fri 9/28/18 |
Designer, System Analyst |
|
Remove the stop words in the translated text |
3 days |
Mon 10/1/18 |
Wed 10/3/18 |
9 |
Developer, Code Designer |
Stemming Words |
15 days |
Thu 10/4/18 |
Wed 10/24/18 |
8 |
|
Remove the affixes |
3 days |
Thu 10/4/18 |
Mon 10/8/18 |
Designer |
|
Generate root word |
4 days |
Tue 10/9/18 |
Fri 10/12/18 |
12 |
System Analyst |
Pattern matching |
2 days |
Mon 10/15/18 |
Tue 10/16/18 |
13 |
Designer |
Text Stemmer and Porter Stemmer |
2 days |
Wed 10/17/18 |
Thu 10/18/18 |
14 |
Developer |
Use of Porter Stemming algorithm |
2 days |
Fri 10/19/18 |
Mon 10/22/18 |
15 |
Developer |
Removing the commoner morphological and in flexional endings from words in English |
2 days |
Tue 10/23/18 |
Wed 10/24/18 |
16 |
Developer, System Analyst |
Identifying Similar Documents |
10 days |
Thu 10/25/18 |
Wed 11/7/18 |
11 |
|
Collection of documents that located around the World Wide Web |
2 days |
Thu 10/25/18 |
Fri 10/26/18 |
System Analyst |
|
Enables small and characteristic fragments translation |
3 days |
Thu 10/25/18 |
Mon 10/29/18 |
Developer |
|
Query documents or texts are inserted |
3 days |
Tue 10/30/18 |
Thu 11/1/18 |
20 |
System Analyst, Technical Writer |
Use of Google AJAX Search API |
4 days |
Fri 11/2/18 |
Wed 11/7/18 |
21 |
Code Designer, Developer |
Comparison of Similar Pattern |
10 days |
Thu 11/8/18 |
Wed 11/21/18 |
18 |
|
Detect plagiarism |
4 days |
Tue 11/13/18 |
Fri 11/16/18 |
24 |
Code Designer, Project Manager, System Analyst |
Represent the sentence uniquely. |
3 days |
Mon 11/19/18 |
Wed 11/21/18 |
25 |
System Analyst, Technical Writer |
Summary of the Result |
9 days |
Thu 11/22/18 |
Tue 12/4/18 |
23 |
|
Gathering the result |
2 days |
Thu 11/22/18 |
Fri 11/23/18 |
Project Manager, System Analyst |
|
Plagiarism detection is displayed |
3 days |
Mon 11/26/18 |
Wed 11/28/18 |
28 |
Designer, Developer, Project Manager |
Highlight the similarities between the two files. |
4 days |
Thu 11/29/18 |
Tue 12/4/18 |
29 |
Code Designer, Developer |
Project charter is shown below.
Resource Cost status is shown below(Ehsan and Shakery, 2016).
Project Activities Cost is shown below.
Name |
Fixed Cost |
Actual Cost |
Remaining Cost |
Cost |
Baseline Cost |
Cost Variance |
Developing a cross-language plagiarism detection tool |
$0.00 |
$0.00 |
$912,000.00 |
$912,000.00 |
$0.00 |
$912,000.00 |
MS Project file is attached here.
Plagiarism is turning into a difficult issue for scholarly network. The recognition of counterfeiting at different levels is an important issue. The complexity of the issue increments when we are finding the plagiarism detection in the source codes that might be in a similar language or they have been changed into different languages(Franco-Salvador et al., 2016). This kind of written falsification is found in the scholastic fills in as well as in the ventures managing programming planning. The real issue with the source code written fabrication is that distinctive programming languages may have different linguistic structure.
In view of language homogeneity or heterogeneity of the writings being looked at, plagiarism detection discovery can be characterized into monolingual and cross-lingual. The cross-language written misrepresentation recognition process is like the outside plagiarism detection identification assignment with a few alterations in heuristic recovery and itemized investigation stages(Gelbukh, 2009). In cross-language heuristic recovery, this stage expects to recover the accumulation of source hopeful archives from the informational index. Deciphering the info archive from the inquiry language to the source language might be required in this stage. The cross-language point by point examination level estimates the cross-language likeness between segments of the suspicious record and segments of the hopeful reports which recovered in the past stage(Kashkur, Parshutin and Borisov, 2010).
Language used : Java script.
Software Design for Cross language plagiarism detection tool is illustrated below(Kasprowicz and Wada, 2014).
If you type the any text it change from English to hindi..(Potthast et al., 2010)
Conclusion
With the project being accomplished, we hope that we would be able to find plagiarism related to any articles on the web provided an input file to our application.
References
‘How to develop a plagiarism detector?’ stackoverflow.com/questions/1193408 extracted on 11 August 2018.
NTLK 3.3 documentation for Natural Language Toolkit extracted from nltk.org on 11 August 2018.
Steven Bird, Edward Loper NTLK: The Natural Language Toolkit
Roland Kuhn, Renato De Mori – The Application of Semantic Classification Trees to Natural Language Understanding.
Chow Kok Kent, NaomieSalim- Web Based Cross Language Semantic Plagiarism Detection, 03 January, 2012
Jeremy Ferrero, Lauren Besacier, Didier Schwab, Frederic Agnes- Deep Investigation of Cross-Language Plagiarism Detection MethodsBarrón-Cedeño, A., Gupta, P. and Rosso, P. (2013). Methods for cross-language plagiarism detection. Knowledge-Based Systems, 50, pp.211-217.
Chauhan, S., Arora, A. and Singhal, Y. (2017). Plagiarism Detection of C Program using Assembly Language. International Journal of Computer Applications, 158(3), pp.17-22.
Ehsan, N. and Shakery, A. (2016). Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information. Information Processing & Management, 52(6), pp.1004-1017.
Franco-Salvador, M., Gupta, P., Rosso, P. and Banchs, R. (2016). Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language. Knowledge-Based Systems, 111, pp.87-99.
Gelbukh, A. (2009). Computational Linguistics and Intelligent Text Processing. Heidelberg: Springer.
Kashkur, M., Parshutin, S. and Borisov, A. (2010). Research into Plagiarism Cases and Plagiarism Detection Methods. Scientific Journal of Riga Technical University. Computer Sciences, 42(1).
Kasprowicz, D. and Wada, H. (2014). Methods for automated detection of plagiarism in integrated-circuit layouts. Microelectronics Journal, 45(9), pp.1212-1219.
Lee, Y. (2012). Plagiarism Detection among Source Codes using Adaptive Methods. KSII Transactions on Internet and Information Systems.
METHODS FOR INTRINSIC PLAGIARISM DETECTION. (2017). Informatics and Applications.
Potthast, M., Barrón-Cedeño, A., Stein, B. and Rosso, P. (2010). Cross-language plagiarism detection. Language Resources and Evaluation, 45(1), pp.45-62.