Home Uncategorized Cross-Language Plagiarism Detection Tool: Guidelines And Project Description

Uncategorized

Cross-Language Plagiarism Detection Tool: Guidelines And Project Description

Guidelines for Progress Report and Final Presentation

This application aims to take in an input from a user which would be a text file. The two languages that can be read from the file should be English and Hindi. The users will be able to provide the input as an UNICODE file. To achieve this we create a “Hindi representation” of the sentence in English.The application will then search for similar files on the internet and provide as with the results that are relevant to the text file that is uploaded. To achieve this we create a “Hindi representation” of the sentence in English.

We went through many articles on the internet’s which were related to the development of the cross platform plagiarism tool. An article [1] on the stack overflow suggested that we could develop this application in Python using the NLTK library and GenSem library which is accomplished by creating the LDA or LSA of the document. We can ultimately use the Google Search API to search for those words. NTLK [2] is the Natural Language Toolkit for the natural language processing. This toolkit supports libraries for classification, tokenization, stemming, tagging, parsing, semantic reasoning etc.

In [5], Chowet. al. mentions about the semantic plagiarism technique. Semantic plagiarism is where the sentence is reconstructed or some terms are changed into its corresponding synonyms. Both of these plagiarisms is hardly detected due to the difference in their fingerprints. Plagiarism detection tools that are available are not capable to detect such plagiarism cases.

Chow et. al. in [5] proposes a new approach in detecting both cross language and semantic plagiarism, where , the query document is shortened by utilising fuzzy swarm-based summarisation approach, the summary will give the most important keywords in the document. Input summary documents are translated into English using Google Translate Application Programming Interface (API) before the words are stemmed and the stop words are removed. Tokenized documents are sent to the Google AJAX Search API to search for similar documents throughout the World Wide Web. Stanford Parser and Word Net are used to determine the semantic similarity between the suspected documents with source documents. Stanford parser assigns each terms in the sentence to their corresponding roles such as Nouns, Verbs and Adjectives. Each sentence is then represented in a predicate form and similarity is measured based on those predicates using information from Word Net taxonomy. Testing dataset is built up from two sets of input documents which are produced based on different plagiarism techniques.

Bird et. al. in [3] overs the scope of using the NTLK toolkit for the natural language processing. We are thinking of using methodology where a Token class is used to represent of unit a text such as a word, sentence or a piece of document. Kuhn et. al.[4] describes the use of the application of semantic classification trees for the understanding of natural language processing. Speech understanding, semantic classification, machine learning, natural language and decision tree based capabilities for a translator application are covered up in this paper.

These paragraphs speakabout the speech classification, machine learning based learning of artificial neural networks, decision trees, tokenization and several other methods.In [6], Jeremy et. al. talks about different state-of-art methods to detect the plagiarism. Some of the methods used in the experiment are Cross-Language Character N-Gram (CL-CnG) , Cross-Language Conceptual Thesaurus-based Similarity (CL-CTS), Cross-Language Alignment-based Similarity Analysis (CL-ASA), Cross-Language Explicit Semantic Analysis (CL-ESA), Translation + Monolingual Analysis (T+MA). According to the author, there is a common behaviour of each method across different language pairs. There is not only a strong correlation across languages but also across text units that were considered. If a method is efficient on a particular language pair, it will be similarly efficient on another language pair as long as enough lexical resources are available for these languages. There was a strong correlation across types of text when they investigated the behaviour of the methods across different types of texts on a particular language pair. It was found that a method could be optimized on a particular collection of text and applied efficiently on another collection. Finally, it was concluded that methods behave differently in clustering match and mismatched units, even if they seem similar in performance.

Project Description

The Project Activities are shown below(Barrón-Cedeño, Gupta and Rosso, 2013).

Developing a cross-language plagiarism detection tool

User management

Document management

Translation of input documents

Translate the plagiarized Hindi documents into English

Improve the effectiveness of the detection process

Use Google Translate AP

Removing Stop Words

Before passing the translated documents for comparison through the Internet

Remove the stop words in the translated text

Stemming Words

Remove the affixes

Generate root word

Pattern matching

Text Stemmer and Porter Stemmer

Use of Porter Stemming algorithm

Removing the commoner morphological and in flexional endings from words in English

Identifying Similar Documents

Collection of documents that located around the World Wide Web

Enables small and characteristic fragments translation

Query documents or texts are inserted

Use of Google AJAX Search API

Comparison of Similar Pattern

Detect plagiarism

Represent the sentence uniquely.

Summary of the Result

Gathering the result

Plagiarism detection is displayed

Highlight the similarities between the two files.

Resource Name	Type	Initials	Max. Units	Std. Rate	Accrue At	Base Calendar
Project Manager	Work	P	100%	$1,000.00/hr	Prorated	Standard
System Analyst	Work	S	100%	$1,000.00/hr	Prorated	Standard
Developer	Work	D	100%	$1,000.00/hr	Prorated	Standard
Designer	Work	D	100%	$1,000.00/hr	Prorated	Standard
Technical Writer	Work	T	100%	$1,000.00/hr	Prorated	Standard
Code Designer	Work	C	100%	$1,000.00/hr	Prorated	Standard

Overall Project Activities are shown below(Chauhan, Arora and Singhal, 2017).

Task Name	Duration	Start	Finish	Predecessors	Resource Names
Developing a cross-language plagiarism detection tool	60 days	Wed 9/12/18	Tue 12/4/18
User management	1 day	Wed 9/12/18	Wed 9/12/18		Designer, Developer
Document management	2 days	Thu 9/13/18	Fri 9/14/18	2	Designer, Project Manager, Technical Writer
Translation of input documents	8 days	Mon 9/17/18	Wed 9/26/18	3
Translate the plagiarized Hindi documents into English	2 days	Mon 9/17/18	Tue 9/18/18		Code Designer, Developer, System Analyst
Improve the effectiveness of the detection process	3 days	Wed 9/19/18	Fri 9/21/18	5	Developer
Use Google Translate AP	3 days	Mon 9/24/18	Wed 9/26/18	6	Code Designer, Designer
Removing Stop Words	5 days	Thu 9/27/18	Wed 10/3/18	4
Before passing the translated documents for comparison through the Internet	2 days	Thu 9/27/18	Fri 9/28/18		Designer, System Analyst
Remove the stop words in the translated text	3 days	Mon 10/1/18	Wed 10/3/18	9	Developer, Code Designer
Stemming Words	15 days	Thu 10/4/18	Wed 10/24/18	8
Remove the affixes	3 days	Thu 10/4/18	Mon 10/8/18		Designer
Generate root word	4 days	Tue 10/9/18	Fri 10/12/18	12	System Analyst
Pattern matching	2 days	Mon 10/15/18	Tue 10/16/18	13	Designer
Text Stemmer and Porter Stemmer	2 days	Wed 10/17/18	Thu 10/18/18	14	Developer
Use of Porter Stemming algorithm	2 days	Fri 10/19/18	Mon 10/22/18	15	Developer
Removing the commoner morphological and in flexional endings from words in English	2 days	Tue 10/23/18	Wed 10/24/18	16	Developer, System Analyst
Identifying Similar Documents	10 days	Thu 10/25/18	Wed 11/7/18	11
Collection of documents that located around the World Wide Web	2 days	Thu 10/25/18	Fri 10/26/18		System Analyst
Enables small and characteristic fragments translation	3 days	Thu 10/25/18	Mon 10/29/18		Developer
Query documents or texts are inserted	3 days	Tue 10/30/18	Thu 11/1/18	20	System Analyst, Technical Writer
Use of Google AJAX Search API	4 days	Fri 11/2/18	Wed 11/7/18	21	Code Designer, Developer
Comparison of Similar Pattern	10 days	Thu 11/8/18	Wed 11/21/18	18
Detect plagiarism	4 days	Tue 11/13/18	Fri 11/16/18	24	Code Designer, Project Manager, System Analyst
Represent the sentence uniquely.	3 days	Mon 11/19/18	Wed 11/21/18	25	System Analyst, Technical Writer
Summary of the Result	9 days	Thu 11/22/18	Tue 12/4/18	23
Gathering the result	2 days	Thu 11/22/18	Fri 11/23/18		Project Manager, System Analyst
Plagiarism detection is displayed	3 days	Mon 11/26/18	Wed 11/28/18	28	Designer, Developer, Project Manager
Highlight the similarities between the two files.	4 days	Thu 11/29/18	Tue 12/4/18	29	Code Designer, Developer

Project charter is shown below.

Resource Cost status is shown below(Ehsan and Shakery, 2016).

Project Activities Cost is shown below.

Name	Fixed Cost	Actual Cost	Remaining Cost	Cost	Baseline Cost	Cost Variance
Developing a cross-language plagiarism detection tool	$0.00	$0.00	$912,000.00	$912,000.00	$0.00	$912,000.00

MS Project file is attached here.

Plagiarism is turning into a difficult issue for scholarly network. The recognition of counterfeiting at different levels is an important issue. The complexity of the issue increments when we are finding the plagiarism detection in the source codes that might be in a similar language or they have been changed into different languages(Franco-Salvador et al., 2016). This kind of written falsification is found in the scholastic fills in as well as in the ventures managing programming planning. The real issue with the source code written fabrication is that distinctive programming languages may have different linguistic structure.

In view of language homogeneity or heterogeneity of the writings being looked at, plagiarism detection discovery can be characterized into monolingual and cross-lingual. The cross-language written misrepresentation recognition process is like the outside plagiarism detection identification assignment with a few alterations in heuristic recovery and itemized investigation stages(Gelbukh, 2009). In cross-language heuristic recovery, this stage expects to recover the accumulation of source hopeful archives from the informational index. Deciphering the info archive from the inquiry language to the source language might be required in this stage. The cross-language point by point examination level estimates the cross-language likeness between segments of the suspicious record and segments of the hopeful reports which recovered in the past stage(Kashkur, Parshutin and Borisov, 2010).

Language used : Java script.

Software Design for Cross language plagiarism detection tool is illustrated below(Kasprowicz and Wada, 2014).

If you type the any text it change from English to hindi..(Potthast et al., 2010)

Conclusion

With the project being accomplished, we hope that we would be able to find plagiarism related to any articles on the web provided an input file to our application.

References

‘How to develop a plagiarism detector?’ stackoverflow.com/questions/1193408 extracted on 11 August 2018.

NTLK 3.3 documentation for Natural Language Toolkit extracted from nltk.org on 11 August 2018.

Steven Bird, Edward Loper NTLK: The Natural Language Toolkit

Roland Kuhn, Renato De Mori – The Application of Semantic Classification Trees to Natural Language Understanding.

Chow Kok Kent, NaomieSalim- Web Based Cross Language Semantic Plagiarism Detection, 03 January, 2012

Jeremy Ferrero, Lauren Besacier, Didier Schwab, Frederic Agnes- Deep Investigation of Cross-Language Plagiarism Detection MethodsBarrón-Cedeño, A., Gupta, P. and Rosso, P. (2013). Methods for cross-language plagiarism detection. Knowledge-Based Systems, 50, pp.211-217.

Chauhan, S., Arora, A. and Singhal, Y. (2017). Plagiarism Detection of C Program using Assembly Language. International Journal of Computer Applications, 158(3), pp.17-22.

Ehsan, N. and Shakery, A. (2016). Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information. Information Processing & Management, 52(6), pp.1004-1017.

Franco-Salvador, M., Gupta, P., Rosso, P. and Banchs, R. (2016). Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language. Knowledge-Based Systems, 111, pp.87-99.

Gelbukh, A. (2009). Computational Linguistics and Intelligent Text Processing. Heidelberg: Springer.

Kashkur, M., Parshutin, S. and Borisov, A. (2010). Research into Plagiarism Cases and Plagiarism Detection Methods. Scientific Journal of Riga Technical University. Computer Sciences, 42(1).

Kasprowicz, D. and Wada, H. (2014). Methods for automated detection of plagiarism in integrated-circuit layouts. Microelectronics Journal, 45(9), pp.1212-1219.

Lee, Y. (2012). Plagiarism Detection among Source Codes using Adaptive Methods. KSII Transactions on Internet and Information Systems.

METHODS FOR INTRINSIC PLAGIARISM DETECTION. (2017). Informatics and Applications.

Potthast, M., Barrón-Cedeño, A., Stein, B. and Rosso, P. (2010). Cross-language plagiarism detection. Language Resources and Evaluation, 45(1), pp.45-62.

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Cross-Language Plagiarism Detection Tool: Guidelines And Project Description ”

Get high-quality paper

NEW! AI matching with writer

Hire a Writer

Client Reviews

4.9

Sitejabber

4.6

Trustpilot

4.8

Our Guarantees

100% Confidentiality

Information about customers is confidential and never disclosed to third parties.

Original Writing

We complete all papers from scratch. You can get a plagiarism report.

Timely Delivery

No missed deadlines – 97% of assignments are completed in time.

Money Back

If you're confident that a writer didn't follow your order details, ask for a refund.

New to Essaytutor.net? Sign up & Save

Calculate the price of your order

Type of paper needed:

Pages:

You will get a personal manager and a discount.

Academic level:

We'll send you the first draft for approval by at

Total price:

$0.00

Power up Your Academic Success with the
Team of Professionals. We’ve Got Your Back.

Power up Your Study Success with Experts We’ve Got Your Back.

Order Now Order Now