A METHOD AND APPARATUS FOR SIMILARITY DETECTION FOR DOCUMENTSBASED ON CONTENTS INCLUDING TEXTS TABLES FLOWCHARTS AND EQUATIONS

Information overload through the internet and advent of software tools for easy manipulation of e-contents are making it an arduous task to detect plagiarism; an offence under copyright laws. The present invention relates to a computer-implemented method and apparatus which obtains textual content f...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	TIWARI, MURLI DHAR, SIDDHARTH, TRIPATHI, RAMESH CHANDRA
Format:	Patent
Sprache:	eng
Schlagworte:	CALCULATING COMPUTING COUNTING ELECTRIC DIGITAL DATA PROCESSING PHYSICS
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Information overload through the internet and advent of software tools for easy manipulation of e-contents are making it an arduous task to detect plagiarism; an offence under copyright laws. The present invention relates to a computer-implemented method and apparatus which obtains textual content from a computer-readable format comprising all popular file formats duly detecting and storing separately the textual parts, tables, equations and flowcharts/block diagrams of the concerned documents. The method steps comprises extraction of text from the document in computer-readable format, removal of special symbols from the textual part of documents, whereby its output is only the words and numerical digits present in documents that have to be checked for similarity after due removal of connectors, stop words to work, only upon the words left which have significance in similarity detection, removal of suffixes for the normalization of words by changing them to their root form (stemming), hashing the root words to make them ready for comparing, searching the similarity of every phrase in query document with every phrase in suspect documents repository using software tools for manipulation of textual content of text and comprising of query as well as repository.