Image Plagiarism Detection Pipeline for Vast Databases

Reuse detection in academic works is a relevant problem. There already are automatic systems to detect many kinds of violations of academic ethics in work texts, such as translation reuses, paraphrases, machine generation and many others. However, much less attention is paid to the image reuse probl...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Kaprielova, Mariam, Grabovoy, Andrey, Varlamova, Ksenia, Potyashin, Ivan, Chekhovich, Yury, Kildyakov, Aleksandr
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Reuse detection in academic works is a relevant problem. There already are automatic systems to detect many kinds of violations of academic ethics in work texts, such as translation reuses, paraphrases, machine generation and many others. However, much less attention is paid to the image reuse problem. At the same time, the level of development of technical means of image processing makes it easy to falsify the results of scientific research or violate the principles of academic ethics in other ways. In order to address this problem, it is necessary to develop a image reuse detection system which would achieve high performance on large document collections. This paper presents an approach that is designed to search for image reuse in large collections of sources. The pipeline involves three steps: image conversion into a vector representation, candidate search, and similarity estimation between query image and each of candidates obtained at the previous step. The article presents results of experiments on quality and latency estimation of the developed system. We obtained Recall@l=98% quality for collection of images created without automatic drawing systems, 59% quality for images of handwritten essays and latency about 0.32 seconds per query for the collection of 59 million objects. The results show that the proposed system can be scaled up and used for industrial tasks that require quick verification of hundreds of thousands of images on a large number of potential sources of reuse.
ISSN:2305-7254
2305-7254
2343-0737
DOI:10.23919/FRUCT61870.2024.10516388