Near duplicate detection using MapReduce
In the massive text dataset, the near duplicate detection issue is widely existed in the real world. In this paper, The vector based algorithm is proposed to detect near duplicate in MapReduce. Given a text set and a similarity threshold, the algorithm can effectively return the similarity pairs who...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Tagungsbericht |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In the massive text dataset, the near duplicate detection issue is widely existed in the real world. In this paper, The vector based algorithm is proposed to detect near duplicate in MapReduce. Given a text set and a similarity threshold, the algorithm can effectively return the similarity pairs whose similarity degree is no less than the threshold. Experimental results on the real datasets show that the algorithm is effective. |
---|---|
DOI: | 10.1109/ICCSNT.2012.6525930 |