DarkDiff: Explainable web page similarity of TOR onion sites
In large-scale data analysis, near-duplicates are often a problem. For example, with two near-duplicate phishing emails, a difference in the salutation (Mr versus Ms) is not essential, but whether it is bank A or B is important. The state-of-the-art in near-duplicate detection is a black box approac...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In large-scale data analysis, near-duplicates are often a problem. For
example, with two near-duplicate phishing emails, a difference in the
salutation (Mr versus Ms) is not essential, but whether it is bank A or B is
important. The state-of-the-art in near-duplicate detection is a black box
approach (MinHash), so one only knows that emails are near-duplicates, but not
why. We present DarkDiff, which can efficiently detect near-duplicates while
providing the reason why there is a near-duplicate. We have developed DarkDiff
to detect near-duplicates of homepages on the Darkweb. DarkDiff works well on
those pages because they resemble the clear web of the past. |
---|---|
DOI: | 10.48550/arxiv.2308.12134 |