Near-duplicate document detection for web crawling

A system generates a hash value for a fetched document and compares the hash value with a set of stored hash values to identify ones of the stored hash values with a sequence of bit positions, less than all of the bit positions, that match a corresponding sequence of bit positions of the hash value....

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: JAIN ARVIND, MANKU GURMEET SINGH
Format: Patent
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator JAIN ARVIND
MANKU GURMEET SINGH
description A system generates a hash value for a fetched document and compares the hash value with a set of stored hash values to identify ones of the stored hash values with a sequence of bit positions, less than all of the bit positions, that match a corresponding sequence of bit positions of the hash value. The system also determines whether any of the identified hash values are substantially similar to the hash value and identify the fetched document as a near-duplicate of another document when one of the identified hash values is substantially similar to the hash value.
format Patent
fullrecord <record><control><sourceid>epo_EVB</sourceid><recordid>TN_cdi_epo_espacenet_US8140505B1</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>US8140505B1</sourcerecordid><originalsourceid>FETCH-epo_espacenet_US8140505B13</originalsourceid><addsrcrecordid>eNrjZDDyS00s0k0pLcjJTE4sSVVIyU8uzU3NK1FISS1JTS7JzM9TSMsvUihPTVJILkosz8nMS-dhYE1LzClO5YXS3AwKbq4hzh66qQX58anFBYnJqXmpJfGhwRaGJgamBqZOhsZEKAEAgkMr3w</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>patent</recordtype></control><display><type>patent</type><title>Near-duplicate document detection for web crawling</title><source>esp@cenet</source><creator>JAIN ARVIND ; MANKU GURMEET SINGH</creator><creatorcontrib>JAIN ARVIND ; MANKU GURMEET SINGH</creatorcontrib><description>A system generates a hash value for a fetched document and compares the hash value with a set of stored hash values to identify ones of the stored hash values with a sequence of bit positions, less than all of the bit positions, that match a corresponding sequence of bit positions of the hash value. The system also determines whether any of the identified hash values are substantially similar to the hash value and identify the fetched document as a near-duplicate of another document when one of the identified hash values is substantially similar to the hash value.</description><language>eng</language><subject>CALCULATING ; COMPUTING ; COUNTING ; ELECTRIC DIGITAL DATA PROCESSING ; PHYSICS</subject><creationdate>2012</creationdate><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://worldwide.espacenet.com/publicationDetails/biblio?FT=D&amp;date=20120320&amp;DB=EPODOC&amp;CC=US&amp;NR=8140505B1$$EHTML$$P50$$Gepo$$Hfree_for_read</linktohtml><link.rule.ids>230,308,776,881,25543,76293</link.rule.ids><linktorsrc>$$Uhttps://worldwide.espacenet.com/publicationDetails/biblio?FT=D&amp;date=20120320&amp;DB=EPODOC&amp;CC=US&amp;NR=8140505B1$$EView_record_in_European_Patent_Office$$FView_record_in_$$GEuropean_Patent_Office$$Hfree_for_read</linktorsrc></links><search><creatorcontrib>JAIN ARVIND</creatorcontrib><creatorcontrib>MANKU GURMEET SINGH</creatorcontrib><title>Near-duplicate document detection for web crawling</title><description>A system generates a hash value for a fetched document and compares the hash value with a set of stored hash values to identify ones of the stored hash values with a sequence of bit positions, less than all of the bit positions, that match a corresponding sequence of bit positions of the hash value. The system also determines whether any of the identified hash values are substantially similar to the hash value and identify the fetched document as a near-duplicate of another document when one of the identified hash values is substantially similar to the hash value.</description><subject>CALCULATING</subject><subject>COMPUTING</subject><subject>COUNTING</subject><subject>ELECTRIC DIGITAL DATA PROCESSING</subject><subject>PHYSICS</subject><fulltext>true</fulltext><rsrctype>patent</rsrctype><creationdate>2012</creationdate><recordtype>patent</recordtype><sourceid>EVB</sourceid><recordid>eNrjZDDyS00s0k0pLcjJTE4sSVVIyU8uzU3NK1FISS1JTS7JzM9TSMsvUihPTVJILkosz8nMS-dhYE1LzClO5YXS3AwKbq4hzh66qQX58anFBYnJqXmpJfGhwRaGJgamBqZOhsZEKAEAgkMr3w</recordid><startdate>20120320</startdate><enddate>20120320</enddate><creator>JAIN ARVIND</creator><creator>MANKU GURMEET SINGH</creator><scope>EVB</scope></search><sort><creationdate>20120320</creationdate><title>Near-duplicate document detection for web crawling</title><author>JAIN ARVIND ; MANKU GURMEET SINGH</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-epo_espacenet_US8140505B13</frbrgroupid><rsrctype>patents</rsrctype><prefilter>patents</prefilter><language>eng</language><creationdate>2012</creationdate><topic>CALCULATING</topic><topic>COMPUTING</topic><topic>COUNTING</topic><topic>ELECTRIC DIGITAL DATA PROCESSING</topic><topic>PHYSICS</topic><toplevel>online_resources</toplevel><creatorcontrib>JAIN ARVIND</creatorcontrib><creatorcontrib>MANKU GURMEET SINGH</creatorcontrib><collection>esp@cenet</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>JAIN ARVIND</au><au>MANKU GURMEET SINGH</au><format>patent</format><genre>patent</genre><ristype>GEN</ristype><title>Near-duplicate document detection for web crawling</title><date>2012-03-20</date><risdate>2012</risdate><abstract>A system generates a hash value for a fetched document and compares the hash value with a set of stored hash values to identify ones of the stored hash values with a sequence of bit positions, less than all of the bit positions, that match a corresponding sequence of bit positions of the hash value. The system also determines whether any of the identified hash values are substantially similar to the hash value and identify the fetched document as a near-duplicate of another document when one of the identified hash values is substantially similar to the hash value.</abstract><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier
ispartof
issn
language eng
recordid cdi_epo_espacenet_US8140505B1
source esp@cenet
subjects CALCULATING
COMPUTING
COUNTING
ELECTRIC DIGITAL DATA PROCESSING
PHYSICS
title Near-duplicate document detection for web crawling
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-26T17%3A43%3A40IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-epo_EVB&rft_val_fmt=info:ofi/fmt:kev:mtx:patent&rft.genre=patent&rft.au=JAIN%20ARVIND&rft.date=2012-03-20&rft_id=info:doi/&rft_dat=%3Cepo_EVB%3EUS8140505B1%3C/epo_EVB%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true