Near-duplicate document detection for web crawling

A system generates a hash value for a fetched document and compares the hash value with a set of stored hash values to identify ones of the stored hash values with a sequence of bit positions, less than all of the bit positions, that match a corresponding sequence of bit positions of the hash value....

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	JAIN ARVIND, MANKU GURMEET SINGH
Format:	Patent
Sprache:	eng
Schlagworte:	CALCULATING COMPUTING COUNTING ELECTRIC DIGITAL DATA PROCESSING PHYSICS
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	JAIN ARVIND MANKU GURMEET SINGH
description	A system generates a hash value for a fetched document and compares the hash value with a set of stored hash values to identify ones of the stored hash values with a sequence of bit positions, less than all of the bit positions, that match a corresponding sequence of bit positions of the hash value. The system also determines whether any of the identified hash values are substantially similar to the hash value and identify the fetched document as a near-duplicate of another document when one of the identified hash values is substantially similar to the hash value.
format	Patent
fullrecord	<record><control><sourceid>epo_EVB</sourceid><recordid>TN_cdi_epo_espacenet_US8140505B1</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>US8140505B1</sourcerecordid><originalsourceid>FETCH-epo_espacenet_US8140505B13</originalsourceid><addsrcrecordid>eNrjZDDyS00s0k0pLcjJTE4sSVVIyU8uzU3NK1FISS1JTS7JzM9TSMsvUihPTVJILkosz8nMS-dhYE1LzClO5YXS3AwKbq4hzh66qQX58anFBYnJqXmpJfGhwRaGJgamBqZOhsZEKAEAgkMr3w</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>patent</recordtype></control><display><type>patent</type><title>Near-duplicate document detection for web crawling</title><source>esp@cenet</source><creator>JAIN ARVIND ; MANKU GURMEET SINGH</creator><creatorcontrib>JAIN ARVIND ; MANKU GURMEET SINGH</creatorcontrib><description>A system generates a hash value for a fetched document and compares the hash value with a set of stored hash values to identify ones of the stored hash values with a sequence of bit positions, less than all of the bit positions, that match a corresponding sequence of bit positions of the hash value. The system also determines whether any of the identified hash values are substantially similar to the hash value and identify the fetched document as a near-duplicate of another document when one of the identified hash values is substantially similar to the hash value.</description><language>eng</language><subject>CALCULATING ; COMPUTING ; COUNTING ; ELECTRIC DIGITAL DATA PROCESSING ; PHYSICS</subject><creationdate>2012</creationdate><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20120320&DB=EPODOC&CC=US&NR=8140505B1$$EHTML$$P50$$Gepo$$Hfree_for_read</linktohtml><link.rule.ids>230,308,776,881,25543,76293</link.rule.ids><linktorsrc>$$Uhttps://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20120320&DB=EPODOC&CC=US&NR=8140505B1$$EView_record_in_European_Patent_Office$$FView_record_in_$$GEuropean_Patent_Office$$Hfree_for_read</linktorsrc></links><search><creatorcontrib>JAIN ARVIND</creatorcontrib><creatorcontrib>MANKU GURMEET SINGH</creatorcontrib><title>Near-duplicate document detection for web crawling</title><description>A system generates a hash value for a fetched document and compares the hash value with a set of stored hash values to identify ones of the stored hash values with a sequence of bit positions, less than all of the bit positions, that match a corresponding sequence of bit positions of the hash value. The system also determines whether any of the identified hash values are substantially similar to the hash value and identify the fetched document as a near-duplicate of another document when one of the identified hash values is substantially similar to the hash value.</description><subject>CALCULATING</subject><subject>COMPUTING</subject><subject>COUNTING</subject><subject>ELECTRIC DIGITAL DATA PROCESSING</subject><subject>PHYSICS</subject><fulltext>true</fulltext><rsrctype>patent</rsrctype><creationdate>2012</creationdate><recordtype>patent</recordtype><sourceid>EVB</sourceid><recordid>eNrjZDDyS00s0k0pLcjJTE4sSVVIyU8uzU3NK1FISS1JTS7JzM9TSMsvUihPTVJILkosz8nMS-dhYE1LzClO5YXS3AwKbq4hzh66qQX58anFBYnJqXmpJfGhwRaGJgamBqZOhsZEKAEAgkMr3w</recordid><startdate>20120320</startdate><enddate>20120320</enddate><creator>JAIN ARVIND</creator><creator>MANKU GURMEET SINGH</creator><scope>EVB</scope></search><sort><creationdate>20120320</creationdate><title>Near-duplicate document detection for web crawling</title><author>JAIN ARVIND ; MANKU GURMEET SINGH</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-epo_espacenet_US8140505B13</frbrgroupid><rsrctype>patents</rsrctype><prefilter>patents</prefilter><language>eng</language><creationdate>2012</creationdate><topic>CALCULATING</topic><topic>COMPUTING</topic><topic>COUNTING</topic><topic>ELECTRIC DIGITAL DATA PROCESSING</topic><topic>PHYSICS</topic><toplevel>online_resources</toplevel><creatorcontrib>JAIN ARVIND</creatorcontrib><creatorcontrib>MANKU GURMEET SINGH</creatorcontrib><collection>esp@cenet</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>JAIN ARVIND</au><au>MANKU GURMEET SINGH</au><format>patent</format><genre>patent</genre><ristype>GEN</ristype><title>Near-duplicate document detection for web crawling</title><date>2012-03-20</date><risdate>2012</risdate><abstract>A system generates a hash value for a fetched document and compares the hash value with a set of stored hash values to identify ones of the stored hash values with a sequence of bit positions, less than all of the bit positions, that match a corresponding sequence of bit positions of the hash value. The system also determines whether any of the identified hash values are substantially similar to the hash value and identify the fetched document as a near-duplicate of another document when one of the identified hash values is substantially similar to the hash value.</abstract><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier
ispartof
issn
language	eng
recordid	cdi_epo_espacenet_US8140505B1
source	esp@cenet
subjects	CALCULATING COMPUTING COUNTING ELECTRIC DIGITAL DATA PROCESSING PHYSICS
title	Near-duplicate document detection for web crawling
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-26T17%3A43%3A40IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-epo_EVB&rft_val_fmt=info:ofi/fmt:kev:mtx:patent&rft.genre=patent&rft.au=JAIN%20ARVIND&rft.date=2012-03-20&rft_id=info:doi/&rft_dat=%3Cepo_EVB%3EUS8140505B1%3C/epo_EVB%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true