Web page repetitive structure and URL feature based Deep Web data extraction

Noise interference in web pages and the demand for multiple sample pages are the key issues of Deep Web data extraction. In this paper, we propose a novel web page repetitive structure and URL feature based approach for Deep Web data extraction. It employs continuous repetitive tag region and simila...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Xingyi Li, Yanyan Kong, Huaji Shi
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 364
container_issue
container_start_page 361
container_title
container_volume 1
creator Xingyi Li
Yanyan Kong
Huaji Shi
description Noise interference in web pages and the demand for multiple sample pages are the key issues of Deep Web data extraction. In this paper, we propose a novel web page repetitive structure and URL feature based approach for Deep Web data extraction. It employs continuous repetitive tag region and similar URL to partition the sample page into blocks, locate the data region and extract specific URL template, which is further exploited to quickly identify the data region and the boundary of data records in similar pages. Experimental results show that our approach is highly effective for Deep Web data extraction.
doi_str_mv 10.1109/ICCSNA.2010.5588744
format Conference Proceeding
fullrecord <record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_5588744</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>5588744</ieee_id><sourcerecordid>5588744</sourcerecordid><originalsourceid>FETCH-LOGICAL-i90t-358a32a6771a20ed9dda73fc7320a4121a8e89a2cc3d4603e4668b6eda7d5fe53</originalsourceid><addsrcrecordid>eNo1T9tKAzEUjIig1v2CvuQHtuae7GNZb4VFQSs-lrPJWYloXZJU9O9dtc7LMMPMwBAy52zBOWvOV237cLtcCDYZWjtnlTogVWMdV0Ipq6w1h-T0X2hxTKqcX9gEpQXT-oR0T9jTEZ6RJhyxxBI_kOaSdr7sElLYBvp439EB4Vf3kDHQC8SR_hQDFKD4WRL4Et-3Z-RogNeM1Z5nZH11uW5v6u7uetUuuzo2rNRSO5ACjLUcBMPQhABWDt5KwUBxwcGha0B4L4MyTKIyxvUGp1TQA2o5I_O_2YiImzHFN0hfm_1_-Q3iLk64</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Web page repetitive structure and URL feature based Deep Web data extraction</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Xingyi Li ; Yanyan Kong ; Huaji Shi</creator><creatorcontrib>Xingyi Li ; Yanyan Kong ; Huaji Shi</creatorcontrib><description>Noise interference in web pages and the demand for multiple sample pages are the key issues of Deep Web data extraction. In this paper, we propose a novel web page repetitive structure and URL feature based approach for Deep Web data extraction. It employs continuous repetitive tag region and similar URL to partition the sample page into blocks, locate the data region and extract specific URL template, which is further exploited to quickly identify the data region and the boundary of data records in similar pages. Experimental results show that our approach is highly effective for Deep Web data extraction.</description><identifier>ISBN: 1424474752</identifier><identifier>ISBN: 9781424474752</identifier><identifier>EISBN: 9781424474776</identifier><identifier>EISBN: 1424474779</identifier><identifier>EISBN: 1424474787</identifier><identifier>EISBN: 9781424474783</identifier><identifier>DOI: 10.1109/ICCSNA.2010.5588744</identifier><language>eng</language><publisher>IEEE</publisher><subject>Accuracy ; data extraction ; Data mining ; Deep Web ; Educational institutions ; Feature extraction ; similar URL ; web page repetitive structure</subject><ispartof>2010 Second International Conference on Communication Systems, Networks and Applications, 2010, Vol.1, p.361-364</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/5588744$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,2058,27925,54920</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/5588744$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Xingyi Li</creatorcontrib><creatorcontrib>Yanyan Kong</creatorcontrib><creatorcontrib>Huaji Shi</creatorcontrib><title>Web page repetitive structure and URL feature based Deep Web data extraction</title><title>2010 Second International Conference on Communication Systems, Networks and Applications</title><addtitle>ICCSNA</addtitle><description>Noise interference in web pages and the demand for multiple sample pages are the key issues of Deep Web data extraction. In this paper, we propose a novel web page repetitive structure and URL feature based approach for Deep Web data extraction. It employs continuous repetitive tag region and similar URL to partition the sample page into blocks, locate the data region and extract specific URL template, which is further exploited to quickly identify the data region and the boundary of data records in similar pages. Experimental results show that our approach is highly effective for Deep Web data extraction.</description><subject>Accuracy</subject><subject>data extraction</subject><subject>Data mining</subject><subject>Deep Web</subject><subject>Educational institutions</subject><subject>Feature extraction</subject><subject>similar URL</subject><subject>web page repetitive structure</subject><isbn>1424474752</isbn><isbn>9781424474752</isbn><isbn>9781424474776</isbn><isbn>1424474779</isbn><isbn>1424474787</isbn><isbn>9781424474783</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2010</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNo1T9tKAzEUjIig1v2CvuQHtuae7GNZb4VFQSs-lrPJWYloXZJU9O9dtc7LMMPMwBAy52zBOWvOV237cLtcCDYZWjtnlTogVWMdV0Ipq6w1h-T0X2hxTKqcX9gEpQXT-oR0T9jTEZ6RJhyxxBI_kOaSdr7sElLYBvp439EB4Vf3kDHQC8SR_hQDFKD4WRL4Et-3Z-RogNeM1Z5nZH11uW5v6u7uetUuuzo2rNRSO5ACjLUcBMPQhABWDt5KwUBxwcGha0B4L4MyTKIyxvUGp1TQA2o5I_O_2YiImzHFN0hfm_1_-Q3iLk64</recordid><startdate>201006</startdate><enddate>201006</enddate><creator>Xingyi Li</creator><creator>Yanyan Kong</creator><creator>Huaji Shi</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>201006</creationdate><title>Web page repetitive structure and URL feature based Deep Web data extraction</title><author>Xingyi Li ; Yanyan Kong ; Huaji Shi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i90t-358a32a6771a20ed9dda73fc7320a4121a8e89a2cc3d4603e4668b6eda7d5fe53</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2010</creationdate><topic>Accuracy</topic><topic>data extraction</topic><topic>Data mining</topic><topic>Deep Web</topic><topic>Educational institutions</topic><topic>Feature extraction</topic><topic>similar URL</topic><topic>web page repetitive structure</topic><toplevel>online_resources</toplevel><creatorcontrib>Xingyi Li</creatorcontrib><creatorcontrib>Yanyan Kong</creatorcontrib><creatorcontrib>Huaji Shi</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Xingyi Li</au><au>Yanyan Kong</au><au>Huaji Shi</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Web page repetitive structure and URL feature based Deep Web data extraction</atitle><btitle>2010 Second International Conference on Communication Systems, Networks and Applications</btitle><stitle>ICCSNA</stitle><date>2010-06</date><risdate>2010</risdate><volume>1</volume><spage>361</spage><epage>364</epage><pages>361-364</pages><isbn>1424474752</isbn><isbn>9781424474752</isbn><eisbn>9781424474776</eisbn><eisbn>1424474779</eisbn><eisbn>1424474787</eisbn><eisbn>9781424474783</eisbn><abstract>Noise interference in web pages and the demand for multiple sample pages are the key issues of Deep Web data extraction. In this paper, we propose a novel web page repetitive structure and URL feature based approach for Deep Web data extraction. It employs continuous repetitive tag region and similar URL to partition the sample page into blocks, locate the data region and extract specific URL template, which is further exploited to quickly identify the data region and the boundary of data records in similar pages. Experimental results show that our approach is highly effective for Deep Web data extraction.</abstract><pub>IEEE</pub><doi>10.1109/ICCSNA.2010.5588744</doi><tpages>4</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier ISBN: 1424474752
ispartof 2010 Second International Conference on Communication Systems, Networks and Applications, 2010, Vol.1, p.361-364
issn
language eng
recordid cdi_ieee_primary_5588744
source IEEE Electronic Library (IEL) Conference Proceedings
subjects Accuracy
data extraction
Data mining
Deep Web
Educational institutions
Feature extraction
similar URL
web page repetitive structure
title Web page repetitive structure and URL feature based Deep Web data extraction
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T18%3A19%3A00IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Web%20page%20repetitive%20structure%20and%20URL%20feature%20based%20Deep%20Web%20data%20extraction&rft.btitle=2010%20Second%20International%20Conference%20on%20Communication%20Systems,%20Networks%20and%20Applications&rft.au=Xingyi%20Li&rft.date=2010-06&rft.volume=1&rft.spage=361&rft.epage=364&rft.pages=361-364&rft.isbn=1424474752&rft.isbn_list=9781424474752&rft_id=info:doi/10.1109/ICCSNA.2010.5588744&rft_dat=%3Cieee_6IE%3E5588744%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=9781424474776&rft.eisbn_list=1424474779&rft.eisbn_list=1424474787&rft.eisbn_list=9781424474783&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=5588744&rfr_iscdi=true