Bootstrapping semantic annotation for content-rich HTML documents

Enormous amount of semantic data is still being encoded in HTML documents. Identifying and annotating the semantic concepts implicit in such documents makes them directly amenable for semantic Web processing. In this paper we describe a highly automated technique for annotating HTML documents, espec...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Mukherjee, S., Ramakrishnan, I.V., Singh, A.
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Computer science HTML Labeling Next generation networking Ontologies Pricing Resource description framework Semantic Web Vehicles XML
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	593
container_issue
container_start_page	583
container_title
container_volume
creator	Mukherjee, S. Ramakrishnan, I.V. Singh, A.
description	Enormous amount of semantic data is still being encoded in HTML documents. Identifying and annotating the semantic concepts implicit in such documents makes them directly amenable for semantic Web processing. In this paper we describe a highly automated technique for annotating HTML documents, especially template-based content-rich documents, containing many different semantic concepts per document. Starting with a (small) seed of hand-labeled instances of semantic concepts in a set of HTML documents we bootstrap an annotation process that automatically identifies unlabeled concept instances present in other documents. The bootstrapping technique exploits the observation that semantically related items in content-rich documents exhibit consistency in presentation style and spatial locality to learn a statistical model for accurately identifying different semantic concepts in HTML documents drawn from a variety of Web sources. We also present experimental results on the effectiveness of the technique.
doi_str_mv	10.1109/ICDE.2005.28
format	Conference Proceeding
fullrecord	<record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_1410176</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>1410176</ieee_id><sourcerecordid>1410176</sourcerecordid><originalsourceid>FETCH-LOGICAL-i213t-58f6f0fb778c2c0c31940bd38db31e4014912239a70169f2fd89f2487630c3ef3</originalsourceid><addsrcrecordid>eNotjD1PwzAURS0-JELpxsbiP5Dwnp049lhCoZWCWIrEVjmODUbEjhIz8O8bCe5wr3R0dAm5RSgQQd3vm8dtwQCqgskzkjFeVzkw8X5OrqEWqmJMVvKCZAiC54JLdkXW8_wFS1SJWEFGNg8xpjlNehx9-KCzHXRI3lAdQkw6-RioixM1MSQbUj5580l3h5eW9tH8DAuab8il09-zXf_virw9bQ_NLm9fn_fNps09Q57ySjrhwHV1LQ0zYDiqErqey77jaEvAUiFjXOkaUCjHXC-XLmUt-CJbx1fk7u_XW2uP4-QHPf0esUTAxTkBTClKUA</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Bootstrapping semantic annotation for content-rich HTML documents</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Mukherjee, S. ; Ramakrishnan, I.V. ; Singh, A.</creator><creatorcontrib>Mukherjee, S. ; Ramakrishnan, I.V. ; Singh, A.</creatorcontrib><description>Enormous amount of semantic data is still being encoded in HTML documents. Identifying and annotating the semantic concepts implicit in such documents makes them directly amenable for semantic Web processing. In this paper we describe a highly automated technique for annotating HTML documents, especially template-based content-rich documents, containing many different semantic concepts per document. Starting with a (small) seed of hand-labeled instances of semantic concepts in a set of HTML documents we bootstrap an annotation process that automatically identifies unlabeled concept instances present in other documents. The bootstrapping technique exploits the observation that semantically related items in content-rich documents exhibit consistency in presentation style and spatial locality to learn a statistical model for accurately identifying different semantic concepts in HTML documents drawn from a variety of Web sources. We also present experimental results on the effectiveness of the technique.</description><identifier>ISSN: 1063-6382</identifier><identifier>ISBN: 0769522858</identifier><identifier>ISBN: 9780769522852</identifier><identifier>EISSN: 2375-026X</identifier><identifier>DOI: 10.1109/ICDE.2005.28</identifier><language>eng</language><publisher>IEEE</publisher><subject>Computer science ; HTML ; Labeling ; Next generation networking ; Ontologies ; Pricing ; Resource description framework ; Semantic Web ; Vehicles ; XML</subject><ispartof>21st International Conference on Data Engineering (ICDE'05), 2005, p.583-593</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/1410176$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,2058,4050,4051,27925,54920</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/1410176$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Mukherjee, S.</creatorcontrib><creatorcontrib>Ramakrishnan, I.V.</creatorcontrib><creatorcontrib>Singh, A.</creatorcontrib><title>Bootstrapping semantic annotation for content-rich HTML documents</title><title>21st International Conference on Data Engineering (ICDE'05)</title><addtitle>ICDE</addtitle><description>Enormous amount of semantic data is still being encoded in HTML documents. Identifying and annotating the semantic concepts implicit in such documents makes them directly amenable for semantic Web processing. In this paper we describe a highly automated technique for annotating HTML documents, especially template-based content-rich documents, containing many different semantic concepts per document. Starting with a (small) seed of hand-labeled instances of semantic concepts in a set of HTML documents we bootstrap an annotation process that automatically identifies unlabeled concept instances present in other documents. The bootstrapping technique exploits the observation that semantically related items in content-rich documents exhibit consistency in presentation style and spatial locality to learn a statistical model for accurately identifying different semantic concepts in HTML documents drawn from a variety of Web sources. We also present experimental results on the effectiveness of the technique.</description><subject>Computer science</subject><subject>HTML</subject><subject>Labeling</subject><subject>Next generation networking</subject><subject>Ontologies</subject><subject>Pricing</subject><subject>Resource description framework</subject><subject>Semantic Web</subject><subject>Vehicles</subject><subject>XML</subject><issn>1063-6382</issn><issn>2375-026X</issn><isbn>0769522858</isbn><isbn>9780769522852</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2005</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNotjD1PwzAURS0-JELpxsbiP5Dwnp049lhCoZWCWIrEVjmODUbEjhIz8O8bCe5wr3R0dAm5RSgQQd3vm8dtwQCqgskzkjFeVzkw8X5OrqEWqmJMVvKCZAiC54JLdkXW8_wFS1SJWEFGNg8xpjlNehx9-KCzHXRI3lAdQkw6-RioixM1MSQbUj5580l3h5eW9tH8DAuab8il09-zXf_virw9bQ_NLm9fn_fNps09Q57ySjrhwHV1LQ0zYDiqErqey77jaEvAUiFjXOkaUCjHXC-XLmUt-CJbx1fk7u_XW2uP4-QHPf0esUTAxTkBTClKUA</recordid><startdate>2005</startdate><enddate>2005</enddate><creator>Mukherjee, S.</creator><creator>Ramakrishnan, I.V.</creator><creator>Singh, A.</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>2005</creationdate><title>Bootstrapping semantic annotation for content-rich HTML documents</title><author>Mukherjee, S. ; Ramakrishnan, I.V. ; Singh, A.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i213t-58f6f0fb778c2c0c31940bd38db31e4014912239a70169f2fd89f2487630c3ef3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2005</creationdate><topic>Computer science</topic><topic>HTML</topic><topic>Labeling</topic><topic>Next generation networking</topic><topic>Ontologies</topic><topic>Pricing</topic><topic>Resource description framework</topic><topic>Semantic Web</topic><topic>Vehicles</topic><topic>XML</topic><toplevel>online_resources</toplevel><creatorcontrib>Mukherjee, S.</creatorcontrib><creatorcontrib>Ramakrishnan, I.V.</creatorcontrib><creatorcontrib>Singh, A.</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Mukherjee, S.</au><au>Ramakrishnan, I.V.</au><au>Singh, A.</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Bootstrapping semantic annotation for content-rich HTML documents</atitle><btitle>21st International Conference on Data Engineering (ICDE'05)</btitle><stitle>ICDE</stitle><date>2005</date><risdate>2005</risdate><spage>583</spage><epage>593</epage><pages>583-593</pages><issn>1063-6382</issn><eissn>2375-026X</eissn><isbn>0769522858</isbn><isbn>9780769522852</isbn><abstract>Enormous amount of semantic data is still being encoded in HTML documents. Identifying and annotating the semantic concepts implicit in such documents makes them directly amenable for semantic Web processing. In this paper we describe a highly automated technique for annotating HTML documents, especially template-based content-rich documents, containing many different semantic concepts per document. Starting with a (small) seed of hand-labeled instances of semantic concepts in a set of HTML documents we bootstrap an annotation process that automatically identifies unlabeled concept instances present in other documents. The bootstrapping technique exploits the observation that semantically related items in content-rich documents exhibit consistency in presentation style and spatial locality to learn a statistical model for accurately identifying different semantic concepts in HTML documents drawn from a variety of Web sources. We also present experimental results on the effectiveness of the technique.</abstract><pub>IEEE</pub><doi>10.1109/ICDE.2005.28</doi><tpages>11</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1063-6382
ispartof	21st International Conference on Data Engineering (ICDE'05), 2005, p.583-593
issn	1063-6382 2375-026X
language	eng
recordid	cdi_ieee_primary_1410176
source	IEEE Electronic Library (IEL) Conference Proceedings
subjects	Computer science HTML Labeling Next generation networking Ontologies Pricing Resource description framework Semantic Web Vehicles XML
title	Bootstrapping semantic annotation for content-rich HTML documents
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T05%3A02%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Bootstrapping%20semantic%20annotation%20for%20content-rich%20HTML%20documents&rft.btitle=21st%20International%20Conference%20on%20Data%20Engineering%20(ICDE'05)&rft.au=Mukherjee,%20S.&rft.date=2005&rft.spage=583&rft.epage=593&rft.pages=583-593&rft.issn=1063-6382&rft.eissn=2375-026X&rft.isbn=0769522858&rft.isbn_list=9780769522852&rft_id=info:doi/10.1109/ICDE.2005.28&rft_dat=%3Cieee_6IE%3E1410176%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=1410176&rfr_iscdi=true