Block-Level Linkes Based Content Extraction

We present block-level links based content extraction (BLCE)-a method to extract content from the web pages by using the link attributes of blocks, which contains the number of links and the length of link text (anchor text).We describe how to divide one web page into blocks and how to merge the sim...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Shixing Shen, Hui Zhang
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 333
container_issue
container_start_page 330
container_title
container_volume
creator Shixing Shen
Hui Zhang
description We present block-level links based content extraction (BLCE)-a method to extract content from the web pages by using the link attributes of blocks, which contains the number of links and the length of link text (anchor text).We describe how to divide one web page into blocks and how to merge the similar blocks into one, then compute the number of links and the total length of anchor text. We find that extracting content only with the number of links and length of anchor text is not effective because the number of links and length of link text are proportional to the length of page. Density of links is a good method to solve this. So we use the content links ratios and the content anchor text ratios to describe the link attribute of the blocks. BLCE performs better than other methods especially in the new web pages with DIV and CSS where traditional algorithm can't work well.
doi_str_mv 10.1109/PAAP.2011.49
format Conference Proceeding
fullrecord <record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_6128527</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6128527</ieee_id><sourcerecordid>6128527</sourcerecordid><originalsourceid>FETCH-ieee_primary_61285273</originalsourceid><addsrcrecordid>eNp9yjELglAUQOELJaTl1tbiHtq9PvU9RxWjwcGhXcRuYJqGT6L-fQTNTWf4DsCW0CPC-FAmSen5SOQF8QIsCkIpSaGiJZg-RcoVKAIDrO8SizCUYgW21jdEFKRiFaEJ-7Qfm84t-Mm9U7RDx9pJa80XJxuHmYfZyV_zVDdzOw4bMK51r9n-dQ27Y37OTm7LzNVjau_19K4i8lXoS_FfP8pHM2Q</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Block-Level Linkes Based Content Extraction</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Shixing Shen ; Hui Zhang</creator><creatorcontrib>Shixing Shen ; Hui Zhang</creatorcontrib><description>We present block-level links based content extraction (BLCE)-a method to extract content from the web pages by using the link attributes of blocks, which contains the number of links and the length of link text (anchor text).We describe how to divide one web page into blocks and how to merge the similar blocks into one, then compute the number of links and the total length of anchor text. We find that extracting content only with the number of links and length of anchor text is not effective because the number of links and length of link text are proportional to the length of page. Density of links is a good method to solve this. So we use the content links ratios and the content anchor text ratios to describe the link attribute of the blocks. BLCE performs better than other methods especially in the new web pages with DIV and CSS where traditional algorithm can't work well.</description><identifier>ISSN: 2168-3034</identifier><identifier>ISBN: 1457718081</identifier><identifier>ISBN: 9781457718083</identifier><identifier>DOI: 10.1109/PAAP.2011.49</identifier><identifier>LCCN: 2011935573</identifier><language>eng</language><publisher>IEEE</publisher><subject>block-level links ; Cascading style sheets ; content extraction ; Data mining ; HTML ; Internet ; merge block ; Navigation ; Probability distribution ; Web pages</subject><ispartof>2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming, 2011, p.330-333</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6128527$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2051,27904,54898</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6128527$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Shixing Shen</creatorcontrib><creatorcontrib>Hui Zhang</creatorcontrib><title>Block-Level Linkes Based Content Extraction</title><title>2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming</title><addtitle>paap</addtitle><description>We present block-level links based content extraction (BLCE)-a method to extract content from the web pages by using the link attributes of blocks, which contains the number of links and the length of link text (anchor text).We describe how to divide one web page into blocks and how to merge the similar blocks into one, then compute the number of links and the total length of anchor text. We find that extracting content only with the number of links and length of anchor text is not effective because the number of links and length of link text are proportional to the length of page. Density of links is a good method to solve this. So we use the content links ratios and the content anchor text ratios to describe the link attribute of the blocks. BLCE performs better than other methods especially in the new web pages with DIV and CSS where traditional algorithm can't work well.</description><subject>block-level links</subject><subject>Cascading style sheets</subject><subject>content extraction</subject><subject>Data mining</subject><subject>HTML</subject><subject>Internet</subject><subject>merge block</subject><subject>Navigation</subject><subject>Probability distribution</subject><subject>Web pages</subject><issn>2168-3034</issn><isbn>1457718081</isbn><isbn>9781457718083</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2011</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNp9yjELglAUQOELJaTl1tbiHtq9PvU9RxWjwcGhXcRuYJqGT6L-fQTNTWf4DsCW0CPC-FAmSen5SOQF8QIsCkIpSaGiJZg-RcoVKAIDrO8SizCUYgW21jdEFKRiFaEJ-7Qfm84t-Mm9U7RDx9pJa80XJxuHmYfZyV_zVDdzOw4bMK51r9n-dQ27Y37OTm7LzNVjau_19K4i8lXoS_FfP8pHM2Q</recordid><startdate>201112</startdate><enddate>201112</enddate><creator>Shixing Shen</creator><creator>Hui Zhang</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>201112</creationdate><title>Block-Level Linkes Based Content Extraction</title><author>Shixing Shen ; Hui Zhang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-ieee_primary_61285273</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2011</creationdate><topic>block-level links</topic><topic>Cascading style sheets</topic><topic>content extraction</topic><topic>Data mining</topic><topic>HTML</topic><topic>Internet</topic><topic>merge block</topic><topic>Navigation</topic><topic>Probability distribution</topic><topic>Web pages</topic><toplevel>online_resources</toplevel><creatorcontrib>Shixing Shen</creatorcontrib><creatorcontrib>Hui Zhang</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Shixing Shen</au><au>Hui Zhang</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Block-Level Linkes Based Content Extraction</atitle><btitle>2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming</btitle><stitle>paap</stitle><date>2011-12</date><risdate>2011</risdate><spage>330</spage><epage>333</epage><pages>330-333</pages><issn>2168-3034</issn><isbn>1457718081</isbn><isbn>9781457718083</isbn><abstract>We present block-level links based content extraction (BLCE)-a method to extract content from the web pages by using the link attributes of blocks, which contains the number of links and the length of link text (anchor text).We describe how to divide one web page into blocks and how to merge the similar blocks into one, then compute the number of links and the total length of anchor text. We find that extracting content only with the number of links and length of anchor text is not effective because the number of links and length of link text are proportional to the length of page. Density of links is a good method to solve this. So we use the content links ratios and the content anchor text ratios to describe the link attribute of the blocks. BLCE performs better than other methods especially in the new web pages with DIV and CSS where traditional algorithm can't work well.</abstract><pub>IEEE</pub><doi>10.1109/PAAP.2011.49</doi></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 2168-3034
ispartof 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming, 2011, p.330-333
issn 2168-3034
language eng
recordid cdi_ieee_primary_6128527
source IEEE Electronic Library (IEL) Conference Proceedings
subjects block-level links
Cascading style sheets
content extraction
Data mining
HTML
Internet
merge block
Navigation
Probability distribution
Web pages
title Block-Level Linkes Based Content Extraction
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-25T15%3A01%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Block-Level%20Linkes%20Based%20Content%20Extraction&rft.btitle=2011%20Fourth%20International%20Symposium%20on%20Parallel%20Architectures,%20Algorithms%20and%20Programming&rft.au=Shixing%20Shen&rft.date=2011-12&rft.spage=330&rft.epage=333&rft.pages=330-333&rft.issn=2168-3034&rft.isbn=1457718081&rft.isbn_list=9781457718083&rft_id=info:doi/10.1109/PAAP.2011.49&rft_dat=%3Cieee_6IE%3E6128527%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=6128527&rfr_iscdi=true