Block-Level Linkes Based Content Extraction
We present block-level links based content extraction (BLCE)-a method to extract content from the web pages by using the link attributes of blocks, which contains the number of links and the length of link text (anchor text).We describe how to divide one web page into blocks and how to merge the sim...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Tagungsbericht |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 333 |
---|---|
container_issue | |
container_start_page | 330 |
container_title | |
container_volume | |
creator | Shixing Shen Hui Zhang |
description | We present block-level links based content extraction (BLCE)-a method to extract content from the web pages by using the link attributes of blocks, which contains the number of links and the length of link text (anchor text).We describe how to divide one web page into blocks and how to merge the similar blocks into one, then compute the number of links and the total length of anchor text. We find that extracting content only with the number of links and length of anchor text is not effective because the number of links and length of link text are proportional to the length of page. Density of links is a good method to solve this. So we use the content links ratios and the content anchor text ratios to describe the link attribute of the blocks. BLCE performs better than other methods especially in the new web pages with DIV and CSS where traditional algorithm can't work well. |
doi_str_mv | 10.1109/PAAP.2011.49 |
format | Conference Proceeding |
fullrecord | <record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_6128527</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6128527</ieee_id><sourcerecordid>6128527</sourcerecordid><originalsourceid>FETCH-ieee_primary_61285273</originalsourceid><addsrcrecordid>eNp9yjELglAUQOELJaTl1tbiHtq9PvU9RxWjwcGhXcRuYJqGT6L-fQTNTWf4DsCW0CPC-FAmSen5SOQF8QIsCkIpSaGiJZg-RcoVKAIDrO8SizCUYgW21jdEFKRiFaEJ-7Qfm84t-Mm9U7RDx9pJa80XJxuHmYfZyV_zVDdzOw4bMK51r9n-dQ27Y37OTm7LzNVjau_19K4i8lXoS_FfP8pHM2Q</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Block-Level Linkes Based Content Extraction</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Shixing Shen ; Hui Zhang</creator><creatorcontrib>Shixing Shen ; Hui Zhang</creatorcontrib><description>We present block-level links based content extraction (BLCE)-a method to extract content from the web pages by using the link attributes of blocks, which contains the number of links and the length of link text (anchor text).We describe how to divide one web page into blocks and how to merge the similar blocks into one, then compute the number of links and the total length of anchor text. We find that extracting content only with the number of links and length of anchor text is not effective because the number of links and length of link text are proportional to the length of page. Density of links is a good method to solve this. So we use the content links ratios and the content anchor text ratios to describe the link attribute of the blocks. BLCE performs better than other methods especially in the new web pages with DIV and CSS where traditional algorithm can't work well.</description><identifier>ISSN: 2168-3034</identifier><identifier>ISBN: 1457718081</identifier><identifier>ISBN: 9781457718083</identifier><identifier>DOI: 10.1109/PAAP.2011.49</identifier><identifier>LCCN: 2011935573</identifier><language>eng</language><publisher>IEEE</publisher><subject>block-level links ; Cascading style sheets ; content extraction ; Data mining ; HTML ; Internet ; merge block ; Navigation ; Probability distribution ; Web pages</subject><ispartof>2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming, 2011, p.330-333</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6128527$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2051,27904,54898</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6128527$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Shixing Shen</creatorcontrib><creatorcontrib>Hui Zhang</creatorcontrib><title>Block-Level Linkes Based Content Extraction</title><title>2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming</title><addtitle>paap</addtitle><description>We present block-level links based content extraction (BLCE)-a method to extract content from the web pages by using the link attributes of blocks, which contains the number of links and the length of link text (anchor text).We describe how to divide one web page into blocks and how to merge the similar blocks into one, then compute the number of links and the total length of anchor text. We find that extracting content only with the number of links and length of anchor text is not effective because the number of links and length of link text are proportional to the length of page. Density of links is a good method to solve this. So we use the content links ratios and the content anchor text ratios to describe the link attribute of the blocks. BLCE performs better than other methods especially in the new web pages with DIV and CSS where traditional algorithm can't work well.</description><subject>block-level links</subject><subject>Cascading style sheets</subject><subject>content extraction</subject><subject>Data mining</subject><subject>HTML</subject><subject>Internet</subject><subject>merge block</subject><subject>Navigation</subject><subject>Probability distribution</subject><subject>Web pages</subject><issn>2168-3034</issn><isbn>1457718081</isbn><isbn>9781457718083</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2011</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNp9yjELglAUQOELJaTl1tbiHtq9PvU9RxWjwcGhXcRuYJqGT6L-fQTNTWf4DsCW0CPC-FAmSen5SOQF8QIsCkIpSaGiJZg-RcoVKAIDrO8SizCUYgW21jdEFKRiFaEJ-7Qfm84t-Mm9U7RDx9pJa80XJxuHmYfZyV_zVDdzOw4bMK51r9n-dQ27Y37OTm7LzNVjau_19K4i8lXoS_FfP8pHM2Q</recordid><startdate>201112</startdate><enddate>201112</enddate><creator>Shixing Shen</creator><creator>Hui Zhang</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>201112</creationdate><title>Block-Level Linkes Based Content Extraction</title><author>Shixing Shen ; Hui Zhang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-ieee_primary_61285273</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2011</creationdate><topic>block-level links</topic><topic>Cascading style sheets</topic><topic>content extraction</topic><topic>Data mining</topic><topic>HTML</topic><topic>Internet</topic><topic>merge block</topic><topic>Navigation</topic><topic>Probability distribution</topic><topic>Web pages</topic><toplevel>online_resources</toplevel><creatorcontrib>Shixing Shen</creatorcontrib><creatorcontrib>Hui Zhang</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Shixing Shen</au><au>Hui Zhang</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Block-Level Linkes Based Content Extraction</atitle><btitle>2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming</btitle><stitle>paap</stitle><date>2011-12</date><risdate>2011</risdate><spage>330</spage><epage>333</epage><pages>330-333</pages><issn>2168-3034</issn><isbn>1457718081</isbn><isbn>9781457718083</isbn><abstract>We present block-level links based content extraction (BLCE)-a method to extract content from the web pages by using the link attributes of blocks, which contains the number of links and the length of link text (anchor text).We describe how to divide one web page into blocks and how to merge the similar blocks into one, then compute the number of links and the total length of anchor text. We find that extracting content only with the number of links and length of anchor text is not effective because the number of links and length of link text are proportional to the length of page. Density of links is a good method to solve this. So we use the content links ratios and the content anchor text ratios to describe the link attribute of the blocks. BLCE performs better than other methods especially in the new web pages with DIV and CSS where traditional algorithm can't work well.</abstract><pub>IEEE</pub><doi>10.1109/PAAP.2011.49</doi></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 2168-3034 |
ispartof | 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming, 2011, p.330-333 |
issn | 2168-3034 |
language | eng |
recordid | cdi_ieee_primary_6128527 |
source | IEEE Electronic Library (IEL) Conference Proceedings |
subjects | block-level links Cascading style sheets content extraction Data mining HTML Internet merge block Navigation Probability distribution Web pages |
title | Block-Level Linkes Based Content Extraction |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-25T15%3A01%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Block-Level%20Linkes%20Based%20Content%20Extraction&rft.btitle=2011%20Fourth%20International%20Symposium%20on%20Parallel%20Architectures,%20Algorithms%20and%20Programming&rft.au=Shixing%20Shen&rft.date=2011-12&rft.spage=330&rft.epage=333&rft.pages=330-333&rft.issn=2168-3034&rft.isbn=1457718081&rft.isbn_list=9781457718083&rft_id=info:doi/10.1109/PAAP.2011.49&rft_dat=%3Cieee_6IE%3E6128527%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=6128527&rfr_iscdi=true |