Efficient Sentiment-Aware Web Crawling Methods for Constructing Sentiment Dictionary

In traditional web crawling, all web pages crawled are first stored to databases. As a result, this approach can store unnecessary web pages and requires additional running time for the construction of a sentiment dictionary in a particular domain because sentiment words should be identified by scan...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2021, Vol.9, p.161208-161223
Hauptverfasser:	On, Byung-Won, Jo, Jun-Young, Shin, Hyunkwang, Gim, Jangwon, Choi, Gyu Sang, Jung, Soo-Mok
Format:	Artikel
Sprache:	eng
Schlagworte:	Crawlers Dictionaries Feature extraction Hash join Sentiment analysis sentiment lexicon Uniform resource locators Vocabulary web crawling Web pages Websites
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In traditional web crawling, all web pages crawled are first stored to databases. As a result, this approach can store unnecessary web pages and requires additional running time for the construction of a sentiment dictionary in a particular domain because sentiment words should be identified by scanning all web pages in the database. To address these problems, we first define the sentiment-aware web crawling problem and then propose two hash-based methods for the implementation. One is based on hash join and the other is bucket-sorted hash join. In particular, we propose a novel bucket-sorted hash join for the efficient sentiment-aware web crawling method. Our experimental results show that the proposed web crawling method using bucket-sorted hash join outperforms existing web crawling methods by significantly reducing the running time and storage space. In the proposed method, the time taken to execute the sentiment-aware task per web page is 0.016 seconds and the database space can be saved by 59% compared to the existing web crawling methods.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2021.3129187