AnnoFin–A hybrid algorithm to annotate financial text

•AnnoFin helps a user to classify financial text data to ten categories.•AnnoFin, when trained with 30% of data and has an accuracy of 73.56%.•The accuracy increases by 2% if the training data is increased by 10%. In this work, we study the problem of annotating a large volume of Financial text by l...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Expert systems with applications 2017-12, Vol.88, p.270-275
Hauptverfasser: Swarup Das, Ananda, Mehta, Sameep, Subramaniam, L.V.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 275
container_issue
container_start_page 270
container_title Expert systems with applications
container_volume 88
creator Swarup Das, Ananda
Mehta, Sameep
Subramaniam, L.V.
description •AnnoFin helps a user to classify financial text data to ten categories.•AnnoFin, when trained with 30% of data and has an accuracy of 73.56%.•The accuracy increases by 2% if the training data is increased by 10%. In this work, we study the problem of annotating a large volume of Financial text by learning from a small set of human-annotated training data. The training data is prepared by randomly selecting some text sentences from the large corpus of financial text. Conventionally, bootstrapping algorithm is used to annotate large volume of unlabeled data by learning from a small set of annotated data. However, the small set of annotated data have to be carefully chosen as seed data. Thus, our approach is a digress from the conventional approach of bootstrapping as we let the users randomly select the seed data. We show that our proposed algorithm has an accuracy of 73.56% in classifying the financial texts into the different categories (“Accounting”, “Cost”, “Employee”, “Financing”, “Sales”, “Investments”, “Operations”, “Profit”, “Regulations” and “Irrelevant”) even when the training data is just 30% of the total data set. Additionally, the accuracy improves by an approximate average of 2% for an increase of the training data by 10% and the accuracy of our system is 77.91% when the training data is about 50% of the total data set. As a dictionary of hand chosen keywords prepared by domain experts are often used for financial text extraction, we assumed the existence of almost linearly separable hyperplanes between the different classes and therefore, we have used Linear Support Vector Machine along with a modified version of Label Propagation Algorithm which exploits the notion of neighborhood (in Euclidean space) for classification. We believe that our proposed techniques will be of help to Early Warning Systems used in banks where large volumes of unstructured texts need to be processed for better insights about a company.
doi_str_mv 10.1016/j.eswa.2017.07.016
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_1956485974</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0957417417304852</els_id><sourcerecordid>1956485974</sourcerecordid><originalsourceid>FETCH-LOGICAL-c328t-876ebb3dfddce7ad3c152d06969d88d8b216947849c0c03f2f77d979846de82d3</originalsourceid><addsrcrecordid>eNp9kM1KAzEUhYMoWKsv4GrA9Yz5m_yAm1KsCgU3ug6ZJGMztJOapP7sfAff0Ccxpa6FA3dxv3Pv4QBwiWCDIGLXQ-PSu24wRLyBRYgdgQkSnNSMS3IMJlC2vKaI01NwltIACwghnwA-G8ew8OPP1_esWn120dtKr19C9Hm1qXKodNlnnV3V-1GPxut1ld1HPgcnvV4nd_E3p-B5cfs0v6-Xj3cP89myNgSLXAvOXNcR21trHNeWGNRiC5lk0gphRYcRk5QLKg00kPS459xKLgVl1glsyRRcHe5uY3jduZTVEHZxLC8Vki2jopWcFgofKBNDStH1ahv9RsdPhaDaF6QGtS9I7QtSsAixYro5mFzJ_-ZdVMl4NxpnfXQmKxv8f_Zf9gxvCQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1956485974</pqid></control><display><type>article</type><title>AnnoFin–A hybrid algorithm to annotate financial text</title><source>ScienceDirect Journals (5 years ago - present)</source><creator>Swarup Das, Ananda ; Mehta, Sameep ; Subramaniam, L.V.</creator><creatorcontrib>Swarup Das, Ananda ; Mehta, Sameep ; Subramaniam, L.V.</creatorcontrib><description>•AnnoFin helps a user to classify financial text data to ten categories.•AnnoFin, when trained with 30% of data and has an accuracy of 73.56%.•The accuracy increases by 2% if the training data is increased by 10%. In this work, we study the problem of annotating a large volume of Financial text by learning from a small set of human-annotated training data. The training data is prepared by randomly selecting some text sentences from the large corpus of financial text. Conventionally, bootstrapping algorithm is used to annotate large volume of unlabeled data by learning from a small set of annotated data. However, the small set of annotated data have to be carefully chosen as seed data. Thus, our approach is a digress from the conventional approach of bootstrapping as we let the users randomly select the seed data. We show that our proposed algorithm has an accuracy of 73.56% in classifying the financial texts into the different categories (“Accounting”, “Cost”, “Employee”, “Financing”, “Sales”, “Investments”, “Operations”, “Profit”, “Regulations” and “Irrelevant”) even when the training data is just 30% of the total data set. Additionally, the accuracy improves by an approximate average of 2% for an increase of the training data by 10% and the accuracy of our system is 77.91% when the training data is about 50% of the total data set. As a dictionary of hand chosen keywords prepared by domain experts are often used for financial text extraction, we assumed the existence of almost linearly separable hyperplanes between the different classes and therefore, we have used Linear Support Vector Machine along with a modified version of Label Propagation Algorithm which exploits the notion of neighborhood (in Euclidean space) for classification. We believe that our proposed techniques will be of help to Early Warning Systems used in banks where large volumes of unstructured texts need to be processed for better insights about a company.</description><identifier>ISSN: 0957-4174</identifier><identifier>EISSN: 1873-6793</identifier><identifier>DOI: 10.1016/j.eswa.2017.07.016</identifier><language>eng</language><publisher>New York: Elsevier Ltd</publisher><subject>Accounting ; Algorithms ; Clustering ; Early warning systems ; Euclidean geometry ; Financial sentences ; Hyperplanes ; Label propagation algorithm ; Machine learning ; Sentences ; SVM ; Text classification ; Texts ; Training</subject><ispartof>Expert systems with applications, 2017-12, Vol.88, p.270-275</ispartof><rights>2017 Elsevier Ltd</rights><rights>Copyright Elsevier BV Dec 1, 2017</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c328t-876ebb3dfddce7ad3c152d06969d88d8b216947849c0c03f2f77d979846de82d3</citedby><cites>FETCH-LOGICAL-c328t-876ebb3dfddce7ad3c152d06969d88d8b216947849c0c03f2f77d979846de82d3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.eswa.2017.07.016$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,780,784,3550,27924,27925,45995</link.rule.ids></links><search><creatorcontrib>Swarup Das, Ananda</creatorcontrib><creatorcontrib>Mehta, Sameep</creatorcontrib><creatorcontrib>Subramaniam, L.V.</creatorcontrib><title>AnnoFin–A hybrid algorithm to annotate financial text</title><title>Expert systems with applications</title><description>•AnnoFin helps a user to classify financial text data to ten categories.•AnnoFin, when trained with 30% of data and has an accuracy of 73.56%.•The accuracy increases by 2% if the training data is increased by 10%. In this work, we study the problem of annotating a large volume of Financial text by learning from a small set of human-annotated training data. The training data is prepared by randomly selecting some text sentences from the large corpus of financial text. Conventionally, bootstrapping algorithm is used to annotate large volume of unlabeled data by learning from a small set of annotated data. However, the small set of annotated data have to be carefully chosen as seed data. Thus, our approach is a digress from the conventional approach of bootstrapping as we let the users randomly select the seed data. We show that our proposed algorithm has an accuracy of 73.56% in classifying the financial texts into the different categories (“Accounting”, “Cost”, “Employee”, “Financing”, “Sales”, “Investments”, “Operations”, “Profit”, “Regulations” and “Irrelevant”) even when the training data is just 30% of the total data set. Additionally, the accuracy improves by an approximate average of 2% for an increase of the training data by 10% and the accuracy of our system is 77.91% when the training data is about 50% of the total data set. As a dictionary of hand chosen keywords prepared by domain experts are often used for financial text extraction, we assumed the existence of almost linearly separable hyperplanes between the different classes and therefore, we have used Linear Support Vector Machine along with a modified version of Label Propagation Algorithm which exploits the notion of neighborhood (in Euclidean space) for classification. We believe that our proposed techniques will be of help to Early Warning Systems used in banks where large volumes of unstructured texts need to be processed for better insights about a company.</description><subject>Accounting</subject><subject>Algorithms</subject><subject>Clustering</subject><subject>Early warning systems</subject><subject>Euclidean geometry</subject><subject>Financial sentences</subject><subject>Hyperplanes</subject><subject>Label propagation algorithm</subject><subject>Machine learning</subject><subject>Sentences</subject><subject>SVM</subject><subject>Text classification</subject><subject>Texts</subject><subject>Training</subject><issn>0957-4174</issn><issn>1873-6793</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><recordid>eNp9kM1KAzEUhYMoWKsv4GrA9Yz5m_yAm1KsCgU3ug6ZJGMztJOapP7sfAff0Ccxpa6FA3dxv3Pv4QBwiWCDIGLXQ-PSu24wRLyBRYgdgQkSnNSMS3IMJlC2vKaI01NwltIACwghnwA-G8ew8OPP1_esWn120dtKr19C9Hm1qXKodNlnnV3V-1GPxut1ld1HPgcnvV4nd_E3p-B5cfs0v6-Xj3cP89myNgSLXAvOXNcR21trHNeWGNRiC5lk0gphRYcRk5QLKg00kPS459xKLgVl1glsyRRcHe5uY3jduZTVEHZxLC8Vki2jopWcFgofKBNDStH1ahv9RsdPhaDaF6QGtS9I7QtSsAixYro5mFzJ_-ZdVMl4NxpnfXQmKxv8f_Zf9gxvCQ</recordid><startdate>20171201</startdate><enddate>20171201</enddate><creator>Swarup Das, Ananda</creator><creator>Mehta, Sameep</creator><creator>Subramaniam, L.V.</creator><general>Elsevier Ltd</general><general>Elsevier BV</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20171201</creationdate><title>AnnoFin–A hybrid algorithm to annotate financial text</title><author>Swarup Das, Ananda ; Mehta, Sameep ; Subramaniam, L.V.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c328t-876ebb3dfddce7ad3c152d06969d88d8b216947849c0c03f2f77d979846de82d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><topic>Accounting</topic><topic>Algorithms</topic><topic>Clustering</topic><topic>Early warning systems</topic><topic>Euclidean geometry</topic><topic>Financial sentences</topic><topic>Hyperplanes</topic><topic>Label propagation algorithm</topic><topic>Machine learning</topic><topic>Sentences</topic><topic>SVM</topic><topic>Text classification</topic><topic>Texts</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Swarup Das, Ananda</creatorcontrib><creatorcontrib>Mehta, Sameep</creatorcontrib><creatorcontrib>Subramaniam, L.V.</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Expert systems with applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Swarup Das, Ananda</au><au>Mehta, Sameep</au><au>Subramaniam, L.V.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>AnnoFin–A hybrid algorithm to annotate financial text</atitle><jtitle>Expert systems with applications</jtitle><date>2017-12-01</date><risdate>2017</risdate><volume>88</volume><spage>270</spage><epage>275</epage><pages>270-275</pages><issn>0957-4174</issn><eissn>1873-6793</eissn><abstract>•AnnoFin helps a user to classify financial text data to ten categories.•AnnoFin, when trained with 30% of data and has an accuracy of 73.56%.•The accuracy increases by 2% if the training data is increased by 10%. In this work, we study the problem of annotating a large volume of Financial text by learning from a small set of human-annotated training data. The training data is prepared by randomly selecting some text sentences from the large corpus of financial text. Conventionally, bootstrapping algorithm is used to annotate large volume of unlabeled data by learning from a small set of annotated data. However, the small set of annotated data have to be carefully chosen as seed data. Thus, our approach is a digress from the conventional approach of bootstrapping as we let the users randomly select the seed data. We show that our proposed algorithm has an accuracy of 73.56% in classifying the financial texts into the different categories (“Accounting”, “Cost”, “Employee”, “Financing”, “Sales”, “Investments”, “Operations”, “Profit”, “Regulations” and “Irrelevant”) even when the training data is just 30% of the total data set. Additionally, the accuracy improves by an approximate average of 2% for an increase of the training data by 10% and the accuracy of our system is 77.91% when the training data is about 50% of the total data set. As a dictionary of hand chosen keywords prepared by domain experts are often used for financial text extraction, we assumed the existence of almost linearly separable hyperplanes between the different classes and therefore, we have used Linear Support Vector Machine along with a modified version of Label Propagation Algorithm which exploits the notion of neighborhood (in Euclidean space) for classification. We believe that our proposed techniques will be of help to Early Warning Systems used in banks where large volumes of unstructured texts need to be processed for better insights about a company.</abstract><cop>New York</cop><pub>Elsevier Ltd</pub><doi>10.1016/j.eswa.2017.07.016</doi><tpages>6</tpages></addata></record>
fulltext fulltext
identifier ISSN: 0957-4174
ispartof Expert systems with applications, 2017-12, Vol.88, p.270-275
issn 0957-4174
1873-6793
language eng
recordid cdi_proquest_journals_1956485974
source ScienceDirect Journals (5 years ago - present)
subjects Accounting
Algorithms
Clustering
Early warning systems
Euclidean geometry
Financial sentences
Hyperplanes
Label propagation algorithm
Machine learning
Sentences
SVM
Text classification
Texts
Training
title AnnoFin–A hybrid algorithm to annotate financial text
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T06%3A11%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=AnnoFin%E2%80%93A%20hybrid%20algorithm%20to%20annotate%20financial%20text&rft.jtitle=Expert%20systems%20with%20applications&rft.au=Swarup%20Das,%20Ananda&rft.date=2017-12-01&rft.volume=88&rft.spage=270&rft.epage=275&rft.pages=270-275&rft.issn=0957-4174&rft.eissn=1873-6793&rft_id=info:doi/10.1016/j.eswa.2017.07.016&rft_dat=%3Cproquest_cross%3E1956485974%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1956485974&rft_id=info:pmid/&rft_els_id=S0957417417304852&rfr_iscdi=true