Tweet Classification Toward Twitter-Based Disease Surveillance: New Data, Methods, and Evaluations

The amount of medical and clinical-related information on the Web is increasing. Among the different types of information available, social media-based data obtained directly from people are particularly valuable and are attracting significant attention. To encourage medical natural language process...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of medical Internet research 2019-02, Vol.21 (2), p.e12783
Hauptverfasser:	Wakamiya, Shoko, Morita, Mizuki, Kano, Yoshinobu, Ohkuma, Tomoko, Aramaki, Eiji
Format:	Artikel
Sprache:	eng
Schlagworte:	Analysis Computational linguistics Data Mining - statistics & numerical data Databases, Factual - trends Humans Influenza Internet Language processing Machine Learning Medical research Medicine, Experimental Methods Natural language interfaces Natural Language Processing Original Paper Population Surveillance Social media Social Media - trends Web sites
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue	2
container_start_page	e12783
container_title	Journal of medical Internet research
container_volume	21
creator	Wakamiya, Shoko Morita, Mizuki Kano, Yoshinobu Ohkuma, Tomoko Aramaki, Eiji
description	The amount of medical and clinical-related information on the Web is increasing. Among the different types of information available, social media-based data obtained directly from people are particularly valuable and are attracting significant attention. To encourage medical natural language processing (NLP) research exploiting social media data, the 13th NII Testbeds and Community for Information access Research (NTCIR-13) Medical natural language processing for Web document (MedWeb) provides pseudo-Twitter messages in a cross-language and multi-label corpus, covering 3 languages (Japanese, English, and Chinese) and annotated with 8 symptom labels (such as cold, fever, and flu). Then, participants classify each tweet into 1 of the 2 categories: those containing a patient's symptom and those that do not. This study aimed to present the results of groups participating in a Japanese subtask, English subtask, and Chinese subtask along with discussions, to clarify the issues that need to be resolved in the field of medical NLP. In summary, 8 groups (19 systems) participated in the Japanese subtask, 4 groups (12 systems) participated in the English subtask, and 2 groups (6 systems) participated in the Chinese subtask. In total, 2 baseline systems were constructed for each subtask. The performance of the participant and baseline systems was assessed using the exact match accuracy, F-measure based on precision and recall, and Hamming loss. The best system achieved exactly 0.880 match accuracy, 0.920 F-measure, and 0.019 Hamming loss. The averages of match accuracy, F-measure, and Hamming loss for the Japanese subtask were 0.720, 0.820, and 0.051; those for the English subtask were 0.770, 0.850, and 0.037; and those for the Chinese subtask were 0.810, 0.880, and 0.032, respectively. This paper presented and discussed the performance of systems participating in the NTCIR-13 MedWeb task. As the MedWeb task settings can be formalized as the factualization of text, the achievement of this task could be directly applied to practical clinical applications.
doi_str_mv	10.2196/12783
format	Article
fullrecord	<record><control><sourceid>gale_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_6401666</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A769356349</galeid><sourcerecordid>A769356349</sourcerecordid><originalsourceid>FETCH-LOGICAL-c530t-9a2233bffe828d64bad5b989cb0b0358c80bed87582edaa85abfd705ec6b9c7d3</originalsourceid><addsrcrecordid>eNptkl1rFDEUhoMottb-BQmIYKFTk8l8JF4I7bZqodaLrtfhJDmzjcxO6iSzq__ebLeWLkgucjh5zgMnvIQccnZSctV84GUrxTOyzyshCylb_vxJvUdexfiTsZJVir8ke4K1sq5Yu0_MfI2Y6KyHGH3nLSQfBjoPaxgdna99SjgWZxDR0XMfMRf0ZhpX6PseBosf6TWu6TkkOKbfMN0GF48pDI5erKCf7mXxNXnRQR_x8OE-ID8-X8xnX4ur718uZ6dXha0FS4WCshTCdB3KUrqmMuBqo6SyhhkmamklM-hkW8sSHYCswXSuZTXaxijbOnFAPm29d5NZorM4pBF6fTf6JYx_dACvd18Gf6sXYaWbivGmabLg_YNgDL8mjEkvfbS42RTDFHXJZcUrVTcio2-36AJ61H7oQjbaDa5P20aJzFQqUyf_ofJxuPQ2DNj53N8ZONoZyEzC32kBU4z68uZ6l323Ze0YYhyxe9yUM72JhL6PRObePP2WR-pfBsRfOwmwAg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2184149563</pqid></control><display><type>article</type><title>Tweet Classification Toward Twitter-Based Disease Surveillance: New Data, Methods, and Evaluations</title><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>PubMed Central Open Access</source><source>EZB-FREE-00999 freely available EZB journals</source><source>PubMed Central</source><creator>Wakamiya, Shoko ; Morita, Mizuki ; Kano, Yoshinobu ; Ohkuma, Tomoko ; Aramaki, Eiji</creator><creatorcontrib>Wakamiya, Shoko ; Morita, Mizuki ; Kano, Yoshinobu ; Ohkuma, Tomoko ; Aramaki, Eiji</creatorcontrib><description>The amount of medical and clinical-related information on the Web is increasing. Among the different types of information available, social media-based data obtained directly from people are particularly valuable and are attracting significant attention. To encourage medical natural language processing (NLP) research exploiting social media data, the 13th NII Testbeds and Community for Information access Research (NTCIR-13) Medical natural language processing for Web document (MedWeb) provides pseudo-Twitter messages in a cross-language and multi-label corpus, covering 3 languages (Japanese, English, and Chinese) and annotated with 8 symptom labels (such as cold, fever, and flu). Then, participants classify each tweet into 1 of the 2 categories: those containing a patient's symptom and those that do not. This study aimed to present the results of groups participating in a Japanese subtask, English subtask, and Chinese subtask along with discussions, to clarify the issues that need to be resolved in the field of medical NLP. In summary, 8 groups (19 systems) participated in the Japanese subtask, 4 groups (12 systems) participated in the English subtask, and 2 groups (6 systems) participated in the Chinese subtask. In total, 2 baseline systems were constructed for each subtask. The performance of the participant and baseline systems was assessed using the exact match accuracy, F-measure based on precision and recall, and Hamming loss. The best system achieved exactly 0.880 match accuracy, 0.920 F-measure, and 0.019 Hamming loss. The averages of match accuracy, F-measure, and Hamming loss for the Japanese subtask were 0.720, 0.820, and 0.051; those for the English subtask were 0.770, 0.850, and 0.037; and those for the Chinese subtask were 0.810, 0.880, and 0.032, respectively. This paper presented and discussed the performance of systems participating in the NTCIR-13 MedWeb task. As the MedWeb task settings can be formalized as the factualization of text, the achievement of this task could be directly applied to practical clinical applications.</description><identifier>ISSN: 1438-8871</identifier><identifier>ISSN: 1439-4456</identifier><identifier>EISSN: 1438-8871</identifier><identifier>DOI: 10.2196/12783</identifier><identifier>PMID: 30785407</identifier><language>eng</language><publisher>Canada: Journal of Medical Internet Research</publisher><subject>Analysis ; Computational linguistics ; Data Mining - statistics & numerical data ; Databases, Factual - trends ; Humans ; Influenza ; Internet ; Language processing ; Machine Learning ; Medical research ; Medicine, Experimental ; Methods ; Natural language interfaces ; Natural Language Processing ; Original Paper ; Population Surveillance ; Social media ; Social Media - trends ; Web sites</subject><ispartof>Journal of medical Internet research, 2019-02, Vol.21 (2), p.e12783</ispartof><rights>Shoko Wakamiya, Mizuki Morita, Yoshinobu Kano, Tomoko Ohkuma, Eiji Aramaki. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 20.02.2019.</rights><rights>COPYRIGHT 2019 Journal of Medical Internet Research</rights><rights>Shoko Wakamiya, Mizuki Morita, Yoshinobu Kano, Tomoko Ohkuma, Eiji Aramaki. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 20.02.2019. 2019</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c530t-9a2233bffe828d64bad5b989cb0b0358c80bed87582edaa85abfd705ec6b9c7d3</citedby><cites>FETCH-LOGICAL-c530t-9a2233bffe828d64bad5b989cb0b0358c80bed87582edaa85abfd705ec6b9c7d3</cites><orcidid>0000-0002-9371-1340 ; 0000-0001-7864-842X ; 0000-0001-8592-5499 ; 0000-0003-0201-3609 ; 0000-0002-5078-4814</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,314,727,780,784,864,885,27924,27925</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/30785407$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Wakamiya, Shoko</creatorcontrib><creatorcontrib>Morita, Mizuki</creatorcontrib><creatorcontrib>Kano, Yoshinobu</creatorcontrib><creatorcontrib>Ohkuma, Tomoko</creatorcontrib><creatorcontrib>Aramaki, Eiji</creatorcontrib><title>Tweet Classification Toward Twitter-Based Disease Surveillance: New Data, Methods, and Evaluations</title><title>Journal of medical Internet research</title><addtitle>J Med Internet Res</addtitle><description>The amount of medical and clinical-related information on the Web is increasing. Among the different types of information available, social media-based data obtained directly from people are particularly valuable and are attracting significant attention. To encourage medical natural language processing (NLP) research exploiting social media data, the 13th NII Testbeds and Community for Information access Research (NTCIR-13) Medical natural language processing for Web document (MedWeb) provides pseudo-Twitter messages in a cross-language and multi-label corpus, covering 3 languages (Japanese, English, and Chinese) and annotated with 8 symptom labels (such as cold, fever, and flu). Then, participants classify each tweet into 1 of the 2 categories: those containing a patient's symptom and those that do not. This study aimed to present the results of groups participating in a Japanese subtask, English subtask, and Chinese subtask along with discussions, to clarify the issues that need to be resolved in the field of medical NLP. In summary, 8 groups (19 systems) participated in the Japanese subtask, 4 groups (12 systems) participated in the English subtask, and 2 groups (6 systems) participated in the Chinese subtask. In total, 2 baseline systems were constructed for each subtask. The performance of the participant and baseline systems was assessed using the exact match accuracy, F-measure based on precision and recall, and Hamming loss. The best system achieved exactly 0.880 match accuracy, 0.920 F-measure, and 0.019 Hamming loss. The averages of match accuracy, F-measure, and Hamming loss for the Japanese subtask were 0.720, 0.820, and 0.051; those for the English subtask were 0.770, 0.850, and 0.037; and those for the Chinese subtask were 0.810, 0.880, and 0.032, respectively. This paper presented and discussed the performance of systems participating in the NTCIR-13 MedWeb task. As the MedWeb task settings can be formalized as the factualization of text, the achievement of this task could be directly applied to practical clinical applications.</description><subject>Analysis</subject><subject>Computational linguistics</subject><subject>Data Mining - statistics & numerical data</subject><subject>Databases, Factual - trends</subject><subject>Humans</subject><subject>Influenza</subject><subject>Internet</subject><subject>Language processing</subject><subject>Machine Learning</subject><subject>Medical research</subject><subject>Medicine, Experimental</subject><subject>Methods</subject><subject>Natural language interfaces</subject><subject>Natural Language Processing</subject><subject>Original Paper</subject><subject>Population Surveillance</subject><subject>Social media</subject><subject>Social Media - trends</subject><subject>Web sites</subject><issn>1438-8871</issn><issn>1439-4456</issn><issn>1438-8871</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNptkl1rFDEUhoMottb-BQmIYKFTk8l8JF4I7bZqodaLrtfhJDmzjcxO6iSzq__ebLeWLkgucjh5zgMnvIQccnZSctV84GUrxTOyzyshCylb_vxJvUdexfiTsZJVir8ke4K1sq5Yu0_MfI2Y6KyHGH3nLSQfBjoPaxgdna99SjgWZxDR0XMfMRf0ZhpX6PseBosf6TWu6TkkOKbfMN0GF48pDI5erKCf7mXxNXnRQR_x8OE-ID8-X8xnX4ur718uZ6dXha0FS4WCshTCdB3KUrqmMuBqo6SyhhkmamklM-hkW8sSHYCswXSuZTXaxijbOnFAPm29d5NZorM4pBF6fTf6JYx_dACvd18Gf6sXYaWbivGmabLg_YNgDL8mjEkvfbS42RTDFHXJZcUrVTcio2-36AJ61H7oQjbaDa5P20aJzFQqUyf_ofJxuPQ2DNj53N8ZONoZyEzC32kBU4z68uZ6l323Ze0YYhyxe9yUM72JhL6PRObePP2WR-pfBsRfOwmwAg</recordid><startdate>20190220</startdate><enddate>20190220</enddate><creator>Wakamiya, Shoko</creator><creator>Morita, Mizuki</creator><creator>Kano, Yoshinobu</creator><creator>Ohkuma, Tomoko</creator><creator>Aramaki, Eiji</creator><general>Journal of Medical Internet Research</general><general>JMIR Publications</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>ISN</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0002-9371-1340</orcidid><orcidid>https://orcid.org/0000-0001-7864-842X</orcidid><orcidid>https://orcid.org/0000-0001-8592-5499</orcidid><orcidid>https://orcid.org/0000-0003-0201-3609</orcidid><orcidid>https://orcid.org/0000-0002-5078-4814</orcidid></search><sort><creationdate>20190220</creationdate><title>Tweet Classification Toward Twitter-Based Disease Surveillance: New Data, Methods, and Evaluations</title><author>Wakamiya, Shoko ; Morita, Mizuki ; Kano, Yoshinobu ; Ohkuma, Tomoko ; Aramaki, Eiji</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c530t-9a2233bffe828d64bad5b989cb0b0358c80bed87582edaa85abfd705ec6b9c7d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Analysis</topic><topic>Computational linguistics</topic><topic>Data Mining - statistics & numerical data</topic><topic>Databases, Factual - trends</topic><topic>Humans</topic><topic>Influenza</topic><topic>Internet</topic><topic>Language processing</topic><topic>Machine Learning</topic><topic>Medical research</topic><topic>Medicine, Experimental</topic><topic>Methods</topic><topic>Natural language interfaces</topic><topic>Natural Language Processing</topic><topic>Original Paper</topic><topic>Population Surveillance</topic><topic>Social media</topic><topic>Social Media - trends</topic><topic>Web sites</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wakamiya, Shoko</creatorcontrib><creatorcontrib>Morita, Mizuki</creatorcontrib><creatorcontrib>Kano, Yoshinobu</creatorcontrib><creatorcontrib>Ohkuma, Tomoko</creatorcontrib><creatorcontrib>Aramaki, Eiji</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Canada</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Journal of medical Internet research</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wakamiya, Shoko</au><au>Morita, Mizuki</au><au>Kano, Yoshinobu</au><au>Ohkuma, Tomoko</au><au>Aramaki, Eiji</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Tweet Classification Toward Twitter-Based Disease Surveillance: New Data, Methods, and Evaluations</atitle><jtitle>Journal of medical Internet research</jtitle><addtitle>J Med Internet Res</addtitle><date>2019-02-20</date><risdate>2019</risdate><volume>21</volume><issue>2</issue><spage>e12783</spage><pages>e12783-</pages><issn>1438-8871</issn><issn>1439-4456</issn><eissn>1438-8871</eissn><abstract>The amount of medical and clinical-related information on the Web is increasing. Among the different types of information available, social media-based data obtained directly from people are particularly valuable and are attracting significant attention. To encourage medical natural language processing (NLP) research exploiting social media data, the 13th NII Testbeds and Community for Information access Research (NTCIR-13) Medical natural language processing for Web document (MedWeb) provides pseudo-Twitter messages in a cross-language and multi-label corpus, covering 3 languages (Japanese, English, and Chinese) and annotated with 8 symptom labels (such as cold, fever, and flu). Then, participants classify each tweet into 1 of the 2 categories: those containing a patient's symptom and those that do not. This study aimed to present the results of groups participating in a Japanese subtask, English subtask, and Chinese subtask along with discussions, to clarify the issues that need to be resolved in the field of medical NLP. In summary, 8 groups (19 systems) participated in the Japanese subtask, 4 groups (12 systems) participated in the English subtask, and 2 groups (6 systems) participated in the Chinese subtask. In total, 2 baseline systems were constructed for each subtask. The performance of the participant and baseline systems was assessed using the exact match accuracy, F-measure based on precision and recall, and Hamming loss. The best system achieved exactly 0.880 match accuracy, 0.920 F-measure, and 0.019 Hamming loss. The averages of match accuracy, F-measure, and Hamming loss for the Japanese subtask were 0.720, 0.820, and 0.051; those for the English subtask were 0.770, 0.850, and 0.037; and those for the Chinese subtask were 0.810, 0.880, and 0.032, respectively. This paper presented and discussed the performance of systems participating in the NTCIR-13 MedWeb task. As the MedWeb task settings can be formalized as the factualization of text, the achievement of this task could be directly applied to practical clinical applications.</abstract><cop>Canada</cop><pub>Journal of Medical Internet Research</pub><pmid>30785407</pmid><doi>10.2196/12783</doi><orcidid>https://orcid.org/0000-0002-9371-1340</orcidid><orcidid>https://orcid.org/0000-0001-7864-842X</orcidid><orcidid>https://orcid.org/0000-0001-8592-5499</orcidid><orcidid>https://orcid.org/0000-0003-0201-3609</orcidid><orcidid>https://orcid.org/0000-0002-5078-4814</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1438-8871
ispartof	Journal of medical Internet research, 2019-02, Vol.21 (2), p.e12783
issn	1438-8871 1439-4456 1438-8871
language	eng
recordid	cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_6401666
source	MEDLINE; DOAJ Directory of Open Access Journals; PubMed Central Open Access; EZB-FREE-00999 freely available EZB journals; PubMed Central
subjects	Analysis Computational linguistics Data Mining - statistics & numerical data Databases, Factual - trends Humans Influenza Internet Language processing Machine Learning Medical research Medicine, Experimental Methods Natural language interfaces Natural Language Processing Original Paper Population Surveillance Social media Social Media - trends Web sites
title	Tweet Classification Toward Twitter-Based Disease Surveillance: New Data, Methods, and Evaluations
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T12%3A02%3A27IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Tweet%20Classification%20Toward%20Twitter-Based%20Disease%20Surveillance:%20New%20Data,%20Methods,%20and%20Evaluations&rft.jtitle=Journal%20of%20medical%20Internet%20research&rft.au=Wakamiya,%20Shoko&rft.date=2019-02-20&rft.volume=21&rft.issue=2&rft.spage=e12783&rft.pages=e12783-&rft.issn=1438-8871&rft.eissn=1438-8871&rft_id=info:doi/10.2196/12783&rft_dat=%3Cgale_pubme%3EA769356349%3C/gale_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2184149563&rft_id=info:pmid/30785407&rft_galeid=A769356349&rfr_iscdi=true