Method and dataset entity mining in scientific literature: A CNN + BiLSTM model with self-attention

The traditional literature analysis mainly focuses on the literature metadata such as topics, authors, keywords, references, and rarely pays attention to the main content of papers. However, in many scientific domains such as science, computing, engineering, the methods and datasets involved in the...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Knowledge-based systems 2022-01, Vol.235, p.107621, Article 107621
Hauptverfasser:	Hou, Linlin, Zhang, Ji, Wu, Ou, Yu, Ting, Wang, Zhen, Li, Zhao, Gao, Jianliang, Ye, Yingchun, Yao, Rujing
Format:	Artikel
Sprache:	eng
Schlagworte:	Ablation Artificial neural networks CNN+BiLSTM-Attention-CRF structure Datasets Domains Literature analysis Methods and datasets mining Named entity recognition Recommender systems Scientific papers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page	107621
container_title	Knowledge-based systems
container_volume	235
creator	Hou, Linlin Zhang, Ji Wu, Ou Yu, Ting Wang, Zhen Li, Zhao Gao, Jianliang Ye, Yingchun Yao, Rujing
description	The traditional literature analysis mainly focuses on the literature metadata such as topics, authors, keywords, references, and rarely pays attention to the main content of papers. However, in many scientific domains such as science, computing, engineering, the methods and datasets involved in the papers published carry important information and are quite useful for domain analysis and recommendation. Method and dataset entities have various forms, which are more difficult to recognize than common entities. In this paper, we propose a novel Method and Dataset Entity Recognition model (MDER), which is able to effectively extract the method and dataset entities from the main textual content of scientific papers. The model is the first to combine rule embedding, a parallel structure of Convolutional Neural Network (CNN) and a two-layer Bi-directional Long Short-Term Memory (BiLSTM) with the self-attention mechanism. We evaluate the proposed model on datasets constructed from the published papers of different research areas in computer science. Our model performs well in multiple areas and features a good capacity for cross-area learning and recognition. Ablation study indicates that building different modules collectively contributes to the good entity recognition performance as a whole. The data augmentation positively contributes to model training, making our model much more robust. We finally apply the proposed model on PAKDD papers published from 2009–2019 to mine insightful results over a long time span.11PAKDD is the abbreviation of Pacific-Asia Conference on Knowledge Discovery and Data Mining. Our source code and datasets are available at https://github.com/houlinlinvictoria/MDER.
doi_str_mv	10.1016/j.knosys.2021.107621
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2621868410</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0950705121008832</els_id><sourcerecordid>2621868410</sourcerecordid><originalsourceid>FETCH-LOGICAL-c249t-6dfd731e4bd9090e1268b6bba5d7b0b73a10698bf12beec9aaefa5ffb4bfa5263</originalsourceid><addsrcrecordid>eNp9kM1OwzAQhC0EEqXwBhwscUQptps4CQckqPiT2nKgnC07XlOHNC62C-rbkxDOnEYazcxqP4TOKZlQQvlVPfloXdiHCSOMdlbOGT1AI1rkLMlTUh6iESkzkuQko8foJISaEMIYLUaoWkBcO41lq7GWUQaIGNpo4x5vbGvbd2xbHCrbe8ZWuLERvIw7D9f4Fs-WS3yJ7-z8dbXAG6ehwd82rnGAxiQyxr7l2lN0ZGQT4OxPx-jt4X41e0rmL4_Ps9t5UrG0jAnXRudTCqnSJSkJUMYLxZWSmc4VUflUUsLLQhnKFEBVSglGZsaoVHXK-HSMLobdrXefOwhR1G7n2-6kYB2RghcpJV0qHVKVdyF4MGLr7Ub6vaBE9DhFLQacoscpBpxd7WaoQffBlwUvfqlUoK2HKgrt7P8DP2KhgMk</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2621868410</pqid></control><display><type>article</type><title>Method and dataset entity mining in scientific literature: A CNN + BiLSTM model with self-attention</title><source>ScienceDirect Journals (5 years ago - present)</source><creator>Hou, Linlin ; Zhang, Ji ; Wu, Ou ; Yu, Ting ; Wang, Zhen ; Li, Zhao ; Gao, Jianliang ; Ye, Yingchun ; Yao, Rujing</creator><creatorcontrib>Hou, Linlin ; Zhang, Ji ; Wu, Ou ; Yu, Ting ; Wang, Zhen ; Li, Zhao ; Gao, Jianliang ; Ye, Yingchun ; Yao, Rujing</creatorcontrib><description>The traditional literature analysis mainly focuses on the literature metadata such as topics, authors, keywords, references, and rarely pays attention to the main content of papers. However, in many scientific domains such as science, computing, engineering, the methods and datasets involved in the papers published carry important information and are quite useful for domain analysis and recommendation. Method and dataset entities have various forms, which are more difficult to recognize than common entities. In this paper, we propose a novel Method and Dataset Entity Recognition model (MDER), which is able to effectively extract the method and dataset entities from the main textual content of scientific papers. The model is the first to combine rule embedding, a parallel structure of Convolutional Neural Network (CNN) and a two-layer Bi-directional Long Short-Term Memory (BiLSTM) with the self-attention mechanism. We evaluate the proposed model on datasets constructed from the published papers of different research areas in computer science. Our model performs well in multiple areas and features a good capacity for cross-area learning and recognition. Ablation study indicates that building different modules collectively contributes to the good entity recognition performance as a whole. The data augmentation positively contributes to model training, making our model much more robust. We finally apply the proposed model on PAKDD papers published from 2009–2019 to mine insightful results over a long time span.11PAKDD is the abbreviation of Pacific-Asia Conference on Knowledge Discovery and Data Mining. Our source code and datasets are available at https://github.com/houlinlinvictoria/MDER.</description><identifier>ISSN: 0950-7051</identifier><identifier>EISSN: 1872-7409</identifier><identifier>DOI: 10.1016/j.knosys.2021.107621</identifier><language>eng</language><publisher>Amsterdam: Elsevier B.V</publisher><subject>Ablation ; Artificial neural networks ; CNN+BiLSTM-Attention-CRF structure ; Datasets ; Domains ; Literature analysis ; Methods and datasets mining ; Named entity recognition ; Recommender systems ; Scientific papers</subject><ispartof>Knowledge-based systems, 2022-01, Vol.235, p.107621, Article 107621</ispartof><rights>2021 Elsevier B.V.</rights><rights>Copyright Elsevier Science Ltd. Jan 10, 2022</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c249t-6dfd731e4bd9090e1268b6bba5d7b0b73a10698bf12beec9aaefa5ffb4bfa5263</citedby><cites>FETCH-LOGICAL-c249t-6dfd731e4bd9090e1268b6bba5d7b0b73a10698bf12beec9aaefa5ffb4bfa5263</cites><orcidid>0000-0001-6386-1906</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.knosys.2021.107621$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,780,784,3550,27924,27925,45995</link.rule.ids></links><search><creatorcontrib>Hou, Linlin</creatorcontrib><creatorcontrib>Zhang, Ji</creatorcontrib><creatorcontrib>Wu, Ou</creatorcontrib><creatorcontrib>Yu, Ting</creatorcontrib><creatorcontrib>Wang, Zhen</creatorcontrib><creatorcontrib>Li, Zhao</creatorcontrib><creatorcontrib>Gao, Jianliang</creatorcontrib><creatorcontrib>Ye, Yingchun</creatorcontrib><creatorcontrib>Yao, Rujing</creatorcontrib><title>Method and dataset entity mining in scientific literature: A CNN + BiLSTM model with self-attention</title><title>Knowledge-based systems</title><description>The traditional literature analysis mainly focuses on the literature metadata such as topics, authors, keywords, references, and rarely pays attention to the main content of papers. However, in many scientific domains such as science, computing, engineering, the methods and datasets involved in the papers published carry important information and are quite useful for domain analysis and recommendation. Method and dataset entities have various forms, which are more difficult to recognize than common entities. In this paper, we propose a novel Method and Dataset Entity Recognition model (MDER), which is able to effectively extract the method and dataset entities from the main textual content of scientific papers. The model is the first to combine rule embedding, a parallel structure of Convolutional Neural Network (CNN) and a two-layer Bi-directional Long Short-Term Memory (BiLSTM) with the self-attention mechanism. We evaluate the proposed model on datasets constructed from the published papers of different research areas in computer science. Our model performs well in multiple areas and features a good capacity for cross-area learning and recognition. Ablation study indicates that building different modules collectively contributes to the good entity recognition performance as a whole. The data augmentation positively contributes to model training, making our model much more robust. We finally apply the proposed model on PAKDD papers published from 2009–2019 to mine insightful results over a long time span.11PAKDD is the abbreviation of Pacific-Asia Conference on Knowledge Discovery and Data Mining. Our source code and datasets are available at https://github.com/houlinlinvictoria/MDER.</description><subject>Ablation</subject><subject>Artificial neural networks</subject><subject>CNN+BiLSTM-Attention-CRF structure</subject><subject>Datasets</subject><subject>Domains</subject><subject>Literature analysis</subject><subject>Methods and datasets mining</subject><subject>Named entity recognition</subject><subject>Recommender systems</subject><subject>Scientific papers</subject><issn>0950-7051</issn><issn>1872-7409</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNp9kM1OwzAQhC0EEqXwBhwscUQptps4CQckqPiT2nKgnC07XlOHNC62C-rbkxDOnEYazcxqP4TOKZlQQvlVPfloXdiHCSOMdlbOGT1AI1rkLMlTUh6iESkzkuQko8foJISaEMIYLUaoWkBcO41lq7GWUQaIGNpo4x5vbGvbd2xbHCrbe8ZWuLERvIw7D9f4Fs-WS3yJ7-z8dbXAG6ehwd82rnGAxiQyxr7l2lN0ZGQT4OxPx-jt4X41e0rmL4_Ps9t5UrG0jAnXRudTCqnSJSkJUMYLxZWSmc4VUflUUsLLQhnKFEBVSglGZsaoVHXK-HSMLobdrXefOwhR1G7n2-6kYB2RghcpJV0qHVKVdyF4MGLr7Ub6vaBE9DhFLQacoscpBpxd7WaoQffBlwUvfqlUoK2HKgrt7P8DP2KhgMk</recordid><startdate>20220110</startdate><enddate>20220110</enddate><creator>Hou, Linlin</creator><creator>Zhang, Ji</creator><creator>Wu, Ou</creator><creator>Yu, Ting</creator><creator>Wang, Zhen</creator><creator>Li, Zhao</creator><creator>Gao, Jianliang</creator><creator>Ye, Yingchun</creator><creator>Yao, Rujing</creator><general>Elsevier B.V</general><general>Elsevier Science Ltd</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>E3H</scope><scope>F2A</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-6386-1906</orcidid></search><sort><creationdate>20220110</creationdate><title>Method and dataset entity mining in scientific literature: A CNN + BiLSTM model with self-attention</title><author>Hou, Linlin ; Zhang, Ji ; Wu, Ou ; Yu, Ting ; Wang, Zhen ; Li, Zhao ; Gao, Jianliang ; Ye, Yingchun ; Yao, Rujing</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c249t-6dfd731e4bd9090e1268b6bba5d7b0b73a10698bf12beec9aaefa5ffb4bfa5263</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Ablation</topic><topic>Artificial neural networks</topic><topic>CNN+BiLSTM-Attention-CRF structure</topic><topic>Datasets</topic><topic>Domains</topic><topic>Literature analysis</topic><topic>Methods and datasets mining</topic><topic>Named entity recognition</topic><topic>Recommender systems</topic><topic>Scientific papers</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Hou, Linlin</creatorcontrib><creatorcontrib>Zhang, Ji</creatorcontrib><creatorcontrib>Wu, Ou</creatorcontrib><creatorcontrib>Yu, Ting</creatorcontrib><creatorcontrib>Wang, Zhen</creatorcontrib><creatorcontrib>Li, Zhao</creatorcontrib><creatorcontrib>Gao, Jianliang</creatorcontrib><creatorcontrib>Ye, Yingchun</creatorcontrib><creatorcontrib>Yao, Rujing</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>Library & Information Sciences Abstracts (LISA)</collection><collection>Library & Information Science Abstracts (LISA)</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Knowledge-based systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Hou, Linlin</au><au>Zhang, Ji</au><au>Wu, Ou</au><au>Yu, Ting</au><au>Wang, Zhen</au><au>Li, Zhao</au><au>Gao, Jianliang</au><au>Ye, Yingchun</au><au>Yao, Rujing</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Method and dataset entity mining in scientific literature: A CNN + BiLSTM model with self-attention</atitle><jtitle>Knowledge-based systems</jtitle><date>2022-01-10</date><risdate>2022</risdate><volume>235</volume><spage>107621</spage><pages>107621-</pages><artnum>107621</artnum><issn>0950-7051</issn><eissn>1872-7409</eissn><abstract>The traditional literature analysis mainly focuses on the literature metadata such as topics, authors, keywords, references, and rarely pays attention to the main content of papers. However, in many scientific domains such as science, computing, engineering, the methods and datasets involved in the papers published carry important information and are quite useful for domain analysis and recommendation. Method and dataset entities have various forms, which are more difficult to recognize than common entities. In this paper, we propose a novel Method and Dataset Entity Recognition model (MDER), which is able to effectively extract the method and dataset entities from the main textual content of scientific papers. The model is the first to combine rule embedding, a parallel structure of Convolutional Neural Network (CNN) and a two-layer Bi-directional Long Short-Term Memory (BiLSTM) with the self-attention mechanism. We evaluate the proposed model on datasets constructed from the published papers of different research areas in computer science. Our model performs well in multiple areas and features a good capacity for cross-area learning and recognition. Ablation study indicates that building different modules collectively contributes to the good entity recognition performance as a whole. The data augmentation positively contributes to model training, making our model much more robust. We finally apply the proposed model on PAKDD papers published from 2009–2019 to mine insightful results over a long time span.11PAKDD is the abbreviation of Pacific-Asia Conference on Knowledge Discovery and Data Mining. Our source code and datasets are available at https://github.com/houlinlinvictoria/MDER.</abstract><cop>Amsterdam</cop><pub>Elsevier B.V</pub><doi>10.1016/j.knosys.2021.107621</doi><orcidid>https://orcid.org/0000-0001-6386-1906</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 0950-7051
ispartof	Knowledge-based systems, 2022-01, Vol.235, p.107621, Article 107621
issn	0950-7051 1872-7409
language	eng
recordid	cdi_proquest_journals_2621868410
source	ScienceDirect Journals (5 years ago - present)
subjects	Ablation Artificial neural networks CNN+BiLSTM-Attention-CRF structure Datasets Domains Literature analysis Methods and datasets mining Named entity recognition Recommender systems Scientific papers
title	Method and dataset entity mining in scientific literature: A CNN + BiLSTM model with self-attention
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T00%3A26%3A57IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Method%20and%20dataset%20entity%20mining%20in%20scientific%20literature:%20A%20CNN%20+%20BiLSTM%20model%20with%20self-attention&rft.jtitle=Knowledge-based%20systems&rft.au=Hou,%20Linlin&rft.date=2022-01-10&rft.volume=235&rft.spage=107621&rft.pages=107621-&rft.artnum=107621&rft.issn=0950-7051&rft.eissn=1872-7409&rft_id=info:doi/10.1016/j.knosys.2021.107621&rft_dat=%3Cproquest_cross%3E2621868410%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2621868410&rft_id=info:pmid/&rft_els_id=S0950705121008832&rfr_iscdi=true