Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports

Objective: In machine learning, it is apparent that the classification of the task performance increases if bootstrap aggregation (bagging) is applied. However, the bagging of deep neural networks takes tremendous amounts of computational resources and training time. The research question that we ai...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of biomedical informatics 2020-09, Vol.110
Hauptverfasser: Yoon, Hong-Jun, Klasky, Hilda B., Gounley, John P., Alawad, Mohammed, Gao, Shang, Durbin, Eric B., Wu, Xiao-Cheng, Stroup, Antoinette, Doherty, Jennifer, Coyle, Linda, Penberthy, Lynne, Christian, J. Blair, Tourassi, Georgia
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title Journal of biomedical informatics
container_volume 110
creator Yoon, Hong-Jun
Klasky, Hilda B.
Gounley, John P.
Alawad, Mohammed
Gao, Shang
Durbin, Eric B.
Wu, Xiao-Cheng
Stroup, Antoinette
Doherty, Jennifer
Coyle, Linda
Penberthy, Lynne
Christian, J. Blair
Tourassi, Georgia
description Objective: In machine learning, it is apparent that the classification of the task performance increases if bootstrap aggregation (bagging) is applied. However, the bagging of deep neural networks takes tremendous amounts of computational resources and training time. The research question that we aimed to answer in this research is whether we could achieve higher task performance scores and accelerate the training by dividing a problem into sub-problems. Materials and Methods: The data used in this study consist of free text from electronic cancer pathology reports. We applied bagging and partitioned data training using Multi-Task Convolutional Neural Network (MT-CNN) and Multi-Task Hierarchical Convolutional Attention Network (MT-HCAN) classifiers. We split a big problem into 20 sub-problems, resampled the training cases 2,000 times, and trained the deep learning model for each bootstrap sample and each sub-problem—thus, generating up to 40,000 models. We performed the training of many models concurrently in a high-performance computing environment at Oak Ridge National Laboratory (ORNL). Results: We demonstrated that aggregation of the models improves task performance compared with the single-model approach, which is consistent with other research studies; and we demonstrated that the two proposed partitioned bagging methods achieved higher classification accuracy scores on four tasks. Notably, the improvements were significant for the extraction of cancer histology data, which had more than 500 class labels in the task; these results show that data partition may alleviate the complexity of the task. On the contrary, the methods did not achieve superior scores for the tasks of site and subsite classification. Intrinsically, since data partitioning was based on the primary cancer site, the accuracy depended on the determination of the partitions, which needs further investigation and improvement. Conclusion: Results in this research demonstrate that 1. The data partitioning and bagging strategy achieved higher performance scores. 2. We achieved faster training leveraged by the high-performance Summit supercomputer at ORNL.
format Article
fullrecord <record><control><sourceid>osti</sourceid><recordid>TN_cdi_osti_scitechconnect_1665991</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1665991</sourcerecordid><originalsourceid>FETCH-osti_scitechconnect_16659913</originalsourceid><addsrcrecordid>eNqNissKwjAQAIMo-PyHxXuhUVv0KKL4Ad4lrtsYabMlu4L-vUXEs6cZhumZkS2WiyxfrfP-z8vV0IxF7nlubVGUI_PYIlJNySldQZMLMUQPXMGFWaULLTjvE3mngWN2cdJ9V6IWQqw4NZ8M9OxO_Ki8RKkRqBI3gC4iJWid3rhm_4JELSeVqRlUrhaafTkx88P-tDtmLBrOgkEJb8gxEurZlmWx2djlX9MbBa9PvQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports</title><source>Elsevier ScienceDirect Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><creator>Yoon, Hong-Jun ; Klasky, Hilda B. ; Gounley, John P. ; Alawad, Mohammed ; Gao, Shang ; Durbin, Eric B. ; Wu, Xiao-Cheng ; Stroup, Antoinette ; Doherty, Jennifer ; Coyle, Linda ; Penberthy, Lynne ; Christian, J. Blair ; Tourassi, Georgia</creator><creatorcontrib>Yoon, Hong-Jun ; Klasky, Hilda B. ; Gounley, John P. ; Alawad, Mohammed ; Gao, Shang ; Durbin, Eric B. ; Wu, Xiao-Cheng ; Stroup, Antoinette ; Doherty, Jennifer ; Coyle, Linda ; Penberthy, Lynne ; Christian, J. Blair ; Tourassi, Georgia ; Los Alamos National Lab. (LANL), Los Alamos, NM (United States) ; Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States) ; Argonne National Lab. (ANL), Argonne, IL (United States) ; Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)</creatorcontrib><description>Objective: In machine learning, it is apparent that the classification of the task performance increases if bootstrap aggregation (bagging) is applied. However, the bagging of deep neural networks takes tremendous amounts of computational resources and training time. The research question that we aimed to answer in this research is whether we could achieve higher task performance scores and accelerate the training by dividing a problem into sub-problems. Materials and Methods: The data used in this study consist of free text from electronic cancer pathology reports. We applied bagging and partitioned data training using Multi-Task Convolutional Neural Network (MT-CNN) and Multi-Task Hierarchical Convolutional Attention Network (MT-HCAN) classifiers. We split a big problem into 20 sub-problems, resampled the training cases 2,000 times, and trained the deep learning model for each bootstrap sample and each sub-problem—thus, generating up to 40,000 models. We performed the training of many models concurrently in a high-performance computing environment at Oak Ridge National Laboratory (ORNL). Results: We demonstrated that aggregation of the models improves task performance compared with the single-model approach, which is consistent with other research studies; and we demonstrated that the two proposed partitioned bagging methods achieved higher classification accuracy scores on four tasks. Notably, the improvements were significant for the extraction of cancer histology data, which had more than 500 class labels in the task; these results show that data partition may alleviate the complexity of the task. On the contrary, the methods did not achieve superior scores for the tasks of site and subsite classification. Intrinsically, since data partitioning was based on the primary cancer site, the accuracy depended on the determination of the partitions, which needs further investigation and improvement. Conclusion: Results in this research demonstrate that 1. The data partitioning and bagging strategy achieved higher performance scores. 2. We achieved faster training leveraged by the high-performance Summit supercomputer at ORNL.</description><identifier>ISSN: 1532-0464</identifier><identifier>EISSN: 1532-0480</identifier><language>eng</language><publisher>United States: Elsevier</publisher><subject>60 APPLIED LIFE SCIENCES ; Bootstrap aggregation ; Convolutional neural networks ; Data partitioning ; Deep learning ; Hierarchical self-attention networks ; High-performance computing ; Natural language processing</subject><ispartof>Journal of biomedical informatics, 2020-09, Vol.110</ispartof><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000000172352521 ; 0000000254505878 ; 0000000274910440 ; 0000000294189638 ; 0000000246581635</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,314,777,781,882</link.rule.ids><backlink>$$Uhttps://www.osti.gov/servlets/purl/1665991$$D View this record in Osti.gov$$Hfree_for_read</backlink></links><search><creatorcontrib>Yoon, Hong-Jun</creatorcontrib><creatorcontrib>Klasky, Hilda B.</creatorcontrib><creatorcontrib>Gounley, John P.</creatorcontrib><creatorcontrib>Alawad, Mohammed</creatorcontrib><creatorcontrib>Gao, Shang</creatorcontrib><creatorcontrib>Durbin, Eric B.</creatorcontrib><creatorcontrib>Wu, Xiao-Cheng</creatorcontrib><creatorcontrib>Stroup, Antoinette</creatorcontrib><creatorcontrib>Doherty, Jennifer</creatorcontrib><creatorcontrib>Coyle, Linda</creatorcontrib><creatorcontrib>Penberthy, Lynne</creatorcontrib><creatorcontrib>Christian, J. Blair</creatorcontrib><creatorcontrib>Tourassi, Georgia</creatorcontrib><creatorcontrib>Los Alamos National Lab. (LANL), Los Alamos, NM (United States)</creatorcontrib><creatorcontrib>Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)</creatorcontrib><creatorcontrib>Argonne National Lab. (ANL), Argonne, IL (United States)</creatorcontrib><creatorcontrib>Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)</creatorcontrib><title>Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports</title><title>Journal of biomedical informatics</title><description>Objective: In machine learning, it is apparent that the classification of the task performance increases if bootstrap aggregation (bagging) is applied. However, the bagging of deep neural networks takes tremendous amounts of computational resources and training time. The research question that we aimed to answer in this research is whether we could achieve higher task performance scores and accelerate the training by dividing a problem into sub-problems. Materials and Methods: The data used in this study consist of free text from electronic cancer pathology reports. We applied bagging and partitioned data training using Multi-Task Convolutional Neural Network (MT-CNN) and Multi-Task Hierarchical Convolutional Attention Network (MT-HCAN) classifiers. We split a big problem into 20 sub-problems, resampled the training cases 2,000 times, and trained the deep learning model for each bootstrap sample and each sub-problem—thus, generating up to 40,000 models. We performed the training of many models concurrently in a high-performance computing environment at Oak Ridge National Laboratory (ORNL). Results: We demonstrated that aggregation of the models improves task performance compared with the single-model approach, which is consistent with other research studies; and we demonstrated that the two proposed partitioned bagging methods achieved higher classification accuracy scores on four tasks. Notably, the improvements were significant for the extraction of cancer histology data, which had more than 500 class labels in the task; these results show that data partition may alleviate the complexity of the task. On the contrary, the methods did not achieve superior scores for the tasks of site and subsite classification. Intrinsically, since data partitioning was based on the primary cancer site, the accuracy depended on the determination of the partitions, which needs further investigation and improvement. Conclusion: Results in this research demonstrate that 1. The data partitioning and bagging strategy achieved higher performance scores. 2. We achieved faster training leveraged by the high-performance Summit supercomputer at ORNL.</description><subject>60 APPLIED LIFE SCIENCES</subject><subject>Bootstrap aggregation</subject><subject>Convolutional neural networks</subject><subject>Data partitioning</subject><subject>Deep learning</subject><subject>Hierarchical self-attention networks</subject><subject>High-performance computing</subject><subject>Natural language processing</subject><issn>1532-0464</issn><issn>1532-0480</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><recordid>eNqNissKwjAQAIMo-PyHxXuhUVv0KKL4Ad4lrtsYabMlu4L-vUXEs6cZhumZkS2WiyxfrfP-z8vV0IxF7nlubVGUI_PYIlJNySldQZMLMUQPXMGFWaULLTjvE3mngWN2cdJ9V6IWQqw4NZ8M9OxO_Ki8RKkRqBI3gC4iJWid3rhm_4JELSeVqRlUrhaafTkx88P-tDtmLBrOgkEJb8gxEurZlmWx2djlX9MbBa9PvQ</recordid><startdate>20200909</startdate><enddate>20200909</enddate><creator>Yoon, Hong-Jun</creator><creator>Klasky, Hilda B.</creator><creator>Gounley, John P.</creator><creator>Alawad, Mohammed</creator><creator>Gao, Shang</creator><creator>Durbin, Eric B.</creator><creator>Wu, Xiao-Cheng</creator><creator>Stroup, Antoinette</creator><creator>Doherty, Jennifer</creator><creator>Coyle, Linda</creator><creator>Penberthy, Lynne</creator><creator>Christian, J. Blair</creator><creator>Tourassi, Georgia</creator><general>Elsevier</general><scope>OIOZB</scope><scope>OTOTI</scope><orcidid>https://orcid.org/0000000172352521</orcidid><orcidid>https://orcid.org/0000000254505878</orcidid><orcidid>https://orcid.org/0000000274910440</orcidid><orcidid>https://orcid.org/0000000294189638</orcidid><orcidid>https://orcid.org/0000000246581635</orcidid></search><sort><creationdate>20200909</creationdate><title>Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports</title><author>Yoon, Hong-Jun ; Klasky, Hilda B. ; Gounley, John P. ; Alawad, Mohammed ; Gao, Shang ; Durbin, Eric B. ; Wu, Xiao-Cheng ; Stroup, Antoinette ; Doherty, Jennifer ; Coyle, Linda ; Penberthy, Lynne ; Christian, J. Blair ; Tourassi, Georgia</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-osti_scitechconnect_16659913</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>60 APPLIED LIFE SCIENCES</topic><topic>Bootstrap aggregation</topic><topic>Convolutional neural networks</topic><topic>Data partitioning</topic><topic>Deep learning</topic><topic>Hierarchical self-attention networks</topic><topic>High-performance computing</topic><topic>Natural language processing</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Yoon, Hong-Jun</creatorcontrib><creatorcontrib>Klasky, Hilda B.</creatorcontrib><creatorcontrib>Gounley, John P.</creatorcontrib><creatorcontrib>Alawad, Mohammed</creatorcontrib><creatorcontrib>Gao, Shang</creatorcontrib><creatorcontrib>Durbin, Eric B.</creatorcontrib><creatorcontrib>Wu, Xiao-Cheng</creatorcontrib><creatorcontrib>Stroup, Antoinette</creatorcontrib><creatorcontrib>Doherty, Jennifer</creatorcontrib><creatorcontrib>Coyle, Linda</creatorcontrib><creatorcontrib>Penberthy, Lynne</creatorcontrib><creatorcontrib>Christian, J. Blair</creatorcontrib><creatorcontrib>Tourassi, Georgia</creatorcontrib><creatorcontrib>Los Alamos National Lab. (LANL), Los Alamos, NM (United States)</creatorcontrib><creatorcontrib>Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)</creatorcontrib><creatorcontrib>Argonne National Lab. (ANL), Argonne, IL (United States)</creatorcontrib><creatorcontrib>Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)</creatorcontrib><collection>OSTI.GOV - Hybrid</collection><collection>OSTI.GOV</collection><jtitle>Journal of biomedical informatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yoon, Hong-Jun</au><au>Klasky, Hilda B.</au><au>Gounley, John P.</au><au>Alawad, Mohammed</au><au>Gao, Shang</au><au>Durbin, Eric B.</au><au>Wu, Xiao-Cheng</au><au>Stroup, Antoinette</au><au>Doherty, Jennifer</au><au>Coyle, Linda</au><au>Penberthy, Lynne</au><au>Christian, J. Blair</au><au>Tourassi, Georgia</au><aucorp>Los Alamos National Lab. (LANL), Los Alamos, NM (United States)</aucorp><aucorp>Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)</aucorp><aucorp>Argonne National Lab. (ANL), Argonne, IL (United States)</aucorp><aucorp>Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)</aucorp><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports</atitle><jtitle>Journal of biomedical informatics</jtitle><date>2020-09-09</date><risdate>2020</risdate><volume>110</volume><issn>1532-0464</issn><eissn>1532-0480</eissn><abstract>Objective: In machine learning, it is apparent that the classification of the task performance increases if bootstrap aggregation (bagging) is applied. However, the bagging of deep neural networks takes tremendous amounts of computational resources and training time. The research question that we aimed to answer in this research is whether we could achieve higher task performance scores and accelerate the training by dividing a problem into sub-problems. Materials and Methods: The data used in this study consist of free text from electronic cancer pathology reports. We applied bagging and partitioned data training using Multi-Task Convolutional Neural Network (MT-CNN) and Multi-Task Hierarchical Convolutional Attention Network (MT-HCAN) classifiers. We split a big problem into 20 sub-problems, resampled the training cases 2,000 times, and trained the deep learning model for each bootstrap sample and each sub-problem—thus, generating up to 40,000 models. We performed the training of many models concurrently in a high-performance computing environment at Oak Ridge National Laboratory (ORNL). Results: We demonstrated that aggregation of the models improves task performance compared with the single-model approach, which is consistent with other research studies; and we demonstrated that the two proposed partitioned bagging methods achieved higher classification accuracy scores on four tasks. Notably, the improvements were significant for the extraction of cancer histology data, which had more than 500 class labels in the task; these results show that data partition may alleviate the complexity of the task. On the contrary, the methods did not achieve superior scores for the tasks of site and subsite classification. Intrinsically, since data partitioning was based on the primary cancer site, the accuracy depended on the determination of the partitions, which needs further investigation and improvement. Conclusion: Results in this research demonstrate that 1. The data partitioning and bagging strategy achieved higher performance scores. 2. We achieved faster training leveraged by the high-performance Summit supercomputer at ORNL.</abstract><cop>United States</cop><pub>Elsevier</pub><orcidid>https://orcid.org/0000000172352521</orcidid><orcidid>https://orcid.org/0000000254505878</orcidid><orcidid>https://orcid.org/0000000274910440</orcidid><orcidid>https://orcid.org/0000000294189638</orcidid><orcidid>https://orcid.org/0000000246581635</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1532-0464
ispartof Journal of biomedical informatics, 2020-09, Vol.110
issn 1532-0464
1532-0480
language eng
recordid cdi_osti_scitechconnect_1665991
source Elsevier ScienceDirect Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals
subjects 60 APPLIED LIFE SCIENCES
Bootstrap aggregation
Convolutional neural networks
Data partitioning
Deep learning
Hierarchical self-attention networks
High-performance computing
Natural language processing
title Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T13%3A40%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-osti&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Accelerated%20training%20of%20bootstrap%20aggregation-based%20deep%20information%20extraction%20systems%20from%20cancer%20pathology%20reports&rft.jtitle=Journal%20of%20biomedical%20informatics&rft.au=Yoon,%20Hong-Jun&rft.aucorp=Los%20Alamos%20National%20Lab.%20(LANL),%20Los%20Alamos,%20NM%20(United%20States)&rft.date=2020-09-09&rft.volume=110&rft.issn=1532-0464&rft.eissn=1532-0480&rft_id=info:doi/&rft_dat=%3Costi%3E1665991%3C/osti%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true