Crash data augmentation using variational autoencoder

•In this paper, we present a data augmentation technique to reproduce crash data.•Variational Autoencoder (VAE) was used to generate millions of crash samples from only a limited number of training data.•The generated data was compared to real data from different statistical standpoints and similari...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Accident analysis and prevention 2021-03, Vol.151, p.105950-105950, Article 105950
Hauptverfasser:	Islam, Zubayer, Abdel-Aty, Mohamed, Cai, Qing, Yuan, Jinghui
Format:	Artikel
Sprache:	eng
Schlagworte:	Crash prediction Data augmentation Variational autoencoder
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	105950
container_issue
container_start_page	105950
container_title	Accident analysis and prevention
container_volume	151
creator	Islam, Zubayer Abdel-Aty, Mohamed Cai, Qing Yuan, Jinghui
description	•In this paper, we present a data augmentation technique to reproduce crash data.•Variational Autoencoder (VAE) was used to generate millions of crash samples from only a limited number of training data.•The generated data was compared to real data from different statistical standpoints and similarity was reported.•It was also compared to some of the minority oversampling techniques like SMOTE and ADASYN. The results were also compared with the GAN framework for generating data.•Crash prediction models based on Logistic Regression, Support Vector Machine and Artificial Neural Network were used to compare the generated data from the different models.•Overall, VAE showed excellent results compared to the other data augmentation methods. In this paper, we present a data augmentation technique to reproduce crash data. The dataset comprising crash and non-crash events are extremely imbalanced. For instance, the dataset used in this paper consists of only 625 crash events for over 6.5 million non-crash events. Thus, learning algorithms tend to perform poorly on these datasets. We have used variational autoencoder to encode all the events into a latent space. After training, the model could successfully separate crash and non-crash events. To generate data, we sampled from the latent space containing crash data. The generated data was compared with the real data from different statistical aspects. t-Test, Levene-test and Kolmogrove Smirnov test showed that the generated data was statistically similar to the real data. It was also compared to some of the minority oversampling techniques like SMOTE and ADASYN as well as the GAN framework for generating data. Crash prediction models based on Logistic Regression (LR), Support Vector Machine (SVM) and Artificial Neural Network (ANN) were used to compare the generated data from the different oversampling techniques. Overall, variational autoencoder (VAE) showed excellent results compared to the other data augmentation methods. Specificity is improved by 8% and 4% for VAE-LR and VAE-SVM respectively when compared to SMOTE while the sensitivity is improved by 6% and 5% when compared to ADASYN. Moreover, VAE generated data also helps to overcome the overfitting problem in SMOTE and ADASYN since there is flexibility in choosing the decision boundary.
doi_str_mv	10.1016/j.aap.2020.105950
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2473741368</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S000145752031770X</els_id><sourcerecordid>2473741368</sourcerecordid><originalsourceid>FETCH-LOGICAL-c381t-a322fa568043e3463145aa159321f6f14dfd81661a3d4459eb934c4b79e95a303</originalsourceid><addsrcrecordid>eNp9kE1Lw0AQhhdRbK3-AC_So5fU3cx-JHiS4hcUvOh5mWYndUs-6m4i-O9NTfXoaXiZZ16Yh7FLwReCC32zXSDuFilP91nlih-xqchMnqRcmWM25ZyLRCqjJuwsxu0QTWbUKZsAgOGaw5SpZcD4PnfY4Rz7TU1Nh51vm3kffbOZf2LwPxmrYd211BSto3DOTkqsIl0c5oy9Pdy_Lp-S1cvj8_JulRSQiS5BSNMSlc64BAKpQUiFKFQOqSh1KaQrXSa0FghOSpXTOgdZyLXJKVcIHGbseuzdhfajp9jZ2seCqgobavtoU2nASAE6G1AxokVoYwxU2l3wNYYvK7jd27JbO9iye1t2tDXcXB3q-3VN7u_iV88A3I4ADU9-ego2Fn5wQM4HKjrrWv9P_TeZBnhw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2473741368</pqid></control><display><type>article</type><title>Crash data augmentation using variational autoencoder</title><source>Access via ScienceDirect (Elsevier)</source><creator>Islam, Zubayer ; Abdel-Aty, Mohamed ; Cai, Qing ; Yuan, Jinghui</creator><creatorcontrib>Islam, Zubayer ; Abdel-Aty, Mohamed ; Cai, Qing ; Yuan, Jinghui</creatorcontrib><description>•In this paper, we present a data augmentation technique to reproduce crash data.•Variational Autoencoder (VAE) was used to generate millions of crash samples from only a limited number of training data.•The generated data was compared to real data from different statistical standpoints and similarity was reported.•It was also compared to some of the minority oversampling techniques like SMOTE and ADASYN. The results were also compared with the GAN framework for generating data.•Crash prediction models based on Logistic Regression, Support Vector Machine and Artificial Neural Network were used to compare the generated data from the different models.•Overall, VAE showed excellent results compared to the other data augmentation methods. In this paper, we present a data augmentation technique to reproduce crash data. The dataset comprising crash and non-crash events are extremely imbalanced. For instance, the dataset used in this paper consists of only 625 crash events for over 6.5 million non-crash events. Thus, learning algorithms tend to perform poorly on these datasets. We have used variational autoencoder to encode all the events into a latent space. After training, the model could successfully separate crash and non-crash events. To generate data, we sampled from the latent space containing crash data. The generated data was compared with the real data from different statistical aspects. t-Test, Levene-test and Kolmogrove Smirnov test showed that the generated data was statistically similar to the real data. It was also compared to some of the minority oversampling techniques like SMOTE and ADASYN as well as the GAN framework for generating data. Crash prediction models based on Logistic Regression (LR), Support Vector Machine (SVM) and Artificial Neural Network (ANN) were used to compare the generated data from the different oversampling techniques. Overall, variational autoencoder (VAE) showed excellent results compared to the other data augmentation methods. Specificity is improved by 8% and 4% for VAE-LR and VAE-SVM respectively when compared to SMOTE while the sensitivity is improved by 6% and 5% when compared to ADASYN. Moreover, VAE generated data also helps to overcome the overfitting problem in SMOTE and ADASYN since there is flexibility in choosing the decision boundary.</description><identifier>ISSN: 0001-4575</identifier><identifier>EISSN: 1879-2057</identifier><identifier>DOI: 10.1016/j.aap.2020.105950</identifier><identifier>PMID: 33370603</identifier><language>eng</language><publisher>England: Elsevier Ltd</publisher><subject>Crash prediction ; Data augmentation ; Variational autoencoder</subject><ispartof>Accident analysis and prevention, 2021-03, Vol.151, p.105950-105950, Article 105950</ispartof><rights>2020 Elsevier Ltd</rights><rights>Copyright © 2020 Elsevier Ltd. All rights reserved.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c381t-a322fa568043e3463145aa159321f6f14dfd81661a3d4459eb934c4b79e95a303</citedby><cites>FETCH-LOGICAL-c381t-a322fa568043e3463145aa159321f6f14dfd81661a3d4459eb934c4b79e95a303</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.aap.2020.105950$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>315,781,785,3551,27926,27927,45997</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/33370603$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Islam, Zubayer</creatorcontrib><creatorcontrib>Abdel-Aty, Mohamed</creatorcontrib><creatorcontrib>Cai, Qing</creatorcontrib><creatorcontrib>Yuan, Jinghui</creatorcontrib><title>Crash data augmentation using variational autoencoder</title><title>Accident analysis and prevention</title><addtitle>Accid Anal Prev</addtitle><description>•In this paper, we present a data augmentation technique to reproduce crash data.•Variational Autoencoder (VAE) was used to generate millions of crash samples from only a limited number of training data.•The generated data was compared to real data from different statistical standpoints and similarity was reported.•It was also compared to some of the minority oversampling techniques like SMOTE and ADASYN. The results were also compared with the GAN framework for generating data.•Crash prediction models based on Logistic Regression, Support Vector Machine and Artificial Neural Network were used to compare the generated data from the different models.•Overall, VAE showed excellent results compared to the other data augmentation methods. In this paper, we present a data augmentation technique to reproduce crash data. The dataset comprising crash and non-crash events are extremely imbalanced. For instance, the dataset used in this paper consists of only 625 crash events for over 6.5 million non-crash events. Thus, learning algorithms tend to perform poorly on these datasets. We have used variational autoencoder to encode all the events into a latent space. After training, the model could successfully separate crash and non-crash events. To generate data, we sampled from the latent space containing crash data. The generated data was compared with the real data from different statistical aspects. t-Test, Levene-test and Kolmogrove Smirnov test showed that the generated data was statistically similar to the real data. It was also compared to some of the minority oversampling techniques like SMOTE and ADASYN as well as the GAN framework for generating data. Crash prediction models based on Logistic Regression (LR), Support Vector Machine (SVM) and Artificial Neural Network (ANN) were used to compare the generated data from the different oversampling techniques. Overall, variational autoencoder (VAE) showed excellent results compared to the other data augmentation methods. Specificity is improved by 8% and 4% for VAE-LR and VAE-SVM respectively when compared to SMOTE while the sensitivity is improved by 6% and 5% when compared to ADASYN. Moreover, VAE generated data also helps to overcome the overfitting problem in SMOTE and ADASYN since there is flexibility in choosing the decision boundary.</description><subject>Crash prediction</subject><subject>Data augmentation</subject><subject>Variational autoencoder</subject><issn>0001-4575</issn><issn>1879-2057</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><recordid>eNp9kE1Lw0AQhhdRbK3-AC_So5fU3cx-JHiS4hcUvOh5mWYndUs-6m4i-O9NTfXoaXiZZ16Yh7FLwReCC32zXSDuFilP91nlih-xqchMnqRcmWM25ZyLRCqjJuwsxu0QTWbUKZsAgOGaw5SpZcD4PnfY4Rz7TU1Nh51vm3kffbOZf2LwPxmrYd211BSto3DOTkqsIl0c5oy9Pdy_Lp-S1cvj8_JulRSQiS5BSNMSlc64BAKpQUiFKFQOqSh1KaQrXSa0FghOSpXTOgdZyLXJKVcIHGbseuzdhfajp9jZ2seCqgobavtoU2nASAE6G1AxokVoYwxU2l3wNYYvK7jd27JbO9iye1t2tDXcXB3q-3VN7u_iV88A3I4ADU9-ego2Fn5wQM4HKjrrWv9P_TeZBnhw</recordid><startdate>20210301</startdate><enddate>20210301</enddate><creator>Islam, Zubayer</creator><creator>Abdel-Aty, Mohamed</creator><creator>Cai, Qing</creator><creator>Yuan, Jinghui</creator><general>Elsevier Ltd</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope></search><sort><creationdate>20210301</creationdate><title>Crash data augmentation using variational autoencoder</title><author>Islam, Zubayer ; Abdel-Aty, Mohamed ; Cai, Qing ; Yuan, Jinghui</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c381t-a322fa568043e3463145aa159321f6f14dfd81661a3d4459eb934c4b79e95a303</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Crash prediction</topic><topic>Data augmentation</topic><topic>Variational autoencoder</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Islam, Zubayer</creatorcontrib><creatorcontrib>Abdel-Aty, Mohamed</creatorcontrib><creatorcontrib>Cai, Qing</creatorcontrib><creatorcontrib>Yuan, Jinghui</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>Accident analysis and prevention</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Islam, Zubayer</au><au>Abdel-Aty, Mohamed</au><au>Cai, Qing</au><au>Yuan, Jinghui</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Crash data augmentation using variational autoencoder</atitle><jtitle>Accident analysis and prevention</jtitle><addtitle>Accid Anal Prev</addtitle><date>2021-03-01</date><risdate>2021</risdate><volume>151</volume><spage>105950</spage><epage>105950</epage><pages>105950-105950</pages><artnum>105950</artnum><issn>0001-4575</issn><eissn>1879-2057</eissn><abstract>•In this paper, we present a data augmentation technique to reproduce crash data.•Variational Autoencoder (VAE) was used to generate millions of crash samples from only a limited number of training data.•The generated data was compared to real data from different statistical standpoints and similarity was reported.•It was also compared to some of the minority oversampling techniques like SMOTE and ADASYN. The results were also compared with the GAN framework for generating data.•Crash prediction models based on Logistic Regression, Support Vector Machine and Artificial Neural Network were used to compare the generated data from the different models.•Overall, VAE showed excellent results compared to the other data augmentation methods. In this paper, we present a data augmentation technique to reproduce crash data. The dataset comprising crash and non-crash events are extremely imbalanced. For instance, the dataset used in this paper consists of only 625 crash events for over 6.5 million non-crash events. Thus, learning algorithms tend to perform poorly on these datasets. We have used variational autoencoder to encode all the events into a latent space. After training, the model could successfully separate crash and non-crash events. To generate data, we sampled from the latent space containing crash data. The generated data was compared with the real data from different statistical aspects. t-Test, Levene-test and Kolmogrove Smirnov test showed that the generated data was statistically similar to the real data. It was also compared to some of the minority oversampling techniques like SMOTE and ADASYN as well as the GAN framework for generating data. Crash prediction models based on Logistic Regression (LR), Support Vector Machine (SVM) and Artificial Neural Network (ANN) were used to compare the generated data from the different oversampling techniques. Overall, variational autoencoder (VAE) showed excellent results compared to the other data augmentation methods. Specificity is improved by 8% and 4% for VAE-LR and VAE-SVM respectively when compared to SMOTE while the sensitivity is improved by 6% and 5% when compared to ADASYN. Moreover, VAE generated data also helps to overcome the overfitting problem in SMOTE and ADASYN since there is flexibility in choosing the decision boundary.</abstract><cop>England</cop><pub>Elsevier Ltd</pub><pmid>33370603</pmid><doi>10.1016/j.aap.2020.105950</doi><tpages>1</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 0001-4575
ispartof	Accident analysis and prevention, 2021-03, Vol.151, p.105950-105950, Article 105950
issn	0001-4575 1879-2057
language	eng
recordid	cdi_proquest_miscellaneous_2473741368
source	Access via ScienceDirect (Elsevier)
subjects	Crash prediction Data augmentation Variational autoencoder
title	Crash data augmentation using variational autoencoder
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-18T07%3A10%3A24IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Crash%20data%20augmentation%20using%20variational%20autoencoder&rft.jtitle=Accident%20analysis%20and%20prevention&rft.au=Islam,%20Zubayer&rft.date=2021-03-01&rft.volume=151&rft.spage=105950&rft.epage=105950&rft.pages=105950-105950&rft.artnum=105950&rft.issn=0001-4575&rft.eissn=1879-2057&rft_id=info:doi/10.1016/j.aap.2020.105950&rft_dat=%3Cproquest_cross%3E2473741368%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2473741368&rft_id=info:pmid/33370603&rft_els_id=S000145752031770X&rfr_iscdi=true