Apache Spark based kernelized fuzzy clustering framework for single nucleotide polymorphism sequence analysis

[Display omitted] •The kernelized fuzzy clustering algorithms based on Apache Spark for the clustering of huge Single Nucleotide Polymorphism (SNP).•The SNP preprocessing used for feature extraction.•The complexity of the kernelized algorithms is linear. This paper introduces a kernel based fuzzy cl...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Computational biology and chemistry 2021-06, Vol.92, p.107454-107454, Article 107454
Hauptverfasser:	Jha, Preeti, Tiwari, Aruna, Bharill, Neha, Ratnaparkhe, Milind, Mounika, Mukkamalla, Nagendra, Neha
Format:	Artikel
Sprache:	eng
Schlagworte:	Apache Spark High-dimensional Kernelized fuzzy clustering Non-linear SNP sequences
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	107454
container_issue
container_start_page	107454
container_title	Computational biology and chemistry
container_volume	92
creator	Jha, Preeti Tiwari, Aruna Bharill, Neha Ratnaparkhe, Milind Mounika, Mukkamalla Nagendra, Neha
description	[Display omitted] •The kernelized fuzzy clustering algorithms based on Apache Spark for the clustering of huge Single Nucleotide Polymorphism (SNP).•The SNP preprocessing used for feature extraction.•The complexity of the kernelized algorithms is linear. This paper introduces a kernel based fuzzy clustering approach to deal with the non-linear separable problems by applying kernel Radial Basis Functions (RBF) which maps the input data space non-linearly into a high-dimensional feature space. Discovering clusters in the high-dimensional genomics data is extremely challenging for the bioinformatics researchers for genome analysis. To support the investigations in bioinformatics, explicitly on genomic clustering, we proposed high-dimensional kernelized fuzzy clustering algorithms based on Apache Spark framework for clustering of Single Nucleotide Polymorphism (SNP) sequences. The paper proposes the Kernelized Scalable Random Sampling with Iterative Optimization Fuzzy c-Means (KSRSIO-FCM) which inherently uses another proposed Kernelized Scalable Literal Fuzzy c-Means (KSLFCM) clustering algorithm. Both the approaches completely adapt the Apache Spark cluster framework by localized sub-clustering Resilient Distributed Dataset (RDD) method. Additionally, we are also proposing a preprocessing approach for generating numeric feature vectors for huge SNP sequences and making it a scalable preprocessing approach by executing it on an Apache Spark cluster, which is applied to real-world SNP datasets taken from open-internet repositories of two different plant species, i.e., soybean and rice. The comparison of the proposed scalable kernelized fuzzy clustering results with similar works shows the significant improvement of the proposed algorithm in terms of time and space complexity, Silhouette index, and Davies-Bouldin index. Exhaustive experiments are performed on various SNP datasets to show the effectiveness of proposed KSRSIO-FCM in comparison with proposed KSLFCM and other scalable clustering algorithms, i.e., SRSIO-FCM, and SLFCM.
doi_str_mv	10.1016/j.compbiolchem.2021.107454
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2499391516</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S1476927121000219</els_id><sourcerecordid>2499391516</sourcerecordid><originalsourceid>FETCH-LOGICAL-c380t-cf9077165718448c07d1aa1102a7a2f598991a210e3df643d8ee29931a31eb643</originalsourceid><addsrcrecordid>eNqNkEtvGyEQgFGVqHbS_oUK5ZSLHYZ90pvlPKVIOSSVekOYnW1wYNnAbiv71wfLrtVjTzMavplhPkIugM2BQXm1nmvv-pXxVr-im3PGIT1UeZF_IlPIq3ImeP3z5JhXMCFnMa4Z4xljxWcyybKyzktRTIlb9CpNoc-9Cm90pSI29A1Dh9ZsU9qO2-2GajvGAYPpftE2KId_fGJbH2hMJYu0G7VFP5gGae_txvnQv5roaMT3ETuNVHXKbqKJX8hpq2zEr4d4Tn7c3rws72ePT3cPy8XjTGc1G2a6FayqoCwqqPO81qxqQCkAxlWleFuIWghQHBhmTVvmWVMjciEyUBngKhXOyeV-bh98-kIcpDNRo7WqQz9GyfNECyigTOj3PaqDjzFgK_tgnAobCUzudMu1_Fe33OmWe92p-dthz7hy2Bxb__pNwPUewHTtb4NBRm12ShoTUA-y8eZ_9nwARWiZ8w</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2499391516</pqid></control><display><type>article</type><title>Apache Spark based kernelized fuzzy clustering framework for single nucleotide polymorphism sequence analysis</title><source>Access via ScienceDirect (Elsevier)</source><creator>Jha, Preeti ; Tiwari, Aruna ; Bharill, Neha ; Ratnaparkhe, Milind ; Mounika, Mukkamalla ; Nagendra, Neha</creator><creatorcontrib>Jha, Preeti ; Tiwari, Aruna ; Bharill, Neha ; Ratnaparkhe, Milind ; Mounika, Mukkamalla ; Nagendra, Neha</creatorcontrib><description>[Display omitted] •The kernelized fuzzy clustering algorithms based on Apache Spark for the clustering of huge Single Nucleotide Polymorphism (SNP).•The SNP preprocessing used for feature extraction.•The complexity of the kernelized algorithms is linear. This paper introduces a kernel based fuzzy clustering approach to deal with the non-linear separable problems by applying kernel Radial Basis Functions (RBF) which maps the input data space non-linearly into a high-dimensional feature space. Discovering clusters in the high-dimensional genomics data is extremely challenging for the bioinformatics researchers for genome analysis. To support the investigations in bioinformatics, explicitly on genomic clustering, we proposed high-dimensional kernelized fuzzy clustering algorithms based on Apache Spark framework for clustering of Single Nucleotide Polymorphism (SNP) sequences. The paper proposes the Kernelized Scalable Random Sampling with Iterative Optimization Fuzzy c-Means (KSRSIO-FCM) which inherently uses another proposed Kernelized Scalable Literal Fuzzy c-Means (KSLFCM) clustering algorithm. Both the approaches completely adapt the Apache Spark cluster framework by localized sub-clustering Resilient Distributed Dataset (RDD) method. Additionally, we are also proposing a preprocessing approach for generating numeric feature vectors for huge SNP sequences and making it a scalable preprocessing approach by executing it on an Apache Spark cluster, which is applied to real-world SNP datasets taken from open-internet repositories of two different plant species, i.e., soybean and rice. The comparison of the proposed scalable kernelized fuzzy clustering results with similar works shows the significant improvement of the proposed algorithm in terms of time and space complexity, Silhouette index, and Davies-Bouldin index. Exhaustive experiments are performed on various SNP datasets to show the effectiveness of proposed KSRSIO-FCM in comparison with proposed KSLFCM and other scalable clustering algorithms, i.e., SRSIO-FCM, and SLFCM.</description><identifier>ISSN: 1476-9271</identifier><identifier>EISSN: 1476-928X</identifier><identifier>DOI: 10.1016/j.compbiolchem.2021.107454</identifier><identifier>PMID: 33684695</identifier><language>eng</language><publisher>England: Elsevier Ltd</publisher><subject>Apache Spark ; High-dimensional ; Kernelized fuzzy clustering ; Non-linear ; SNP sequences</subject><ispartof>Computational biology and chemistry, 2021-06, Vol.92, p.107454-107454, Article 107454</ispartof><rights>2021 Elsevier Ltd</rights><rights>Copyright © 2021 Elsevier Ltd. All rights reserved.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c380t-cf9077165718448c07d1aa1102a7a2f598991a210e3df643d8ee29931a31eb643</citedby><cites>FETCH-LOGICAL-c380t-cf9077165718448c07d1aa1102a7a2f598991a210e3df643d8ee29931a31eb643</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.compbiolchem.2021.107454$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,780,784,3550,27924,27925,45995</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/33684695$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Jha, Preeti</creatorcontrib><creatorcontrib>Tiwari, Aruna</creatorcontrib><creatorcontrib>Bharill, Neha</creatorcontrib><creatorcontrib>Ratnaparkhe, Milind</creatorcontrib><creatorcontrib>Mounika, Mukkamalla</creatorcontrib><creatorcontrib>Nagendra, Neha</creatorcontrib><title>Apache Spark based kernelized fuzzy clustering framework for single nucleotide polymorphism sequence analysis</title><title>Computational biology and chemistry</title><addtitle>Comput Biol Chem</addtitle><description>[Display omitted] •The kernelized fuzzy clustering algorithms based on Apache Spark for the clustering of huge Single Nucleotide Polymorphism (SNP).•The SNP preprocessing used for feature extraction.•The complexity of the kernelized algorithms is linear. This paper introduces a kernel based fuzzy clustering approach to deal with the non-linear separable problems by applying kernel Radial Basis Functions (RBF) which maps the input data space non-linearly into a high-dimensional feature space. Discovering clusters in the high-dimensional genomics data is extremely challenging for the bioinformatics researchers for genome analysis. To support the investigations in bioinformatics, explicitly on genomic clustering, we proposed high-dimensional kernelized fuzzy clustering algorithms based on Apache Spark framework for clustering of Single Nucleotide Polymorphism (SNP) sequences. The paper proposes the Kernelized Scalable Random Sampling with Iterative Optimization Fuzzy c-Means (KSRSIO-FCM) which inherently uses another proposed Kernelized Scalable Literal Fuzzy c-Means (KSLFCM) clustering algorithm. Both the approaches completely adapt the Apache Spark cluster framework by localized sub-clustering Resilient Distributed Dataset (RDD) method. Additionally, we are also proposing a preprocessing approach for generating numeric feature vectors for huge SNP sequences and making it a scalable preprocessing approach by executing it on an Apache Spark cluster, which is applied to real-world SNP datasets taken from open-internet repositories of two different plant species, i.e., soybean and rice. The comparison of the proposed scalable kernelized fuzzy clustering results with similar works shows the significant improvement of the proposed algorithm in terms of time and space complexity, Silhouette index, and Davies-Bouldin index. Exhaustive experiments are performed on various SNP datasets to show the effectiveness of proposed KSRSIO-FCM in comparison with proposed KSLFCM and other scalable clustering algorithms, i.e., SRSIO-FCM, and SLFCM.</description><subject>Apache Spark</subject><subject>High-dimensional</subject><subject>Kernelized fuzzy clustering</subject><subject>Non-linear</subject><subject>SNP sequences</subject><issn>1476-9271</issn><issn>1476-928X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><recordid>eNqNkEtvGyEQgFGVqHbS_oUK5ZSLHYZ90pvlPKVIOSSVekOYnW1wYNnAbiv71wfLrtVjTzMavplhPkIugM2BQXm1nmvv-pXxVr-im3PGIT1UeZF_IlPIq3ImeP3z5JhXMCFnMa4Z4xljxWcyybKyzktRTIlb9CpNoc-9Cm90pSI29A1Dh9ZsU9qO2-2GajvGAYPpftE2KId_fGJbH2hMJYu0G7VFP5gGae_txvnQv5roaMT3ETuNVHXKbqKJX8hpq2zEr4d4Tn7c3rws72ePT3cPy8XjTGc1G2a6FayqoCwqqPO81qxqQCkAxlWleFuIWghQHBhmTVvmWVMjciEyUBngKhXOyeV-bh98-kIcpDNRo7WqQz9GyfNECyigTOj3PaqDjzFgK_tgnAobCUzudMu1_Fe33OmWe92p-dthz7hy2Bxb__pNwPUewHTtb4NBRm12ShoTUA-y8eZ_9nwARWiZ8w</recordid><startdate>20210601</startdate><enddate>20210601</enddate><creator>Jha, Preeti</creator><creator>Tiwari, Aruna</creator><creator>Bharill, Neha</creator><creator>Ratnaparkhe, Milind</creator><creator>Mounika, Mukkamalla</creator><creator>Nagendra, Neha</creator><general>Elsevier Ltd</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope></search><sort><creationdate>20210601</creationdate><title>Apache Spark based kernelized fuzzy clustering framework for single nucleotide polymorphism sequence analysis</title><author>Jha, Preeti ; Tiwari, Aruna ; Bharill, Neha ; Ratnaparkhe, Milind ; Mounika, Mukkamalla ; Nagendra, Neha</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c380t-cf9077165718448c07d1aa1102a7a2f598991a210e3df643d8ee29931a31eb643</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Apache Spark</topic><topic>High-dimensional</topic><topic>Kernelized fuzzy clustering</topic><topic>Non-linear</topic><topic>SNP sequences</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Jha, Preeti</creatorcontrib><creatorcontrib>Tiwari, Aruna</creatorcontrib><creatorcontrib>Bharill, Neha</creatorcontrib><creatorcontrib>Ratnaparkhe, Milind</creatorcontrib><creatorcontrib>Mounika, Mukkamalla</creatorcontrib><creatorcontrib>Nagendra, Neha</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>Computational biology and chemistry</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Jha, Preeti</au><au>Tiwari, Aruna</au><au>Bharill, Neha</au><au>Ratnaparkhe, Milind</au><au>Mounika, Mukkamalla</au><au>Nagendra, Neha</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Apache Spark based kernelized fuzzy clustering framework for single nucleotide polymorphism sequence analysis</atitle><jtitle>Computational biology and chemistry</jtitle><addtitle>Comput Biol Chem</addtitle><date>2021-06-01</date><risdate>2021</risdate><volume>92</volume><spage>107454</spage><epage>107454</epage><pages>107454-107454</pages><artnum>107454</artnum><issn>1476-9271</issn><eissn>1476-928X</eissn><abstract>[Display omitted] •The kernelized fuzzy clustering algorithms based on Apache Spark for the clustering of huge Single Nucleotide Polymorphism (SNP).•The SNP preprocessing used for feature extraction.•The complexity of the kernelized algorithms is linear. This paper introduces a kernel based fuzzy clustering approach to deal with the non-linear separable problems by applying kernel Radial Basis Functions (RBF) which maps the input data space non-linearly into a high-dimensional feature space. Discovering clusters in the high-dimensional genomics data is extremely challenging for the bioinformatics researchers for genome analysis. To support the investigations in bioinformatics, explicitly on genomic clustering, we proposed high-dimensional kernelized fuzzy clustering algorithms based on Apache Spark framework for clustering of Single Nucleotide Polymorphism (SNP) sequences. The paper proposes the Kernelized Scalable Random Sampling with Iterative Optimization Fuzzy c-Means (KSRSIO-FCM) which inherently uses another proposed Kernelized Scalable Literal Fuzzy c-Means (KSLFCM) clustering algorithm. Both the approaches completely adapt the Apache Spark cluster framework by localized sub-clustering Resilient Distributed Dataset (RDD) method. Additionally, we are also proposing a preprocessing approach for generating numeric feature vectors for huge SNP sequences and making it a scalable preprocessing approach by executing it on an Apache Spark cluster, which is applied to real-world SNP datasets taken from open-internet repositories of two different plant species, i.e., soybean and rice. The comparison of the proposed scalable kernelized fuzzy clustering results with similar works shows the significant improvement of the proposed algorithm in terms of time and space complexity, Silhouette index, and Davies-Bouldin index. Exhaustive experiments are performed on various SNP datasets to show the effectiveness of proposed KSRSIO-FCM in comparison with proposed KSLFCM and other scalable clustering algorithms, i.e., SRSIO-FCM, and SLFCM.</abstract><cop>England</cop><pub>Elsevier Ltd</pub><pmid>33684695</pmid><doi>10.1016/j.compbiolchem.2021.107454</doi><tpages>1</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 1476-9271
ispartof	Computational biology and chemistry, 2021-06, Vol.92, p.107454-107454, Article 107454
issn	1476-9271 1476-928X
language	eng
recordid	cdi_proquest_miscellaneous_2499391516
source	Access via ScienceDirect (Elsevier)
subjects	Apache Spark High-dimensional Kernelized fuzzy clustering Non-linear SNP sequences
title	Apache Spark based kernelized fuzzy clustering framework for single nucleotide polymorphism sequence analysis
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T21%3A48%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Apache%20Spark%20based%20kernelized%20fuzzy%20clustering%20framework%20for%20single%20nucleotide%20polymorphism%20sequence%20analysis&rft.jtitle=Computational%20biology%20and%20chemistry&rft.au=Jha,%20Preeti&rft.date=2021-06-01&rft.volume=92&rft.spage=107454&rft.epage=107454&rft.pages=107454-107454&rft.artnum=107454&rft.issn=1476-9271&rft.eissn=1476-928X&rft_id=info:doi/10.1016/j.compbiolchem.2021.107454&rft_dat=%3Cproquest_cross%3E2499391516%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2499391516&rft_id=info:pmid/33684695&rft_els_id=S1476927121000219&rfr_iscdi=true