Detecting Near Duplicate Dataset with Machine Learning

This paper introduces the concept of near duplicate dataset, a quasi-duplicate version of a dataset. This version has undergone an unknown number of row and column insertions and deletions (modifications on schema and instance). This concept is interesting for data exploration, data integration and...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	International journal of computer information systems and industrial management applications 2022, Vol.14, p.374-385
Hauptverfasser:	Chevallier, Marc, Rogovschi, Nicoleta, Boufarès, Faouzi, Grozavu, Nistor, Clairmont, Charly
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science Machine Learning
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	385
container_issue
container_start_page	374
container_title	International journal of computer information systems and industrial management applications
container_volume	14
creator	Chevallier, Marc Rogovschi, Nicoleta Boufarès, Faouzi Grozavu, Nistor Clairmont, Charly
description	This paper introduces the concept of near duplicate dataset, a quasi-duplicate version of a dataset. This version has undergone an unknown number of row and column insertions and deletions (modifications on schema and instance). This concept is interesting for data exploration, data integration and data quality. To formalise these insertions and deletions, two parameters are introduced. Our technique for detecting these quasi-duplicate datasets is based on features extraction and machine learning. This method is original because it does not rely on classical techniques of comparisons between columns but on the comparison of metadata vectors summarising the datasets. In order to train these algorithms, we introduce a method to artificially generate training data. We perform several experiments to evaluate the best parameters to use when creating training data and the performance of several classifiers. In the studied cases, these experiments lead us to an accuracy rate higher than 95%.
format	Article
fullrecord	<record><control><sourceid>hal</sourceid><recordid>TN_cdi_hal_primary_oai_HAL_hal_03722301v1</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>oai_HAL_hal_03722301v1</sourcerecordid><originalsourceid>FETCH-hal_primary_oai_HAL_hal_03722301v13</originalsourceid><addsrcrecordid>eNpjYuA0MjQ10DW3tLBgQWJzMPAWF2cZAIGFobmhuRkng5lLaklqcklmXrqCX2pikYJLaUFOZnJiSaqCS2JJYnFqiUJ5ZkmGgm9ickZmXqqCD1BNHlAxDwNrWmJOcSovlOZm0HRzDXH20M1IzIkvKMrMTSyqjM9PzIz3cPSJB4kZGJsbGRkbGJYZGpOiFgCmKzqZ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Detecting Near Duplicate Dataset with Machine Learning</title><source>EZB-FREE-00999 freely available EZB journals</source><creator>Chevallier, Marc ; Rogovschi, Nicoleta ; Boufarès, Faouzi ; Grozavu, Nistor ; Clairmont, Charly</creator><creatorcontrib>Chevallier, Marc ; Rogovschi, Nicoleta ; Boufarès, Faouzi ; Grozavu, Nistor ; Clairmont, Charly</creatorcontrib><description>This paper introduces the concept of near duplicate dataset, a quasi-duplicate version of a dataset. This version has undergone an unknown number of row and column insertions and deletions (modifications on schema and instance). This concept is interesting for data exploration, data integration and data quality. To formalise these insertions and deletions, two parameters are introduced. Our technique for detecting these quasi-duplicate datasets is based on features extraction and machine learning. This method is original because it does not rely on classical techniques of comparisons between columns but on the comparison of metadata vectors summarising the datasets. In order to train these algorithms, we introduce a method to artificially generate training data. We perform several experiments to evaluate the best parameters to use when creating training data and the performance of several classifiers. In the studied cases, these experiments lead us to an accuracy rate higher than 95%.</description><identifier>ISSN: 2150-7988</identifier><identifier>EISSN: 2150-7988</identifier><language>eng</language><publisher>Machine Intelligence Research Labs (MIR Labs)</publisher><subject>Computer Science ; Machine Learning</subject><ispartof>International journal of computer information systems and industrial management applications, 2022, Vol.14, p.374-385</ispartof><rights>Distributed under a Creative Commons Attribution 4.0 International License</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0002-7983-6147 ; 0000-0001-7502-8022 ; 0000-0001-7502-8022 ; 0000-0002-7983-6147</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,314,780,784,885,4024</link.rule.ids><backlink>$$Uhttps://hal.science/hal-03722301$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>Chevallier, Marc</creatorcontrib><creatorcontrib>Rogovschi, Nicoleta</creatorcontrib><creatorcontrib>Boufarès, Faouzi</creatorcontrib><creatorcontrib>Grozavu, Nistor</creatorcontrib><creatorcontrib>Clairmont, Charly</creatorcontrib><title>Detecting Near Duplicate Dataset with Machine Learning</title><title>International journal of computer information systems and industrial management applications</title><description>This paper introduces the concept of near duplicate dataset, a quasi-duplicate version of a dataset. This version has undergone an unknown number of row and column insertions and deletions (modifications on schema and instance). This concept is interesting for data exploration, data integration and data quality. To formalise these insertions and deletions, two parameters are introduced. Our technique for detecting these quasi-duplicate datasets is based on features extraction and machine learning. This method is original because it does not rely on classical techniques of comparisons between columns but on the comparison of metadata vectors summarising the datasets. In order to train these algorithms, we introduce a method to artificially generate training data. We perform several experiments to evaluate the best parameters to use when creating training data and the performance of several classifiers. In the studied cases, these experiments lead us to an accuracy rate higher than 95%.</description><subject>Computer Science</subject><subject>Machine Learning</subject><issn>2150-7988</issn><issn>2150-7988</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNpjYuA0MjQ10DW3tLBgQWJzMPAWF2cZAIGFobmhuRkng5lLaklqcklmXrqCX2pikYJLaUFOZnJiSaqCS2JJYnFqiUJ5ZkmGgm9ickZmXqqCD1BNHlAxDwNrWmJOcSovlOZm0HRzDXH20M1IzIkvKMrMTSyqjM9PzIz3cPSJB4kZGJsbGRkbGJYZGpOiFgCmKzqZ</recordid><startdate>2022</startdate><enddate>2022</enddate><creator>Chevallier, Marc</creator><creator>Rogovschi, Nicoleta</creator><creator>Boufarès, Faouzi</creator><creator>Grozavu, Nistor</creator><creator>Clairmont, Charly</creator><general>Machine Intelligence Research Labs (MIR Labs)</general><scope>1XC</scope><scope>VOOES</scope><orcidid>https://orcid.org/0000-0002-7983-6147</orcidid><orcidid>https://orcid.org/0000-0001-7502-8022</orcidid><orcidid>https://orcid.org/0000-0001-7502-8022</orcidid><orcidid>https://orcid.org/0000-0002-7983-6147</orcidid></search><sort><creationdate>2022</creationdate><title>Detecting Near Duplicate Dataset with Machine Learning</title><author>Chevallier, Marc ; Rogovschi, Nicoleta ; Boufarès, Faouzi ; Grozavu, Nistor ; Clairmont, Charly</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-hal_primary_oai_HAL_hal_03722301v13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science</topic><topic>Machine Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Chevallier, Marc</creatorcontrib><creatorcontrib>Rogovschi, Nicoleta</creatorcontrib><creatorcontrib>Boufarès, Faouzi</creatorcontrib><creatorcontrib>Grozavu, Nistor</creatorcontrib><creatorcontrib>Clairmont, Charly</creatorcontrib><collection>Hyper Article en Ligne (HAL)</collection><collection>Hyper Article en Ligne (HAL) (Open Access)</collection><jtitle>International journal of computer information systems and industrial management applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Chevallier, Marc</au><au>Rogovschi, Nicoleta</au><au>Boufarès, Faouzi</au><au>Grozavu, Nistor</au><au>Clairmont, Charly</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Detecting Near Duplicate Dataset with Machine Learning</atitle><jtitle>International journal of computer information systems and industrial management applications</jtitle><date>2022</date><risdate>2022</risdate><volume>14</volume><spage>374</spage><epage>385</epage><pages>374-385</pages><issn>2150-7988</issn><eissn>2150-7988</eissn><abstract>This paper introduces the concept of near duplicate dataset, a quasi-duplicate version of a dataset. This version has undergone an unknown number of row and column insertions and deletions (modifications on schema and instance). This concept is interesting for data exploration, data integration and data quality. To formalise these insertions and deletions, two parameters are introduced. Our technique for detecting these quasi-duplicate datasets is based on features extraction and machine learning. This method is original because it does not rely on classical techniques of comparisons between columns but on the comparison of metadata vectors summarising the datasets. In order to train these algorithms, we introduce a method to artificially generate training data. We perform several experiments to evaluate the best parameters to use when creating training data and the performance of several classifiers. In the studied cases, these experiments lead us to an accuracy rate higher than 95%.</abstract><pub>Machine Intelligence Research Labs (MIR Labs)</pub><orcidid>https://orcid.org/0000-0002-7983-6147</orcidid><orcidid>https://orcid.org/0000-0001-7502-8022</orcidid><orcidid>https://orcid.org/0000-0001-7502-8022</orcidid><orcidid>https://orcid.org/0000-0002-7983-6147</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2150-7988
ispartof	International journal of computer information systems and industrial management applications, 2022, Vol.14, p.374-385
issn	2150-7988 2150-7988
language	eng
recordid	cdi_hal_primary_oai_HAL_hal_03722301v1
source	EZB-FREE-00999 freely available EZB journals
subjects	Computer Science Machine Learning
title	Detecting Near Duplicate Dataset with Machine Learning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-30T08%3A50%3A05IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-hal&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Detecting%20Near%20Duplicate%20Dataset%20with%20Machine%20Learning&rft.jtitle=International%20journal%20of%20computer%20information%20systems%20and%20industrial%20management%20applications&rft.au=Chevallier,%20Marc&rft.date=2022&rft.volume=14&rft.spage=374&rft.epage=385&rft.pages=374-385&rft.issn=2150-7988&rft.eissn=2150-7988&rft_id=info:doi/&rft_dat=%3Chal%3Eoai_HAL_hal_03722301v1%3C/hal%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true