Detecting Near Duplicate Dataset with Machine Learning

This paper introduces the concept of near duplicate dataset, a quasi-duplicate version of a dataset. This version has undergone an unknown number of row and column insertions and deletions (modifications on schema and instance). This concept is interesting for data exploration, data integration and...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal of computer information systems and industrial management applications 2022, Vol.14, p.374-385
Hauptverfasser: Chevallier, Marc, Rogovschi, Nicoleta, Boufarès, Faouzi, Grozavu, Nistor, Clairmont, Charly
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 385
container_issue
container_start_page 374
container_title International journal of computer information systems and industrial management applications
container_volume 14
creator Chevallier, Marc
Rogovschi, Nicoleta
Boufarès, Faouzi
Grozavu, Nistor
Clairmont, Charly
description This paper introduces the concept of near duplicate dataset, a quasi-duplicate version of a dataset. This version has undergone an unknown number of row and column insertions and deletions (modifications on schema and instance). This concept is interesting for data exploration, data integration and data quality. To formalise these insertions and deletions, two parameters are introduced. Our technique for detecting these quasi-duplicate datasets is based on features extraction and machine learning. This method is original because it does not rely on classical techniques of comparisons between columns but on the comparison of metadata vectors summarising the datasets. In order to train these algorithms, we introduce a method to artificially generate training data. We perform several experiments to evaluate the best parameters to use when creating training data and the performance of several classifiers. In the studied cases, these experiments lead us to an accuracy rate higher than 95%.
format Article
fullrecord <record><control><sourceid>hal</sourceid><recordid>TN_cdi_hal_primary_oai_HAL_hal_03722301v1</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>oai_HAL_hal_03722301v1</sourcerecordid><originalsourceid>FETCH-hal_primary_oai_HAL_hal_03722301v13</originalsourceid><addsrcrecordid>eNpjYuA0MjQ10DW3tLBgQWJzMPAWF2cZAIGFobmhuRkng5lLaklqcklmXrqCX2pikYJLaUFOZnJiSaqCS2JJYnFqiUJ5ZkmGgm9ickZmXqqCD1BNHlAxDwNrWmJOcSovlOZm0HRzDXH20M1IzIkvKMrMTSyqjM9PzIz3cPSJB4kZGJsbGRkbGJYZGpOiFgCmKzqZ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Detecting Near Duplicate Dataset with Machine Learning</title><source>EZB-FREE-00999 freely available EZB journals</source><creator>Chevallier, Marc ; Rogovschi, Nicoleta ; Boufarès, Faouzi ; Grozavu, Nistor ; Clairmont, Charly</creator><creatorcontrib>Chevallier, Marc ; Rogovschi, Nicoleta ; Boufarès, Faouzi ; Grozavu, Nistor ; Clairmont, Charly</creatorcontrib><description>This paper introduces the concept of near duplicate dataset, a quasi-duplicate version of a dataset. This version has undergone an unknown number of row and column insertions and deletions (modifications on schema and instance). This concept is interesting for data exploration, data integration and data quality. To formalise these insertions and deletions, two parameters are introduced. Our technique for detecting these quasi-duplicate datasets is based on features extraction and machine learning. This method is original because it does not rely on classical techniques of comparisons between columns but on the comparison of metadata vectors summarising the datasets. In order to train these algorithms, we introduce a method to artificially generate training data. We perform several experiments to evaluate the best parameters to use when creating training data and the performance of several classifiers. In the studied cases, these experiments lead us to an accuracy rate higher than 95%.</description><identifier>ISSN: 2150-7988</identifier><identifier>EISSN: 2150-7988</identifier><language>eng</language><publisher>Machine Intelligence Research Labs (MIR Labs)</publisher><subject>Computer Science ; Machine Learning</subject><ispartof>International journal of computer information systems and industrial management applications, 2022, Vol.14, p.374-385</ispartof><rights>Distributed under a Creative Commons Attribution 4.0 International License</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0002-7983-6147 ; 0000-0001-7502-8022 ; 0000-0001-7502-8022 ; 0000-0002-7983-6147</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,314,780,784,885,4024</link.rule.ids><backlink>$$Uhttps://hal.science/hal-03722301$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>Chevallier, Marc</creatorcontrib><creatorcontrib>Rogovschi, Nicoleta</creatorcontrib><creatorcontrib>Boufarès, Faouzi</creatorcontrib><creatorcontrib>Grozavu, Nistor</creatorcontrib><creatorcontrib>Clairmont, Charly</creatorcontrib><title>Detecting Near Duplicate Dataset with Machine Learning</title><title>International journal of computer information systems and industrial management applications</title><description>This paper introduces the concept of near duplicate dataset, a quasi-duplicate version of a dataset. This version has undergone an unknown number of row and column insertions and deletions (modifications on schema and instance). This concept is interesting for data exploration, data integration and data quality. To formalise these insertions and deletions, two parameters are introduced. Our technique for detecting these quasi-duplicate datasets is based on features extraction and machine learning. This method is original because it does not rely on classical techniques of comparisons between columns but on the comparison of metadata vectors summarising the datasets. In order to train these algorithms, we introduce a method to artificially generate training data. We perform several experiments to evaluate the best parameters to use when creating training data and the performance of several classifiers. In the studied cases, these experiments lead us to an accuracy rate higher than 95%.</description><subject>Computer Science</subject><subject>Machine Learning</subject><issn>2150-7988</issn><issn>2150-7988</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNpjYuA0MjQ10DW3tLBgQWJzMPAWF2cZAIGFobmhuRkng5lLaklqcklmXrqCX2pikYJLaUFOZnJiSaqCS2JJYnFqiUJ5ZkmGgm9ickZmXqqCD1BNHlAxDwNrWmJOcSovlOZm0HRzDXH20M1IzIkvKMrMTSyqjM9PzIz3cPSJB4kZGJsbGRkbGJYZGpOiFgCmKzqZ</recordid><startdate>2022</startdate><enddate>2022</enddate><creator>Chevallier, Marc</creator><creator>Rogovschi, Nicoleta</creator><creator>Boufarès, Faouzi</creator><creator>Grozavu, Nistor</creator><creator>Clairmont, Charly</creator><general>Machine Intelligence Research Labs (MIR Labs)</general><scope>1XC</scope><scope>VOOES</scope><orcidid>https://orcid.org/0000-0002-7983-6147</orcidid><orcidid>https://orcid.org/0000-0001-7502-8022</orcidid><orcidid>https://orcid.org/0000-0001-7502-8022</orcidid><orcidid>https://orcid.org/0000-0002-7983-6147</orcidid></search><sort><creationdate>2022</creationdate><title>Detecting Near Duplicate Dataset with Machine Learning</title><author>Chevallier, Marc ; Rogovschi, Nicoleta ; Boufarès, Faouzi ; Grozavu, Nistor ; Clairmont, Charly</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-hal_primary_oai_HAL_hal_03722301v13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science</topic><topic>Machine Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Chevallier, Marc</creatorcontrib><creatorcontrib>Rogovschi, Nicoleta</creatorcontrib><creatorcontrib>Boufarès, Faouzi</creatorcontrib><creatorcontrib>Grozavu, Nistor</creatorcontrib><creatorcontrib>Clairmont, Charly</creatorcontrib><collection>Hyper Article en Ligne (HAL)</collection><collection>Hyper Article en Ligne (HAL) (Open Access)</collection><jtitle>International journal of computer information systems and industrial management applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Chevallier, Marc</au><au>Rogovschi, Nicoleta</au><au>Boufarès, Faouzi</au><au>Grozavu, Nistor</au><au>Clairmont, Charly</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Detecting Near Duplicate Dataset with Machine Learning</atitle><jtitle>International journal of computer information systems and industrial management applications</jtitle><date>2022</date><risdate>2022</risdate><volume>14</volume><spage>374</spage><epage>385</epage><pages>374-385</pages><issn>2150-7988</issn><eissn>2150-7988</eissn><abstract>This paper introduces the concept of near duplicate dataset, a quasi-duplicate version of a dataset. This version has undergone an unknown number of row and column insertions and deletions (modifications on schema and instance). This concept is interesting for data exploration, data integration and data quality. To formalise these insertions and deletions, two parameters are introduced. Our technique for detecting these quasi-duplicate datasets is based on features extraction and machine learning. This method is original because it does not rely on classical techniques of comparisons between columns but on the comparison of metadata vectors summarising the datasets. In order to train these algorithms, we introduce a method to artificially generate training data. We perform several experiments to evaluate the best parameters to use when creating training data and the performance of several classifiers. In the studied cases, these experiments lead us to an accuracy rate higher than 95%.</abstract><pub>Machine Intelligence Research Labs (MIR Labs)</pub><orcidid>https://orcid.org/0000-0002-7983-6147</orcidid><orcidid>https://orcid.org/0000-0001-7502-8022</orcidid><orcidid>https://orcid.org/0000-0001-7502-8022</orcidid><orcidid>https://orcid.org/0000-0002-7983-6147</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2150-7988
ispartof International journal of computer information systems and industrial management applications, 2022, Vol.14, p.374-385
issn 2150-7988
2150-7988
language eng
recordid cdi_hal_primary_oai_HAL_hal_03722301v1
source EZB-FREE-00999 freely available EZB journals
subjects Computer Science
Machine Learning
title Detecting Near Duplicate Dataset with Machine Learning
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-30T08%3A50%3A05IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-hal&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Detecting%20Near%20Duplicate%20Dataset%20with%20Machine%20Learning&rft.jtitle=International%20journal%20of%20computer%20information%20systems%20and%20industrial%20management%20applications&rft.au=Chevallier,%20Marc&rft.date=2022&rft.volume=14&rft.spage=374&rft.epage=385&rft.pages=374-385&rft.issn=2150-7988&rft.eissn=2150-7988&rft_id=info:doi/&rft_dat=%3Chal%3Eoai_HAL_hal_03722301v1%3C/hal%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true