Detecting Near Duplicate Dataset with Machine Learning
This paper introduces the concept of near duplicate dataset, a quasi-duplicate version of a dataset. This version has undergone an unknown number of row and column insertions and deletions (modifications on schema and instance). This concept is interesting for data exploration, data integration and...
Gespeichert in:
Veröffentlicht in: | International journal of computer information systems and industrial management applications 2022, Vol.14, p.374-385 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 385 |
---|---|
container_issue | |
container_start_page | 374 |
container_title | International journal of computer information systems and industrial management applications |
container_volume | 14 |
creator | Chevallier, Marc Rogovschi, Nicoleta Boufarès, Faouzi Grozavu, Nistor Clairmont, Charly |
description | This paper introduces the concept of near duplicate dataset, a quasi-duplicate version of a dataset. This version has undergone an unknown number of row and column insertions and deletions (modifications on schema and instance). This concept is interesting for data exploration, data integration and data quality. To formalise these insertions and deletions, two parameters are introduced. Our technique for detecting these quasi-duplicate datasets is based on features extraction and machine learning. This method is original because it does not rely on classical techniques of comparisons between columns but on the comparison of metadata vectors summarising the datasets. In order to train these algorithms, we introduce a method to artificially generate training data. We perform several experiments to evaluate the best parameters to use when creating training data and the performance of several classifiers. In the studied cases, these experiments lead us to an accuracy rate higher than 95%. |
format | Article |
fullrecord | <record><control><sourceid>hal</sourceid><recordid>TN_cdi_hal_primary_oai_HAL_hal_03722301v1</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>oai_HAL_hal_03722301v1</sourcerecordid><originalsourceid>FETCH-hal_primary_oai_HAL_hal_03722301v13</originalsourceid><addsrcrecordid>eNpjYuA0MjQ10DW3tLBgQWJzMPAWF2cZAIGFobmhuRkng5lLaklqcklmXrqCX2pikYJLaUFOZnJiSaqCS2JJYnFqiUJ5ZkmGgm9ickZmXqqCD1BNHlAxDwNrWmJOcSovlOZm0HRzDXH20M1IzIkvKMrMTSyqjM9PzIz3cPSJB4kZGJsbGRkbGJYZGpOiFgCmKzqZ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Detecting Near Duplicate Dataset with Machine Learning</title><source>EZB-FREE-00999 freely available EZB journals</source><creator>Chevallier, Marc ; Rogovschi, Nicoleta ; Boufarès, Faouzi ; Grozavu, Nistor ; Clairmont, Charly</creator><creatorcontrib>Chevallier, Marc ; Rogovschi, Nicoleta ; Boufarès, Faouzi ; Grozavu, Nistor ; Clairmont, Charly</creatorcontrib><description>This paper introduces the concept of near duplicate dataset, a quasi-duplicate version of a dataset. This version has undergone an unknown number of row and column insertions and deletions (modifications on schema and instance). This concept is interesting for data exploration, data integration and data quality. To formalise these insertions and deletions, two parameters are introduced. Our technique for detecting these quasi-duplicate datasets is based on features extraction and machine learning. This method is original because it does not rely on classical techniques of comparisons between columns but on the comparison of metadata vectors summarising the datasets. In order to train these algorithms, we introduce a method to artificially generate training data. We perform several experiments to evaluate the best parameters to use when creating training data and the performance of several classifiers. In the studied cases, these experiments lead us to an accuracy rate higher than 95%.</description><identifier>ISSN: 2150-7988</identifier><identifier>EISSN: 2150-7988</identifier><language>eng</language><publisher>Machine Intelligence Research Labs (MIR Labs)</publisher><subject>Computer Science ; Machine Learning</subject><ispartof>International journal of computer information systems and industrial management applications, 2022, Vol.14, p.374-385</ispartof><rights>Distributed under a Creative Commons Attribution 4.0 International License</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0002-7983-6147 ; 0000-0001-7502-8022 ; 0000-0001-7502-8022 ; 0000-0002-7983-6147</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,314,780,784,885,4024</link.rule.ids><backlink>$$Uhttps://hal.science/hal-03722301$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>Chevallier, Marc</creatorcontrib><creatorcontrib>Rogovschi, Nicoleta</creatorcontrib><creatorcontrib>Boufarès, Faouzi</creatorcontrib><creatorcontrib>Grozavu, Nistor</creatorcontrib><creatorcontrib>Clairmont, Charly</creatorcontrib><title>Detecting Near Duplicate Dataset with Machine Learning</title><title>International journal of computer information systems and industrial management applications</title><description>This paper introduces the concept of near duplicate dataset, a quasi-duplicate version of a dataset. This version has undergone an unknown number of row and column insertions and deletions (modifications on schema and instance). This concept is interesting for data exploration, data integration and data quality. To formalise these insertions and deletions, two parameters are introduced. Our technique for detecting these quasi-duplicate datasets is based on features extraction and machine learning. This method is original because it does not rely on classical techniques of comparisons between columns but on the comparison of metadata vectors summarising the datasets. In order to train these algorithms, we introduce a method to artificially generate training data. We perform several experiments to evaluate the best parameters to use when creating training data and the performance of several classifiers. In the studied cases, these experiments lead us to an accuracy rate higher than 95%.</description><subject>Computer Science</subject><subject>Machine Learning</subject><issn>2150-7988</issn><issn>2150-7988</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNpjYuA0MjQ10DW3tLBgQWJzMPAWF2cZAIGFobmhuRkng5lLaklqcklmXrqCX2pikYJLaUFOZnJiSaqCS2JJYnFqiUJ5ZkmGgm9ickZmXqqCD1BNHlAxDwNrWmJOcSovlOZm0HRzDXH20M1IzIkvKMrMTSyqjM9PzIz3cPSJB4kZGJsbGRkbGJYZGpOiFgCmKzqZ</recordid><startdate>2022</startdate><enddate>2022</enddate><creator>Chevallier, Marc</creator><creator>Rogovschi, Nicoleta</creator><creator>Boufarès, Faouzi</creator><creator>Grozavu, Nistor</creator><creator>Clairmont, Charly</creator><general>Machine Intelligence Research Labs (MIR Labs)</general><scope>1XC</scope><scope>VOOES</scope><orcidid>https://orcid.org/0000-0002-7983-6147</orcidid><orcidid>https://orcid.org/0000-0001-7502-8022</orcidid><orcidid>https://orcid.org/0000-0001-7502-8022</orcidid><orcidid>https://orcid.org/0000-0002-7983-6147</orcidid></search><sort><creationdate>2022</creationdate><title>Detecting Near Duplicate Dataset with Machine Learning</title><author>Chevallier, Marc ; Rogovschi, Nicoleta ; Boufarès, Faouzi ; Grozavu, Nistor ; Clairmont, Charly</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-hal_primary_oai_HAL_hal_03722301v13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science</topic><topic>Machine Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Chevallier, Marc</creatorcontrib><creatorcontrib>Rogovschi, Nicoleta</creatorcontrib><creatorcontrib>Boufarès, Faouzi</creatorcontrib><creatorcontrib>Grozavu, Nistor</creatorcontrib><creatorcontrib>Clairmont, Charly</creatorcontrib><collection>Hyper Article en Ligne (HAL)</collection><collection>Hyper Article en Ligne (HAL) (Open Access)</collection><jtitle>International journal of computer information systems and industrial management applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Chevallier, Marc</au><au>Rogovschi, Nicoleta</au><au>Boufarès, Faouzi</au><au>Grozavu, Nistor</au><au>Clairmont, Charly</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Detecting Near Duplicate Dataset with Machine Learning</atitle><jtitle>International journal of computer information systems and industrial management applications</jtitle><date>2022</date><risdate>2022</risdate><volume>14</volume><spage>374</spage><epage>385</epage><pages>374-385</pages><issn>2150-7988</issn><eissn>2150-7988</eissn><abstract>This paper introduces the concept of near duplicate dataset, a quasi-duplicate version of a dataset. This version has undergone an unknown number of row and column insertions and deletions (modifications on schema and instance). This concept is interesting for data exploration, data integration and data quality. To formalise these insertions and deletions, two parameters are introduced. Our technique for detecting these quasi-duplicate datasets is based on features extraction and machine learning. This method is original because it does not rely on classical techniques of comparisons between columns but on the comparison of metadata vectors summarising the datasets. In order to train these algorithms, we introduce a method to artificially generate training data. We perform several experiments to evaluate the best parameters to use when creating training data and the performance of several classifiers. In the studied cases, these experiments lead us to an accuracy rate higher than 95%.</abstract><pub>Machine Intelligence Research Labs (MIR Labs)</pub><orcidid>https://orcid.org/0000-0002-7983-6147</orcidid><orcidid>https://orcid.org/0000-0001-7502-8022</orcidid><orcidid>https://orcid.org/0000-0001-7502-8022</orcidid><orcidid>https://orcid.org/0000-0002-7983-6147</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2150-7988 |
ispartof | International journal of computer information systems and industrial management applications, 2022, Vol.14, p.374-385 |
issn | 2150-7988 2150-7988 |
language | eng |
recordid | cdi_hal_primary_oai_HAL_hal_03722301v1 |
source | EZB-FREE-00999 freely available EZB journals |
subjects | Computer Science Machine Learning |
title | Detecting Near Duplicate Dataset with Machine Learning |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-30T08%3A50%3A05IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-hal&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Detecting%20Near%20Duplicate%20Dataset%20with%20Machine%20Learning&rft.jtitle=International%20journal%20of%20computer%20information%20systems%20and%20industrial%20management%20applications&rft.au=Chevallier,%20Marc&rft.date=2022&rft.volume=14&rft.spage=374&rft.epage=385&rft.pages=374-385&rft.issn=2150-7988&rft.eissn=2150-7988&rft_id=info:doi/&rft_dat=%3Chal%3Eoai_HAL_hal_03722301v1%3C/hal%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |