DREW : Towards Robust Data Provenance by Leveraging Error-Controlled Watermarking

Identifying the origin of data is crucial for data provenance, with applications including data ownership protection, media forensics, and detecting AI-generated content. A standard approach involves embedding-based retrieval techniques that match query data with entries in a reference dataset. Howe...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-06
Hauptverfasser: Saberi, Mehrdad, Sadasivan, Vinu Sankar, Zarei, Arman, Mahdavifar, Hessam, Feizi, Soheil
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Saberi, Mehrdad
Sadasivan, Vinu Sankar
Zarei, Arman
Mahdavifar, Hessam
Feizi, Soheil
description Identifying the origin of data is crucial for data provenance, with applications including data ownership protection, media forensics, and detecting AI-generated content. A standard approach involves embedding-based retrieval techniques that match query data with entries in a reference dataset. However, this method is not robust against benign and malicious edits. To address this, we propose Data Retrieval with Error-corrected codes and Watermarking (DREW). DREW randomly clusters the reference dataset, injects unique error-controlled watermark keys into each cluster, and uses these keys at query time to identify the appropriate cluster for a given sample. After locating the relevant cluster, embedding vector similarity retrieval is performed within the cluster to find the most accurate matches. The integration of error control codes (ECC) ensures reliable cluster assignments, enabling the method to perform retrieval on the entire dataset in case the ECC algorithm cannot detect the correct cluster with high confidence. This makes DREW maintain baseline performance, while also providing opportunities for performance improvements due to the increased likelihood of correctly matching queries to their origin when performing retrieval on a smaller subset of the dataset. Depending on the watermark technique used, DREW can provide substantial improvements in retrieval accuracy (up to 40\% for some datasets and modification types) across multiple datasets and state-of-the-art embedding models (e.g., DinoV2, CLIP), making our method a promising solution for secure and reliable source identification. The code is available at https://github.com/mehrdadsaberi/DREW
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3065128655</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3065128655</sourcerecordid><originalsourceid>FETCH-proquest_journals_30651286553</originalsourceid><addsrcrecordid>eNqNjsEKgkAUAJcgSMp_eNBZ0N006apGhw4lgkdZc5PM9tXb1ejv89AHdJrDzGFmzOFCBF684XzBXGM63_d5tOVhKBx2TvOshB0U-JbUGMixHoyFVFoJJ8JRaakvCuoPHNWoSLY33UJGhOQlqC1h36sGSmkVPSTdJ7ti86vsjXJ_XLL1PiuSg_ckfA3K2KrDgfSkKuFHYcDjaDr5r_oCKEk-ug</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3065128655</pqid></control><display><type>article</type><title>DREW : Towards Robust Data Provenance by Leveraging Error-Controlled Watermarking</title><source>Free E- Journals</source><creator>Saberi, Mehrdad ; Sadasivan, Vinu Sankar ; Zarei, Arman ; Mahdavifar, Hessam ; Feizi, Soheil</creator><creatorcontrib>Saberi, Mehrdad ; Sadasivan, Vinu Sankar ; Zarei, Arman ; Mahdavifar, Hessam ; Feizi, Soheil</creatorcontrib><description>Identifying the origin of data is crucial for data provenance, with applications including data ownership protection, media forensics, and detecting AI-generated content. A standard approach involves embedding-based retrieval techniques that match query data with entries in a reference dataset. However, this method is not robust against benign and malicious edits. To address this, we propose Data Retrieval with Error-corrected codes and Watermarking (DREW). DREW randomly clusters the reference dataset, injects unique error-controlled watermark keys into each cluster, and uses these keys at query time to identify the appropriate cluster for a given sample. After locating the relevant cluster, embedding vector similarity retrieval is performed within the cluster to find the most accurate matches. The integration of error control codes (ECC) ensures reliable cluster assignments, enabling the method to perform retrieval on the entire dataset in case the ECC algorithm cannot detect the correct cluster with high confidence. This makes DREW maintain baseline performance, while also providing opportunities for performance improvements due to the increased likelihood of correctly matching queries to their origin when performing retrieval on a smaller subset of the dataset. Depending on the watermark technique used, DREW can provide substantial improvements in retrieval accuracy (up to 40\% for some datasets and modification types) across multiple datasets and state-of-the-art embedding models (e.g., DinoV2, CLIP), making our method a promising solution for secure and reliable source identification. The code is available at https://github.com/mehrdadsaberi/DREW</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Clusters ; Data retrieval ; Datasets ; Embedding ; Error correction ; Robustness (mathematics) ; Watermarking</subject><ispartof>arXiv.org, 2024-06</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Saberi, Mehrdad</creatorcontrib><creatorcontrib>Sadasivan, Vinu Sankar</creatorcontrib><creatorcontrib>Zarei, Arman</creatorcontrib><creatorcontrib>Mahdavifar, Hessam</creatorcontrib><creatorcontrib>Feizi, Soheil</creatorcontrib><title>DREW : Towards Robust Data Provenance by Leveraging Error-Controlled Watermarking</title><title>arXiv.org</title><description>Identifying the origin of data is crucial for data provenance, with applications including data ownership protection, media forensics, and detecting AI-generated content. A standard approach involves embedding-based retrieval techniques that match query data with entries in a reference dataset. However, this method is not robust against benign and malicious edits. To address this, we propose Data Retrieval with Error-corrected codes and Watermarking (DREW). DREW randomly clusters the reference dataset, injects unique error-controlled watermark keys into each cluster, and uses these keys at query time to identify the appropriate cluster for a given sample. After locating the relevant cluster, embedding vector similarity retrieval is performed within the cluster to find the most accurate matches. The integration of error control codes (ECC) ensures reliable cluster assignments, enabling the method to perform retrieval on the entire dataset in case the ECC algorithm cannot detect the correct cluster with high confidence. This makes DREW maintain baseline performance, while also providing opportunities for performance improvements due to the increased likelihood of correctly matching queries to their origin when performing retrieval on a smaller subset of the dataset. Depending on the watermark technique used, DREW can provide substantial improvements in retrieval accuracy (up to 40\% for some datasets and modification types) across multiple datasets and state-of-the-art embedding models (e.g., DinoV2, CLIP), making our method a promising solution for secure and reliable source identification. The code is available at https://github.com/mehrdadsaberi/DREW</description><subject>Algorithms</subject><subject>Clusters</subject><subject>Data retrieval</subject><subject>Datasets</subject><subject>Embedding</subject><subject>Error correction</subject><subject>Robustness (mathematics)</subject><subject>Watermarking</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNjsEKgkAUAJcgSMp_eNBZ0N006apGhw4lgkdZc5PM9tXb1ejv89AHdJrDzGFmzOFCBF684XzBXGM63_d5tOVhKBx2TvOshB0U-JbUGMixHoyFVFoJJ8JRaakvCuoPHNWoSLY33UJGhOQlqC1h36sGSmkVPSTdJ7ti86vsjXJ_XLL1PiuSg_ckfA3K2KrDgfSkKuFHYcDjaDr5r_oCKEk-ug</recordid><startdate>20240620</startdate><enddate>20240620</enddate><creator>Saberi, Mehrdad</creator><creator>Sadasivan, Vinu Sankar</creator><creator>Zarei, Arman</creator><creator>Mahdavifar, Hessam</creator><creator>Feizi, Soheil</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240620</creationdate><title>DREW : Towards Robust Data Provenance by Leveraging Error-Controlled Watermarking</title><author>Saberi, Mehrdad ; Sadasivan, Vinu Sankar ; Zarei, Arman ; Mahdavifar, Hessam ; Feizi, Soheil</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30651286553</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Algorithms</topic><topic>Clusters</topic><topic>Data retrieval</topic><topic>Datasets</topic><topic>Embedding</topic><topic>Error correction</topic><topic>Robustness (mathematics)</topic><topic>Watermarking</topic><toplevel>online_resources</toplevel><creatorcontrib>Saberi, Mehrdad</creatorcontrib><creatorcontrib>Sadasivan, Vinu Sankar</creatorcontrib><creatorcontrib>Zarei, Arman</creatorcontrib><creatorcontrib>Mahdavifar, Hessam</creatorcontrib><creatorcontrib>Feizi, Soheil</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Saberi, Mehrdad</au><au>Sadasivan, Vinu Sankar</au><au>Zarei, Arman</au><au>Mahdavifar, Hessam</au><au>Feizi, Soheil</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>DREW : Towards Robust Data Provenance by Leveraging Error-Controlled Watermarking</atitle><jtitle>arXiv.org</jtitle><date>2024-06-20</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Identifying the origin of data is crucial for data provenance, with applications including data ownership protection, media forensics, and detecting AI-generated content. A standard approach involves embedding-based retrieval techniques that match query data with entries in a reference dataset. However, this method is not robust against benign and malicious edits. To address this, we propose Data Retrieval with Error-corrected codes and Watermarking (DREW). DREW randomly clusters the reference dataset, injects unique error-controlled watermark keys into each cluster, and uses these keys at query time to identify the appropriate cluster for a given sample. After locating the relevant cluster, embedding vector similarity retrieval is performed within the cluster to find the most accurate matches. The integration of error control codes (ECC) ensures reliable cluster assignments, enabling the method to perform retrieval on the entire dataset in case the ECC algorithm cannot detect the correct cluster with high confidence. This makes DREW maintain baseline performance, while also providing opportunities for performance improvements due to the increased likelihood of correctly matching queries to their origin when performing retrieval on a smaller subset of the dataset. Depending on the watermark technique used, DREW can provide substantial improvements in retrieval accuracy (up to 40\% for some datasets and modification types) across multiple datasets and state-of-the-art embedding models (e.g., DinoV2, CLIP), making our method a promising solution for secure and reliable source identification. The code is available at https://github.com/mehrdadsaberi/DREW</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-06
issn 2331-8422
language eng
recordid cdi_proquest_journals_3065128655
source Free E- Journals
subjects Algorithms
Clusters
Data retrieval
Datasets
Embedding
Error correction
Robustness (mathematics)
Watermarking
title DREW : Towards Robust Data Provenance by Leveraging Error-Controlled Watermarking
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-02T09%3A11%3A18IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=DREW%20:%20Towards%20Robust%20Data%20Provenance%20by%20Leveraging%20Error-Controlled%20Watermarking&rft.jtitle=arXiv.org&rft.au=Saberi,%20Mehrdad&rft.date=2024-06-20&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3065128655%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3065128655&rft_id=info:pmid/&rfr_iscdi=true