Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies

Prior studies in privacy policies frame the question answering (QA) task as identifying the most relevant text segment or a list of sentences from a policy document given a user query. Existing labeled datasets are heavily imbalanced (only a few relevant segments), limiting the QA performance in thi...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Parvez, Md Rizwan, Chi, Jianfeng, Ahmad, Wasi Uddin, Tian, Yuan, Chang, Kai-Wei
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Parvez, Md Rizwan
Chi, Jianfeng
Ahmad, Wasi Uddin
Tian, Yuan
Chang, Kai-Wei
description Prior studies in privacy policies frame the question answering (QA) task as identifying the most relevant text segment or a list of sentences from a policy document given a user query. Existing labeled datasets are heavily imbalanced (only a few relevant segments), limiting the QA performance in this domain. In this paper, we develop a data augmentation framework based on ensembling retriever models that captures the relevant text segments from unlabeled policy documents and expand the positive examples in the training set. In addition, to improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascade them with noise reduction filter models. Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10\% F1) and achieve a new state-of-the-art F1 score of 50\%. Our ablation studies provide further insights into the effectiveness of our approach.
doi_str_mv 10.48550/arxiv.2204.08952
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2204_08952</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2204_08952</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-e9e0d0c5bf13adc88be1214a06ffd65076677d410ab09208b9ffcdedb4da69913</originalsourceid><addsrcrecordid>eNotz71OwzAUhmEvDKjlApjwDSQcO4kTj1EpP1KlFtQ9OrGPW0upg5w00LsHUqZP7_JJD2P3AtK8Kgp4xPjtp1RKyFOodCFv2faDxuhpwo6vwxGDIcufcERenw8nCiOOvg_c9ZG_n2mYow7DF0UfDvw3dtFPaC5813feeBqW7MZhN9Dd_y7Y_nm9X70mm-3L26reJKhKmZAmsGCK1okMramqloQUOYJyzqoCSqXK0uYCsAUtoWq1c8aSbXOLSmuRLdjD9XYWNZ_RnzBemj9ZM8uyHz-oShs</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies</title><source>arXiv.org</source><creator>Parvez, Md Rizwan ; Chi, Jianfeng ; Ahmad, Wasi Uddin ; Tian, Yuan ; Chang, Kai-Wei</creator><creatorcontrib>Parvez, Md Rizwan ; Chi, Jianfeng ; Ahmad, Wasi Uddin ; Tian, Yuan ; Chang, Kai-Wei</creatorcontrib><description>Prior studies in privacy policies frame the question answering (QA) task as identifying the most relevant text segment or a list of sentences from a policy document given a user query. Existing labeled datasets are heavily imbalanced (only a few relevant segments), limiting the QA performance in this domain. In this paper, we develop a data augmentation framework based on ensembling retriever models that captures the relevant text segments from unlabeled policy documents and expand the positive examples in the training set. In addition, to improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascade them with noise reduction filter models. Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10\% F1) and achieve a new state-of-the-art F1 score of 50\%. Our ablation studies provide further insights into the effectiveness of our approach.</description><identifier>DOI: 10.48550/arxiv.2204.08952</identifier><language>eng</language><subject>Computer Science - Computation and Language</subject><creationdate>2022-04</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2204.08952$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2204.08952$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Parvez, Md Rizwan</creatorcontrib><creatorcontrib>Chi, Jianfeng</creatorcontrib><creatorcontrib>Ahmad, Wasi Uddin</creatorcontrib><creatorcontrib>Tian, Yuan</creatorcontrib><creatorcontrib>Chang, Kai-Wei</creatorcontrib><title>Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies</title><description>Prior studies in privacy policies frame the question answering (QA) task as identifying the most relevant text segment or a list of sentences from a policy document given a user query. Existing labeled datasets are heavily imbalanced (only a few relevant segments), limiting the QA performance in this domain. In this paper, we develop a data augmentation framework based on ensembling retriever models that captures the relevant text segments from unlabeled policy documents and expand the positive examples in the training set. In addition, to improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascade them with noise reduction filter models. Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10\% F1) and achieve a new state-of-the-art F1 score of 50\%. Our ablation studies provide further insights into the effectiveness of our approach.</description><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz71OwzAUhmEvDKjlApjwDSQcO4kTj1EpP1KlFtQ9OrGPW0upg5w00LsHUqZP7_JJD2P3AtK8Kgp4xPjtp1RKyFOodCFv2faDxuhpwo6vwxGDIcufcERenw8nCiOOvg_c9ZG_n2mYow7DF0UfDvw3dtFPaC5813feeBqW7MZhN9Dd_y7Y_nm9X70mm-3L26reJKhKmZAmsGCK1okMramqloQUOYJyzqoCSqXK0uYCsAUtoWq1c8aSbXOLSmuRLdjD9XYWNZ_RnzBemj9ZM8uyHz-oShs</recordid><startdate>20220419</startdate><enddate>20220419</enddate><creator>Parvez, Md Rizwan</creator><creator>Chi, Jianfeng</creator><creator>Ahmad, Wasi Uddin</creator><creator>Tian, Yuan</creator><creator>Chang, Kai-Wei</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20220419</creationdate><title>Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies</title><author>Parvez, Md Rizwan ; Chi, Jianfeng ; Ahmad, Wasi Uddin ; Tian, Yuan ; Chang, Kai-Wei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-e9e0d0c5bf13adc88be1214a06ffd65076677d410ab09208b9ffcdedb4da69913</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Parvez, Md Rizwan</creatorcontrib><creatorcontrib>Chi, Jianfeng</creatorcontrib><creatorcontrib>Ahmad, Wasi Uddin</creatorcontrib><creatorcontrib>Tian, Yuan</creatorcontrib><creatorcontrib>Chang, Kai-Wei</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Parvez, Md Rizwan</au><au>Chi, Jianfeng</au><au>Ahmad, Wasi Uddin</au><au>Tian, Yuan</au><au>Chang, Kai-Wei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies</atitle><date>2022-04-19</date><risdate>2022</risdate><abstract>Prior studies in privacy policies frame the question answering (QA) task as identifying the most relevant text segment or a list of sentences from a policy document given a user query. Existing labeled datasets are heavily imbalanced (only a few relevant segments), limiting the QA performance in this domain. In this paper, we develop a data augmentation framework based on ensembling retriever models that captures the relevant text segments from unlabeled policy documents and expand the positive examples in the training set. In addition, to improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascade them with noise reduction filter models. Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10\% F1) and achieve a new state-of-the-art F1 score of 50\%. Our ablation studies provide further insights into the effectiveness of our approach.</abstract><doi>10.48550/arxiv.2204.08952</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2204.08952
ispartof
issn
language eng
recordid cdi_arxiv_primary_2204_08952
source arXiv.org
subjects Computer Science - Computation and Language
title Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T22%3A28%3A20IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Retrieval%20Enhanced%20Data%20Augmentation%20for%20Question%20Answering%20on%20Privacy%20Policies&rft.au=Parvez,%20Md%20Rizwan&rft.date=2022-04-19&rft_id=info:doi/10.48550/arxiv.2204.08952&rft_dat=%3Carxiv_GOX%3E2204_08952%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true