Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies

Prior studies in privacy policies frame the question answering (QA) task as identifying the most relevant text segment or a list of sentences from a policy document given a user query. Existing labeled datasets are heavily imbalanced (only a few relevant segments), limiting the QA performance in thi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Parvez, Md Rizwan, Chi, Jianfeng, Ahmad, Wasi Uddin, Tian, Yuan, Chang, Kai-Wei
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Parvez, Md Rizwan Chi, Jianfeng Ahmad, Wasi Uddin Tian, Yuan Chang, Kai-Wei
description	Prior studies in privacy policies frame the question answering (QA) task as identifying the most relevant text segment or a list of sentences from a policy document given a user query. Existing labeled datasets are heavily imbalanced (only a few relevant segments), limiting the QA performance in this domain. In this paper, we develop a data augmentation framework based on ensembling retriever models that captures the relevant text segments from unlabeled policy documents and expand the positive examples in the training set. In addition, to improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascade them with noise reduction filter models. Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10\% F1) and achieve a new state-of-the-art F1 score of 50\%. Our ablation studies provide further insights into the effectiveness of our approach.
doi_str_mv	10.48550/arxiv.2204.08952
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2204_08952</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2204_08952</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-e9e0d0c5bf13adc88be1214a06ffd65076677d410ab09208b9ffcdedb4da69913</originalsourceid><addsrcrecordid>eNotz71OwzAUhmEvDKjlApjwDSQcO4kTj1EpP1KlFtQ9OrGPW0upg5w00LsHUqZP7_JJD2P3AtK8Kgp4xPjtp1RKyFOodCFv2faDxuhpwo6vwxGDIcufcERenw8nCiOOvg_c9ZG_n2mYow7DF0UfDvw3dtFPaC5813feeBqW7MZhN9Dd_y7Y_nm9X70mm-3L26reJKhKmZAmsGCK1okMramqloQUOYJyzqoCSqXK0uYCsAUtoWq1c8aSbXOLSmuRLdjD9XYWNZ_RnzBemj9ZM8uyHz-oShs</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies</title><source>arXiv.org</source><creator>Parvez, Md Rizwan ; Chi, Jianfeng ; Ahmad, Wasi Uddin ; Tian, Yuan ; Chang, Kai-Wei</creator><creatorcontrib>Parvez, Md Rizwan ; Chi, Jianfeng ; Ahmad, Wasi Uddin ; Tian, Yuan ; Chang, Kai-Wei</creatorcontrib><description>Prior studies in privacy policies frame the question answering (QA) task as identifying the most relevant text segment or a list of sentences from a policy document given a user query. Existing labeled datasets are heavily imbalanced (only a few relevant segments), limiting the QA performance in this domain. In this paper, we develop a data augmentation framework based on ensembling retriever models that captures the relevant text segments from unlabeled policy documents and expand the positive examples in the training set. In addition, to improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascade them with noise reduction filter models. Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10\% F1) and achieve a new state-of-the-art F1 score of 50\%. Our ablation studies provide further insights into the effectiveness of our approach.</description><identifier>DOI: 10.48550/arxiv.2204.08952</identifier><language>eng</language><subject>Computer Science - Computation and Language</subject><creationdate>2022-04</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2204.08952$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2204.08952$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Parvez, Md Rizwan</creatorcontrib><creatorcontrib>Chi, Jianfeng</creatorcontrib><creatorcontrib>Ahmad, Wasi Uddin</creatorcontrib><creatorcontrib>Tian, Yuan</creatorcontrib><creatorcontrib>Chang, Kai-Wei</creatorcontrib><title>Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies</title><description>Prior studies in privacy policies frame the question answering (QA) task as identifying the most relevant text segment or a list of sentences from a policy document given a user query. Existing labeled datasets are heavily imbalanced (only a few relevant segments), limiting the QA performance in this domain. In this paper, we develop a data augmentation framework based on ensembling retriever models that captures the relevant text segments from unlabeled policy documents and expand the positive examples in the training set. In addition, to improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascade them with noise reduction filter models. Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10\% F1) and achieve a new state-of-the-art F1 score of 50\%. Our ablation studies provide further insights into the effectiveness of our approach.</description><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz71OwzAUhmEvDKjlApjwDSQcO4kTj1EpP1KlFtQ9OrGPW0upg5w00LsHUqZP7_JJD2P3AtK8Kgp4xPjtp1RKyFOodCFv2faDxuhpwo6vwxGDIcufcERenw8nCiOOvg_c9ZG_n2mYow7DF0UfDvw3dtFPaC5813feeBqW7MZhN9Dd_y7Y_nm9X70mm-3L26reJKhKmZAmsGCK1okMramqloQUOYJyzqoCSqXK0uYCsAUtoWq1c8aSbXOLSmuRLdjD9XYWNZ_RnzBemj9ZM8uyHz-oShs</recordid><startdate>20220419</startdate><enddate>20220419</enddate><creator>Parvez, Md Rizwan</creator><creator>Chi, Jianfeng</creator><creator>Ahmad, Wasi Uddin</creator><creator>Tian, Yuan</creator><creator>Chang, Kai-Wei</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20220419</creationdate><title>Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies</title><author>Parvez, Md Rizwan ; Chi, Jianfeng ; Ahmad, Wasi Uddin ; Tian, Yuan ; Chang, Kai-Wei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-e9e0d0c5bf13adc88be1214a06ffd65076677d410ab09208b9ffcdedb4da69913</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Parvez, Md Rizwan</creatorcontrib><creatorcontrib>Chi, Jianfeng</creatorcontrib><creatorcontrib>Ahmad, Wasi Uddin</creatorcontrib><creatorcontrib>Tian, Yuan</creatorcontrib><creatorcontrib>Chang, Kai-Wei</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Parvez, Md Rizwan</au><au>Chi, Jianfeng</au><au>Ahmad, Wasi Uddin</au><au>Tian, Yuan</au><au>Chang, Kai-Wei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies</atitle><date>2022-04-19</date><risdate>2022</risdate><abstract>Prior studies in privacy policies frame the question answering (QA) task as identifying the most relevant text segment or a list of sentences from a policy document given a user query. Existing labeled datasets are heavily imbalanced (only a few relevant segments), limiting the QA performance in this domain. In this paper, we develop a data augmentation framework based on ensembling retriever models that captures the relevant text segments from unlabeled policy documents and expand the positive examples in the training set. In addition, to improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascade them with noise reduction filter models. Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10\% F1) and achieve a new state-of-the-art F1 score of 50\%. Our ablation studies provide further insights into the effectiveness of our approach.</abstract><doi>10.48550/arxiv.2204.08952</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2204.08952
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2204_08952
source	arXiv.org
subjects	Computer Science - Computation and Language
title	Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T22%3A28%3A20IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Retrieval%20Enhanced%20Data%20Augmentation%20for%20Question%20Answering%20on%20Privacy%20Policies&rft.au=Parvez,%20Md%20Rizwan&rft.date=2022-04-19&rft_id=info:doi/10.48550/arxiv.2204.08952&rft_dat=%3Carxiv_GOX%3E2204_08952%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true