Vietnamese Legal Information Retrieval in Question-Answering System
In the modern era of rapidly increasing data volumes, accurately retrieving and recommending relevant documents has become crucial in enhancing the reliability of Question Answering (QA) systems. Recently, Retrieval Augmented Generation (RAG) has gained significant recognition for enhancing the capa...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Ba, Thiem Nguyen The, Vinh Doan Quang, Tung Pham Van, Toan Tran |
description | In the modern era of rapidly increasing data volumes, accurately retrieving
and recommending relevant documents has become crucial in enhancing the
reliability of Question Answering (QA) systems. Recently, Retrieval Augmented
Generation (RAG) has gained significant recognition for enhancing the
capabilities of large language models (LLMs) by mitigating hallucination issues
in QA systems, which is particularly beneficial in the legal domain. Various
methods, such as semantic search using dense vector embeddings or a combination
of multiple techniques to improve results before feeding them to LLMs, have
been proposed. However, these methods often fall short when applied to the
Vietnamese language due to several challenges, namely inefficient Vietnamese
data processing leading to excessive token length or overly simplistic ensemble
techniques that lead to instability and limited improvement. Moreover, a
critical issue often overlooked is the ordering of final relevant documents
which are used as reference to ensure the accuracy of the answers provided by
LLMs. In this report, we introduce our three main modifications taken to
address these challenges. First, we explore various practical approaches to
data processing to overcome the limitations of the embedding model.
Additionally, we enhance Reciprocal Rank Fusion by normalizing order to combine
results from keyword and vector searches effectively. We also meticulously
re-rank the source pieces of information used by LLMs with Active Retrieval to
improve user experience when refining the information generated. In our
opinion, this technique can also be considered as a new re-ranking method that
might be used in place of the traditional cross encoder. Finally, we integrate
these techniques into a comprehensive QA system, significantly improving its
performance and reliability |
doi_str_mv | 10.48550/arxiv.2409.13699 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2409_13699</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2409_13699</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2409_136993</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DM0NrO05GRwDstMLclLzE0tTlXwSU1PzFHwzEvLL8pNLMnMz1MISi0pykwtA4pm5ikElqYWg0R1HfOKy1OLMvPSFYIri0tSc3kYWNMSc4pTeaE0N4O8m2uIs4cu2Lr4gqLM3MSiyniQtfFga40JqwAA7TA4cQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Vietnamese Legal Information Retrieval in Question-Answering System</title><source>arXiv.org</source><creator>Ba, Thiem Nguyen ; The, Vinh Doan ; Quang, Tung Pham ; Van, Toan Tran</creator><creatorcontrib>Ba, Thiem Nguyen ; The, Vinh Doan ; Quang, Tung Pham ; Van, Toan Tran</creatorcontrib><description>In the modern era of rapidly increasing data volumes, accurately retrieving
and recommending relevant documents has become crucial in enhancing the
reliability of Question Answering (QA) systems. Recently, Retrieval Augmented
Generation (RAG) has gained significant recognition for enhancing the
capabilities of large language models (LLMs) by mitigating hallucination issues
in QA systems, which is particularly beneficial in the legal domain. Various
methods, such as semantic search using dense vector embeddings or a combination
of multiple techniques to improve results before feeding them to LLMs, have
been proposed. However, these methods often fall short when applied to the
Vietnamese language due to several challenges, namely inefficient Vietnamese
data processing leading to excessive token length or overly simplistic ensemble
techniques that lead to instability and limited improvement. Moreover, a
critical issue often overlooked is the ordering of final relevant documents
which are used as reference to ensure the accuracy of the answers provided by
LLMs. In this report, we introduce our three main modifications taken to
address these challenges. First, we explore various practical approaches to
data processing to overcome the limitations of the embedding model.
Additionally, we enhance Reciprocal Rank Fusion by normalizing order to combine
results from keyword and vector searches effectively. We also meticulously
re-rank the source pieces of information used by LLMs with Active Retrieval to
improve user experience when refining the information generated. In our
opinion, this technique can also be considered as a new re-ranking method that
might be used in place of the traditional cross encoder. Finally, we integrate
these techniques into a comprehensive QA system, significantly improving its
performance and reliability</description><identifier>DOI: 10.48550/arxiv.2409.13699</identifier><language>eng</language><subject>Computer Science - Information Retrieval</subject><creationdate>2024-09</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2409.13699$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2409.13699$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Ba, Thiem Nguyen</creatorcontrib><creatorcontrib>The, Vinh Doan</creatorcontrib><creatorcontrib>Quang, Tung Pham</creatorcontrib><creatorcontrib>Van, Toan Tran</creatorcontrib><title>Vietnamese Legal Information Retrieval in Question-Answering System</title><description>In the modern era of rapidly increasing data volumes, accurately retrieving
and recommending relevant documents has become crucial in enhancing the
reliability of Question Answering (QA) systems. Recently, Retrieval Augmented
Generation (RAG) has gained significant recognition for enhancing the
capabilities of large language models (LLMs) by mitigating hallucination issues
in QA systems, which is particularly beneficial in the legal domain. Various
methods, such as semantic search using dense vector embeddings or a combination
of multiple techniques to improve results before feeding them to LLMs, have
been proposed. However, these methods often fall short when applied to the
Vietnamese language due to several challenges, namely inefficient Vietnamese
data processing leading to excessive token length or overly simplistic ensemble
techniques that lead to instability and limited improvement. Moreover, a
critical issue often overlooked is the ordering of final relevant documents
which are used as reference to ensure the accuracy of the answers provided by
LLMs. In this report, we introduce our three main modifications taken to
address these challenges. First, we explore various practical approaches to
data processing to overcome the limitations of the embedding model.
Additionally, we enhance Reciprocal Rank Fusion by normalizing order to combine
results from keyword and vector searches effectively. We also meticulously
re-rank the source pieces of information used by LLMs with Active Retrieval to
improve user experience when refining the information generated. In our
opinion, this technique can also be considered as a new re-ranking method that
might be used in place of the traditional cross encoder. Finally, we integrate
these techniques into a comprehensive QA system, significantly improving its
performance and reliability</description><subject>Computer Science - Information Retrieval</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DM0NrO05GRwDstMLclLzE0tTlXwSU1PzFHwzEvLL8pNLMnMz1MISi0pykwtA4pm5ikElqYWg0R1HfOKy1OLMvPSFYIri0tSc3kYWNMSc4pTeaE0N4O8m2uIs4cu2Lr4gqLM3MSiyniQtfFga40JqwAA7TA4cQ</recordid><startdate>20240904</startdate><enddate>20240904</enddate><creator>Ba, Thiem Nguyen</creator><creator>The, Vinh Doan</creator><creator>Quang, Tung Pham</creator><creator>Van, Toan Tran</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240904</creationdate><title>Vietnamese Legal Information Retrieval in Question-Answering System</title><author>Ba, Thiem Nguyen ; The, Vinh Doan ; Quang, Tung Pham ; Van, Toan Tran</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2409_136993</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Information Retrieval</topic><toplevel>online_resources</toplevel><creatorcontrib>Ba, Thiem Nguyen</creatorcontrib><creatorcontrib>The, Vinh Doan</creatorcontrib><creatorcontrib>Quang, Tung Pham</creatorcontrib><creatorcontrib>Van, Toan Tran</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ba, Thiem Nguyen</au><au>The, Vinh Doan</au><au>Quang, Tung Pham</au><au>Van, Toan Tran</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Vietnamese Legal Information Retrieval in Question-Answering System</atitle><date>2024-09-04</date><risdate>2024</risdate><abstract>In the modern era of rapidly increasing data volumes, accurately retrieving
and recommending relevant documents has become crucial in enhancing the
reliability of Question Answering (QA) systems. Recently, Retrieval Augmented
Generation (RAG) has gained significant recognition for enhancing the
capabilities of large language models (LLMs) by mitigating hallucination issues
in QA systems, which is particularly beneficial in the legal domain. Various
methods, such as semantic search using dense vector embeddings or a combination
of multiple techniques to improve results before feeding them to LLMs, have
been proposed. However, these methods often fall short when applied to the
Vietnamese language due to several challenges, namely inefficient Vietnamese
data processing leading to excessive token length or overly simplistic ensemble
techniques that lead to instability and limited improvement. Moreover, a
critical issue often overlooked is the ordering of final relevant documents
which are used as reference to ensure the accuracy of the answers provided by
LLMs. In this report, we introduce our three main modifications taken to
address these challenges. First, we explore various practical approaches to
data processing to overcome the limitations of the embedding model.
Additionally, we enhance Reciprocal Rank Fusion by normalizing order to combine
results from keyword and vector searches effectively. We also meticulously
re-rank the source pieces of information used by LLMs with Active Retrieval to
improve user experience when refining the information generated. In our
opinion, this technique can also be considered as a new re-ranking method that
might be used in place of the traditional cross encoder. Finally, we integrate
these techniques into a comprehensive QA system, significantly improving its
performance and reliability</abstract><doi>10.48550/arxiv.2409.13699</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2409.13699 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2409_13699 |
source | arXiv.org |
subjects | Computer Science - Information Retrieval |
title | Vietnamese Legal Information Retrieval in Question-Answering System |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T13%3A54%3A04IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Vietnamese%20Legal%20Information%20Retrieval%20in%20Question-Answering%20System&rft.au=Ba,%20Thiem%20Nguyen&rft.date=2024-09-04&rft_id=info:doi/10.48550/arxiv.2409.13699&rft_dat=%3Carxiv_GOX%3E2409_13699%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |