Vietnamese Legal Information Retrieval in Question-Answering System

In the modern era of rapidly increasing data volumes, accurately retrieving and recommending relevant documents has become crucial in enhancing the reliability of Question Answering (QA) systems. Recently, Retrieval Augmented Generation (RAG) has gained significant recognition for enhancing the capa...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Ba, Thiem Nguyen, The, Vinh Doan, Quang, Tung Pham, Van, Toan Tran
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Information Retrieval
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Ba, Thiem Nguyen The, Vinh Doan Quang, Tung Pham Van, Toan Tran
description	In the modern era of rapidly increasing data volumes, accurately retrieving and recommending relevant documents has become crucial in enhancing the reliability of Question Answering (QA) systems. Recently, Retrieval Augmented Generation (RAG) has gained significant recognition for enhancing the capabilities of large language models (LLMs) by mitigating hallucination issues in QA systems, which is particularly beneficial in the legal domain. Various methods, such as semantic search using dense vector embeddings or a combination of multiple techniques to improve results before feeding them to LLMs, have been proposed. However, these methods often fall short when applied to the Vietnamese language due to several challenges, namely inefficient Vietnamese data processing leading to excessive token length or overly simplistic ensemble techniques that lead to instability and limited improvement. Moreover, a critical issue often overlooked is the ordering of final relevant documents which are used as reference to ensure the accuracy of the answers provided by LLMs. In this report, we introduce our three main modifications taken to address these challenges. First, we explore various practical approaches to data processing to overcome the limitations of the embedding model. Additionally, we enhance Reciprocal Rank Fusion by normalizing order to combine results from keyword and vector searches effectively. We also meticulously re-rank the source pieces of information used by LLMs with Active Retrieval to improve user experience when refining the information generated. In our opinion, this technique can also be considered as a new re-ranking method that might be used in place of the traditional cross encoder. Finally, we integrate these techniques into a comprehensive QA system, significantly improving its performance and reliability
doi_str_mv	10.48550/arxiv.2409.13699
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2409_13699</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2409_13699</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2409_136993</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DM0NrO05GRwDstMLclLzE0tTlXwSU1PzFHwzEvLL8pNLMnMz1MISi0pykwtA4pm5ikElqYWg0R1HfOKy1OLMvPSFYIri0tSc3kYWNMSc4pTeaE0N4O8m2uIs4cu2Lr4gqLM3MSiyniQtfFga40JqwAA7TA4cQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Vietnamese Legal Information Retrieval in Question-Answering System</title><source>arXiv.org</source><creator>Ba, Thiem Nguyen ; The, Vinh Doan ; Quang, Tung Pham ; Van, Toan Tran</creator><creatorcontrib>Ba, Thiem Nguyen ; The, Vinh Doan ; Quang, Tung Pham ; Van, Toan Tran</creatorcontrib><description>In the modern era of rapidly increasing data volumes, accurately retrieving and recommending relevant documents has become crucial in enhancing the reliability of Question Answering (QA) systems. Recently, Retrieval Augmented Generation (RAG) has gained significant recognition for enhancing the capabilities of large language models (LLMs) by mitigating hallucination issues in QA systems, which is particularly beneficial in the legal domain. Various methods, such as semantic search using dense vector embeddings or a combination of multiple techniques to improve results before feeding them to LLMs, have been proposed. However, these methods often fall short when applied to the Vietnamese language due to several challenges, namely inefficient Vietnamese data processing leading to excessive token length or overly simplistic ensemble techniques that lead to instability and limited improvement. Moreover, a critical issue often overlooked is the ordering of final relevant documents which are used as reference to ensure the accuracy of the answers provided by LLMs. In this report, we introduce our three main modifications taken to address these challenges. First, we explore various practical approaches to data processing to overcome the limitations of the embedding model. Additionally, we enhance Reciprocal Rank Fusion by normalizing order to combine results from keyword and vector searches effectively. We also meticulously re-rank the source pieces of information used by LLMs with Active Retrieval to improve user experience when refining the information generated. In our opinion, this technique can also be considered as a new re-ranking method that might be used in place of the traditional cross encoder. Finally, we integrate these techniques into a comprehensive QA system, significantly improving its performance and reliability</description><identifier>DOI: 10.48550/arxiv.2409.13699</identifier><language>eng</language><subject>Computer Science - Information Retrieval</subject><creationdate>2024-09</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2409.13699$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2409.13699$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Ba, Thiem Nguyen</creatorcontrib><creatorcontrib>The, Vinh Doan</creatorcontrib><creatorcontrib>Quang, Tung Pham</creatorcontrib><creatorcontrib>Van, Toan Tran</creatorcontrib><title>Vietnamese Legal Information Retrieval in Question-Answering System</title><description>In the modern era of rapidly increasing data volumes, accurately retrieving and recommending relevant documents has become crucial in enhancing the reliability of Question Answering (QA) systems. Recently, Retrieval Augmented Generation (RAG) has gained significant recognition for enhancing the capabilities of large language models (LLMs) by mitigating hallucination issues in QA systems, which is particularly beneficial in the legal domain. Various methods, such as semantic search using dense vector embeddings or a combination of multiple techniques to improve results before feeding them to LLMs, have been proposed. However, these methods often fall short when applied to the Vietnamese language due to several challenges, namely inefficient Vietnamese data processing leading to excessive token length or overly simplistic ensemble techniques that lead to instability and limited improvement. Moreover, a critical issue often overlooked is the ordering of final relevant documents which are used as reference to ensure the accuracy of the answers provided by LLMs. In this report, we introduce our three main modifications taken to address these challenges. First, we explore various practical approaches to data processing to overcome the limitations of the embedding model. Additionally, we enhance Reciprocal Rank Fusion by normalizing order to combine results from keyword and vector searches effectively. We also meticulously re-rank the source pieces of information used by LLMs with Active Retrieval to improve user experience when refining the information generated. In our opinion, this technique can also be considered as a new re-ranking method that might be used in place of the traditional cross encoder. Finally, we integrate these techniques into a comprehensive QA system, significantly improving its performance and reliability</description><subject>Computer Science - Information Retrieval</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DM0NrO05GRwDstMLclLzE0tTlXwSU1PzFHwzEvLL8pNLMnMz1MISi0pykwtA4pm5ikElqYWg0R1HfOKy1OLMvPSFYIri0tSc3kYWNMSc4pTeaE0N4O8m2uIs4cu2Lr4gqLM3MSiyniQtfFga40JqwAA7TA4cQ</recordid><startdate>20240904</startdate><enddate>20240904</enddate><creator>Ba, Thiem Nguyen</creator><creator>The, Vinh Doan</creator><creator>Quang, Tung Pham</creator><creator>Van, Toan Tran</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240904</creationdate><title>Vietnamese Legal Information Retrieval in Question-Answering System</title><author>Ba, Thiem Nguyen ; The, Vinh Doan ; Quang, Tung Pham ; Van, Toan Tran</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2409_136993</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Information Retrieval</topic><toplevel>online_resources</toplevel><creatorcontrib>Ba, Thiem Nguyen</creatorcontrib><creatorcontrib>The, Vinh Doan</creatorcontrib><creatorcontrib>Quang, Tung Pham</creatorcontrib><creatorcontrib>Van, Toan Tran</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ba, Thiem Nguyen</au><au>The, Vinh Doan</au><au>Quang, Tung Pham</au><au>Van, Toan Tran</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Vietnamese Legal Information Retrieval in Question-Answering System</atitle><date>2024-09-04</date><risdate>2024</risdate><abstract>In the modern era of rapidly increasing data volumes, accurately retrieving and recommending relevant documents has become crucial in enhancing the reliability of Question Answering (QA) systems. Recently, Retrieval Augmented Generation (RAG) has gained significant recognition for enhancing the capabilities of large language models (LLMs) by mitigating hallucination issues in QA systems, which is particularly beneficial in the legal domain. Various methods, such as semantic search using dense vector embeddings or a combination of multiple techniques to improve results before feeding them to LLMs, have been proposed. However, these methods often fall short when applied to the Vietnamese language due to several challenges, namely inefficient Vietnamese data processing leading to excessive token length or overly simplistic ensemble techniques that lead to instability and limited improvement. Moreover, a critical issue often overlooked is the ordering of final relevant documents which are used as reference to ensure the accuracy of the answers provided by LLMs. In this report, we introduce our three main modifications taken to address these challenges. First, we explore various practical approaches to data processing to overcome the limitations of the embedding model. Additionally, we enhance Reciprocal Rank Fusion by normalizing order to combine results from keyword and vector searches effectively. We also meticulously re-rank the source pieces of information used by LLMs with Active Retrieval to improve user experience when refining the information generated. In our opinion, this technique can also be considered as a new re-ranking method that might be used in place of the traditional cross encoder. Finally, we integrate these techniques into a comprehensive QA system, significantly improving its performance and reliability</abstract><doi>10.48550/arxiv.2409.13699</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2409.13699
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2409_13699
source	arXiv.org
subjects	Computer Science - Information Retrieval
title	Vietnamese Legal Information Retrieval in Question-Answering System
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T13%3A54%3A04IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Vietnamese%20Legal%20Information%20Retrieval%20in%20Question-Answering%20System&rft.au=Ba,%20Thiem%20Nguyen&rft.date=2024-09-04&rft_id=info:doi/10.48550/arxiv.2409.13699&rft_dat=%3Carxiv_GOX%3E2409_13699%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true