OOD-DiskANN: Efficient and Scalable Graph ANNS for Out-of-Distribution Queries

State-of-the-art algorithms for Approximate Nearest Neighbor Search (ANNS) such as DiskANN, FAISS-IVF, and HNSW build data dependent indices that offer substantially better accuracy and search efficiency over data-agnostic indices by overfitting to the index data distribution. When the query data is...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Jaiswal, Shikhar, Krishnaswamy, Ravishankar, Garg, Ankit, Simhadri, Harsha Vardhan, Agrawal, Sheshansh
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Jaiswal, Shikhar
Krishnaswamy, Ravishankar
Garg, Ankit
Simhadri, Harsha Vardhan
Agrawal, Sheshansh
description State-of-the-art algorithms for Approximate Nearest Neighbor Search (ANNS) such as DiskANN, FAISS-IVF, and HNSW build data dependent indices that offer substantially better accuracy and search efficiency over data-agnostic indices by overfitting to the index data distribution. When the query data is drawn from a different distribution - e.g., when index represents image embeddings and query represents textual embeddings - such algorithms lose much of this performance advantage. On a variety of datasets, for a fixed recall target, latency is worse by an order of magnitude or more for Out-Of-Distribution (OOD) queries as compared to In-Distribution (ID) queries. The question we address in this work is whether ANNS algorithms can be made efficient for OOD queries if the index construction is given access to a small sample set of these queries. We answer positively by presenting OOD-DiskANN, which uses a sparing sample (1% of index set size) of OOD queries, and provides up to 40% improvement in mean query latency over SoTA algorithms of a similar memory footprint. OOD-DiskANN is scalable and has the efficiency of graph-based ANNS indices. Some of our contributions can improve query efficiency for ID queries as well.
doi_str_mv 10.48550/arxiv.2211.12850
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2211_12850</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2211_12850</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-f4bea383184dfd700c76bef76cd2b3ccd4de7307f5764a49a22d76514b223b8c3</originalsourceid><addsrcrecordid>eNotz0tOwzAYBGBvWKDCAVjhCzj4bcOuaktBqhKhdh_9fgmLkFROguD20JbVbGZG-hC6Y7SSVin6AOU7f1WcM1YxbhW9RnXTrMk6jx_Lun7Cm5Syz7GfMPQB7z104LqItwWO7_ivscdpKLiZJzKk02oq2c1THnr8NseS43iDrhJ0Y7z9zwU6PG8Oqxeya7avq-WOgDaUJOkiCCuYlSEFQ6k32sVktA_cCe-DDNEIapIyWoJ8BM6D0YpJx7lw1osFur_cnkHtseRPKD_tCdaeYeIXVxRHUw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>OOD-DiskANN: Efficient and Scalable Graph ANNS for Out-of-Distribution Queries</title><source>arXiv.org</source><creator>Jaiswal, Shikhar ; Krishnaswamy, Ravishankar ; Garg, Ankit ; Simhadri, Harsha Vardhan ; Agrawal, Sheshansh</creator><creatorcontrib>Jaiswal, Shikhar ; Krishnaswamy, Ravishankar ; Garg, Ankit ; Simhadri, Harsha Vardhan ; Agrawal, Sheshansh</creatorcontrib><description>State-of-the-art algorithms for Approximate Nearest Neighbor Search (ANNS) such as DiskANN, FAISS-IVF, and HNSW build data dependent indices that offer substantially better accuracy and search efficiency over data-agnostic indices by overfitting to the index data distribution. When the query data is drawn from a different distribution - e.g., when index represents image embeddings and query represents textual embeddings - such algorithms lose much of this performance advantage. On a variety of datasets, for a fixed recall target, latency is worse by an order of magnitude or more for Out-Of-Distribution (OOD) queries as compared to In-Distribution (ID) queries. The question we address in this work is whether ANNS algorithms can be made efficient for OOD queries if the index construction is given access to a small sample set of these queries. We answer positively by presenting OOD-DiskANN, which uses a sparing sample (1% of index set size) of OOD queries, and provides up to 40% improvement in mean query latency over SoTA algorithms of a similar memory footprint. OOD-DiskANN is scalable and has the efficiency of graph-based ANNS indices. Some of our contributions can improve query efficiency for ID queries as well.</description><identifier>DOI: 10.48550/arxiv.2211.12850</identifier><language>eng</language><subject>Computer Science - Information Retrieval ; Computer Science - Learning</subject><creationdate>2022-10</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,777,882</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2211.12850$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2211.12850$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Jaiswal, Shikhar</creatorcontrib><creatorcontrib>Krishnaswamy, Ravishankar</creatorcontrib><creatorcontrib>Garg, Ankit</creatorcontrib><creatorcontrib>Simhadri, Harsha Vardhan</creatorcontrib><creatorcontrib>Agrawal, Sheshansh</creatorcontrib><title>OOD-DiskANN: Efficient and Scalable Graph ANNS for Out-of-Distribution Queries</title><description>State-of-the-art algorithms for Approximate Nearest Neighbor Search (ANNS) such as DiskANN, FAISS-IVF, and HNSW build data dependent indices that offer substantially better accuracy and search efficiency over data-agnostic indices by overfitting to the index data distribution. When the query data is drawn from a different distribution - e.g., when index represents image embeddings and query represents textual embeddings - such algorithms lose much of this performance advantage. On a variety of datasets, for a fixed recall target, latency is worse by an order of magnitude or more for Out-Of-Distribution (OOD) queries as compared to In-Distribution (ID) queries. The question we address in this work is whether ANNS algorithms can be made efficient for OOD queries if the index construction is given access to a small sample set of these queries. We answer positively by presenting OOD-DiskANN, which uses a sparing sample (1% of index set size) of OOD queries, and provides up to 40% improvement in mean query latency over SoTA algorithms of a similar memory footprint. OOD-DiskANN is scalable and has the efficiency of graph-based ANNS indices. Some of our contributions can improve query efficiency for ID queries as well.</description><subject>Computer Science - Information Retrieval</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz0tOwzAYBGBvWKDCAVjhCzj4bcOuaktBqhKhdh_9fgmLkFROguD20JbVbGZG-hC6Y7SSVin6AOU7f1WcM1YxbhW9RnXTrMk6jx_Lun7Cm5Syz7GfMPQB7z104LqItwWO7_ivscdpKLiZJzKk02oq2c1THnr8NseS43iDrhJ0Y7z9zwU6PG8Oqxeya7avq-WOgDaUJOkiCCuYlSEFQ6k32sVktA_cCe-DDNEIapIyWoJ8BM6D0YpJx7lw1osFur_cnkHtseRPKD_tCdaeYeIXVxRHUw</recordid><startdate>20221022</startdate><enddate>20221022</enddate><creator>Jaiswal, Shikhar</creator><creator>Krishnaswamy, Ravishankar</creator><creator>Garg, Ankit</creator><creator>Simhadri, Harsha Vardhan</creator><creator>Agrawal, Sheshansh</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20221022</creationdate><title>OOD-DiskANN: Efficient and Scalable Graph ANNS for Out-of-Distribution Queries</title><author>Jaiswal, Shikhar ; Krishnaswamy, Ravishankar ; Garg, Ankit ; Simhadri, Harsha Vardhan ; Agrawal, Sheshansh</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-f4bea383184dfd700c76bef76cd2b3ccd4de7307f5764a49a22d76514b223b8c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Information Retrieval</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Jaiswal, Shikhar</creatorcontrib><creatorcontrib>Krishnaswamy, Ravishankar</creatorcontrib><creatorcontrib>Garg, Ankit</creatorcontrib><creatorcontrib>Simhadri, Harsha Vardhan</creatorcontrib><creatorcontrib>Agrawal, Sheshansh</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Jaiswal, Shikhar</au><au>Krishnaswamy, Ravishankar</au><au>Garg, Ankit</au><au>Simhadri, Harsha Vardhan</au><au>Agrawal, Sheshansh</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>OOD-DiskANN: Efficient and Scalable Graph ANNS for Out-of-Distribution Queries</atitle><date>2022-10-22</date><risdate>2022</risdate><abstract>State-of-the-art algorithms for Approximate Nearest Neighbor Search (ANNS) such as DiskANN, FAISS-IVF, and HNSW build data dependent indices that offer substantially better accuracy and search efficiency over data-agnostic indices by overfitting to the index data distribution. When the query data is drawn from a different distribution - e.g., when index represents image embeddings and query represents textual embeddings - such algorithms lose much of this performance advantage. On a variety of datasets, for a fixed recall target, latency is worse by an order of magnitude or more for Out-Of-Distribution (OOD) queries as compared to In-Distribution (ID) queries. The question we address in this work is whether ANNS algorithms can be made efficient for OOD queries if the index construction is given access to a small sample set of these queries. We answer positively by presenting OOD-DiskANN, which uses a sparing sample (1% of index set size) of OOD queries, and provides up to 40% improvement in mean query latency over SoTA algorithms of a similar memory footprint. OOD-DiskANN is scalable and has the efficiency of graph-based ANNS indices. Some of our contributions can improve query efficiency for ID queries as well.</abstract><doi>10.48550/arxiv.2211.12850</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2211.12850
ispartof
issn
language eng
recordid cdi_arxiv_primary_2211_12850
source arXiv.org
subjects Computer Science - Information Retrieval
Computer Science - Learning
title OOD-DiskANN: Efficient and Scalable Graph ANNS for Out-of-Distribution Queries
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-21T03%3A17%3A28IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=OOD-DiskANN:%20Efficient%20and%20Scalable%20Graph%20ANNS%20for%20Out-of-Distribution%20Queries&rft.au=Jaiswal,%20Shikhar&rft.date=2022-10-22&rft_id=info:doi/10.48550/arxiv.2211.12850&rft_dat=%3Carxiv_GOX%3E2211_12850%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true