A Lightweight Indexing Approach for Efficient Batch Similarity Processing with MapReduce

Similarity search is a principle operation in different fields of study. However, the cost for that operation is expensive due to several reasons, mainly by redundancy and big data load. There are many approaches that concentrate on how to speed up similarity search, especially with massive datasets...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	SN computer science 2020, Vol.1 (1), p.1, Article 1
Hauptverfasser:	Phan, Trong Nhan, Dang, Tran Khanh
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Batch processing Big Data Computer Imaging Computer Science Computer Systems Organization and Communication Networks Data Structures and Information Theory Datasets Employment Future Data and Security Engineering Information Systems and Communication Service Lightweight Massive data points Original Research Pattern Recognition and Graphics Queries Query processing Redundancy Searching Similarity Software Engineering/Programming and Operating Systems Vision
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Similarity search is a principle operation in different fields of study. However, the cost for that operation is expensive due to several reasons, mainly by redundancy and big data load. There are many approaches that concentrate on how to speed up similarity search, especially with massive datasets, so that we can employ it for plenty of recent applications. In this paper, we study an efficient way for either single or batch similarity processing with MapReduce while minimizing redundant data by building lightweight indexes from the data and query sources. More specifically, we propose a general query processing scheme that not only handles a single query but also deals with sets of query in an incremental manner. In addition, we build the indexes in an ordered fashion, the so-called sorted inverted indexes, so that we can perform our quick pruning strategy that discards unrelated objects. Moreover, we embed metadata inside the indexes to reduce inessential duplicates. Last but not least, we measure our proposed solution by conducting empirical experiments on real datasets. The results verify the efficiency of our method when we do similarity search with query batches, especially when both query sets and datasets are large.
ISSN:	2662-995X 2661-8907
DOI:	10.1007/s42979-019-0007-y