Efficient fuzzy search enabled hash map

Hash maps are data structures widely used in modern programming languages like Java for their simplicity and efficiency. When fuzzy string search is needed (like in natural language processing) finding an approximate key match in a regular Java HashMap is a trivial task. It usually requires the brut...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
1. Verfasser: Topac, V
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Hash maps are data structures widely used in modern programming languages like Java for their simplicity and efficiency. When fuzzy string search is needed (like in natural language processing) finding an approximate key match in a regular Java HashMap is a trivial task. It usually requires the brute force method of iterating trough the set of keys and use of string metrics methods. Although this approach works it is time consuming and loses the hashing advantage of the hash map. Another option is to use a different data structure like TreeMap, which is faster, but also have limitations on fuzzy string search. This article presents FuzzyHashMap, an extension to the regular Java HashMap data structure allowing highly efficient fuzzy string key search. Based on object oriented principles this extended hash map uses a custom key that enables different types of pre-hashing functions and different types of dynamic programming algorithms for approximate string matching. Customizable algorithms and settings bring flexibility to this new data structure, making it adaptable to each specific use case. Fuzzy string search performance comparison between FuzzyHashMap and the regular HashMap are presented for both accuracy and time consumption. Results show very good performance for FuzzyHashMap compared to the regular HashMap. Some real use cases for the extended hash map are listed. All the work described in this paper is released as open source, making it easy for the community to use and extend the capabilities of the current implementation.
DOI:10.1109/SOFA.2010.5565628