Video instance search via spatial fusion of visual words and object proposals

Most popular systems for object instance search are based on the bag-of-visual-word model. The inherent weaknesses of this standard model such as quantization error, unstructured representation, burstiness phenomenon are to some extent solved. However, it has a serious problem of searching small obj...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	International journal of multimedia information retrieval 2019-09, Vol.8 (3), p.181-192
Hauptverfasser:	Nguyen, Vinh-Tiep, Le, Duy Dinh, Tran, Minh-Triet, Nguyen, Tam V., Ngo, Thanh Duc, Satoh, Shin’ichi, Duong, Duc Anh
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science Data Mining and Knowledge Discovery Database Management Datasets Deep learning Image Processing and Computer Vision Information Storage and Retrieval Information Systems Applications (incl.Internet) Methods Multimedia Information Systems Neural networks Proposals Queries Query expansion Regular Paper Searching Sensors Similarity Surveillance Visual discrimination Weighting functions
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Most popular systems for object instance search are based on the bag-of-visual-word model. The inherent weaknesses of this standard model such as quantization error, unstructured representation, burstiness phenomenon are to some extent solved. However, it has a serious problem of searching small objects on a database with cluttered background. In many situations, even the irrelevant objects which share the same texture or shape with a query object get higher score than relevant ones. To overcome this problem, we propose a novel fusion method to significantly boost the accuracy of instance search systems. Firstly, we use the state-of-the-art object detector with denser feature for finding object bounding box and similarity score. Secondly, to exploit the spatial relationship of each visual word with an object proposal, a detected area that might contain a query object, we define three categories of visual word pairs, i.e., discriminative, weak relevant, and context inferred ones. Finally, we propose a new re-ranking scheme with three weighting functions corresponding to the three categories of visual word pairs to compute the final similarity score between a query topic and a video shot. To illustrate the efficiency of the proposed method, we conduct experiments on datasets which have a wide variety of types of query objects. Experimental results on TRECVID Instance Search datasets (INS2013 and INS2014) show the superiority of our proposed method over the state-of-the-art approaches.
ISSN:	2192-6611 2192-662X
DOI:	10.1007/s13735-019-00172-z