Reducing the Dimensional Dependence for Rank-Based Similarity Search

Jeevan Arjun Shelke; Viresh Shivshankar Humnabadkar; Tushar Tinajirao Kale; Girish Gundopant Kulkarni; P. S. Kulkarni

Reducing the Dimensional Dependence for Rank-Based Similarity Search

Jeevan Arjun Shelke, Viresh Shivshankar Humnabadkar, Tushar Tinajirao Kale, Girish Gundopant Kulkarni, P. S. Kulkarni

Abstract

In this paper a data structure for k-NN search, the rank cover tree (RCT) is implemented. The pruning tests for RCT rely on the comparison of similarity values and not on the other properties of the underlying space, such as the triangle inequality. Objects are selected according to their ranks with respect to the query object, allowing much tighter control on the overall execution costs. Theoretical analysis shows that with very high probability, the RCT returns a correct query result in time that depends very competitively on a measure of the intrinsic dimensionality of the data set. The experimental results for the RCT show that non-metric pruning strategies for similarity search can be practical even when the representational dimension of the data is extremely high. They also show that the RCT is capable of meeting or exceeding the level of performance of state-of-the-art methods that make use of metric pruning or other selection tests involving numerical constraints on distance values.

Cite this Article Jeevan Arjun Shelke, Viresh Shivshankar Humnabadkar, Tushar Tinajirao Kale et al. Reducing the Dimensional Dependence for Rank-Based Similarity Search. Journal of Advanced Database Management & Systems. 2016; 3(1): 9–17p.

Keywords

Nearest neighbor search, intrinsic dimensionality, rank-based search

Full Text:

PDF

References

Guttman A. R-Trees: A Dynamic Index Structure for Spatial Searching. In Proc Annu Meeting. 1984; 47–57p.

Ciaccia P, Patella M, Zezula P. M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces. In Proc 23rd Int Conf Very Large Data Bases. 1997; 426–435p.

Goyal N, Lifshits Y, Sch€utze H. Disorder Inequality: A Combinatorial Approach to Nearest Neighbor Search. In Proc Intern Conf Web Search Web Data Mining. 2008; 25–32p.

Guha S, Rastogi R, Shim K. ROCK: A Robust Clustering Algorithm for Categorical Attributes. Inf Syst. 2000; 25(5): 345–366p.

Ert€oz L, Steinbach M, Kumar V. Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data. In Proc 3rd SIAM Int Conf Data Mining. 2003; 1p.

Andoni A, Indyk P. E2LSH 0.1: User Manual. 2005. [Online]. Available: www.mit.edu/andoni/LSH/. 2005.

de Vries T, Chawla S, Houle ME. Finding Local Anomalies in Very High Dimensional Space. In Proc IEEE Int Conf Data Mining. 2010; 128–137p.

Han J, Kamber M. Data Mining: Concepts and Techniques. San Francisco, CA, USA: Morgan Kaufmann; 2006.

Chan HT-H, Gupta A, Maggs BM, et al. On Hierarchical Routing in Doubling Metrics. In Proc 15th Annu ACMSIAM Symp Discrete Algorithm. 2005; 762–771p.

Bozkaya T, Ozsoyoglu M. Indexing Large Metric Spaces for Similarity Search Queries. ACM Trans Database Syst. 1999; 24(3): 361–404p.

Beygelzimer A, Kakade S, Langford J. Cover Trees for Nearest Neighbor. In Proc 23rd Int Conf Mach Learn. 2006; 97–104p.

Breunig MM, Kriegel H-P, Ng RT, et al. LOF: Identifying Density-Based Local Outliers. SIGMOD Rec. 2000; 29(2): 93–104p.

Ch_avez E, Navarro G, Baeza-Yates R, et al. Searching in Metric Spaces. ACM Comput Surv. 2001; 33(3): 273–321p.

Beyer KS, Goldstein J, Ramakrishnan R, et al. When is Nearest Neighbor Meaningful? In Proc 7th Int Conf Database Theory. 1999; 217–235p.

Asuncion A, Newman DJ. UCI Machine Learning Repository. 2007. [Online]. Available: http://www.ics.uci.edu/~mlearn/MLRepository.html

Chandola V, Banerjee A, Kumar V. Anomaly Detection: A Survey. ACM Comput Surv. 2009; 41(3): 1–58p.

Gupta A, Krauthgamer R, Lee JR. Bounded Geometries, Fractals, and Low-Distortion Embeddings. In Proc 44th Annu IEEE Symp Foundations Comput Sci. 2003; 534p.

Karger DR, Ruhl M. Finding Nearest Neighbors in Growth Restricted Metrics. In Proc 34th ACM Symp Theory Comput. 2002; 741–750p.

Geusebroek JM, Burghouts GJ, Smeulders AWM. The Amsterdam Library of Object Images. Int J Comput Vis. 2005; 61(1): 103–112p.

Houle ME, Sakuma J. Fast Approximate Similarity Search in Extremely High-Dimensional Data Sets. In Proc 21st Intern Conf Data Eng. 2005; 619–630p.

Cover T, Hart P. Nearest Neighbor Pattern Classification. IEEE Trans Inf Theory. Jan 1967; IT-13(1): 21–27p.

Houle ME, Nett M. Rank Cover Trees for Nearest Neighbor Search. In Proc Int Conf Similarity Search Appl. 2013; 16–29p.

Goodman JE, O’Rourke J, editors. Handbook of Discrete and Computational Geometry. Cleveland, OH, USA: CRC Press; 1997.

Indyk P, Motwani R. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proc 30th ACM Symp Theory Comput. 1998; 604–613p.

Houle ME. The Relevant Set Correlation Model for Data Clustering. Statist Anal Data Mining. 2008; 1(3): 157–176p.

Houle Michael E, Michael Nett. Rank-Based Similarity Search: Reducing the Dimensional Dependence. IEEE Trans Pattern Anal Machine Intell. Jan 2015; 37(1).

Abraham I, Malkhi D, Dobzinski O. LAND: Stretch (1 + Epsilon) Locality-Aware Networks for DHTs. In Proc. 15th Annu ACM-SIAM Symp Discrete Algorithm. 2004; 550–559p.

Ankerst M, Breunig MM, Kriegel H-P, et al. OPTICS: Ordering Points to Identify the Clustering Structure. In Proc ACM SIGMOD Int Conf Manag Data. 1999; 49–60p.

Brin S. Near Neighbor Search in Large Metric Spaces. In Proc 21th Int Conf Very Large Data Bases. 1995; 574–584p.

Ester M, Kriegel H-P, Sander J, et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proc 2nd Int Conf Knowl Discov Data Mining. 1996; 226–231p.

Gionis A, Indyk P, Motwani R. Similarity Search in High Dimensions via Hashing. In Proc 25th Int Conf Very Large Data Bases. 1999; 518–529p.

Guha S, Rastogi R, Shim K. CURE: An Efficient Clustering Algorithm for Large Databases. Inf Syst. 2001; 26(1): 35–58p.

Katayama N, Satoh S. The SR-Tree: An Index Structure for High-Dimensional Nearest Neighbor Queries. In Proc ACM SIGMOD Int Conf Manag Data. May 1997; 369–380p.

Krauthgamer R, Lee JR. The Black-Box Complexity of Nearest Neighbor Search. In Proc 31st Int Colloquium Automata, Lang Programm. 2004; 858–869p.

Krauthgamer R, Lee JR. Navigating Nets: Simple Algorithms for Proximity Search. In Proc 15th Annu ACM-SIAM Symp Discrete Algorithms. 2004; 798–807p.

Le Cun Y, Bottou L, Bengio Y, et al. Gradient-Based Learning Applied to Document Recognition. Proc IEEE. Nov 1998; 86(11): 2278–2324p.

Refbacks

There are currently no refbacks.

This site has been shifted to https://stmcomputers.stmjournals.com/

Username
Password
Remember me