On January 22, 2025, a meeting of the members of the editorial board and editorial board of the journals “Applied Aspects of Information Technology” and “Herald of Advanced Information Technology” was held (Read more)

Partitioning the data space before applying hashing using clustering algorithms

Authors

  • Sergey A. Subbotin National University “Zaporizhzhia Polytechnic”, 64, Zhukovskogo Str. Zaporizhzhia, 69011, Ukraine
  • Fedir А. Shmalko National University “Zaporizhzhia Polytechnic”, 64, Zhukovskogo Str. Zaporizhzhia, 69011, Ukraine

DOI:

https://doi.org/10.15276/hait.8.2025.2

Keywords:

adaptive encoding tree, bidirectional encoder representations from transformers cauterization, dimensionality reduction, approximate nearest neighbor, multimodal data; root node

Abstract

This research presents a locality-sensitive hashing framework that enhances approximate nearest neighbor search efficiency by integrating adaptive encoding trees and BERT-based clusterization. The proposed method optimizes data space partitioning before applying hashing, improving retrieval accuracy while reducing computational complexity. First, multimodal data, such as images and textual descriptions, are transformed into a unified semantic space using pre-trained bidirectional encoder representations from transformers embeddings. this ensures cross-modal consistency and facilitates high-dimensional similarity comparisons. Second, dimensionality reduction techniques like Uniform Manifold Approximation and Projection or t-distributed stochastic neighbor embedding are applied to mitigate the curse of dimensionality while preserving key relationships between data points. Third, an adaptive encoding tree locality-sensitive hashing encoding tree is constructed, dynamically segmenting the data space based on statistical distribution, thereby enabling efficient hierarchical clustering. Each data point is converted into a symbolic representation, allowing fast retrieval using structured hashing. Fourth, locality-sensitive hashing is applied to the encoded dataset, leveraging p-stable distributions to maintain high search precision while reducing index size. The combination of encoding trees and Locality-Sensitive Hashing enables efficient candidate selection while minimizing search overhead. Experimental evaluations on the CarDD dataset, which includes car damage images and annotations, demonstrate that the proposed method outperforms state-of-the-art approximate nearest neighbor techniques in both indexing efficiency and retrieval accuracy. The results highlight its adaptability to large-scale, high-dimensional, and multimodal datasets, making it suitable for diagnostic models and real-time retrieval tasks.

Downloads

Download data is not yet available.

Author Biographies

Sergey A. Subbotin, National University “Zaporizhzhia Polytechnic”, 64, Zhukovskogo Str. Zaporizhzhia, 69011, Ukraine

Doctor of Engineering Science, Professor, Head of Department of Software Tools

Scopus Author ID: 7006531104

Fedir А. Shmalko , National University “Zaporizhzhia Polytechnic”, 64, Zhukovskogo Str. Zaporizhzhia, 69011, Ukraine

PhD student, Department of Software Tools

Downloads

Published

2025-04-04

How to Cite

Subbotin, S. A. ., & Shmalko F. А. . (2025). Partitioning the data space before applying hashing using clustering algorithms. Herald of Advanced Information Technology, 8(1), 28–42. https://doi.org/10.15276/hait.8.2025.2