From classification to taxonomy: Automated structuring of vehicle repair names in multilingual corpora
Main Article Content
Abstract
This study introduces and rigorously validates a hybrid, five-stage Natural Language Processing pipeline that transforms unstructured, bilingual repair-order text into fully navigable, hierarchical action taxonomy – bridging the gap between flat keyword classification and business-grade knowledge organization. Addressing the limitations of both traditional and modern Natural Language Processing methods in technical, noisy, and domain-specific datasets, the proposed methodology integrates advanced lemmatization, manual core dictionary creation, semantic filtering, transformer-based classification, and embedding-driven clustering. Building on advanced Ukrainian lemmatization, dynamic semantic filtering, multilingual sentence embeddings, and density clustering, the pipeline systematically overcomes the noise, code-switching, and “long-tail” rarity that typify real-world automotive datasets. Tested on a corpus of over 4.3 million service records, the approach achieves over 92 % cluster coherence with minimal manual annotation. The resulting taxonomy unlocks four immediate industrial benefits: enterprise-wide repair analytics and benchmarking across branches and brands; intent-aware chatbots capable of precise service triage and automated quotation; inventory and workforce optimization through fine-grained job statistics; and a practical blueprint for industry-level standardization of repair nomenclature and data exchange. In sum, the work demonstrates that combining minimal expert input with modern embedding techniques and density clustering can automate taxonomy induction at industrial scale, setting a new benchmark for digital transformation initiatives that depend on accurate structuring of noisy technical language.