INTENT DETECTION IN TECHNICAL SUPPORT CHATBOTS: A COMPARATIVE STUDY OF MACHINE LEARNING MODELS

ОПРЕДЕЛЕНИЕ НАМЕРЕНИЙ В ЧАТ-БОТАХ ТЕХНИЧЕСКОЙ ПОДДЕРЖКИ: СРАВНИТЕЛЬНОЕ ИССЛЕДОВАНИЕ МОДЕЛЕЙ МАШИННОГО ОБУЧЕНИЯ

Shyryn A. Kartbayev A.Zh.

28.06.2025 191

6(135)

10. Информатика, вычислительная техника и управление

Цитировать:

Shyryn A., Kartbayev A.Zh. INTENT DETECTION IN TECHNICAL SUPPORT CHATBOTS: A COMPARATIVE STUDY OF MACHINE LEARNING MODELS // Universum: технические науки : электрон. научн. журн. 2025. 6(135). URL: https://7universum.com/ru/tech/archive/item/20265 (дата обращения: 09.01.2026).

Прочитать статью:

DOI - 10.32743/UniTech.2025.135.6.20265

ABSTRACT

Technical support systems for unmanned systems such as drones and autonomous vehicles require precise and timely understanding of user intent. This study compares three machine learning models—Logistic Regression, Random Forest, and BERT—for intent detection in support chatbots. Using a domain-specific synthetic dataset simulating common support queries, we evaluate the models based on classification accuracy, F1-score, training time, and model size. The results show that while traditional models like Logistic Regression and Random Forest provide acceptable performance with lower resource consumption, the transformer-based BERT model significantly outperforms them in handling complex, ambiguous user queries. This paper provides guidance for researchers and engineers aiming to build adaptive support systems that combine contextual understanding with operational efficiency.

АННОТАЦИЯ

Системы технической поддержки для беспилотных систем, таких как дроны и автономные транспортные средства, требуют точного и своевременного понимания пользовательских запросов. В данном исследовании проводится сравнение трёх моделей машинного обучения — логистической регрессии, случайного леса и модели BERT — для задачи определения намерений в чат-ботах технической поддержки. Используя специализированный синтетический датасет, моделирующий типичные запросы пользователей, мы оцениваем модели по таким критериям, как точность классификации, F1-мера, время инференса и размер модели. Результаты показывают, что традиционные модели, такие как логистическая регрессия и случайный лес, обеспечивают приемлемую точность при низком потреблении ресурсов, в то время как модель BERT на основе трансформеров значительно превосходит их в обработке сложных и неоднозначных запросов. Данная работа предлагает рекомендации для исследователей и инженеров, стремящихся к созданию адаптивных систем поддержки, сочетающих контекстное понимание и вычислительную эффективность.

Keywords: intent detection, chatbot, technical support, unmanned systems, predictive analytics, BERT, machine learning

Ключевые слова: определение намерений, чат-бот, техническая поддержка, беспилотные системы, предиктивная аналитика, BERT, машинное обучение

INTRODUCTION

With the increasing adoption of unmanned systems (UMS), including drones, autonomous ground vehicles, and robotic platforms, there is a growing demand for intelligent support systems capable of providing real-time assistance to end-users. These systems are often deployed in high-stakes environments such as agriculture, defense, or industrial monitoring, where technical failures can lead to operational delays or safety risks[1]. Chatbots have emerged as a scalable solution to provide automated support; however, most existing systems struggle to accurately detect user intent when presented with vague or context-dependent queries[2].

Intent detection is a foundational task in natural language understanding (NLU) and enables chatbots to classify the user's intention from free-text input. Accurate intent detection is critical for generating relevant responses in support applications. Traditional rule-based systems lack the flexibility and learning ability required for dynamic environments, while classical machine learning models like Logistic Regression and Random Forest provide a balance between interpretability and performance. On the other hand, deep learning models—particularly those based on transformer architectures such as BERT—are capable of extracting deeper contextual features.

This paper presents a comparative study of three different approaches to intent classification in the domain of unmanned system support: Logistic Regression, Random Forest, and BERT. We construct a synthetic dataset based on frequently encountered support scenarios, and evaluate each model using several performance metrics including accuracy, F1-score, and inference latency[3].

Table 1.

Example intent classification dataset

User Query	Intent Label
My drone won’t connect to the remote control	connectivity_issue
I’m unable to upload the mission plan	mission_upload_problem
The battery of my drone depletes too fast	battery_status
Getting error code 4001 in the system	technical_error_report
Can you help with recalibrating the navigation system?	recalibration_query

The goal is to identify the trade-offs between model complexity, interpretability, and performance, and to inform the design of future adaptive chatbot systems.

BACKGROUND AND RELATED WORK

Intent detection has received significant attention in the field of dialogue systems, particularly within commercial applications such as virtual assistants and customer service bots. Traditional approaches rely on statistical classifiers trained on bag-of-words or TF-IDF representations. While fast and efficient, such models often lack the depth required to handle user inputs that are semantically similar but lexically diverse[4].

Recent advancements in natural language processing (NLP), especially the introduction of transformer-based models such as BERT, have revolutionized text classification tasks[5]. BERT, pre-trained on large corpora with masked language modeling, captures bidirectional context and has outperformed previous models across a wide range of NLP benchmarks.

In the domain of technical support, however, the application of BERT remains limited due to computational cost and the scarcity of labeled support-specific data[6]. Hybrid approaches that combine rule-based preprocessing with contextual embeddings have been proposed but are not widely adopted in real-time systems The architecture of the BERT-based intent classification pipeline is shown in Figure 1.

Figure 1. Architecture of BERT-based intent classification pipeline

MATERIALS AND METHODS

Dataset.To realistically simulate real-world technical support scenarios for unmanned aerial systems, a custom synthetic dataset was designed. This dataset comprises a total of 2500 user utterances, evenly distributed across five representative intent categories: connectivity_problem, battery_status, mission_upload, hardware_issue, and calibration. These categories were selected based on a review of support documentation and frequently asked questions from UAV user manuals and online drone forums[7].

Each intent class contains 500 distinct user utterances. The utterances were crafted to reflect diverse linguistic features such as grammar complexity, variation in vocabulary, use of domain-specific terminology, ambiguity, and formality level[8]. For instance, the intent class connectivity_problem includes inputs like:

“I can't connect my drone to the controller”;
“Lost signal halfway through the mission”;
“Pairing failed again”.

The diversity in phrasing was a deliberate design choice to test the models' ability to generalize and disambiguate intent in less structured environments. The dataset also includes paraphrased and noisy inputs that reflect real-world user interaction, such as typos, informal language, and incomplete queries.

Moreover, several utterances were designed to mimic edge cases - queries that may fall between two intent categories. For example, “My drone keeps drifting left”, which could indicate a hardware malfunction or a need for recalibration. “The mission won't load even though the signal is strong” - may involve either a mission_upload or connectivity_problem issue.

Such cases increase the difficulty of classification and are critical for evaluating the contextual learning capabilities of advanced models like BERT[9].

The annotation process involved assigning one of the five intent labels to each utterance. This was done manually to ensure consistency and semantic relevance. To simulate real deployment data, roughly 10% of the queries were designed to be ambiguous enough that even a human annotator would need contextual clues. These were labeled based on majority consensus during internal review.

To enable reproducibility and fair model comparison, the dataset was divided using a stratified split into training (70%), validation (15%), and testing (15%) sets. Stratification preserved the balance of intent classes across all subsets.

The resulting dataset not only serves as a robust benchmark for intent classification in the domain of unmanned systems, but also reflects a realistic variety of real-world support dialogues.

Evaluation Metrics.Evaluating the performance of intent detection models requires a multifaceted approach, as no single metric can fully capture the complexity of model behavior across varied inputs. The following metrics were used to evaluate the performance of all three models:

Accuracy: This measures the overall percentage of correctly classified user inputs. While commonly reported, accuracy alone may be misleading in the presence of class imbalance. Thus, it is used as a general indicator of performance but supplemented with more robust metrics.

Precision, Recall, and F1-Score (Macro-Averaged): These metrics are computed for each class individually and then averaged[10]. Macro-averaging ensures that each class contributes equally to the final score, regardless of class frequency.

Precision quantifies how many of the predicted intents are actually correct.
Recall measures how many of the true intents were successfully predicted.
F1-Score provides a harmonic mean of Precision and Recall, offering a balanced view of model performance.

These metrics are essential for understanding how well a model performs across all intent types, especially when the cost of misclassification is high (e.g., a calibration issue misclassified as connectivity).

Inference Time: Since real-world chatbots operate under strict latency constraints, it is vital to measure the time each model takes to process a single input during inference. This was done by averaging the response time across 500 randomly selected test queries using CPU-only processing.

Model Size: Measured in megabytes (MB), this metric indicates the disk storage requirement of each model. This factor directly impacts deployability, especially in mobile or embedded environments.

Training Time: Although not directly related to inference performance, training time offers practical insight into the model's computational complexity. For real-world applications where regular re-training is necessary (e.g., for adaptive systems), this metric is significant.

Error Analysis: In addition to quantitative metrics, qualitative error analysis was performed. Misclassified examples were grouped and reviewed to identify common failure patterns[11]. For instance, we observed that classical models struggled most with ambiguous or grammatically incorrect inputs, while BERT’s errors often involved subtle semantic nuances that even humans might misinterpret.

Together, these evaluation criteria provide a comprehensive overview of each model’s effectiveness, efficiency, and readiness for deployment in real-time unmanned system support chatbots.

RESULTS AND DISCUSSION

The performance of the three models was evaluated on the test subset of the synthetic dataset, which included diverse user utterances related to technical support for unmanned systems. This section presents both quantitative and qualitative analyses based on the evaluation metrics described earlier. The results are summarized in the following Table 2.

Table 2.

Performance comparison of LR, RF, and BERT models

Model	Accuracy	F1 Score (Macro)	Inference Time (ms)	Model Size (MB)
Logistic Regression	84.2%	0.839	2.1	4.2
Random Forest	86.0%	0.855	3.8	12.5
BERT	92.6%	0.924	27.3	418.7

As expected, BERT achieved the highest performance in terms of both accuracy and macro-averaged F1-score. This confirms its ability to interpret nuanced and context-rich user input, which is especially relevant in real-world chatbot applications where user queries are often unstructured and vary in expression[12].

Random Forest outperformed Logistic Regression slightly, likely due to its ability to model non-linear relationships and interactions between features. However, the marginal improvement came at the cost of increased model size and slower inference time.

Despite its simplicity, Logistic Regression delivered surprisingly competitive results. Its fast inference speed and minimal resource footprint make it suitable for deployment in constrained environments, such as embedded systems or mobile chatbot platforms.

The choice of intent detection model depends heavily on the intended deployment context. While BERT delivers superior accuracy, its inference time and large model size make it less suitable for resource-constrained settings. On the other hand, Logistic Regression and Random Forest offer faster response times and smaller memory requirements, albeit with reduced accuracy.

This trade-off highlights the need for adaptive strategies in chatbot system design. For example, an architecture could employ a lightweight model such as Logistic Regression for initial classification, and escalate to a BERT-based reanalysis only when confidence is low or the query is particularly complex[13].

A manual review of misclassified examples revealed several patterns:

Logistic Regression and Random Forest struggled with queries containing ambiguous or novel vocabulary, such as “My drone is freaking out” or “Compass doesn’t look stable”;
BERT handled most of these cases effectively but sometimes failed with short queries lacking context, such as "Uploading failed" or "GPS error." These issues could potentially be mitigated through the use of surrounding conversational history or multi-turn context tracking;
All models occasionally misclassified edge cases designed to straddle multiple intent categories, such as: “The drone won’t move after uploading the plan,” which could plausibly be labeled as either mission_upload or hardware_issue.

These observations emphasize that while BERT is more robust to linguistic variation, it is not immune to limitations, especially in one-shot classification without dialogue history.

Given that unmanned systems are frequently operated in time-sensitive and field-based conditions, the balance between model performance and efficiency becomes critical. A hybrid deployment strategy may offer the best solution, allowing systems to balance speed, accuracy, and computational demand. Furthermore, incorporating confidence thresholds and fallback mechanisms into chatbot workflows can ensure better handling of ambiguous or novel user inputs.

Overall, this study confirms the advantages of using transformer-based architectures for intent detection in technical support chatbots, while also acknowledging the operational value of classical models in real-world constraints.

CONCLUSION

This study compared three machine learning models—Logistic Regression, Random Forest, and BERT—for intent detection in technical support chatbots for unmanned systems. Using a synthetic dataset and consistent evaluation criteria, we assessed both classification performance and deployment feasibility.

BERT delivered the highest accuracy and F1 score, demonstrating its effectiveness in handling varied and ambiguous user input. However, its large model size and slower inference limit its applicability in real-time or resource-constrained environments. In contrast, Logistic Regression and Random Forest offered faster inference and lower memory usage, making them suitable for lightweight applications.

The choice of model should depend on system constraints and accuracy requirements. In many cases, a hybrid architecture combining lightweight classifiers with transformer-based fallback can offer a balanced solution. Future work should explore real-world data integration, use of dialogue history, and refinement of hybrid approaches for adaptive technical support systems.

References:

Chen, Q., Zhuo, Z., & Wang, W. (2019). BERT for Joint Intent Classification and Slot Filling. arXiv preprint arXiv:1902.10909.
Perumal, S., Saini, R., & Singh, A. (2023). Building Customer Support Chatbots With Intent Recognition. Journal of AI Research, 45(3), 230–245.
Ouaddi, F., et al. (2025). Assessing the Effectiveness of LLMs for Intent Detection in Tourism Chatbots: A Comparative Analysis. Applied Soft Computing, 139.
Wu, Y., et al. (2024). Comparative Study of Machine Learning Algorithms for Intent Detection. Scientific Reports.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT.
Zhang, Y., & Wang, D. (2022). A Hybrid Model for Intent Detection Using Random Forest and BERT. Expert Systems with Applications, 190.
Tur, G., Hakkani-Tür, D., & Heck, L. (2010). What is left to be understood in intent determination? IEEE SLT.
Sarikaya, R., Hinton, G., & Deoras, A. (2014). Application of deep belief networks for natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
Liu, B., & Lane, I. (2016). Attention-based recurrent neural network models for joint intent detection and slot filling. Interspeech.
Sun, C., Qiu, X., & Huang, X. (2019). How to Fine-Tune BERT for Text Classification? China National Conference on Chinese Computational Linguistics.
Qiu, X., Sun, T., Xu, Y., Shao, Y., Dai, N., & Huang, X. (2020). Pre-trained models for natural language processing: A survey. Science China Technological Sciences.
Hadi, M., Khan, S. A., & Usman, M. (2023). Comparative Analysis of Machine Learning Algorithms for Short Text Classification. Procedia Computer Science.
Jain, M., Kumar, P., Kota, R., & Patel, S. N. (2018). Evaluating and Informing the Design of Chatbots. In Proceedings of the 2018 Designing Interactive Systems Conference.