Master’s degree student, School of Information Technology and Engineering, Kazakh-British Technical University, Kazakhstan, Almaty
DETECTION OF STRUCTURED QUERY LANGUAGE INJECTION USING MACHINE LEARNING ALGORITHMS AND FRAMEWORKS
ABSTRACT
Structured Query Language (SQL) injection remains one of the highest-risk web-application vulnerabilities, enabling attackers to read, modify or delete backend data via malicious queries. To overcome the limits of signature-based defenses, this study proposes a supervised machine-learning architecture for SQL-injection detection. Ten algorithms were benchmarked on two public Kaggle datasets. The best model on Dataset 1—Random Forest—reached 99.53 % accuracy and 0.994 F1-score (precision 0.999, recall 0.988). On an unseen Dataset 2 the leading model shifted to LightGBM, achieving 99.54 % accuracy and 0.995 F1-score (precision 0.996, recall 0.994). These results demonstrate that tree-based ensembles reliably generalise across different traffic patterns and can be integrated into web-application firewalls to detect evolving SQL-injection attacks with minimal false alerts.
АННОТАЦИЯ
SQL-инъекции — одна из наиболее критичных уязвимостей веб-приложений, позволяющая злоумышленнику читать, изменять или удалять данные через вредоносные запросы. Для автоматического выявления таких атак предложена архитектура контролируемого машинного обучения. Десять алгоритмов протестированы на двух открытых наборах Kaggle. На датасете 1 лидирует модель Random Forest с точностью 99,53 % и F1-мерой 0,994 (precision 0,999; recall 0,988). На независимом датасете 2 лучший результат показал LightGBM: точность 99,54 % и F1-мера 0,995 (precision 0,996; recall 0,994). Высокие метрики на разных выборках подтверждают способность деревообразных ансамблей надёжно обнаруживать меняющиеся шаблоны SQL-инъекций и позволяют интегрировать их в веб-брендмауэры для раннего обнаружения атак при минимальном числе ложных срабатываний.
Keywords: SQL Injection, machine learning, classification, prediction, information security.
Ключевые слова: SQL-инъекция, машинное обучение, классификация, прогнозирование, информационная безопасность.
Introduction
Our reliance on technology is growing at an unprecedented speed, and the information technology infrastructure is expanding fast. Both opportunities and concerns have come with this progress, which has completely changed the way people work and live. A big part of applications people use daily are web-based. To improve their comfort, these applications are freely available online. These types of applications pose a security risk in the form of unapproved and uncontrolled access, as they are potentially vulnerable to unauthorized access [1, p. 1].
As technology keeps evolving, one of the core components driving the transformation is the use of databases. Relational databases are organized, manipulated, and managed utilizing query language known as SQL (Structured Query Language). It allows an application or user to work with a database by putting in some new data, removing old data, and altering data that has already been saved [2, p. 2].
The OWASP (Open Web Application Security Project) Top 10 [3] provides an explanatory framework for comprehending web application vulnerabilities, emphasizing risks such as SQL injection, which is in the list, named “SQL exploitation” Injection vulnerabilities, including SQL injection, are ranked third in the 2021 list, with 94% of applications tested showing some form of injection flaws.
Table 1.
OWASP Top 10 vulnerabilities for 2021
|
Control of a compromised access |
|
Failures of encryption |
|
SQL exploitation |
|
Poor architecture |
|
Misconfiguration of protection |
|
Outdated, insecure components |
|
Failures of login and verification |
|
Failures of data and software trustworthiness |
|
Failures of monitoring, logging of threats |
|
Forgery of server-side demands |
Even with its extensive use and importance, SQL presents certain security vulnerabilities that can be exploited by hackers. One of the most prominent vulnerabilities is SQL injection, a technique whereby attackers insert or” inject” malicious SQL code into input fields or query parameters. For databases dependent on architectures, detecting and mitigating SQL injection attacks is essential since successful implementations can have severe consequences [4, p. 1].
/Batyrkhanov.files/image001.png)
Figure 1. Web application SQL injection
SQL injection can be classified into three types of attacks:
• In-band attacks
• Out-of-band attacks
• Inferential attacks
/Batyrkhanov.files/image002.jpg)
Figure 2. Types of SQL injection
In-band SQL Injection: The easiest-to-exploit form of SQLi (Structured Query Language injection) attack is the in-band type of attack. It is executed when the hacker utilizes the same channel for communication to compile a malicious SQL query and receive the target data [5, p. 2]. In-band SQL injection has two common methods:
1) Error-based In-band SQL Injection: This approach relies on the database server’s bug alerts to collect details about the structure of the database. In certain situations, an hacker can get information about the whole database leveraging error-based SQL injection alone. Whereas bugs can be in handy in the process of creation a web application, they must be removed from public access on live sites or logged into a file with restricted access.
2) Union-based In-band SQL Injection: Using the UNION SQL operator, union-based SQL injection combines the output of multiple SELECT payloads into a united result that is represented as a part of the HTTP (Hypertext Transfer Protocol) response.
Out-of-band SQL Injection: Out-of-band SQL injection is a type of vulnerability where the hacker does not receive a response from the aimed unite on the similar channel of communication. Instead, hackers force web-applications to convey the data to a distanced point under their control. Out-of-band attacks are only feasible if the database server supports commands that trigger DNS (Domain Name System) or HTTP requests, which is common in popular SQL servers.
Inferential (Blind) SQL Injection: Inferential SQL injection attack, often referred to as blind SQL injection, occurs when the hacker cannot directly see the result of the queries because data are not transferred between the web application and the attacker. However, the attacker uses this vulnerability to enumerate the database by observing the behavior of the application. Blind SQL injection comes in two varieties:
1) Boolean-based Blind SQL Injection: The hacker deduces logical conclusions from a TRUE FALSE query entered the database [6, p. 20].
2) Time-based Blind SQL Injection: The attacker sends an SQL query to the database, forcing it to wait a specific amount of time (usually measured in seconds) before responding [6].
The primary aim of this research is to explore the effectiveness of various machine learning algorithms and frameworks in detecting SQL injection attacks. With the growing dependence on web-based applications and the rising threat of injection vulnerabilities, particularly SQL injection, it becomes crucial to enhance detection methods. This study investigates whether the results from previous detection efforts can be consistently reproduced, whether the inclusion of new machine learning models and tools can lead to improved performance, and whether the developed solutions can be effectively applied to different datasets. By addressing these core questions, the research seeks to contribute to more reliable, adaptable, and scalable solutions for securing database-driven web applications.
Materials and methods.
Machine-learning methods increasingly provide the foundation of SQL-injection (SQLi) defenses as they learn complex request patterns and automatically adapt with attackers' modification in tactics.
Muhammad & Ghafory [7, p.1] showed that common WEKA classifiers – e.g. multilayer perceptron and logistic regression with 10-fold CV training – were successful in distinguishing between benign and malicious access-log entries.
In Kaggle's SQLi corpus Anu et al. [8, p. 4] identified K-Nearest Neighbours (97 % accuracy) as the best, after rigorous cleaning, normalisation and metric testing.
Ibrohim & Suryani [9, p. 1] enhanced performance to 93.98 % using TF-IDF feature extraction, class balancing and SVM/Naïve-Bayes voting ensemble.
With IoT-destined traffic as the goal, Sharma & Babbar [10, p.1] determined ML's resilience to varying injection vectors: Naïve Bayes headed their experiments (97.73 % accuracy), with logistic regression, random forest and SVM as followers.
Collectively these works demonstrate that, with adequate preprocessing and model fine-tuning, supervised ML can automate SQLi detection on a variety of datasets, be highly precise/recall and generalize to new attack variants.
Recent advances in data availability, GPUs and other specialist hardware make large-scale machine-learning possible today. As these algorithms are both adaptive and flexible, they are now used for tasks such as SQL-injection detection; this work therefore employs an ML. Resarch questions are:
1) Can earlier detection results be reproduced?
2) Do additional algorithms and frameworks boost performance?
3) Are the findings transferable to a different dataset?
Datasets: Dataset 1 [11] holds 30 919 SQL queries (19 537 benign, 11 382 malicious); Dataset 2 [12] contains 24 707 queries (13 134 benign, 11 573 malicious). Figure 3 shows the distribution of labelled attacks in Dataset 2.
/Batyrkhanov.files/image003.png)
Figure 3. SQL Injection Dataset 1
/Batyrkhanov.files/image004.png)
Figure 4. SQL Injection Dataset 2
Table 2 presents a sample of non-harmful SQL queries along with their corresponding labels, marked as 0. These queries represent typical benign operations in a database, such as selecting data without malicious intent.
Table 2.
Records of Benign SQL Queries and Their Label from the Datasets
|
Query |
Label |
|
SELECT Country FROM Customers |
0 |
|
SELECT COUNT (DISTINCT Country) FROM Customers |
0 |
|
FROM (SELECT * FROM hour) |
0 |
Table 3 displays records of harmful SQL queries labeled as 1. These queries include patterns frequently used in injection attacks, such as attempting to manipulate logic or injecting malicious strings to bypass authentication mechanisms.
Table 3.
Records of Malicious SQL Queries and Their Label from the Datasets
|
Query |
Label |
|
or 1 = 1 or ’’ = ’ |
1 |
|
admin’ or ’1’ = ’1’ |
1 |
|
-6681") or 5251=1162,1 |
1 |
Preprocessing stage: In this research, various machine learning models were applied to detect SQL injection attacks. The dataset was preprocessed by removing the null and duplicate values. After removing the null and duplicate values, the first dataset had 30907 left.
To work with textual data as SQL queries, the values should be converted from raw text to numerical representations suitable for machine learning models. This procedure was handled using the TF-IDF (Term Frequency-Inverse Document Frequency) method. It underlines distinctive and important terms while de-emphasizing less informative, common words. It improves the model’s ability to concentrate on significant features. TF-IDF technique helps to capture patterns and structures that are specific to SQL injections, such as logical operators and keywords. For all algorithms and frameworks, the following preprocessing steps were applied:
1) TF-IDF was used to convert the SQL queries into feature vectors, limiting the features to the top 5000 for computational efficiency.
2) The dataset was split into training (70% of data) and testing (30% of data) sets.
3) Model efficiency was assessed leveraging following metrics as precision, accuracy, F1-score, recall
Machine Learning algorithms, frameworks and applied configurations:
1) Logistic Regression is the linear machine learning model that can be used for binary classification. It is a widely recognized technique that is employed in the machine learning domain [13, p.3]. It estimates the probability of a data point belonging to a specific category by leveraging the logistic function that plots real-valued numbers into the range of numbers. It works by fitting a linear equation to the input features, which represents the log-odds of the target class. It minimizes the binary cross-entropy loss in training in order to adjust the weight of features. The decision boundary is defined by the point where the probability is half of the range.
/Batyrkhanov.files/image005.png)
Figure 5. Flow chart of the proposed methodology
Decision-tree learners (DT, random_state = 42) split features recursively in order to maximize a purity measure, creating an interpretable hierarchy of rules [14, p. 2]. Random Forest accomplishes this in extension by training 100 such trees on boot-strapped sets of features and aggregating the votes across all CPU cores (n_estimators = 100, n_jobs = -1, random_state = 42), creating a more stable ensemble [15, p.3]. Gradient Boosting adds 100 shallow trees in a sequence, each fitted on the previous residuals (random_state = 42, n_estimators = 100), so that loss is progressively minimized by stage-wise changes [16, p. 2]. Linear-kernel Support Vector Machine (C = 1.0) has maximum margin between the classes and can be kernel-lifted to non-linear decision planes [17, p. 4]. Probabilistic Naïve Bayes (random_state = 42) assumes feature independence under conditionality, giving a simple baseline for text-like data [18, p. 1]. Instance-based K-Nearest Neighbours (k = 5) classifies a sample by majority vote among its five nearest neighbours in Euclidean space without requiring explicit training [19, p. 3]. Of all the modern gradient-boosting toolkits, XGBoost sequentially grows 100 depth-6 trees (learning_rate = 0.1, eval_metric = log-loss, random_state = 42, label_encoder = False) with regularized, sparsity-aware splits [20, p. 2]; LightGBM uses histogram binning and leaf-wise growth (boosting_type = gbdt, num_leaves = 31, n_estimators = 100, learning_rate = 0.1, max_depth = -1, random_state = 42, verbose = -1) for memory-caching large-scale training [21, p. 3]; and CatBoost avoids target leakage by ordered boosting while naturally handling categorical features (iterations = 100, depth = 6, learning_rate = 0.1, loss_function = Logloss, eval_metric = Accuracy, random_seed = 42, verbose = 0).
Results and discussion.
Both using available SQL-injection datasets all classifiers reached above 0.95 accuracy except for K-Nearest Neighbours on Dataset 1 (0.91), validating that with simple feature engineering using TF-IDF good and bad queries are already well discriminated (Table 4). Dataset 1. The ensemble models dominated: Random Forest with 99.53 % accuracy and F1-score 0.994 (precision 0.999, recall 0.988), followed by LightGBM (99.36 %, 0.991) and XGBoost (99.17 %, 0.989). Baseline Linear and probabilistic trails fell behind; Logistic Regression suffered loss of recall (0.877) even with very high precision, and KNN paid the highest class-imbalance penalty. Dataset 2. When confronted with novel traffic patterns the order changed to some extent. LightGBM now outperformed all the metrics (99.54 % / 0.995), followed by CatBoost and XGBoost which were statistically on par (≈ 0.995 F1). Random Forest's accuracy dropped to 98.93 %, with some over-fitting to Corpus 1. All the other models performed better, indicating Dataset 2 is linearly separable. The results validate previous research that tree-based ensembles are the most stable SQL-injection payload detectors. Gradient boosting and bagging consistently paired outstanding precision (≥ 0.99) with recall ≥ 0.98, the required feature for security scenarios in which misses render data useless. Their ability to model high-order interactions between features appears to allow them to detect obfuscated injection strings that saturate linear separators. Leaf-wise growth by LightGBM maintained best performance across corpora, with high generalisation capacity; its limited inference size also renders it an option for integration within web-application firewalls. Alternatively, KNN's susceptibility to sparser or noisier features makes it unadvisable, and that Logistic Regression suffers from low recall suggests avoiding lone dependence on linear boundaries. The overall effort shows that a well-tuned ensemble can detect emergent SQL-injection attacks with ≤ 1 % error rate on two heterogeneous datasets and meet operational demands without sacrificing too much for false alarms.
Table 4.
Obtained results from dataset №1
|
Models for dataset 1 and 2 |
Accuracy |
Precision |
Recall |
F1-Score |
|
Logistic Regression |
0.9518/ 0.9915 |
0.9928/ 0.9941 |
0.8767/ 0.9875 |
0.9311/ 0.9908 |
|
Gradient Boosting |
0.9866/ 0.9870 |
0.9949/ 0.9841 |
0.9689/ 0.9881 |
0.9818/ 0.9861 |
|
Random Forest |
0.9953/ 0.9893 |
0.9994/ 0.9886 |
0.9878/ 0.9881 |
0.9936/ 0.9883 |
|
Naive Bayes |
0.9619/ 0.9816 |
0.9792/ 0.9910 |
0.9170/ 0.9688 |
0.9471/ 0.9798 |
|
Support Vector Machine |
0.9739/ 0.9860 |
0.9911/ 0.9804 |
0.9382/ 0.9894 |
0.9639/ 0.9849 |
|
LightGBM |
0.9936/ 0.9954 |
0.9971/ 0.9962 |
0.9858/ 0.9939 |
0.9914/ 0.9951 |
|
K-Nearest Neighbours |
0.9106/ 0.9771 |
0.9970/ 0.9675 |
0.7618/ 0.9833 |
0.8636/ 0.9753 |
|
XGBoost |
0.9917/ 0.9943 |
0.9956/ 0.9951 |
0.9820/ 0.9925 |
0.9888/ 0.9938 |
|
Decision Tree |
0.9920/ 0.9892 |
0.9933/ 0.9864 |
0.9852/ 0.9904 |
0.9892/ 0.9884 |
|
CatBoost |
0.9917/ 0.9951 |
0.9973/ 0.9947 |
0.9803/ 0.9947 |
0.9887/ 0.9947 |
Conclusion
SQL injection remains a critical security concern in modern applications, posing a threat to data integrity in the network. SQL injection attacks continue to be a serious problem due to their effectiveness, simplicity, and potential for exploitation. Therefore, the development of detection systems is vital in preventing such type of attacks. Experiments were successfully conducted on two distinct datasets, consecutively testing ten different machine learning algorithms and frameworks. Through these experiments, the primary research questions were addressed. The contributions of the current research are the exploration of a broad range of machine learning algorithms and frameworks with their individual configurations and the validation of their performance on new, unseen data on the second dataset. The high performance was accomplished not only in terms of accuracy but in other key metrics. In conclusion, this research contributes insights and advances in the detection of SQL injection attacks, providing both practical and theoretical significance for enhancing security in applications.
References:
- Alarfaj F. K., Khan N. A. Enhancing the performance of SQL injection attack detection through probabilistic neural networks // Applied Sciences. — 2023. — Т. 13. — № 7. — С. 4365.
- Fu H., Guo C., Jiang C., Ping Y., Lv X. SDSIOT: An SQL injection attack detection and stage identification method based on outbound traffic // Electronics. — 2023. — Т. 12. — № 11. — С. 2472.
- OWASP Foundation. OWASP Top 10: 2021. — OWASP Foundation, 2021. — 36 с.
- Tasdemir K. et al. An investigation of machine learning algorithms for high-bandwidth SQL injection detection utilising BlueField-3 DPU technology // Proc. IEEE Int. System-on-Chip Conf. (SOCC). — Santa Clara (CA, USA): IEEE, 2023. — С. 1–6. — DOI: 10.1109/SOCC58585.2023.10256777.
- Issakhani M., Huang M., Tayebi M. A., Lashkari A. H. An evolutionary algorithm for adversarial SQL injection attack generation // Proc. IEEE Int. Conf. on Intelligence and Security Informatics (ISI) . — Charlotte (NC, USA): IEEE, 2023. — С. 1–6. — DOI: 10.1109/ISI58743.2023.10297141.
- Loor C. A., Morocho K., Hallo M. Using data mining techniques for the detection of SQL injection attacks on database systems // Revista Politécnica. — 2023. — Т. 51. — С. 19–28.
- Muhammad T., Ghafory H. SQL injection attack detection using machine learning algorithm // Mesopotamian Journal of Cybersecurity. — 2022. — С. 5–17.
- Anu P. et al. Mitigation of SQL injection attacks through machine learning classifier // Proc. 2nd Int. Conf. on Sustainable Computing and Smart Systems (ICSCSS). — Coimbatore (India): IEEE, 2024. — С. 1–6. — DOI: 10.1109/ICSCSS60660.2024.10625626.
- Ibrohim M. M., Suryani V. Classification of SQL injection attacks using ensemble learning SVM and Naïve Bayes // Proc. Int. Conf. on Data Science and Its Applications (ICoDSA). — Bandung (Indonesia): IEEE, 2023. — С. 230–236. — DOI: 10.1109/ICoDSA58501.2023.10277436.
- Sharma A., Babbar H. Machine learning solutions for evolving injection attack landscape // Proc. 2nd Int. Conf. on Future Technologies (INCOFT). — Belagavi (India): IEEE, 2023. — С. 1–6. — DOI: 10.1109/INCOFT60753.2023.10425456.
- Ahmed A. S. S., Shachi M. SQL injection dataset. — San Francisco: Kaggle, 2020. — URL: https://www.kaggle.com/datasets/sajid576/sql-injection-dataset.
- Rayten. SQL injection dataset . — San Francisco: Kaggle, 2020. — URL: https://www.kaggle.com/datasets/rayten/sql-injection-dataset.
- Setiyaji A., Ramli K., Hidayatulloh Z. Y., Budhi Dharmawan G. S. A technique utilizing machine learning and convolutional neural networks for the identification of SQL injection attacks // Proc. 4th Int. Conf. on Science & Information Technology in Smart Administration (ICSINTESA). — Balikpapan (Indonesia): IEEE, 2024. — С. 1–6. — DOI: 10.1109/ICSINTESA62455.2024.10748116.
- Papageorgiou E., Stylios C., Groumpos P. A combined fuzzy cognitive map and decision trees model for medical decision making // Proc. 28th Annu. Int. Conf. IEEE Engineering in Medicine & Biology Society (EMBS). — New York (NY, USA): IEEE, 2006. — С. 6117–6120. — DOI: 10.1109/IEMBS.2006.260354.
- Martins W., Bagesteiro L. B., Weber T. O., Balbinot A. FPGA-based implementation of random forest classifier for sEMG signal classification // Proc. 46th Annu. Int. Conf. IEEE Engineering in Medicine & Biology Society (EMBC). — Orlando (FL, USA): IEEE, 2024. — С. 1–4. — DOI: 10.1109/EMBC53108.2024.10781521.
- Matoušek J., Tihelka D. Using extreme gradient boosting to detect glottal closure instants in speech signal // ICASSP 2019 – IEEE Int. Conf. on Acoustics, Speech & Signal Processing. — Brighton (UK): IEEE, 2019. — С. 6515–6519. — DOI: 10.1109/ICASSP.2019.8683889.
- Yan W.-Y., He Q. Multi-class fuzzy support vector machine based on dismissing margin // Proc. Int. Conf. on Machine Learning and Cybernetics (ICMLC) . — Baoding (China): IEEE, 2009. — С. 1139–1144. — DOI: 10.1109/ICMLC.2009.5212368.
- Vijay V., Verma P. Variants of naïve Bayes algorithm for hate speech detection in text documents // Proc. Int. Conf. on Artificial Intelligence and Smart Communication (AISC). — Greater Noida (India): IEEE, 2023. — С. 18–21. — DOI: 10.1109/AISC56616.2023.10085511.
- Hacham S. A. K., Uçan O. N. Detection of malicious SQL injections using SVM and KNN algorithms // Proc. 7th Int. Symp. on Innovative Approaches to Smart Technology (ISAS). — Istanbul (Türkiye): IEEE, 2023. — С. 1–6. — DOI: 10.1109/ISAS60782.2023.10391560.
- Roy P., Kumar R., Rani P. SQL injection attack detection by machine learning classifier // Proc. Int. Conf. on Applied Artificial Intelligence and Computing (ICAAIC). — Salem (India): IEEE, 2022. — С. 1–6. — DOI: 10.1109/ICAAIC53929.2022.9792964.
- Lin Z., Zhu S. DFS-Enhanced LightGBM: An extended LightGBM model applied to ICU heart failure mortality prediction // Proc. Int. Conf. on High Performance Big Data and Intelligent Systems (HDIS). — Macau (China): IEEE, 2023. — С. 113–117. — DOI: 10.1109/HDIS60872.2023.10499633.