FORECASTING CONSUMER BEHAVIOR USING MACHINE LEARNING MODELS

ПРОГНОЗИРОВАНИЕ ПОТРЕБИТЕЛЬСКОГО ПОВЕДЕНИЯ С ИСПОЛЬЗОВАНИЕМ МОДЕЛЕЙ МАШИННОГО ОБУЧЕНИЯ

Kairova B.

27.03.2026 116

3(144)

10. Информатика, вычислительная техника и управление

Цитировать:

Kairova B. FORECASTING CONSUMER BEHAVIOR USING MACHINE LEARNING MODELS // Universum: технические науки : электрон. научн. журн. 2026. 3(144). URL: https://7universum.com/ru/tech/archive/item/22247 (дата обращения: 28.05.2026).

Прочитать статью:

DOI - 10.32743/UniTech.2026.144.3.22247

ABSTRACT

The research compares five machine learning algorithms-XGBoost, LSTM, Random Forest, SVR, and KNN-for the task of predicting consumer demand, with the overarching goal of inventory optimization. The dataset used for this research comprises over 230 million records from an international online retail platform, spanning October to December 2019, with validation on January 2020 data. Features include product information, event types (view, cart, purchase), pricing, and user behavior across sessions. Exploratory data analysis (EDA) was performed to identify trends, seasonal effects, and conversion behavior. After applying logarithmic transformation to account for skewed demand distribution, the LSTM model achieved the best results: , , and . XGBoost followed closely with . These results demonstrate the value of machine learning-particularly deep learning and gradient boosting approaches in effectively forecasting product demand.

АННОТАЦИЯ

В данном исследовании проводится сравнительный анализ пяти алгоритмов машинного обучения — XGBoost, LSTM, Random Forest, SVR и KNN — для прогнозирования потребительского спроса с целью оптимизации управления запасами. В качестве исходных данных использован набор данных международной платформы электронной коммерции, содержащий более 230 миллионов записей за период с октября по декабрь 2019 года, с последующей валидацией на данных за январь 2020 года. Данные включают информацию о товарах, типах пользовательских событий (просмотр, добавление в корзину, покупка), ценах и поведении пользователей в рамках сессий. Для выявления закономерностей и сезонных эффектов был проведён разведочный анализ данных. После логарифмического преобразования распределения спроса наилучшие результаты показала модель LSTM (R² = 0.998, MAPE = 0.66%, SMAPE = 0.61%), тогда как модель XGBoost также продемонстрировала высокую точность (R² = 0.987). Результаты подтверждают эффективность современных методов машинного обучения для точного прогнозирования спроса и повышения эффективности управления запасами.

Keywords: Machine Learning, Demand Forecasting, Consumer Behavior, Inventory Optimization, E-commerce, Predictive Analytics, LSTM, XGBoost.

Ключевые слова: Машинное обучение, Прогнозирование спроса, Поведение потребителей, Оптимизация запасов, Электронная коммерция, Предиктивная аналитика, LSTM, XGBoost.

I. Introduction

The exponential growth of e-commerce and the widespread adoption of digital technologies have transformed how businesses interact with consumers. With every click, scroll, and purchase, vast amounts of behavioral data are generated, offering unprecedented opportunities to understand consumer behavior [1]. Machine learning (ML) has emerged as a key enabler for predictive analytics [12], [18]. This study aims to build and compare demand forecasting models, identify key behavioral drivers, and provide business recommendations for inventory and marketing optimization. The study evaluates multiple approaches:

1. Single Machine Learning Models: Models like Logistic Regression, Naive Bayes, and KNN are widely used for their simplicity and fast training speed, but often underperform compared to complex models on high-dimensional data [10], [2], [5], [8].

2. Ensemble Methods: Algorithms like Random Forest and XGBoost have proven highly effective with large e-commerce datasets, offering high accuracy despite requiring more computational resources [19], [20], [13], [4], [3].

3. Hybrid and Stacked Models: Combining multiple algorithms compensates for individual limitations, often yielding the best results for time-series features [6], [14], [13], [17], [11].

4. Clickstream-Based Models: Analyzing user navigation, clicks, and session time enables the prediction of purchase intent before a transaction occurs [7], [1], [9].

II. Methodology

A. Dataset Acquisition

The dataset includes over 230 million records from October-December 2019. The validation set is from January 2020. The dataset consists of the following fields:

event_time: Timestamp of the user action
event_type: Type of event: view, cart, or purchase
product_id: Unique identifier of the product
category_id: Identifier for the product category
brand: Brand or manufacturer of the product
price: Price of the product at the time of the event
user_id: Unique identifier for the user
user_session: Identifier for the user session

Total: 233,460,662 records, of which 3,656,843 are purchases.

B. Data Analysis

EDA was conducted to analyze purchasing trends, weekday vs. weekend behavior, peak hours, conversion rates, and repeat buying behavior similar to [15]. Sunday mornings and December spikes due to holidays were notable findings.

C. Demand Prediction Algorithms

We compare machine learning algorithms for demand forecasting, described below using their mathematical formulations and structural logic.

Algorithm 1: Support Vector Regressor

Input: Training dataset , kernel function , regularization parameter , epsilon

Output: Trained Support Vector Regressor model Construct the kernel matrix for all pairs ;

Solve the following optimization problem:

minimize

subject to , for all

Calculate the weight vector w and bias term ;

for each support vector do

end

return with the parameters and ;

Algorithm 2: Random Forest

Input: Train dataset , number of trees number of splitter features

Output: Random Forest model

Initialization of an empty list forest to store trees;

for to T do

Random sample m features without replacement; Create bootstrap sample from ;

Train decision tree on

Add to

end return

Algorithm 3: XGBoost

Input: Dataset , number of boosting rounds , learning rate

Output: Ensemble of regression trees

Objective Function:

Where:

and

return fin

Algorithm 4: LSTM (Long Short-Term Memory)

Input: Sequence data , number of hidden units , learning rate

Output: Trained LSTM model with hidden state and memory cell

Cell Updates:

Train using Backpropagation Through Time (BPTT)

Algorithm 5: K-Nearest Neighbors (KNN)

Input: Training dataset number of neighbors , distance metric

Output: Predicted value for a given input x

For a new input : Find nearest neighbors using distance ; Compute prediction:

return

D. Evaluation Metrics

The following metrics were used to evaluate the performance of machine learning regression models:

1. Root Mean Squared Error (RMSE)

RMSE measures the average magnitude of the error, emphasizing larger errors due to the square.

2. Mean Absolute Error (MAE)

MAE provides the average absolute difference between predicted and actual values.

3. Mean Absolute Percentage Error (MAPE)

MAPE expresses the error as a percentage. This metric can be unstable when is close to zero.

4. Symmetric Mean Absolute Percentage Error (SMAPE)

SMAPE improves on MAPE by providing symmetry and handling small values more robustly.

5. Coefficient of Determination

indicates how well the predictions approximate the actual values, with 1 being perfect prediction and 0 meaning the model performs no better than the mean.

Logarithmic transformtion was applied to improve distribution and reduce the effect of outliers.

III. Results

A. Exploratory Data Analysis

Data analysis showed distinct trends indicating increased consumer activity on weekends, with an average peak around 9-10 AM UTC. The most popular time remains the first half of Sunday. On average, only 0.8-0.9% of views result in a purchase. Products placed in the cart have the highest conversion rate (35-45%), demonstrating that the cart strongly influences the final purchase decision.

B. Models Performance

Both training and test sets show the same pattern-a long right tail. Most products have a weekly demand of up to 1,000 units, but there are outliers the so-called "star" products for which the model tends to perform poorly.

As can be seen, a large Root Mean Squared Error (RMSE) indicates that the model makes larger errors on high-demand values, even though the Mean Absolute Error (MAE) remains relatively low-around 5 units, with the average product demand being 15 units.

Table 1.

Sample of prediction errors for selected products

Index	True Demand	Predicted Demand	Absolute Error	Squared Error
1156	9,194	2,059	7,135	50,903,187
1847	6,852	1,380	5,472	29,946,849
1788	7,344	2,105	5,239	27,444,760
1846	6,541	1,380	5,161	26,639,755
1787	6,406	2,105	4,301	18,496,663
1157	9,778	5,480	4,298	18,472,405
1845	5,448	1,380	4,068	16,551,646
1155	7,813	3,803	4,010	16,077,075
1786	5,823	1,835	3,988	15,900,369
1848	5,096	1,380	3,716	13,811,417

In the sample table, the index refers to the product number. The absolute error represents the absolute difference between the predicted and actual values. The Mean Squared Error (MSE) is this difference squared, which penalizes large errors more heavily.

Figure 1. Predicted and actual values

Table 2.

Model performance comparison on validation set

Model	RMSE	MAE	MAPE	SMAPE	R2
XGBoost	8415.95	6.41	52.18%	35.84%	0.618
LSTM	7986.97	2.66	11.90%	10.71%	0.638
Random Forest	12501.61	8.38	82.70%	49.34%	0.433
SVR	18686.32	7.83	55.75%	45.79%	0.152
KNN	748.65	4.27	11.44%	23.54%	0.966

As can be observed, the K Nearest Neighbors algorithm performs the best on raw data. However, it also shows extremely large values for MAPE and SMAPE. This is due to the nature of these metrics: when the actual values are small, percentage-based errors become disproportionately large.

The MAPE formula is as follows:

where is the actual value, and is the forecasted value.

Then apply a logarithmic transformation to the input data. This is motivated by the fact that the distribution has a long right tail, and the logarithm compresses large values, resulting in better generalization of the model (i.e., improved adaptability to unseen situations). Also observed through outlier analysis, variance typically increases with the growth of the mean value. For example, products grouped into demand intervals such as 0-10, 11-100, and 101+ will show increasing deviation, with the highest deviation in the last group.

Figure 2. Predicted and actual values (log trans)

Table 3.

MODEL PERFORMANCE AFTER LOGARITHMIC TRANSFORMATION

Model	RMSE	MAE	MAPE	SMAPE	R2
XGBoost	0.0126	0.048	3.80%	3.49%	0.987
LSTM	0.0024	0.0096	0.66%	0.61%	0.998
Random Forest	0.0703	0.196	18.43%	16.22%	0.928
SVR	0.3720	0.466	36.72%	34.41%	0.619
KNN	0.2669	0.400	36.25%	31.55%	0.727

On average, the logarithmic transformation led to an improvement of approximately 90.986% across all models. LSTM holds the top position across most metrics. XGBoost also demonstrated a significant improvement, and the unrealistic MAPE and SMAPE values observed in KNN were corrected. Overall, the logarithmic transformation resulted in an average improvement of 90.986%.

IV. Conclusion

As a result of the study, a large-scale assessment of consumer behavior and demand for goods was carried out using machine learning algorithms. Among the tested models, LSTM and XGBoost demonstrated the highest accuracy, especially after the logarithmic transformation of features, which significantly improved accuracy metrics. There was also a strong relationship between time, product category, and probability of purchase, which confirms the importance of taking into account seasonality and user activity.

Conversion analysis and product clustering allowed us to identify product groups with high potential for further promotion. The results obtained can be effectively used by companies to plan purchases more accurately, develop marketing strategies, and improve the overall efficiency of logistics operations. Further research may be aimed at including additional factors such as promotions, reviews, and customer demographics.

References:

E. Kuric, A. Puskas, P. Demcak, and D. Mensatorisova, "Effect of Low-Level Interaction Data in Repeat Purchase Prediction Task," International Journal of Human-Computer Interaction, 2023.
A.Sharma et al., "Machine Learning Approach: Consumer Buying Behavior Analysis," in IEEE Pune Section International Conference (Pune Con), 2022.
J. Ning, K. F. Li, and T. Avant, "A Cost-Sensitive Ensemble Model for e-Commerce Customer Behavior Prediction with Weighted SVM," in Complex, Intelligent, and Software Intensive Systems, Springer, 2023.
M. Alojail and S. Bhatia, "A Novel Technique for Behavioral Analytics Using Ensemble Learning Algorithms in E-commerce," IEEE Access, 2020.
N. I. A. Rusli, F. A. Zulkifle, and I. S. Ramli, "A Comparative Study of Machine Learning Classification Models on Customer Behavior Data," in Soft Computing in Data Science, Springer, 2023.
Z. Liu and X. Ma, "Predictive Analysis of User Purchase Behavior Based on Machine Learning," International Journal of Smart Business and Technology, 2019.
Y. Al-Tayeb, "Predicting Consumer Behavior in Online Shopping Using Clickstream Data and Machine Learning Algorithms," Master Thesis, Tilburg University, 2024.
V. Parihar and S. Yadav, "Comparative Analysis of Different Machine Learning Algorithms to Predict Online Shoppers Behaviour," International Journal of Advanced Networking and Applications, 2022.
S. Garg et al., "An Extensive Review and Comparison of Different Machine Learning Algorithms for Customer Behaviour Pattern Analysis," in IEEE UPCON, 2023.
S. Subramanian et al., "Performance Analysis of Different Machine Learning in Customer Prediction," in IEEE ICOEI, 2022.
X. Zhai et al., "Prediction Model of User Purchase Behavior Based on Machine Learning," in IEEE International Conference on Mechatronics and Automation, 2020.
K. Anshu, S. K. Singh, and R. Kumari, "A Machine Learning Model for Effective Consumer Behaviour Prediction," in IEEE ISCON, 2021.
W. Hu and Y. Shi, "Prediction of Online Consumers' Buying Behavior Based on LSTM-RF Model," in IEEE Big Data Conference, 2020.
H. Valecha et al., "Prediction of Consumer Behaviour Using Random Forest Algorithm," in IEEE UPCON, 2018.
V. Shrirame et al., "Consumer Behavior Analytics Using Machine Learning Algorithms," IEEE Conference, 2020.
X. Wang, Y. Xiangbin, and M. Yangchun, "Research on User Consumption Behavior Prediction Based on Improved XGBoost Algorithm," in IEEE Big Data, 2018.
J. Si, "E-Commerce User Purchase Prediction Based on Improved Machine Learning Algorithms," Independent publication, China, 2023.
S. Bailkar et al., "Smart Inventory Optimization using Machine Learning Algorithms," in IEEE IDCIoT, 2024.
K. Maheswari and P. P. A. Priya, "Predicting Customer Behavior in Online Shopping Using SVM Classifier," in IEEE Conference on Intelligent Techniques, 2017.
X. Wang et al., "Integrated Machine Learning Concept with XGBoost and Random Forest Framework for Predicting Purchase Behaviour by Online Customers in e-Commerce Social Networks," in IEEE FiCloud, 2023.