FORECASTING CONSUMER BEHAVIOR USING MACHINE LEARNING MODELS

ПРОГНОЗИРОВАНИЕ ПОТРЕБИТЕЛЬСКОГО ПОВЕДЕНИЯ С ИСПОЛЬЗОВАНИЕМ МОДЕЛЕЙ МАШИННОГО ОБУЧЕНИЯ
Kairova B.
Цитировать:
Kairova B. FORECASTING CONSUMER BEHAVIOR USING MACHINE LEARNING MODELS // Universum: технические науки : электрон. научн. журн. 2026. 3(144). URL: https://7universum.com/ru/tech/archive/item/22247 (дата обращения: 28.03.2026).
Прочитать статью:
DOI - 10.32743/UniTech.2026.144.3.22247

 

ABSTRACT

The research compares five machine learning algorithms-XGBoost, LSTM, Random Forest, SVR, and KNN-for the task of predicting consumer demand, with the overarching goal of inventory optimization. The dataset used for this research comprises over 230 million records from an international online retail platform, spanning October to December 2019, with validation on January 2020 data. Features include product information, event types (view, cart, purchase), pricing, and user behavior across sessions. Exploratory data analysis (EDA) was performed to identify trends, seasonal effects, and conversion behavior. After applying logarithmic transformation to account for skewed demand distribution, the LSTM model achieved the best results: , , and . XGBoost followed closely with . These results demonstrate the value of machine learning-particularly deep learning and gradient boosting approaches in effectively forecasting product demand.

АННОТАЦИЯ

В данном исследовании проводится сравнительный анализ пяти алгоритмов машинного обучения — XGBoost, LSTM, Random Forest, SVR и KNN — для прогнозирования потребительского спроса с целью оптимизации управления запасами. В качестве исходных данных использован набор данных международной платформы электронной коммерции, содержащий более 230 миллионов записей за период с октября по декабрь 2019 года, с последующей валидацией на данных за январь 2020 года. Данные включают информацию о товарах, типах пользовательских событий (просмотр, добавление в корзину, покупка), ценах и поведении пользователей в рамках сессий. Для выявления закономерностей и сезонных эффектов был проведён разведочный анализ данных. После логарифмического преобразования распределения спроса наилучшие результаты показала модель LSTM (R² = 0.998, MAPE = 0.66%, SMAPE = 0.61%), тогда как модель XGBoost также продемонстрировала высокую точность (R² = 0.987). Результаты подтверждают эффективность современных методов машинного обучения для точного прогнозирования спроса и повышения эффективности управления запасами.

 

Keywords: Machine Learning, Demand Forecasting, Consumer Behavior, Inventory Optimization, E-commerce, Predictive Analytics, LSTM, XGBoost.

Ключевые слова: Машинное обучение, Прогнозирование спроса, Поведение потребителей, Оптимизация запасов, Электронная коммерция, Предиктивная аналитика, LSTM, XGBoost.

 

I. Introduction

The exponential growth of e-commerce and the widespread adoption of digital technologies have transformed how businesses interact with consumers. With every click, scroll, and purchase, vast amounts of behavioral data are generated, offering unprecedented opportunities to understand consumer behavior [1]. Machine learning (ML) has emerged as a key enabler for predictive analytics [12], [18]. This study aims to build and compare demand forecasting models, identify key behavioral drivers, and provide business recommendations for inventory and marketing optimization. The study evaluates multiple approaches:

1. Single Machine Learning Models: Models like Logistic Regression, Naive Bayes, and KNN are widely used for their simplicity and fast training speed, but often underperform compared to complex models on high-dimensional data [10], [2], [5], [8].

2. Ensemble Methods: Algorithms like Random Forest and XGBoost have proven highly effective with large e-commerce datasets, offering high accuracy despite requiring more computational resources [19], [20], [13], [4], [3].

3. Hybrid and Stacked Models: Combining multiple algorithms compensates for individual limitations, often yielding the best results for time-series features [6], [14], [13], [17], [11].

4. Clickstream-Based Models: Analyzing user navigation, clicks, and session time enables the prediction of purchase intent before a transaction occurs [7], [1], [9].

II. Methodology

A. Dataset Acquisition

The dataset includes over 230 million records from October-December 2019. The validation set is from January 2020. The dataset consists of the following fields:

  • event_time: Timestamp of the user action
  • event_type: Type of event: view, cart, or purchase
  • product_id: Unique identifier of the product
  • category_id: Identifier for the product category
  • brand: Brand or manufacturer of the product
  • price: Price of the product at the time of the event
  • user_id: Unique identifier for the user
  • user_session: Identifier for the user session

Total: 233,460,662 records, of which 3,656,843 are purchases.

B. Data Analysis

EDA was conducted to analyze purchasing trends, weekday vs. weekend behavior, peak hours, conversion rates, and repeat buying behavior similar to [15]. Sunday mornings and December spikes due to holidays were notable findings.

C. Demand Prediction Algorithms

We compare machine learning algorithms for demand forecasting, described below using their mathematical formulations and structural logic.

Algorithm 1: Support Vector Regressor

Input: Training dataset , kernel function , regularization parameter , epsilon

Output: Trained Support Vector Regressor model Construct the kernel matrix  for all pairs ;

Solve the following optimization problem:

minimize

subject to  ,      for all

Calculate the weight vector w and bias term ;

for each support vector  do

end

return  with the parameters  and ;

Algorithm 2: Random Forest

Input: Train dataset , number of trees  number of splitter features

Output: Random Forest model

Initialization of an empty list forest to store trees;

for  to T do

Random sample m features without replacement; Create bootstrap sample  from ;

Train decision tree  on

Add  to

end return 

Algorithm 3: XGBoost

Input: Dataset , number of boosting rounds , learning rate

Output: Ensemble of regression trees

Objective Function:

Where:

and

return fin

Algorithm 4: LSTM (Long Short-Term Memory)

Input: Sequence data , number of hidden units , learning rate

Output: Trained LSTM model with hidden state and memory cell

Cell Updates:

Train using Backpropagation Through Time (BPTT)

Algorithm 5: K-Nearest Neighbors (KNN)

Input: Training dataset  number of neighbors , distance metric  

Output: Predicted value for a given input x

For a new input : Find  nearest neighbors  using distance ; Compute prediction:

 return

D. Evaluation Metrics

The following metrics were used to evaluate the performance of machine learning regression models:

1. Root Mean Squared Error (RMSE)

RMSE measures the average magnitude of the error, emphasizing larger errors due to the square.

2. Mean Absolute Error (MAE)

MAE provides the average absolute difference between predicted and actual values.

3. Mean Absolute Percentage Error (MAPE)

MAPE expresses the error as a percentage. This metric can be unstable when   is close to zero.

4. Symmetric Mean Absolute Percentage Error (SMAPE)

SMAPE improves on MAPE by providing symmetry and handling small values more robustly.

5. Coefficient of Determination

 indicates how well the predictions approximate the actual values, with 1 being perfect prediction and 0 meaning the model performs no better than the mean.

Logarithmic transformtion was applied to improve distribution and reduce the effect of outliers.

III. Results

A. Exploratory Data Analysis

Data analysis showed distinct trends indicating increased consumer activity on weekends, with an average peak around 9-10 AM UTC. The most popular time remains the first half of Sunday. On average, only 0.8-0.9% of views result in a purchase. Products placed in the cart have the highest conversion rate (35-45%), demonstrating that the cart strongly influences the final purchase decision.

B. Models Performance

Both training and test sets show the same pattern-a long right tail. Most products have a weekly demand of up to 1,000 units, but there are outliers the so-called "star" products for which the model tends to perform poorly.

As can be seen, a large Root Mean Squared Error (RMSE) indicates that the model makes larger errors on high-demand values, even though the Mean Absolute Error (MAE) remains relatively low-around 5 units, with the average product demand being 15 units.

Table 1.

Sample of prediction errors for selected products

Index

True Demand

Predicted Demand

Absolute Error

Squared Error

1156

9,194

2,059

7,135

50,903,187

1847

6,852

1,380

5,472

29,946,849

1788

7,344

2,105

5,239

27,444,760

1846

6,541

1,380

5,161

26,639,755

1787

6,406

2,105

4,301

18,496,663

1157

9,778

5,480

4,298

18,472,405

1845

5,448

1,380

4,068

16,551,646

1155

7,813

3,803

4,010

16,077,075

1786

5,823

1,835

3,988

15,900,369

1848

5,096

1,380

3,716

13,811,417

 

In the sample table, the index refers to the product number. The absolute error represents the absolute difference between the predicted and actual values. The Mean Squared Error (MSE) is this difference squared, which penalizes large errors more heavily.

 

Figure 1. Predicted and actual values

 

Table 2.

 Model performance comparison on validation set

Model

RMSE

MAE

MAPE

SMAPE

R2

XGBoost

8415.95

6.41

52.18%

35.84%

0.618

LSTM

7986.97

2.66

11.90%

10.71%

0.638

Random Forest

12501.61

8.38

82.70%

49.34%

0.433

SVR

18686.32

7.83

55.75%

45.79%

0.152

KNN

748.65

4.27

11.44%

23.54%

0.966

 

As can be observed, the K Nearest Neighbors algorithm performs the best on raw data. However, it also shows extremely large values for MAPE and SMAPE. This is due to the nature of these metrics: when the actual values are small, percentage-based errors become disproportionately large.

The MAPE formula is as follows:

where  is the actual value, and  is the forecasted value.

Then apply a logarithmic transformation to the input data. This is motivated by the fact that the distribution has a long right tail, and the logarithm compresses large values, resulting in better generalization of the model (i.e., improved adaptability to unseen situations). Also observed through outlier analysis, variance typically increases with the growth of the mean value. For example, products grouped into demand intervals such as 0-10, 11-100, and 101+ will show increasing deviation, with the highest deviation in the last group.

 

Figure 2. Predicted and actual values (log trans)

 

Table 3.

MODEL PERFORMANCE AFTER LOGARITHMIC TRANSFORMATION

Model

RMSE

MAE

MAPE

SMAPE

R2

XGBoost

0.0126

0.048

3.80%

3.49%

0.987

LSTM

0.0024

0.0096

0.66%

0.61%

0.998

Random Forest

0.0703

0.196

18.43%

16.22%

0.928

SVR

0.3720

0.466

36.72%

34.41%

0.619

KNN

0.2669

0.400

36.25%

31.55%

0.727

 

On average, the logarithmic transformation led to an improvement of approximately 90.986% across all models. LSTM holds the top position across most metrics. XGBoost also demonstrated a significant improvement, and the unrealistic MAPE and SMAPE values observed in KNN were corrected. Overall, the logarithmic transformation resulted in an average improvement of 90.986%.

IV. Conclusion

As a result of the study, a large-scale assessment of consumer behavior and demand for goods was carried out using machine learning algorithms. Among the tested models, LSTM and XGBoost demonstrated the highest accuracy, especially after the logarithmic transformation of features, which significantly improved accuracy metrics. There was also a strong relationship between time, product category, and probability of purchase, which confirms the importance of taking into account seasonality and user activity.

Conversion analysis and product clustering allowed us to identify product groups with high potential for further promotion. The results obtained can be effectively used by companies to plan purchases more accurately, develop marketing strategies, and improve the overall efficiency of logistics operations. Further research may be aimed at including additional factors such as promotions, reviews, and customer demographics.

 

References:

  1. E. Kuric, A. Puskas, P. Demcak, and D. Mensatorisova, "Effect of Low-Level Interaction Data in Repeat Purchase Prediction Task," International Journal of Human-Computer Interaction, 2023.
  2. A.Sharma et al., "Machine Learning Approach: Consumer Buying Behavior Analysis," in IEEE Pune Section International Conference (Pune Con), 2022.
  3. J. Ning, K. F. Li, and T. Avant, "A Cost-Sensitive Ensemble Model for e-Commerce Customer Behavior Prediction with Weighted SVM," in Complex, Intelligent, and Software Intensive Systems, Springer, 2023.
  4. M. Alojail and S. Bhatia, "A Novel Technique for Behavioral Analytics Using Ensemble Learning Algorithms in E-commerce," IEEE Access, 2020.
  5. N. I. A. Rusli, F. A. Zulkifle, and I. S. Ramli, "A Comparative Study of Machine Learning Classification Models on Customer Behavior Data," in Soft Computing in Data Science, Springer, 2023.
  6. Z. Liu and X. Ma, "Predictive Analysis of User Purchase Behavior Based on Machine Learning," International Journal of Smart Business and Technology, 2019.
  7. Y. Al-Tayeb, "Predicting Consumer Behavior in Online Shopping Using Clickstream Data and Machine Learning Algorithms," Master Thesis, Tilburg University, 2024.
  8. V. Parihar and S. Yadav, "Comparative Analysis of Different Machine Learning Algorithms to Predict Online Shoppers Behaviour," International Journal of Advanced Networking and Applications, 2022.
  9. S. Garg et al., "An Extensive Review and Comparison of Different Machine Learning Algorithms for Customer Behaviour Pattern Analysis," in IEEE UPCON, 2023.
  10. S. Subramanian et al., "Performance Analysis of Different Machine Learning in Customer Prediction," in IEEE ICOEI, 2022.
  11. X. Zhai et al., "Prediction Model of User Purchase Behavior Based on Machine Learning," in IEEE International Conference on Mechatronics and Automation, 2020.
  12. K. Anshu, S. K. Singh, and R. Kumari, "A Machine Learning Model for Effective Consumer Behaviour Prediction," in IEEE ISCON, 2021.
  13. W. Hu and Y. Shi, "Prediction of Online Consumers' Buying Behavior Based on LSTM-RF Model," in IEEE Big Data Conference, 2020.
  14. H. Valecha et al., "Prediction of Consumer Behaviour Using Random Forest Algorithm," in IEEE UPCON, 2018.
  15. V. Shrirame et al., "Consumer Behavior Analytics Using Machine Learning Algorithms," IEEE Conference, 2020.
  16. X. Wang, Y. Xiangbin, and M. Yangchun, "Research on User Consumption Behavior Prediction Based on Improved XGBoost Algorithm," in IEEE Big Data, 2018.
  17. J. Si, "E-Commerce User Purchase Prediction Based on Improved Machine Learning Algorithms," Independent publication, China, 2023.
  18. S. Bailkar et al., "Smart Inventory Optimization using Machine Learning Algorithms," in IEEE IDCIoT, 2024.
  19. K. Maheswari and P. P. A. Priya, "Predicting Customer Behavior in Online Shopping Using SVM Classifier," in IEEE Conference on Intelligent Techniques, 2017.
  20. X. Wang et al., "Integrated Machine Learning Concept with XGBoost and Random Forest Framework for Predicting Purchase Behaviour by Online Customers in e-Commerce Social Networks," in IEEE FiCloud, 2023.
Информация об авторах

Master’s student of School of Information Technology and Engineering, Kazakh-British Technical University, Kazakhstan, Almaty

магистрант Школы информационных технологий и инженерии Казахско-британского технического университета, Казахстан, г.Алматы

Журнал зарегистрирован Федеральной службой по надзору в сфере связи, информационных технологий и массовых коммуникаций (Роскомнадзор), регистрационный номер ЭЛ №ФС77-54434 от 17.06.2013
Учредитель журнала - ООО «МЦНО»
Главный редактор - Звездина Марина Юрьевна.
Top