BUILDING A MODEL FOR PREDICTING THE BEST PLAYER OF THE FOOTBALL MATCH

ПОСТРОЕНИЕ МОДЕЛИ ПРОГНОЗИРОВАНИЯ ЛУЧШЕГО ИГРОКА ФУТБОЛЬНОГО МАТЧА
Shapagat A.S. Kabdrakhova S.
Цитировать:
Shapagat A.S., Kabdrakhova S. BUILDING A MODEL FOR PREDICTING THE BEST PLAYER OF THE FOOTBALL MATCH // Universum: технические науки : электрон. научн. журн. 2025. 5(134). URL: https://7universum.com/ru/tech/archive/item/20141 (дата обращения: 05.12.2025).
Прочитать статью:
DOI - 10.32743/UniTech.2025.134.5.20141

 

ABSTRACT

The increasing role of data analytics in football has opened new possibilities for evaluating player performance beyond subjective judgment. Traditionally, identifying the best player in a match—commonly referred to as the “Man of the Match”—has relied on opinions from commentators, fans, or pundits. However, the availability of detailed match statistics now allows for more objective, data-driven approaches to performance assessment. This research explores the development of a predictive model aimed at identifying the best player in a football match using publicly available match data. Data was collected from multiple matches via WhoScored.com and FBRef.com, incorporating a wide range of individual player metrics, including passing, shooting, defensive actions, and overall match involvement. The research applies regression techniques to estimate match ratings from which the top player can be inferred. The results show that machine learning models can effectively identify high-performing players using in-game statistics alone. In addition to making accurate predictions, the study highlights which features most influence the outcome, offering valuable insight into the components of individual excellence on the pitch. This work contributes to the broader field of sports analytics by providing a framework for objective player evaluation that could complement traditional methods.

АННОТАЦИЯ

Возрастающая роль аналитики данных в футболе открыла новые возможности для оценки результатов игроков за пределами субъективного суждения. Традиционно определение лучшего игрока в матче — обычно называемого «MVP» — основывалось на мнениях комментаторов, болельщиков или экспертов. Однако доступность подробной статистики матчей теперь позволяет применять более объективные, основанные на данных подходы к оценке результатов. В этом исследовании изучается разработка прогностической модели, направленной на определение лучшего игрока в футбольном матче с использованием общедоступных данных о матчах. Данные были собраны из нескольких матчей через WhoScored.com и FBRef.com, включая широкий спектр индивидуальных показателей игроков, включая пасы, удары, защитные действия и общую вовлеченность в матч. В исследовании применяются методы регрессии для оценки рейтингов матчей, из которых можно вывести лучшего игрока. Результаты показывают, что модели машинного обучения могут эффективно определять высокопроизводительных игроков, используя только внутриигровую статистику. Помимо точных прогнозов, исследование подчеркивает, какие характеристики больше всего влияют на результат, предлагая ценную информацию о компонентах индивидуального мастерства на поле. Эта работа вносит вклад в более широкую область спортивной аналитики, предоставляя основу для объективной оценки игроков, которая может дополнять традиционные методы.

 

Keywords: football, machine learning, regression, sport, the best player, match statistics.

Ключевые слова: футбол, машинное обучение, регрессия, спорт, лучший игрок, статистика матчей.

 

Introduction

In the modern game of football, performance analysis has become an essential part of how teams, fans, and analysts understand the sport. With the growing availability of detailed match statistics, it is now possible to evaluate individual player contributions in a more objective and data-driven manner. However, deciding who the "best player" in a match often remains a subjective call, typically based on opinion, media narratives, or fan preferences. Platforms like WhoScored and SofaScore attempt to bring some consistency to these evaluations by providing numerical ratings based on in-game actions, yet their methodologies are largely proprietary and not open to public scrutiny.

This research aims to explore whether a custom-built model, based on public data and transparent logic, can effectively assess player performance and predict who the best player in a match is. The approach involves building rating system that reflects individual contributions across a variety of football actions—such as passing, shooting, defending, and ball progression. By evaluating how closely the model's predictions align with actual post-match ratings, the project seeks to both validate its predictive ability and shed light on the key indicators of exceptional individual displays.

The core goal is not just to build a black-box model that makes accurate predictions, but to understand the underlying mechanics of great performances. Through feature analysis and model interpretation, the project also aims to highlight which actions matter most when it comes to being the best player on the pitch. Ultimately, this work contributes to the ongoing development of transparent and interpretable player evaluation tools within the field of sports analytics.

 Literature review

Assessing individual player performance in football has traditionally been a subjective process, often relying on visual observations and expert opinion. With the growing influence of sports analytics, there has been a shift towards data-driven evaluations, which aim to provide objective, replicable assessments of player impact during a match. The rise of detailed event data—tracking every pass, shot, tackle, and dribble—has made it possible to quantify a player's contribution with increasing precision. [1]

Several platforms such as WhoScored, SofaScore, and Instat have emerged as key players in this space. These platforms generate post-match ratings for each player, usually on a scale of 1 to 10, based on match statistics. However, the algorithms behind these ratings are proprietary, limiting transparency and academic evaluation [2]. This has led researchers to explore ways to build open, interpretable models for player evaluation using publicly accessible data.

Most existing rating systems use a weighted average of in-game events, where each event (e.g., a goal, successful dribble, or interception) contributes differently to a player's overall score. For example, Decroos, T., Bransen, L., Van Haaren, J. and Davis, J proposed the VAEP (Valuing Actions by Estimating Probabilities) framework, which assigns value to every player action based on how much it changes the likelihood of scoring or conceding. This approach is context-aware and goes beyond simple event counting. [3]

Other studies, such as Schulte, O., Zhao, Z., Gholami, have developed models that attempt to estimate performance through possession sequences and their outcomes, rather than just raw metrics. These approaches emphasize the importance of context, such as field position, match state, and opposition strength—elements often overlooked in commercial rating systems.[4]

The use of machine learning in sports analytics has expanded rapidly. Researchers have applied algorithms ranging from logistic regression and decision trees to more advanced methods like Random Forests, Gradient Boosting Machines, and Neural Networks to classify and predict outcomes based on match data [5]. In the context of individual player ratings, machine learning can help identify complex, nonlinear patterns in data that traditional models may miss.

Saikia and Bhattacharyya (2021) applied supervised learning techniques to predict the best player in a football match using a set of handcrafted features.[6] Similarly, Leung et al. (2020) used regression models to estimate player performance based on in-game statistics, comparing their predictions to expert ratings. Their findings highlight the potential of combining statistical modeling with machine learning to develop more robust performance evaluation tools.[7]

A key challenge in this domain is the subjectivity of the ground truth. Since player ratings and MVP (Most Valuable Player) awards are often based on human judgment, they can be biased toward goal scorers or players in more visible positions. Defender and goalkeeper contributions, for example, are less likely to be reflected accurately in traditional ratings [8].

Moreover, position-specific expectations must be considered when evaluating players. A forward and a center-back may both perform well but contribute in completely different ways. Several models address this by either training separate models per position or incorporating position as a feature during modeling [9].

Recent research has explored the idea of constructing custom rating models, where researchers assign their own weights to performance metrics and validate them against external benchmarks such as WhoScored or expert judgments. This approach allows greater flexibility and interpretability. Studies like Pappalardo et al. (2019) have released open datasets (e.g., Soccer Player Performance Dataset) and proposed interpretable scoring functions [10].

Dataset

Data collection

Dataset that is we are using in this research were collected using data parsing. What we were tried to do is collecting each football matches indivisual player’s statistics. For data parsing we used popular Python libraries like BeautifulSoup and Selenium. Dataset that we see from Table 1. is the final form of data collection. Let's shortly explain what we did to get final from os dataset, so, in the beginning of data parsing we chose two big, popular football websites WhoScored and FBref. Then using Python parsing tools scraped data from both websites and  concateneted them appropriately. From Figure 1. and Figure 2. we can see tables that scraped during the parsing. For matches we chose UEFA Champions League 40 matches which are from quarter finals, round of 16 and knockout phase. The final version of dataset has 1117 rows and 29 columns after cleansing and manipulating. 

 

Figure 1. Screenshot of web page from FBRef 

 

Figure 2. Screenshot of WhoScored web page

 

Table 1.

Dataset

Column name

Meaning

1

  Player

Player’s full name

2

Pos

Position

3

Min

Minutes played

4

Gls

Goals scored

5

Ast

Assists given

6

  Sh

Total shots

7

SoT

Shots on target

8

CrdY

Yellow cards earned

9

CrdR

Red cards earned

10

Blocks

Blocked shots

11

xG

Expected goals

12

xAG

Expected assists

13

SCA

Shot creating actions

14

GCA

Goal creating actions

15

Cmp

Passes completed

16

Cmp%

Pass completion %

17

Tkl

Number of tackles

18

Match

Name of two clubs

19

Rating

Rating (Target value)

20

Fls

Fouls committed

21

Fld

Fouls drawn

22

Off

Offsides

 23

Crs

Crosses

24

OG

Own goals

25

Recov

Recovery

26

Int

Interceptions

27

TklW

Tackles won

28

Won

Aerials won

29

Lost

Aerials lost

 

Data preprocessing

After parsing data from websites, we had two datasets .First dataset from WhoScored website consist of four sub tables and 40 columns in summary. We can see it from Table 2. From there all we need are “Player”, “Match” and the main target feature “Rating” columns.

Second dataset from FBref website consist of four sub tables and 158 columns in summary. We can see it from Table 3. Here we also selected the main features that will affect the most for player performance assessment and got one big dataset by concatenating sub tables.

Two dataset seperately preprocessed by deleting Null values, missing values,duplicates and renaming to convenient names to avoid further consequences. After we got two proper dataset sorted them by “Player” and give each of them identical “Player_ID”. The same process were made for the second dataset, then concatenated by matching “Player_ID” column.

In the research we tried to predict rating of players for whole dataset which consist of all positions. Also, by grouping by defender, midfielder and forward positions. To do so we defined each player’s position from “Pos” column. For this process we used information from Figure 4. and got Table 4.

Table 2.

Number of columns in Whoscored website

WhoScored Player stats

Number of colums

  Summary

9

  Offensive

11

  Defensive

8

  Passing

12

 

Table 3.

Number of columns in FBRef website

FBRef Player stats

Number of columns

  Summary

32

  Passing

29

  Pass Types

22

  Defensive actions

23

  Possession

     29

Miscellaneous Stats

     23

 

Table 4.

Number of players divided into position

Positions

Number of rows

 Percentage

Defender

409

36.6 %

Midfielder

362

32.4 %

Forward

346

31.0 %

 

Figure 4. The Football player’s positions in the pitch

Data analysis

The most valuable player of the match depends on features not uniformly. Some features affect more significantly. Therefore, we need to construct correlation matrix between selected numerical variables. This process shown in the Figure 5. Most positively relation marked with dark red color. Most negatively relation which means inversely proportional drawn with dark blue color. Weak relation seems like no correlation drawn with white color.

 

Figure 5. Correlation heatmap of features

 

Figure 6. gives as sorted feature correlations respect to “Rating” target values. From the figure we can conclude that Top 5 high correlated attributes respect to rating are: Goals, Shots on Target, Shot Creating Actions, Shots, Expected Assisted Goals, Expected Goals, Goal Creating Actions, Assists, Recovery, Completed Passes. Negatively correlated attributes are: Yellow Cards, Red Cards.

 

Figure 6. Ordered feature importance to Rating

 

Methodology

To predict rating of players, we applied five machine learning algorithms:

  • Linear regression
  • Ridge regression
  • Random forest
  • XGboost
  • MLP Regressor (Multi-Layer Perceptron)

Results

The data consists of 1117 samples and 29 attributes after preprocessing. We split data into training data and testing data. As traditional ratio was 80% for training and 20% for testing. Several machine learning models were implemented to predict the best-performing player in a football match based on their statistical attributes as mentioned before. The models tested included Linear Regression, Ridge Regression, Random Forest Regressor, XGBoost Regressor, and MLP Regressor. Performance was evaluated using standard regression metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), and the R-squared (R²) score. Among the models, Linear Regression and Ridge Regression produced the most reliable and interpretable results. Both consistently achieved lower MAE and MSE values compared to the more complex models, while maintaining competitive R² scores. In contrast, while models like Random Forest and XGBoost have ability to capture more complex relationships in the data, they did not significantly outperform the simpler linear models in this specific use case. The MLP Regressor, which relies on neural networks, showed low performance because of shortage of dataset used in this study.

After results what can be concluded is that when we consider only forward dataset models performs very well. It can be explained by attributes that more correlated with target were attacking features which are characteristics of attackers. After that were attributes that mostly belong to midfielders. Therefore, we got such a result. At the beginning we chose target value from “WhoScored” website and other attributes from “FBRef” website. More player’s performance assessment organizations’s exact formula for rating is not publicly available, making it a black-box model.

Table 5.

Results of classification algorithms

Full data

MSE

   MAE     

Linear regression

0.6840497

0.14576138

0.27145739

Ridge regression

0.6804758

0.1474101

0.2738157

Random forest

0.5721101

0.1974039

0.3168496

XGBoost

0.5533109

0.2060768

0.3271210

MLP Regressor

0.5339859

0.2149922

0.3564033

Forward’s data

    MSE

   MAE     

Linear regression

0.8703524

0.11153840

0.24683704

Ridge regression

0.87056729

0.1113536

0.24720806

Random forest

0.7250496

0.2365453

0.3244414

XGBoost

0.7051395

0.2536744

0.33842954

MLP Regressor

0.5426326

0.393482

0.5093210

Midfielder’s data

   MSE

   MAE     

Linear regression

0.71179388

0.15081783

0.27662865

Ridge regression

0.68301521

0.16587766

0.291955638

Random forest

0.61827927

0.19975388

0.325752054

XGBoost

0.59979034

0.20942911

0.32208351

MLP Regressor

0.504951113

0.25905834

0.39163139

Defender's data

    MSE

   MAE     

Linear regression

0.4105287724

0.213295061

0.332002004

Ridge regression

0.424518235

0.208233095

0.32739487

Random forest

0.3619636293

0.230867938

0.359649999

XGBoost

0.161367690

0.303451842

0.422016183

MLP Regressor

-0.090933120

0.394744707

0.454965997

 

Conclusion and future work

The results of this study suggest that straightforward, linear models like Linear Regression and Ridge Regression can be effective for predicting top-performing football players, especially when using structured match data. Despite the increasing popularity of advanced machine learning techniques, simpler models still offer strong predictive power, particularly when paired with well-selected features and clean data. The findings highlight the importance of balancing model complexity with data quality and problem scope. In this case, regularized linear models were not only easier to interpret but also delivered competitive accuracy compared to more sophisticated methods. This supports their use as a reliable baseline for player performance prediction tasks in football analytics.

Future work could explore incorporating more contextual match information (such as match difficulty, opposition strength, or tactical roles) and experimenting with ranking-based approaches or ensemble methods to further refine the model’s ability to identify the best player in each match. There is also room to refine the input features. Future work could explore feature engineering techniques such as interaction terms, rolling averages (form), and more detailed spatial data (e.g., heatmaps or zones of influence), if available.

From a technical perspective, hyperparameter tuning, cross-validation strategies, and ensemble learning (combining predictions from multiple models) are areas that could yield performance gains. Additionally, if more labeled data becomes available, especially across a wide variety of leagues and match types, more complex models like neural networks could be revisited with better results.

 

References:

  1. Memmert, D., Raabe, D., Schwab, S., & Rein, R. (2017). Data analytics in football: Positional data collection, modelling and analysis. European Journal of Sport Science.
  2. Bialkowski, A., Lucey, P., Carr, P., Yue, Y., & Matthews, I. (2016). Large-scale analysis of soccer matches using spatiotemporal tracking data. 2016 IEEE International Conference on Data Mining (ICDM).
  3. Decroos, T., Bransen, L., Van Haaren, J., & Davis, J. (2019). Actions Speak Louder Than Goals: Valuing Player Actions in Soccer. KDD '19.
  4. Schulte, O., Zhao, Z., & Gholami, S. (2017). Representing and reasoning about game play in soccer. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.
  5. Bunker, R., & Thabtah, F. (2019). A machine learning framework for sport result prediction. Applied Computing and Informatics.
  6. Saikia, H., & Bhattacharyya, D. (2021). Predicting the best player in a football match using supervised learning techniques. International Journal of Computer Applications, 183(12), 17–23.
  7. Leung, K., et al. (2020). Predicting Soccer Player Ratings with Supervised Learning Techniques. International Journal of Computer Science and Information Security.
  8. Bransen, L., & Van Haaren, J. (2018). Measuring soccer players’ on-the-ball contributions from passes during games. In Proceedings of the KDD Workshop on Sports Analytics.
  9. Liu, H., Hopkins, W., Gómez, M. A., & Molinuevo, S. J. (2020). Inter-operator reliability of live football match statistics from OPTA Sportsdata. International Journal of Performance Analysis in Sport.
  10. Pappalardo, L., Cintia, P., Ferragina, P., Massucco, E., Pedreschi, D., & Giannotti, F. (2019). A public data set of spatio-temporal match events in soccer competitions. Scientific Data, 6(1), 236.
Информация об авторах

Student, Department of Information Technologies, Kazakh-British Technical University, Kazakhstan, Almaty

студент, кафедра Информационных технологии, Казахстанско-Британский Технический Университет (КБТУ), Казахстан, г. Алматы

Professor of Physical and Mathematical Sciences, Kazakh National University named after Al-Farabi, Kazakhstan, Almaty

профессор физ.-мат. наук, Казахский национальный университет имени Аль-Фараби, Казахстан, г. Алматы

Журнал зарегистрирован Федеральной службой по надзору в сфере связи, информационных технологий и массовых коммуникаций (Роскомнадзор), регистрационный номер ЭЛ №ФС77-54434 от 17.06.2013
Учредитель журнала - ООО «МЦНО»
Главный редактор - Звездина Марина Юрьевна.
Top