Skip to main content

Classifying profitable soccer players


In this project I set out to create a classification model in order to predict key characteristics where soccer players are expected to show significant market value growth in the short to mid-term future. I relied on data available on, a European soccer player market valuation database/website used by soccer clubs around the world. 

Data Collection and Cleaning

In order to collect the relevant data, I developed a scrapping program using Beautiful Soup on Python. After scraping all of the necessary data, I used python to remove any null values and standardized all data types. After collecting and cleaning the data through Python. I was finally ready to perform preliminary data exploration to find any key trends of themes in the dataset. 

Exploratory Data Analysis

Size = Change in market value, X = age, Y = Market Value

From an initial review of the dataset, I found that there was most likely a strong relationship between the age of the player and expected market value growth, with players around 18 years of age experiencing the most significant growth in market value on a nominal basis. In addition to age, the prior market value looks to have a significant impact on expected market value growth. The higher the player is valued, the more likely the athlete's market value will increase substantially as long as they are less than 28 years of age. 
Side by side plots with linear regression

I did notice that there may be a significant relationship between age, position and change in market value as my data exploration makes it clear that there should be different peaks in market value during a player's career dependent upon their position. For instance, from the figures below you can see that central midfielders tend to experience rapid increase in market value around 18, yet center backs experience a rapid increase in market value in their late 20's. Although a positions feature would most likely strengthen the findings of the classification model, with a numerical in-balance in positions and smaller dataset, I decided to omit including the positions factor in the model.

Change in market value by age for central midfielders

Change in market value by age for center-backs

Generating and Selecting a Classification Model

After getting a good idea in terms of how the data may be correlated, I was ready to get started on the classification model. In my model, I decided to use an increase of 10 million euros since January 1, 2021 as the cut off for a player being deemed as "profitable". After splitting my data into training and testing sets, I was ready to test out a few models to see which one would yield the most accurate predictions. 

For this dataset, I decided to dig deeper into the KNN model and ended up with 10 nearest neighbors. 

KNN model with K = 10, X = age, Y = prior market value. Red signifying profitable players.


From a quick data exploration and validation from the KNN model, we can see that there may be a relationship between age and prior transfer value and short/mid-term market value returns. As the KNN model above shows, that younger players with higher initial market values are more likely to be categorized as "profitable" with a market value increases of over 10 million Euros. 

As this is just a preliminary analysis I plan on improving the model by expanding my dataset and by increasing the number of factors. In order to expand my dataset, I would need to find a new source of data, as the data source I used does not include players with significant market value decreases- this would most definitely make the model more accurate as it would take into consideration players who are overvalued and may experience a market value decrease. 

In addition to expanding the observations, there are clearly other factors that may impact short/mid-term market value changes for soccer players. In the near future I plan on combining the data available from KPMG's Football benchmark to included game statistics like expected goals/assists per 90. 


Popular posts from this blog

RDD analysis on time-series financial data

For this project, I set out to understand how changes in Facebook's advertisement algorithm impacts facebook's stock prices. In addition to Facebook, four other stock ticker and the S&P index were used to tease out any larger market and industry price changes. Using a RDD model, I was able to find two significant algorithmic/acquisitional updates that impacted facebooks stock prices in a statistically significant manner.  This project demonstrates the following technical skills: - Collecting and cleaning data - Advanced statistical modeling - MATLAB - Excel - Working with financial time-series data please click here or the below link for the full Google Drive Slides

Linear and logistic regression analysis to find arbitrage opportunities in art markets

In this project, I worked with a group of three other students with the goal to find arbitrage opportunities between the New York, London, and Paris art markets. We developed two models; a multivariate regression model to predict the hammer price and a logistic regression model to predict the likelihood of a sale.  This project demonstrates the following technical skills: - Collecting and cleaning data - Advanced statistical modeling - Testing statistical models - Machine Learning - R - STATA - Excel