How To Win The English Premier League: A Final Tutorial¶
Jaehong Kim, Abdullah Alkhaloufalabdalhmid, Andrew Parker
8/20/23
Note: All Images used are Copyright Free Stock Images
Github Repository¶
Github repository can be found here:
Jaehong Kim - https://github.com/J-Gim/J-Gim.github.io
Andrew Parker - https://github.com/aparkro/aparkro.github.io/blob/main/FinalTutorial.ipynb
Table of Contents:¶
links to understand each step of the Data Analysis Process
Introduction¶
The English Premier League (EPL), one of the largest leagues in the world, draws millions of fans each year. These fans come to see up to 20 qualifying clubs compete against eachother in a bid to win the Premier League Trophy and the millions of dollars that come in reward money. However, behind the flashing lights and breathtaking penalties, data-analytics play an important role.
https://drive.google.com/file/d/1VpRWhx_HqZLo5gtgNJa6OEL_ywdfxbB8/view?usp=drive_link
Data analytics gives Premier clubs strategic insights into their own operations. By understanding strengths and weaknesses in a club's operations, one can target these characteristics and make the neccesary changes to create a stronger/better club. When millions and millions of dollars are on the line, we can see why Data Science is so crucial!
On pitch analytics include things like optimizing team performance, monitoring player health, and tactical player deployments on pitch. These respective changes can come about by measuring data such as a team's goal scoring rate, shots on target, a player's heart beat rates, and statistics related to a player's aerobic endurance.
Off the pitch, data analysis helps clubs maintain their financial affairs. As stated before, sports teams are ultimately businesses at their heart and clubs in the EPL are no different! Statistical data allows us to make important financial/managerial decisions such as player trades, merchandise marketing, and stadium location/licensing.
For broadcasters and advertisers, understanding fan engagement and other viewership trends can allow for more efficient content strategies and marketing campaigns.
These decisions all play a role in ensuring that a club can get maximum value for their investments and remain above water as a profitable business. Going underwater financially could not only mean being relegated to a lower league, but also in the complete liquidation of the club! To fans and owners alike, this is an absolute nightmare but without sufficient statistical forecasting, could easily become a reality.
https://drive.google.com/file/d/1rNwjT1xFIWepQeZwFHdkXM3Cv6I5BTHo/view?usp=drive_link
Lastly, it is important to note that the Premier League operates by a point system. The winner of the Premier League for each season is decided by the team that accumulates the most points by the end of the season. Points are awarded to a team as follows:
==> 3 points for a win.
==> 1 point for a draw
==> no points for a loss.
Each team will play 38 matches by the end of each season, and the 3 teams with the lowest amount of points (bottom 3 on the scoreboard), will be relegated to the Championship League (and likewise replaced by the top 3 teams of the Championship League).
Thus, we can see that while remaining in the Premier League can be lucrative it is not easy. One must keep a club financially stable and efficient enough to remain above the bottom 3 rankings of the EPL. A club must ultimately understand its strengths and weaknesses in order to improve. To acheive this, proper data analytics (by going through the stages of data collection, proccessing, hypothesis testing, and interpretation) is crucial.
https://drive.google.com/file/d/1qW6Aq4znZ1rsaD922YRieDpoew8xOdgk/view?usp=drive_link
Further information about the Premier League can be found below:
https://www.premierleague.com/stats ==> The Official Premier League Website with Statistical Data and other information
https://fbref.com/en/comps/9/Premier-League-Stats ==> fbref which holds Premier League Statistics focusing on player stats
https://footystats.org/england/premier-league ==> FootyStats provides prepackaged csv data for different seasons with a wide array of data (provided you have a subscription 😀).
Purpose Of This Tutorial¶
For our purposes, we are interested in on-pitch data. More specifically,
What On-Pitch factors appear to have an impact on a teams chance of winning a Premier League Season (i.e. accumulating the most points in a given season)?
On-Pitch Factors that will be considered in this tutorial include:
==> Goals Scored
==> Goals Conceded
==> Goals Difference (Scored - Conceded) per game
==> Total shots
==> Total shots on target
==> Percent Shots Scored (goals / total shots)
==> Fouls Committed
==> Yellow Cards
==> Red Cards
==> Corners
https://drive.google.com/file/d/1RMyXztpCIyqNqzb0pIHCcCvq_aMwhB0t/view?usp=drive_link
Off-Pitch factors that were not looked at but could be used for a more comprehensive analysis include:
==> Annual Fan Attendence
==> Team Value
==> Stadium Revenue
https://drive.google.com/file/d/1MDONFkPlRpknWCsU1q-E30eGm-Fr-SW6/view?usp=drive_link
This tutorial will follow the typical 'Data Science Lifecycle' as listed in the table of contents:
Glossary¶
This glossary will define all column labels in case of ambiguity/confusion:
Team --> Name of club for a given row
Season_End_Year --> The ending year of a season (i.e. for the 2010-2011 season 'Season_End_Year' would be 2011)
Rk --> The Rank/position of the team in the league based on its accumulated points
MP --> Matches the team has played during the season or tournament.
W --> The number of games the team has won.
D --> The number of games that ended in a tie or draw.
L --> The number of games the team has lost.
GF --> The total number of goals the team has scored.
GA --> The total number of goals the team has conceded or let in.
GD --> Calculated as GF - GA. Indicates the difference between the number of goals scored and the number of goals conceded.
Pts --> The accumulated points for a given team in a given season (a team gets 3 points for a win, 1 point for a draw, and no points for a loss.)
M#GoalsScored --> The number of goals a team has scored in Match # of a given Season
M#GoalsConceded --> The number of goals a team has conceded in Match # of a given Season
M#Shots --> The number of attempts on goal (regardless of whether it resulted in a goal or not) for a given team in Match # of a given Season
M#ShotsOnTarget --> The number of shots that entered the goal or would have entered the goal (if they had not been blocked) for a given team in Match # of a given Season
CornerKicks --> The number of corner kicks for a given team and season
Fouls --> The number of fouls in a season for a given team (includes red and yellow cards as well as minor offences)
YellowCard --> The number of yellow cards in a season for a given team (awarded for "unsportsmanlike" tackles)
RedCard --> The number of red cards in a season for a given team (awarded for "dangerous offences" and assault)
TotalShots --> Total shots taken in direction of the goal (but not neccesarily on track to enter the goal) for a given team and season
TotalShotsOnTarget --> Total shots taken that were on track to be a goal (includes shots that were either blocked or went in) for a given team and season
PercentShotsScored --> Percent of shots taken that resulted in a goal for a given team and season.
https://drive.google.com/file/d/1y5HBTFcnIvo5VSBpRMaWyx0JNjjnBaGA/view?usp=drive_link
Imports¶
#defines the import packages that will be used in this tutorial
from google.colab import drive # imports the drive module from Google Colab, allowing us to access and mount Google Drive files within the Colab environment
import matplotlib.pyplot as plt # imports the pyplot module from matplotlib in order to plot various charts
import numpy as np # Imports numpy, a library used for numerical operations and working with arrays.
import pandas as pd # Imports pandas, a powerful data manipulation and analysis library.
import statsmodels.formula.api # provides functions to create statistical models and conduct hypothesis tests using formulas.
import seaborn as sns # Imports seaborn, a statistical data visualization library based on matplotlib. It provides a higher-level interface and attractive visualizations.
import statsmodels.api as sm # Imports statsmodels package for statistical model functions
Data Collection¶
In the context of our study on the English Premier League, data collection involves sourcing and accumulating desired match statistics and league standings from various seasons. This data will later be used to determine the on-pitch factors that are statistically significant in either increasing or decreasing a teams seasonal points in the EPL.
Specifically, datasets detailing goals, shots, fouls and related metrics for each matchday from the 2010/11 season up to the 2019/20 season have been gathered.
Primary Sources of Data are:
https://www.premierleague.com/
https://data.world/evangower/premier-league-standings-1992-2022 ==> A single csv file detailing Basic Data
https://www.kaggle.com/datasets/taranguyen/english-premier-league-data-for-10-seasons ==> 10 seasonal csv files and one larger csv file detailing Shots, Fouls, Cards, and Corners Data
Further Unused Information about the Premier League for Additional Off-Pitch Statistics:
https://www.kaggle.com/datasets/arbabqaisar/pl-tableattendancekit-sponsorship-data ==> Fan Attendence Data
https://www.transfermarkt.com/premier-league/marktwerteverein/wettbewerb/GB1/plus/?stichtag=2011-05-15 ==> Team Valuation Data
# customizes the display settings for the pandas library.
pd.set_option('display.width', 3500)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)
# mounts Google Drive to the /content/drive directory of the Colab environment, allowing us to access content at that location.
drive.mount('/content/drive')
#sets the filepaths for the various tables that we will be using
# The main dataset is premier-league-tables.csv
# there are additional datasets for each season from 2010/11 to 2019/20, detailing goals and shots for each matchday.
# the last dataset adds fouls, red card, yellow card, corners
orig = '/content/drive/MyDrive/premier-league-tables.csv' # basic data
# goals and shots for each matchday.
epl1011 = '/content/drive/MyDrive/epldat10seasons/epl1011matchday-goals-shots.csv'
epl1112 = '/content/drive/MyDrive/epldat10seasons/epl1112matchday-goals-shots.csv'
epl1213 = '/content/drive/MyDrive/epldat10seasons/epl1213matchday-goals-shots.csv'
epl1314 = '/content/drive/MyDrive/epldat10seasons/epl1314matchday-goals-shots.csv'
epl1415 = '/content/drive/MyDrive/epldat10seasons/epl1415matchday-goals-shots.csv'
epl1516 = '/content/drive/MyDrive/epldat10seasons/epl1516matchday-goals-shots.csv'
epl1617 = '/content/drive/MyDrive/epldat10seasons/epl1617matchday-goals-shots.csv'
epl1718 = '/content/drive/MyDrive/epldat10seasons/epl1718matchday-goals-shots.csv'
epl1819 = '/content/drive/MyDrive/epldat10seasons/epl1819matchday-goals-shots.csv'
epl1920 = '/content/drive/MyDrive/epldat10seasons/epl1920matchday-goals-shots.csv'
epl10season = '/content/drive/MyDrive/epldat10seasons/epl-allseasons-matchstats.csv' # fouls, red card, yellow card, corners
# The main dataset of basic statistics (orig) is loaded into the pandas df DataFrame.
df = pd.read_csv(orig, sep=',')
# The datasets of match data for each season are loaded into separate DataFrames (df11, df12, ..., df20).
df11 = pd.read_csv(epl1011, sep=',')
df12 = pd.read_csv(epl1112, sep=',')
df13 = pd.read_csv(epl1213, sep=',')
df14 = pd.read_csv(epl1314, sep=',')
df15 = pd.read_csv(epl1415, sep=',')
df16 = pd.read_csv(epl1516, sep=',')
df17 = pd.read_csv(epl1617, sep=',')
df18 = pd.read_csv(epl1718, sep=',')
df19 = pd.read_csv(epl1819, sep=',')
df20 = pd.read_csv(epl1920, sep=',')
# creates a dataframe for the fouls,cards, and corners data
temp_df = pd.read_csv(epl10season, sep=',')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Explanation of Data Collection Code¶
The code above first sets the desired viewing settings for the code outputs.
Next, we use the mount command to access our desired csv files ('premier-league-tables.csv' and 'epl1011matchday-goals-shots.csv' etc.) which are stored in our Google drive at '/content/drive/MyDrive/'.
We then establish the filepaths for our desired csv files.
Next we use the pandas command read_csv() to form dataframes from our desired data from each csv file using a comma ',' as the delimiter.
Data Processing¶
Now that our desired data has been collected, the process of data curation begins. Since the data spans multiple sets and consists of numerous metrics, it is important to make sure that there are no gaps/issues between datasets.
We can ensure dataset integrity by verifying completeness (i.e. the presence of missing values) and accuracy (i.e. the datasets line up with one another are have verifiable data). This involves checking for any inconsistencies, missing values, or potential anomalies within each season's dataset.
By successfully curating our data, we can guarantee that our subsequent analyses are grounded in reliable and comprehensive data.
# The main dataset df is filtered to retain only the rows corresponding to seasons of 2010-2011 to 2019-2020.
years_to_keep = [2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020]
filtered_df = df[df['Season_End_Year'].isin(years_to_keep)]
# The 'Notes' column is removed as it is not needed
filtered_df = filtered_df.drop(columns=['Notes'])
# View and double check the filtered data
print(filtered_df.head())
print("")
Season_End_Year Team Rk MP W D L GF GA GD Pts 366 2011 Arsenal 4 38 19 11 8 72 43 29 68 367 2011 Aston Villa 9 38 12 12 14 48 59 -11 48 368 2011 Birmingham City 18 38 8 15 15 37 58 -21 39 369 2011 Blackburn 15 38 11 10 17 46 59 -13 43 370 2011 Blackpool 19 38 10 9 19 55 78 -23 39
Explanation of Code Above¶
We are first filtering the basic data set "df" which holds general statistics such as Wins, Losses, and overall Points for every team and season.
The basic data set "df" is filtered for the seasons 2011-2020 and is stored as the "filtered_df" dataset
We Then print the "filtered_df" dataset to double check its output.
# For each individual season dataset (df11, df12, etc.), we add a new column indicating the season end year.
df11['Season_End_Year'] = 2011
df12['Season_End_Year'] = 2012
df13['Season_End_Year'] = 2013
df14['Season_End_Year'] = 2014
df15['Season_End_Year'] = 2015
df16['Season_End_Year'] = 2016
df17['Season_End_Year'] = 2017
df18['Season_End_Year'] = 2018
df19['Season_End_Year'] = 2019
df20['Season_End_Year'] = 2020
# Combine all season datasets into a single DataFrame 'combined_df'
# This will make it easier to deal with seasonal match data regarding goals and shots for each team
frames = [df11, df12, df13, df14, df15, df16, df17, df18, df19, df20]
combined_df = pd.concat(frames, ignore_index=True)
# The filtered_df and combined_df have variations of the same column/value types which makes it hard to analyze them later in case of merging or grouping
# For example, combined_df refers to teams as 'Clubs' and filtered_df refers to teams as 'Team'
# In addition, combined_df refers to certain teams with different names such as using 'Birmingham' instead of 'Birmingham City' as used in filtered_df
# Thus, we must standardize the columns and values for each table by replacing alternative/short names with full/shared names to maintain consistency.
combined_df = combined_df.rename(columns={"Club": "Team"})
combined_df['Team'] = combined_df['Team'].replace('Birmingham', 'Birmingham City')
combined_df['Team'] = combined_df['Team'].replace('Man City', 'Manchester City')
combined_df['Team'] = combined_df['Team'].replace('Man Utd', 'Manchester Utd')
combined_df['Team'] = combined_df['Team'].replace('Man Utd', 'Manchester Utd')
combined_df['Team'] = combined_df['Team'].replace('Newcastle', 'Newcastle Utd')
combined_df['Team'] = combined_df['Team'].replace('Stoke', 'Stoke City')
combined_df['Team'] = combined_df['Team'].replace('Wigan', 'Wigan Athletic')
combined_df['Team'] = combined_df['Team'].replace('Norwich', 'Norwich City')
combined_df['Team'] = combined_df['Team'].replace('Swansea', 'Swansea City')
combined_df['Team'] = combined_df['Team'].replace('Cardiff', 'Cardiff City')
combined_df['Team'] = combined_df['Team'].replace('Hull', 'Hull City')
combined_df['Team'] = combined_df['Team'].replace('Leicester', 'Leicester City')
# Rearrange the columns to move the 'Season_End_Year' column right after the 'Team' column.
# This will make it easier to find/read as knowing the corresponding Season of each row's data is important to our later analyses
cols = combined_df.columns.tolist()
cols = [cols[0]] + [cols[-1]] + cols[1:-1]
combined_df = combined_df[cols]
# View and double check the modified 'combined_df'
#print(combined_df.head())
# Print Selected columns of 'combined_df' to avoid glitching when outputting on github
selected_columns = combined_df.iloc[:, :10]
print(selected_columns.head())
Team Season_End_Year M1GoalsScored M2GoalsScored M3GoalsScored M4GoalsScored M5GoalsScored M6GoalsScored M7GoalsScored M8GoalsScored 0 Arsenal 2011 1 6 2 4 1 2 0 2 1 Aston Villa 2011 3 0 1 1 1 2 1 0 2 Birmingham City 2011 2 2 2 0 1 0 0 1 3 Blackburn 2011 1 1 1 1 1 2 0 0 4 Blackpool 2011 4 0 2 2 0 1 2 2
Explanation of Code Above¶
We are first manually adding the "Season_End_Year" for each of the 10 seperate dataframes ("df11" to "df20") that that detail goals and shots for each matchday.
Each of these 10 dataframes is then combined to form an all-encompassing dataframe called "combined_df" that holds goals and shots for each matchday for season end years of 2011-2020.
The column and variables names of combined_df are then renamed to ensure that they match with filtered_data
The columns are then reordered for ease of access in combined_df
Finally, a slice of combined_df is printed in order to verify output. Keep in mind that the rest of the columns output is hidden in order to fit within the margins. The columns not shown include M#GoalsConceded, M#Shots, and M#ShotsOnTarget.
# create the last temp_df by calculating the total corner kicks, fouls, yellow cards, and red cards for each team and season group
drop_columns = list(range(1, 3)) + list(range(5, 11)) + list(range(15, 19))
temp_df.drop(temp_df.columns[drop_columns], inplace=True, axis=1)
blank_df = pd.DataFrame(columns=['Season_End_Year', 'Team', 'CornerKicks', 'Fouls', 'YellowCard', 'RedCard'])
temp_rows = []
current_season_end_year = None
# Iterate through each row in the df DataFrame
for index, row in temp_df.iterrows():
# Extract season end year
season_end_year = int(row['Season'][-2:])
# Set values to add into the array to create data frame
home_team = row['HomeTeam']
away_team = row['AwayTeam']
home_corner = row['HomeCorners']
away_corner = row['AwayCorners']
home_fouls = row['HomeFouls']
away_fouls = row['AwayFouls']
home_yello = row['HomeYellowCards']
away_yello = row['AwayYellowCards']
home_red = row['HomeRedCards']
away_red = row['AwayRedCards']
# Add values to array to create dataframe
temp_rows.append({'Season_End_Year': season_end_year, 'Team': home_team, 'CornerKicks': home_corner, 'Fouls': home_fouls, 'YellowCard': home_yello, 'RedCard': home_red})
temp_rows.append({'Season_End_Year': season_end_year, 'Team': away_team, 'CornerKicks': away_corner, 'Fouls': away_fouls, 'YellowCard': away_yello, 'RedCard': away_red})
# Fill in blank dataframe with values from above
blank_df = pd.DataFrame(temp_rows)
# Group the values by season and team and sum the values
stats_df = blank_df.groupby(['Season_End_Year', 'Team'], as_index=False).sum()
stats_df['Season_End_Year'] = '20' + stats_df['Season_End_Year'].astype(str)
stats_df['Season_End_Year'] = stats_df['Season_End_Year'].astype(int)
# The filtered_df and combined_df have variations of the same column/value types which makes it hard to analyze them later in case of merging or grouping
# For example, combined_df refers to teams as 'Clubs' and filtered_df refers to teams as 'Team'
# In addition, combined_df refers to certain teams with different names such as using 'Birmingham' instead of 'Birmingham City' as used in filtered_df
# Thus, we must standardize the columns and values for each table by replacing alternative/short names with full/shared names to maintain consistency.
stats_df['Team'] = stats_df['Team'].replace('Birmingham', 'Birmingham City')
stats_df['Team'] = stats_df['Team'].replace('Man City', 'Manchester City')
stats_df['Team'] = stats_df['Team'].replace('Man Utd', 'Manchester Utd')
stats_df['Team'] = stats_df['Team'].replace('Man Utd', 'Manchester Utd')
stats_df['Team'] = stats_df['Team'].replace('Newcastle', 'Newcastle Utd')
stats_df['Team'] = stats_df['Team'].replace('Stoke', 'Stoke City')
stats_df['Team'] = stats_df['Team'].replace('Wigan', 'Wigan Athletic')
stats_df['Team'] = stats_df['Team'].replace('Norwich', 'Norwich City')
stats_df['Team'] = stats_df['Team'].replace('Swansea', 'Swansea City')
stats_df['Team'] = stats_df['Team'].replace('Cardiff', 'Cardiff City')
stats_df['Team'] = stats_df['Team'].replace('Hull', 'Hull City')
stats_df['Team'] = stats_df['Team'].replace('Leicester', 'Leicester City')
print(stats_df.head())
Season_End_Year Team CornerKicks Fouls YellowCard RedCard 0 2011 Arsenal 252 432 68 6 1 2011 Aston Villa 235 437 71 2 2 2011 Birmingham City 152 399 57 3 3 2011 Blackburn 175 455 65 4 4 2011 Blackpool 186 403 47 2
Explanation of Code Above¶
The last dataframe stats_df is created in order to store total corner kicks, fouls, yellow cards, and red cards data from temp_df.
The code first calculates total corner kicks, total fouls, total yellow cards, and total red cards by summing each variable for every match detailed in temp_df.
The calculated values for total corner kicks, total fouls, total yellow cards, and total red cards are stored in stats_df.
The column and variables names of stats_df are then renamed to ensure that they match with filtered_data.
We Then print the "stats_df" dataset to double check its output.
# Merge combined_df and filtered_df to show the combined statistics for all Teams from Seasons 2010-2011 to 2019-2020
complete_data = pd.merge(combined_df, filtered_df, on=['Season_End_Year', 'Team'])
#move the columns Rk, MP, W, D, L, GF, GA, GD, and Pts to start after 'Season_End_Year'
cols = complete_data.columns.tolist()
move_cols = ['Rk', 'MP', 'W', 'D', 'L', 'GF', 'GA', 'GD', 'Pts']
new_order = cols[:2] + move_cols + cols[2:-9]
complete_data = complete_data[new_order]
# Merge combined_df and stats_df to show the combined statistics for all Teams from Seasons 2010-2011 to 2019-2020
complete_data = pd.merge(complete_data, stats_df, on=['Season_End_Year', 'Team'])
#move the columns CornerKicks Fouls YellowCard RedCard to start after 'Season_End_Year'
cols = complete_data.columns.tolist()
move_cols = ['CornerKicks', 'Fouls', 'YellowCard', 'RedCard']
new_order = cols[:11] + move_cols + cols[11:-4]
complete_data = complete_data[new_order]
# calculate columns to add: Total shots, Total shots on target, Percent Shots Scored (goals / total shots)
complete_data['TotalShots'] = combined_df.loc[:, 'M1Shots':'M38Shots'].sum(axis=1)
complete_data['TotalShotsOnTarget'] = combined_df.loc[:, 'M1ShotsOnTarget':'M38ShotsOnTarget'].sum(axis=1)
complete_data['PercentShotsScored'] = complete_data['GF'] / complete_data['TotalShots']
# shift the columns again
cols = complete_data.columns.tolist()
move_cols = ['TotalShots', 'TotalShotsOnTarget', 'PercentShotsScored']
new_order = cols[:15] + move_cols + cols[15:-3]
complete_data = complete_data[new_order]
#drop unnecessary columns
complete_data.drop(complete_data.iloc[:, 18:], inplace= True, axis=1)
# View and double check the encompassing 'merged_data'
print(complete_data.head())
# Checks for missing data
print("Total Missing values found in complete_data:")
print(complete_data.isnull().sum().sum())
Team Season_End_Year Rk MP W D L GF GA GD Pts CornerKicks Fouls YellowCard RedCard TotalShots TotalShotsOnTarget PercentShotsScored 0 Arsenal 2011 4 38 19 11 8 72 43 29 68 252 432 68 6 595 342 0.121008 1 Aston Villa 2011 9 38 12 12 14 48 59 -11 48 235 437 71 2 436 241 0.110092 2 Birmingham City 2011 18 38 8 15 15 37 58 -21 39 152 399 57 3 327 155 0.113150 3 Blackburn 2011 15 38 11 10 17 46 59 -13 43 175 455 65 4 385 191 0.119481 4 Blackpool 2011 19 38 10 9 19 55 78 -23 39 186 403 47 2 446 235 0.123318 Total Missing values found in complete_data: 0
Explanation of Data Processing Code¶
To Recap: Our previous three code blocks gave us the following dataframes:
- filtered_df which stores basic team info for seasons 2011-2020
- combined_df which stores goals and shots for each matchday in seasons 2011-2020
- stats_df which stores total corner kicks, total fouls, total yellow cards, and total red cards for seasons 2011-2020
We first modify combined_df so that it now stores total goals and total shots taken data for for seasons 2011-2020
We then go through the combined_df and filtered_df merge this data into a single table called 'complete_data' on the columns 'Season_End_Year' and 'Team' (which we standardized earlier). We also rearrange some of the columns on 'complete_data' so they can be read easier.
We then merge 'complete_data' with stats_df on the columns 'Season_End_Year' and 'Team' and set it to 'complete_data'.
Thus, we now have 'complete_data' which holds data from filtered_df (storing basic team info), combined_df (storing total goals/shots), and stats_df (storing total corners/fouls/cards) for each Team and Season_End_Year combination from 2011-2020
Thankfully, none of these datasets had any missing data which we checked by running print(complete_data.isnull().sum().sum())
Exploratory Analysis & Data Visualization¶
In order to assess Premier League team performances, we perform exploratory analysis where we plot the relationships between provided and derived metrics such as Goals, ShotsOnTargetRatio (Goals/ShotsTaken) and Fouls against 'Pts' and against eachother.
With this visualization of these variable relationships, we gain insights into factors that play role in helping a team gain or lose more points.
Line plots can track the performance trajectory of teams across seasons, while scatter plots uncover relationships between various performance metrics, such as the correlation between shot accuracy and points. Annotations and color-codings allow for insight into specfic variables such as team-wise breakdowns.
Together, these steps help form a cohesive narrative of our data. EDA ultimately transforms raw data into discernible patterns and stories that help us better understand the player and game dynamics of the Premier League.
###Points across Years
# Take out all missing data that is labeled as -1 and every year that is Not between(1990-2014)
filtered_df = filtered_df[(filtered_df['Season_End_Year'] != -1) & (filtered_df['Season_End_Year'].between(2011, 2020))]
# Total Points vs Season of each Team
plt.figure(figsize=(9, 6))
# Create a plot with lines for each team
for teamName, team_data in filtered_df.groupby('Team'):
plt.plot(team_data['Season_End_Year'], team_data['Pts'], label=teamName)
# Set plot labels and title
plt.xlabel('Season End Year')
plt.ylabel('Total Points Accumulated')
plt.title(' Total Points vs. Season of each Team 2010/2011 to 2019/2020')
plt.legend(loc='upper left', bbox_to_anchor=(1, 1), ncol = 2)
plt.grid(True)
plt.tight_layout()
plt.show()
Analysis of 'Total Points vs Season of each Team'¶
The line plot of points accumulated vs. year for each Premier League team reveals interesting insights. Teams that consistently perform well tend to maintain higher point accumulations across multiple years, suggesting a level of consistency and dominance. This is contrasted with other teams that show more variability in their point totals, often hovering near the bottom. Notably, the graph highlights an exceptional single-season performance where a team (Man City) achieved one of the highest point accumulations recorded. Additionally, Manchester City stands out as a model of consistency, consistently maintaining high point totals and remaining competitive at the top levels of the league. This analysis showcases the importance of sustained performance for maintaining a strong position in the Premier League and recognizes teams like Manchester City's remarkable stability in their league standings.
Explanation of 'Total Points vs Season of each Team' Plot Significance¶
The output of our code is a line chart where:
==> The x-axis represents the season end years from 2011 to 2020.
==> The y-axis represents the total points accumulated (where more points indicate more wins)
==> There are multiple lines where each line represents an English Premier League Team's performance in terms of points earned across the ten seasons.
Looking at the output, we can see that there is large variance in different team's trajectories for win success overtime. Liverpool (top gray line) for example, has some fluctuation but overall appears to have a positive regression, starting with 58 pts in the 2010-2011 season and ending with 99 pts in the 2019-2020 season. Other teams have had lots of fluctuation and appear to have more horizontal regressions, indicating little real growth.
These trends can further help us hypothesize about factors in a specific teams win rate. For example, Liverpool's trendline shows periods of peaks and troughs where it starts around average (2012), grows consistently until a spike (2014), and then drops sharply again (2016), only to grow consistently again to a higher peak (2020). We can look for causation in these trends, such as perhaps a change of managers during troughs, or larger investment in the team during peaks.
Lineplots of Season vs. Factors¶
### Lineplots of Season vs. Factors
# Columns to plot
columns_to_plot = ['GF', 'GA', 'GD', 'CornerKicks', 'Fouls', 'YellowCard', 'RedCard', 'TotalShots', 'TotalShotsOnTarget', 'PercentShotsScored']
# Determine number of rows required to plot all columns (3 columns in each row)
num_rows = (len(columns_to_plot) + 2) // 3
# Create a single figure and multiple subplots
fig, axes = plt.subplots(num_rows, 3, figsize=(18, 6 * num_rows))
# Flatten the axes object to iterate easily
axes = axes.ravel()
# Using enumeration loop to plot each column
for idx, (column, ax) in enumerate(zip(columns_to_plot, axes), 1):
# Plotting using seaborn for easy grouped line plots
sns.lineplot(data=complete_data, x="Season_End_Year", y=column, hue="Team", ax=ax)
ax.set_title(f"{idx}. {column} by Season")
ax.set_ylabel(column)
ax.set_xlabel("Season_End_Year")
ax.legend().set_visible(False) # Hide legend for clarity
# Hide any remaining unused subplots (if any)
for ax in axes[len(columns_to_plot):]:
ax.axis('off')
# Adjust layout
plt.tight_layout()
plt.grid(True)
plt.show()
Explanation of the 'Lineplots of Season vs. Factors' Above¶
The above plots are some interesting visualization of generalized dataset from 'complete_data'. Instead of following the general trend over the season, the visualization represents the dynamic game play environment that English Premier League has. Each team's goals scored, goals conceded, fouls, and shots varies by each season.
However, there are some notable tendencies in each teams' performance where teams who scores more goals during a particular season also appear to have a high number of shots that they take in the same season.
Interestingly, some teams who scored more goals during specific seasons also appeared to have less fouls in those same seasons than those who did not.
Lastly, despite the fact that total number of shots has remained in a similar range over the seasons, total shots on target has drastically went down over the years (from 2012 - 2016). Trends like this will require more analysis in order to deduce their root cause.
Individual Subplot Explanation of Lineplots of Season vs. Factors' Above¶
Goals Scored by Each Team Across Years: This line chart illustrates the goals scored by each team for each season over the span of a decade.
Notably, some teams have maintained consistent high levels of goal scoring across different seasons (top pink line). This could be due to a strong attacking strategy or a consistent pool of talented strikers that have enabled these teams to consistently find the back of the net.
Other teams appear to have a lot more variance in their goal-scoring ability.
Goals Conceded by Each Team Across Years: Likewise, while there is a lot of variance, certain teams appear to be more consistent on conceding goals than others.
This could be due to the fact that bigger teams have shown a degree of defensive consistency. This might be attributed to their strong defensive tactics, solid backline, or goalkeeping prowess.
Smaller teams, on the other hand, might experience more variability due to changes in player composition and tactical approach.
Goal Difference (Scored - Conceded) Across Years: The fluctuating goal difference trend signifies the dynamic nature of the league. However, the consistency observed among top teams like Manchester City and Liverpool suggests attacking prowess and a well-structured defense.
Corner Kicks for Each Season: The chart highlighting corner kicks for each teams over multiple seasons.
Overall there is alot of variation and while some teams seem to have a more consistent higher level of corner-kicks, it is harder to tell if this is due to a specific characteristic or strategy of the team, or if it just due to chance.
Fouls Per Season for Each Team: The substantial fluctuation in fouls committed per season could suggest evolving playing styles and referee tendencies.
Looking at the multiple lines there appears to be a common spike around in fouls around 2015 and 2018. This could be a result of Teams adopting more aggressive strategies, leading to a higher foul rate.
Yellow and Red Cards Across Seasons: While there is a lot of variation, the spike in yellow and red cards from 2014 to 2016 might be due to stricter refereeing or an increased focus on agressive playstyles.
Total Shots by Each Team for Each Season: Again, there is a lot of variation, but spikes in 2014 seem to suggest a focus on shot-taking. This tendency seems to reverse however as there appears to be a concentration in lower shots taken towards the later seasons.
Total Shots on Target Across Years: The significant drop in shots on target observed in 2014, followed by a sustained decrease, hints at a shift in teams' approach to shooting accuracy. This could be attributed to changes in tactical emphasis, focusing on shot quality over quantity. The observed trend might indicate a more cautious approach to shooting to ensure better chances of scoring.
Percentage of Shots Scored Across Years: While there is a a lot of fluctuation there appears to be significant and consistent concentration of the shot percentage around 0.10 to 0.12 from 2011-2020. This makes sense as consistent shot taking is important in the league. In addition, it hints that lots of shots are taken but not as many go in.
In conclusion, interesting trends and patterns in the Premier League over the analyzed period have been discovered. These trends offer insights into evolving tactical strategies, player behavior, and changes in the overall dynamics of the league. It's important to consider these observations in the context of broader football trends and potential external factors that might influence team performance and gameplay.
Lineplots of Season_End_Year vs. Mean Factors¶
### Lineplots of Season vs. Mean Factors
# Columns to plot
columns_to_plot = ['GF', 'GA', 'GD', 'CornerKicks', 'Fouls', 'YellowCard', 'RedCard', 'TotalShots', 'TotalShotsOnTarget', 'PercentShotsScored']
# Group data by Season_End_Year and calculate the mean
grouped_data = complete_data.groupby('Season_End_Year').mean().reset_index()
# Determine number of rows required to plot all columns (3 columns in each row)
num_rows = (len(columns_to_plot) + 2) // 3
# Create a single figure and multiple subplots
fig, axes = plt.subplots(num_rows, 3, figsize=(18, 6 * num_rows))
# Flatten the axes object to iterate easily
axes = axes.ravel()
# Using enumeration loop to plot each column
for idx, (column, ax) in enumerate(zip(columns_to_plot, axes), 1):
# Plotting using seaborn for easy grouped line plots
sns.lineplot(data=grouped_data, y=column, x="Season_End_Year", ax=ax)
ax.set_title(f"{idx}. Mean {column} by Season")
ax.set_ylabel(column)
ax.set_xlabel("Season_End_Year")
# Hide any remaining unused subplots (if any)
for ax in axes[len(columns_to_plot):]:
ax.axis('off')
# Adjust layout
plt.tight_layout()
plt.show()
<ipython-input-23-867cb8230581>:7: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function. grouped_data = complete_data.groupby('Season_End_Year').mean().reset_index()
MEAN SEASON FACTORS VS YEARS ANALYSIS¶
Mean Goals Scored Over Time: The fluctuating pattern of mean goals scored, with a significant drop between 2014 and 2016, followed by a subsequent increase between 2016 and 2020, reflects the dynamic nature of goal-scoring trends. Changes in playing styles, tactics, and team compositions might have contributed to these variations.
Mean Goals Conceded Over Time: Mean goals conceded exhibit a same fluctuating pattern along with the Mean goals scored. It corresponds to the fact that as the number of goals decrease, the number of goals conceded would decrease also.
Mean Goal Difference (Scored - Conceded) Over Time: Mean Goal Difference is plotted in 0 because total number of goals – total number of goals conceded is zero.
Mean Corner Kicks Against Over Time: The gradual drop in mean corner kicks against over the years suggests a trend of fewer corner kicks occurring as time progresses. This could be due to tactical shifts and strategies focused on minimizing situations that lead to conceding corner kicks.
Mean Fouls Per Season Over Time: The fluctuating pattern with peaks and drops in mean fouls indicates varying levels of physicality and aggression in different seasons. The jump from 2012 to 2018 could suggest tactical adjustments or rule changes that influenced players' behaviors.
Mean Yellow and Red Cards Over Time: The resemblance of mean yellow cards to the mean fouls pattern highlights a potential correlation between fouls committed and disciplinary actions taken by referees. The gradual drop in mean red cards across all years could indicate improved player conduct.
Mean Total Shots Over Time: The decreasing trend in mean total shots suggests a strategic shift in teams' approach to shooting. The emphasis might be moving away from sheer volume and more toward quality, aiming for more accurate shots rather than a higher number of attempts.
Mean Total Shots on Target Over Time: The major drop in mean total shots on target during 2014, followed by a consistent trend, underscores a change in shooting accuracy. Teams seem to have adjusted their tactics, focusing on precision rather than simply attempting to place shots on goal.
Mean Percentage of Shots Scored Over Time: The significant drop between 2014 and 2016 in mean percentage of shots scored, followed by a recovery, might reflect changes in finishing efficiency during that period. The subsequent return to normal suggests a stabilization in teams' ability to convert shots into goals.
These trends could be influenced by factors such as tactical innovations, rule changes, player development, and coaching strategies. Keep in mind that while these trends can help draw conclusions, a comprehensive understanding might require additional contextual information and in-depth investigation.
Scatterplots Pts vs. Factors¶
### Scatterplots Pts vs. Factors
# List of columns to plot
columns_to_plot = ['GF', 'GA', 'GD', 'CornerKicks', 'Fouls', 'YellowCard', 'RedCard', 'TotalShots', 'TotalShotsOnTarget', 'PercentShotsScored']
# Determine the number of rows required to plot all columns (3 columns in each row)
num_rows = (len(columns_to_plot) + 2) // 3
# Create a single figure and multiple subplots
fig, axes = plt.subplots(num_rows, 3, figsize=(18, 6 * num_rows))
# Flatten the axes object to iterate easily
axes = axes.ravel()
# Iterate over each column in columns_to_plot to create scatter plots against 'Pts' for each group
for idx, (column, ax) in enumerate(zip(columns_to_plot, axes), 1):
# Plotting
ax.scatter(complete_data[column], complete_data['Pts'], label=column, alpha=0.6)
# Regression line computation using statsmodels
X = sm.add_constant(complete_data[column])
model = sm.OLS(complete_data['Pts'], X).fit()
ax.plot(complete_data[column], model.predict(X), color='red') # plot regression line
ax.text(0.1, 0.9, f'y = {model.params[column]:.2f}x + {model.params["const"]:.2f}', transform=ax.transAxes, color="red") # display equation
ax.text(0.1, 0.8, f'p-value: {model.pvalues[column]:.5f}', transform=ax.transAxes, color="blue") # display p-value
ax.set_title(f"{idx}. {column} vs Points by Team and Season")
ax.set_xlabel(f"{column}")
ax.set_ylabel("Points")
ax.legend()
ax.grid(True)
# Hide any remaining unused subplots (if any)
for ax in axes[len(columns_to_plot):]:
ax.axis('off')
plt.tight_layout()
plt.grid(True)
plt.show()
Season Factors VS Points ScatterPlot- Analysis¶
Goals Scored vs Points: The positive correlation observed between goals scored and points indicates that teams which score more goals tend to accumulate more points, reflecting a connection between offensive prowess and overall success. This strong association suggests that goal-scoring ability is a significant factor contributing to a team's performance in the league.
Goals Conceded vs Points: The negative correlation between goals conceded and points highlights that teams which concede fewer goals generally attain higher point totals. This underscores the importance of a strong defense in achieving positive outcomes in the league. Teams with solid defensive records are often better positioned to secure points and higher league standings.
Goal Difference vs Points: The positive slope or correlation between goal difference and points reinforces the idea that teams with a greater positive goal difference, indicating a strong attack and solid defense, tend to achieve higher point totals.
Corner Kicks vs Points: The trend you've noted suggests a less clear correlation between corner kicks and points compared to goals scored. While there seems to be a connection that more corner kicks might contribute to more points, it might not be as pronounced as with goals scored.
Fouls vs Points: The general trend of higher fouls being associated with lower points aligns with the notion that disciplined play and avoiding excessive fouls can contribute to better results. However, the presence of outliers might indicate that some teams achieve success despite committing more fouls, possibly due to other strengths or tactical considerations.
Yellow and Red Cards vs Points: Similar to fouls, the correlation between yellow and red cards and points also suggests that teams with better disciplinary records tend to accumulate more points. This might reflect a more composed and controlled approach to the game, contributing to a team's overall performance.
Total Shots vs Points: The direct correlation observed between total shots and points reinforces the idea that teams with a higher volume of shots tend to secure more points. This could indicate a more offensive-minded strategy, where creating a larger number of scoring opportunities is linked to achieving better results.
Total Shots on Target vs Points: The correlation between total shots on target and points appears to reaffirm the importance of shot accuracy. Teams that consistently place shots on target might be more likely to convert those chances into goals, resulting in higher point totals and better league standings.
Percentage of Shots Scored vs Points: The connection between shot efficiency, as represented by the percentage of shots scored, and points emphasizes the strategic value of converting chances into goals. Teams with a higher percentage of shots scored are more effective in capitalizing on their opportunities, leading to better point accumulation.
These trends can help in understanding the tactical priorities and strategies that contribute to achieving favorable outcomes over the analyzed seasons. Remember that while correlations are observed, they don't always imply causation, and multiple factors could be at play in determining a team's performance.
Here is a link to help understand the scatter plots and correlation: https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/introduction-to-scatterplots/a/scatterplots-and-correlation-review
Scatterplots of Mean Pts For Every Team vs. Mean Factors For Every Team from Seasons 2011-2020¶
### Scatterplots of Mean Pts For Every Team vs. Mean Factors For Every Team from Seasons 2011-2020
# Group by 'Team' and compute the means
grouped_means = complete_data.groupby(['Team']).mean().reset_index()
# List of columns to plot
columns_to_plot = ['GF', 'GA', 'GD', 'CornerKicks', 'Fouls', 'YellowCard', 'RedCard', 'TotalShots', 'TotalShotsOnTarget', 'PercentShotsScored']
# Determine the number of rows required to plot all columns (3 columns in each row)
num_rows = (len(columns_to_plot) + 2) // 3
# Create a single figure and multiple subplots
fig, axes = plt.subplots(num_rows, 3, figsize=(18, 6 * num_rows))
# Flatten the axes object to iterate easily
axes = axes.ravel()
# Iterate over each column in columns_to_plot to create scatter plots against 'Pts' for each group
for idx, (column, ax) in enumerate(zip(columns_to_plot, axes), 1):
# Plotting
ax.scatter(grouped_means[column], grouped_means['Pts'], label=column, alpha=0.6)
for i, team in enumerate(grouped_means['Team']):
ax.annotate(team, (grouped_means[column][i], grouped_means['Pts'][i]), fontsize=8, alpha=0.6, ha='center')
# Regression line computation using statsmodels
X = sm.add_constant(grouped_means[column])
model = sm.OLS(grouped_means['Pts'], X).fit()
ax.plot(grouped_means[column], model.predict(X), color='red') # plot regression line
ax.text(0.1, 0.9, f'y = {model.params[column]:.2f}x + {model.params["const"]:.2f}', transform=ax.transAxes, color="red") # display equation
ax.text(0.1, 0.8, f'p-value: {model.pvalues[column]:.5f}', transform=ax.transAxes, color="blue") # display p-value
ax.set_title(f"{idx}. Mean {column} vs Mean Points by Team From 2011-2020")
ax.set_xlabel(f"{column}")
ax.set_ylabel("Points")
ax.legend()
ax.grid(True)
# Hide any remaining unused subplots (if any)
for ax in axes[len(columns_to_plot):]:
ax.axis('off')
plt.tight_layout()
plt.grid(True)
plt.show()
Analysis Of Mean Pts For Every Team vs. Mean Factors For Every Team from Seasons 2011-2020¶
Mean Goals Scored vs Mean Points: The correlation between mean goals scored and mean points reaffirms the previously observed trend that teams with higher goal-scoring abilities tend to accumulate more points. The positioning of Manchester City at the top right further emphasizes their consistent success and strong offensive performance over the 10-year span.
Mean Goals Conceded vs Mean Points: The negative correlation between mean goals conceded and mean points aligns with the understanding that teams with better defensive records tend to achieve higher point totals. Manchester City's presence at the top left end underscores their overall defensive stability and thus strong point accumulation.
Mean Goal Difference vs Mean Points: The direct correlation between mean goal difference and mean points confirms the trend that teams with better goal differences, reflecting a balanced offensive and defensive performance, tend to be more successful. This relationship supports the notion that a well-rounded team has a higher chance of achieving success.
Mean Corner Kicks vs Mean Points: The correlation between mean corner kicks and mean points suggests that teams generating more corner kicks tend to secure more points. Manchester City's strong performance in both categories again highlights their proactive approach to attacking set pieces, contributing to their success in point accumulation.
Mean Fouls vs Mean Points: The correlation line is a lot more tempered and might not be seen as significant as previous factors. Still, The slightly negative trend of teams with fewer mean fouls achieving better mean points could suggest that disciplined play is linked to higher success.
However, Manchester City's outlier position could indicate that while their style of play may involve more fouls, their overall success remains unaffected due to other factors.
Mean Yellow Cards vs Mean Points: The slight negative slope observed between mean yellow cards and mean points appears to suggest that teams with fewer yellow cards tend to achieve higher points. However, this is a very flimsy correlation and the presence of outliers, including successful teams with more yellow cards, indicates that other factors might mitigate the impact of disciplinary actions.
Mean Red Cards vs Mean Points: Similar to mean yellow cards, there is a VERY slight trend of fewer mean red cards correlating with higher mean points is observed. While a less aggravating approach might be beneficial, a significant amount of cases exist where successful teams have received more red cards.
Mean Total Shots vs Mean Points: The direct correlation between mean total shots and mean points reinforces the notion that teams frequently generating scoring opportunities tend to accumulate more points. This emphasizes the importance of attacking intent in achieving favorable results.
Mean Total Shots on Target vs Mean Points: The trend of higher mean total shots on target being associated with better mean points aligns with the idea that shot accuracy is crucial for converting opportunities into goals. Outliers like Bolton indicate that other factors can impact point accumulation.
Mean Percentage of Shots Scored vs Mean Points: The general correlation between mean percentage of shots scored and mean points underscores the tactical significance of shot efficiency. While there's a trend that better shot conversion leads to higher point totals, exceptions like Reading highlight the complexity of factors affecting team success.
These trends contribute to a better understanding of the strategic priorities that contribute to achieving strong point outcomes in the Premier League. Keep in mind that while these correlations are present, other contextual factors can also play a role in team performance. Nonetheless, these subplots hint at the specific factors that either increase or decrease a Teams 'Pts' at the end of a season.
FOR MORE INFORMATION ABOUT PLAYING STYLES AND TECHNIQUES:¶
https://www.statsperform.com/resource/stats-playing-styles-introduction/
Model: Analysis, Hypothesis Testing, & ML¶
Although above plots with regression line shows good linear relationship between the individual features and points, we wanted to determine if it’s suitable for the multivariate linear regression by looking at the correlation between the variables we have. The matrix below provided higher than expected correlation between the variable but the result was understandable. For example, Fouls, YellowCard, and RedCard are correlated to each other but not with other variables such as GF, GA, or TotalShots. Looking at this closely, Fouls and YellowCard had stong correlation which is predictable since the more fouls we observe, the likelihood of yellow cards will increase showing positive correlation of the two variables. We could also observe how GF has positive correlation with CornerKicks from increasing chance of set piece goals. Given the correlation matrix, we are able to determine that a lot of variables are correlated together, however, we will test a multivariate linear regression model below variables because GF, GD, TotalShots have practical significance in determining the performance of the clubs in the league.
Here is the link to help you interpret the heatmap of correlation matrix: https://www.statology.org/how-to-read-a-correlation-matrix/
# Compute the correlation matrix
corr_matrix = complete_data.corr()
# Set the figure size
plt.figure(figsize=(15, 10))
# Draw the heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5, fmt=".2f")
# Rotate y-axis labels for better readability
plt.yticks(rotation=0)
plt.xticks(rotation=90)
plt.show()
<ipython-input-26-b79b5a3f169f>:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. corr_matrix = complete_data.corr()
Explanation Of the Code Above¶
Looking at our correlation graph, we can see that:
==> Factors like Goals Scored (+0.91), GD (+0.97), Corner Kicks (+0.71), TotalShots (+0.78), TotalShotsOnTarget (0.54), and PercentShotsScored (+0.73) had positive correlations with Pts.
==> Factors like Goals Conceded (-0.84), Fouls (-0.26), and YellowCard (-0.22) had negative correlations with Pts.
This makes sense as higher attempts on scoring would result in higher points (since teams are awarded +3 points per goal). In addition, higher levels of fouls and penalties could award the "enemy" team chances to score more points which would result in a higher chance of a team losing the game (and thus gaining less points).
That being said, Fouls and YellowCard did have a weaker correlation level than preferred (more negative then -0.5).
One more interesting thing to note is that Red Cards had a very slight negative correlation that was close to negligible. Further analysis will be needed to determine why this is.
OLS Regression With Single Independent Variables¶
Overall, we can see from the output that certain factors are considered to be statistically significant in affecting a team's seasonal points. Factors like Goals Scored (GF), Goals Conceded (GA), CornerKicks, Fouls, YellowCard, TotalShots, TotalShotsOnTarget, PercentShotsScored, all had p-values below the default significance value of 0.05. This implies that that they are statistically significant in their effect as an independent variable on a Team Points (Pts) dependent variable.
In contrast, the RedCard factor had a p-value of 0.124 which is above the default significance value of 0.05. This implies that that the RedCard is NOT statistically significant in its effect as an independent variable on a Team Points (Pts) dependent variable. This is a bit curious and could be due to the fact that there is a tradeoff between behaviors that benefited the team (but resulted in a red card) and the player being expelled from the game due to the red card.
The largest R-Squared value for the individual factors was 0.830 for Goals Scored (GF). This implies that out of the factors we tested, the Goals Scored had the greatest effect on variation in the total points for a team in a given season.
Here is a link to help readers understand how to interpret OLS linear regression analysis (for both multivariate and single variables): https://www.geeksforgeeks.org/interpreting-the-results-of-linear-regression-using-ols-summary/.
# Evaluate linear regression model using statsmdel OLS
stats = statsmodels.formula.api.ols(formula="Pts ~ GF", data=complete_data).fit()
# Print the summary
stats.summary()
Dep. Variable: | Pts | R-squared: | 0.830 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.829 |
Method: | Least Squares | F-statistic: | 965.8 |
Date: | Mon, 21 Aug 2023 | Prob (F-statistic): | 4.38e-78 |
Time: | 03:32:08 | Log-Likelihood: | -677.90 |
No. Observations: | 200 | AIC: | 1360. |
Df Residuals: | 198 | BIC: | 1366. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 2.4506 | 1.684 | 1.455 | 0.147 | -0.870 | 5.771 |
GF | 0.9560 | 0.031 | 31.077 | 0.000 | 0.895 | 1.017 |
Omnibus: | 0.203 | Durbin-Watson: | 2.018 |
---|---|---|---|
Prob(Omnibus): | 0.903 | Jarque-Bera (JB): | 0.355 |
Skew: | -0.026 | Prob(JB): | 0.837 |
Kurtosis: | 2.800 | Cond. No. | 181. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Evaluate linear regression model using statsmdel OLS
stats1 = statsmodels.formula.api.ols(formula="Pts ~ GA", data=complete_data).fit()
# Print the summary
stats1.summary()
Dep. Variable: | Pts | R-squared: | 0.707 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.705 |
Method: | Least Squares | F-statistic: | 476.8 |
Date: | Mon, 21 Aug 2023 | Prob (F-statistic): | 1.28e-54 |
Time: | 03:32:12 | Log-Likelihood: | -732.40 |
No. Observations: | 200 | AIC: | 1469. |
Df Residuals: | 198 | BIC: | 1475. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 111.9906 | 2.813 | 39.806 | 0.000 | 106.442 | 117.539 |
GA | -1.1439 | 0.052 | -21.837 | 0.000 | -1.247 | -1.041 |
Omnibus: | 2.202 | Durbin-Watson: | 1.744 |
---|---|---|---|
Prob(Omnibus): | 0.333 | Jarque-Bera (JB): | 2.197 |
Skew: | 0.251 | Prob(JB): | 0.333 |
Kurtosis: | 2.896 | Cond. No. | 226. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Evaluate linear regression model using statsmdel OLS
stats2 = statsmodels.formula.api.ols(formula="Pts ~ CornerKicks", data=complete_data).fit()
# Print the summary
stats2.summary()
Dep. Variable: | Pts | R-squared: | 0.510 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.508 |
Method: | Least Squares | F-statistic: | 206.2 |
Date: | Mon, 21 Aug 2023 | Prob (F-statistic): | 1.63e-32 |
Time: | 03:32:14 | Log-Likelihood: | -783.66 |
No. Observations: | 200 | AIC: | 1571. |
Df Residuals: | 198 | BIC: | 1578. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -18.5494 | 5.010 | -3.702 | 0.000 | -28.430 | -8.669 |
CornerKicks | 0.3468 | 0.024 | 14.360 | 0.000 | 0.299 | 0.394 |
Omnibus: | 2.552 | Durbin-Watson: | 1.954 |
---|---|---|---|
Prob(Omnibus): | 0.279 | Jarque-Bera (JB): | 2.151 |
Skew: | 0.222 | Prob(JB): | 0.341 |
Kurtosis: | 3.246 | Cond. No. | 1.20e+03 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.2e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
# Evaluate linear regression model using statsmdel OLS
stats3 = statsmodels.formula.api.ols(formula="Pts ~ Fouls", data=complete_data).fit()
# Print the summary
stats3.summary()
Dep. Variable: | Pts | R-squared: | 0.070 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.066 |
Method: | Least Squares | F-statistic: | 14.99 |
Date: | Mon, 21 Aug 2023 | Prob (F-statistic): | 0.000147 |
Time: | 03:32:16 | Log-Likelihood: | -847.72 |
No. Observations: | 200 | AIC: | 1699. |
Df Residuals: | 198 | BIC: | 1706. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 99.7441 | 12.307 | 8.105 | 0.000 | 75.475 | 124.013 |
Fouls | -0.1162 | 0.030 | -3.872 | 0.000 | -0.175 | -0.057 |
Omnibus: | 7.850 | Durbin-Watson: | 1.621 |
---|---|---|---|
Prob(Omnibus): | 0.020 | Jarque-Bera (JB): | 7.224 |
Skew: | 0.403 | Prob(JB): | 0.0270 |
Kurtosis: | 2.533 | Cond. No. | 4.23e+03 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.23e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
# Evaluate linear regression model using statsmdel OLS
stats4 = statsmodels.formula.api.ols(formula="Pts ~ YellowCard", data=complete_data).fit()
# Print the summary
stats4.summary()
Dep. Variable: | Pts | R-squared: | 0.046 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.041 |
Method: | Least Squares | F-statistic: | 9.496 |
Date: | Mon, 21 Aug 2023 | Prob (F-statistic): | 0.00235 |
Time: | 03:32:19 | Log-Likelihood: | -850.34 |
No. Observations: | 200 | AIC: | 1705. |
Df Residuals: | 198 | BIC: | 1711. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 75.1675 | 7.512 | 10.006 | 0.000 | 60.353 | 89.982 |
YellowCard | -0.3689 | 0.120 | -3.081 | 0.002 | -0.605 | -0.133 |
Omnibus: | 11.220 | Durbin-Watson: | 1.616 |
---|---|---|---|
Prob(Omnibus): | 0.004 | Jarque-Bera (JB): | 12.011 |
Skew: | 0.581 | Prob(JB): | 0.00247 |
Kurtosis: | 2.702 | Cond. No. | 390. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Evaluate linear regression model using statsmdel OLS
stats5 = statsmodels.formula.api.ols(formula="Pts ~ RedCard", data=complete_data).fit()
# Print the summary
stats5.summary()
Dep. Variable: | Pts | R-squared: | 0.012 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.007 |
Method: | Least Squares | F-statistic: | 2.388 |
Date: | Mon, 21 Aug 2023 | Prob (F-statistic): | 0.124 |
Time: | 03:32:22 | Log-Likelihood: | -853.82 |
No. Observations: | 200 | AIC: | 1712. |
Df Residuals: | 198 | BIC: | 1718. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 55.2518 | 2.260 | 24.445 | 0.000 | 50.795 | 59.709 |
RedCard | -1.0981 | 0.711 | -1.545 | 0.124 | -2.499 | 0.303 |
Omnibus: | 11.792 | Durbin-Watson: | 1.638 |
---|---|---|---|
Prob(Omnibus): | 0.003 | Jarque-Bera (JB): | 12.784 |
Skew: | 0.605 | Prob(JB): | 0.00168 |
Kurtosis: | 2.734 | Cond. No. | 6.27 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Evaluate linear regression model using statsmdel OLS
stats6 = statsmodels.formula.api.ols(formula="Pts ~ TotalShots", data=complete_data).fit()
# Print the summary
stats6.summary()
Dep. Variable: | Pts | R-squared: | 0.611 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.609 |
Method: | Least Squares | F-statistic: | 311.6 |
Date: | Mon, 21 Aug 2023 | Prob (F-statistic): | 1.64e-42 |
Time: | 03:32:24 | Log-Likelihood: | -760.49 |
No. Observations: | 200 | AIC: | 1525. |
Df Residuals: | 198 | BIC: | 1532. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -22.5517 | 4.311 | -5.231 | 0.000 | -31.053 | -14.050 |
TotalShots | 0.1549 | 0.009 | 17.652 | 0.000 | 0.138 | 0.172 |
Omnibus: | 6.095 | Durbin-Watson: | 1.957 |
---|---|---|---|
Prob(Omnibus): | 0.047 | Jarque-Bera (JB): | 5.805 |
Skew: | 0.356 | Prob(JB): | 0.0549 |
Kurtosis: | 3.434 | Cond. No. | 2.75e+03 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.75e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
# Evaluate linear regression model using statsmdel OLS
stats7 = statsmodels.formula.api.ols(formula="Pts ~ TotalShotsOnTarget", data=complete_data).fit()
# Print the summary
stats7.summary()
Dep. Variable: | Pts | R-squared: | 0.288 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.284 |
Method: | Least Squares | F-statistic: | 80.10 |
Date: | Mon, 21 Aug 2023 | Prob (F-statistic): | 2.59e-16 |
Time: | 03:32:26 | Log-Likelihood: | -821.05 |
No. Observations: | 200 | AIC: | 1646. |
Df Residuals: | 198 | BIC: | 1653. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 25.0790 | 3.218 | 7.794 | 0.000 | 18.734 | 31.424 |
TotalShotsOnTarget | 0.1396 | 0.016 | 8.950 | 0.000 | 0.109 | 0.170 |
Omnibus: | 7.301 | Durbin-Watson: | 1.351 |
---|---|---|---|
Prob(Omnibus): | 0.026 | Jarque-Bera (JB): | 7.252 |
Skew: | 0.464 | Prob(JB): | 0.0266 |
Kurtosis: | 3.095 | Cond. No. | 636. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Evaluate linear regression model using statsmdel OLS
stats8 = statsmodels.formula.api.ols(formula="Pts ~ PercentShotsScored", data=complete_data).fit()
# Print the summary
stats8.summary()
Dep. Variable: | Pts | R-squared: | 0.538 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.536 |
Method: | Least Squares | F-statistic: | 230.5 |
Date: | Mon, 21 Aug 2023 | Prob (F-statistic): | 4.93e-35 |
Time: | 03:32:29 | Log-Likelihood: | -777.82 |
No. Observations: | 200 | AIC: | 1560. |
Df Residuals: | 198 | BIC: | 1566. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -15.3248 | 4.534 | -3.380 | 0.001 | -24.266 | -6.383 |
PercentShotsScored | 635.0815 | 41.831 | 15.182 | 0.000 | 552.590 | 717.573 |
Omnibus: | 0.071 | Durbin-Watson: | 1.711 |
---|---|---|---|
Prob(Omnibus): | 0.965 | Jarque-Bera (JB): | 0.202 |
Skew: | 0.005 | Prob(JB): | 0.904 |
Kurtosis: | 2.845 | Cond. No. | 50.3 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression With Multiple Independent Variables and Prediction Comparison¶
We also wanted to observe how the variables above together would contribute to accumulating points for the teams in the Primier Leauge. By looking at the linear regression data below, variables such as GA, TotalShots, and PercentShotsScored are statistically significant in this linear regression model. The three statistically significant variables are assumed to be correct where goals concieved can lower your chance of winning the game and the number of shots team takes and higher the shot efficiency will increase the chance of scoring a goal. These observations are also supported by the value of their coefficient. The negative coefficient for GA explains that each goal conceded would lower the points while positive coefficient from the other two explains the positive relationship to the output. However, there were higher than expected p values such as GF and TotalShotsOnTarget which could indicate that it may be just a variable that is not significant or there are correlation between variables that we needed to check. Despite the unexpected p values, 0.941 r squared value explains that 94.1% of the variation can be explained with the independent variables in this model. This is very large value and implies a great regressive model.
In addition, 0.941 is greater than the largest individual factor R-squared value which was 0.83 for Goals Scored. This further suggests that a more comprehensive model that includes multiple factors will lead to a better fit.
# Evaluate linear regression model using statsmdel OLS
stats9 = statsmodels.formula.api.ols(formula="Pts ~ GF + GA + Fouls + YellowCard + RedCard + TotalShots + TotalShotsOnTarget + PercentShotsScored", data=complete_data).fit()
# Print the summary
stats9.summary()
Dep. Variable: | Pts | R-squared: | 0.941 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.938 |
Method: | Least Squares | F-statistic: | 380.0 |
Date: | Mon, 21 Aug 2023 | Prob (F-statistic): | 6.35e-113 |
Time: | 03:32:33 | Log-Likelihood: | -572.18 |
No. Observations: | 200 | AIC: | 1162. |
Df Residuals: | 191 | BIC: | 1192. |
Df Model: | 8 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 30.8839 | 10.755 | 2.872 | 0.005 | 9.670 | 52.098 |
GF | 0.2234 | 0.177 | 1.260 | 0.209 | -0.126 | 0.573 |
GA | -0.6010 | 0.033 | -18.279 | 0.000 | -0.666 | -0.536 |
Fouls | -0.0172 | 0.010 | -1.744 | 0.083 | -0.037 | 0.002 |
YellowCard | 0.0271 | 0.038 | 0.721 | 0.472 | -0.047 | 0.101 |
RedCard | 0.0585 | 0.183 | 0.320 | 0.750 | -0.302 | 0.419 |
TotalShots | 0.0479 | 0.021 | 2.304 | 0.022 | 0.007 | 0.089 |
TotalShotsOnTarget | -0.0075 | 0.006 | -1.193 | 0.234 | -0.020 | 0.005 |
PercentShotsScored | 231.3256 | 89.116 | 2.596 | 0.010 | 55.548 | 407.103 |
Omnibus: | 3.034 | Durbin-Watson: | 2.248 |
---|---|---|---|
Prob(Omnibus): | 0.219 | Jarque-Bera (JB): | 2.778 |
Skew: | 0.178 | Prob(JB): | 0.249 |
Kurtosis: | 3.454 | Cond. No. | 1.97e+05 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.97e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Below table will have comparison of actual points that each team has earned vs the predicted points from the multivariate linear regression model using OLS. Although variables possess higher than expected correlation between others, we wanted to include those variables due to the practical significance of variable, such as GF. As it turns out, the outcome of the prediction model using multivariable was very close to the actual data that we had which was interesting to observe.
# Copy complete data from above
compare_data = complete_data.copy()
# Drop unnecessary column
compare_data = compare_data.drop(['Pts'], axis=1)
# Get the prediction data based on the multivariate linear regression model
prediction = stats9.predict(compare_data)
# Copy complete data for comparison of actual data vs prediction data
result = complete_data.copy()
# Drop unnecessary columns and add prediction data into comparison table
result = result.drop(['Rk', 'MP', 'W', 'D', 'L', 'GF', 'GA', 'GD', 'CornerKicks', 'Fouls', 'YellowCard', 'RedCard', 'TotalShots', 'TotalShotsOnTarget', 'PercentShotsScored'], axis=1)
result['Predicted_Pts'] = prediction
result = result.rename(columns={'Pts':'Actual_Pts'})
# Print the comparison table
print(result.head())
Team Season_End_Year Actual_Pts Predicted_Pts 0 Arsenal 2011 68 69.802765 1 Aston Villa 2011 48 45.210850 2 Birmingham City 2011 39 39.822635 3 Blackburn 2011 43 44.515867 4 Blackpool 2011 39 38.873528
In addition, we were able to plot points of actual data vs prediction data to visualize the linear relation of the two. The plot below the shows a good linear relation between the prediction and actual points by forming well-observed positive sloped eclipse.
# Set up the figure and axis
plt.figure(figsize=(10, 6))
# Scatter plot
sns.scatterplot(x='Actual_Pts', y='Predicted_Pts', data=result, color='blue')
# Add a line of perfect prediction
max_pts = max(result['Actual_Pts'].max(), result['Predicted_Pts'].max())
min_pts = min(result['Actual_Pts'].min(), result['Predicted_Pts'].min())
plt.plot([min_pts, max_pts], [min_pts, max_pts], color='red', linestyle='--')
# Add title and labels
plt.title('Actual Points vs Predicted Points')
plt.xlabel('Actual Points')
plt.ylabel('Predicted Points')
plt.grid(True)
# Display the plot
plt.tight_layout()
plt.show()
Explanation of Code Above¶
Again, looking at the plot above, we can see that the multivariate model appears to be a very good fit. If the Predicted model was perfect then the Predicted points vs. Actual points would form a perfect linear line. The plotted points are very close to the ideal line which suggests a good fit and thus model.
Interpretation: Insight & Policy Decision¶
Overall, we can see that there are several factors that play a significant role in either increasing or decreasing a Team's points at the end of a season. Based on our OLS regression results run with the formula "Pts ~ factor" for a given factor, we made the following conclusions:
==> Goals Scored (GF), Goals Conceded (GA), CornerKicks, Fouls, YellowCard, TotalShots, TotalShotsOnTarget, PercentShotsScored, all have p-values below 0.05 which implies that that they have a statistically significant impact on Team Points (Pts). Goals Scored (GF) especially appeared to have an impact on Team points (Pts) because it had the highest R-Squared value of 0.830.
==> the RedCard factor had a p-value of 0.124 which is greater than 0.05 implying that it is NOT statistically significant in its effect on Team Points (Pts).
Looking at our correlation graph, we can see that:
==> Factors like Goals Scored (+0.91), GD (+0.97), Corner Kicks (+0.71), TotalShots (+0.78), TotalShotsOnTarget (0.54), and PercentShotsScored (+0.73) had positive correlations with Pts.
==> Factors like Goals Conceded (-0.84), Fouls (-0.26), and YellowCard (-0.22) had negative correlations with Pts.
Looking forwards, we can use this information to isolate strategies that clubs can focus on in order to increase their winning chance (based on the points they can score). Focusing on strong striking abilities will increase a club's ability to execute lots of shots on target which will result in more goals scored, thus increasing total points. Likewise, introducing measures to dissuade uneccesary bad behavior can decrease the chance of receiving a foul or yellow card which appears to result in more points.
Furthermore, future analyses could make use of additional statistical techniques to further understanding of maximizing points in the EPL. One such improvement could be to explore specific strategies and their correlations between perceived negative and positive variables. For example, on the surface, agressive plays could be seen as a positive as they could increase the chances of scoring goals which would increase points. However, agressive plays could also possibly increase the chance of fouls which would decrease points. If we found an agressive play that increased ShotsOnTarget but barely increased fouls, then that would be ideal.
This can be acheived with techniques such as finding additional variables that would not be correlated to each other to lower the p-values of multivariate linear regression model or by analyzing more intricate individual club's statistics.
In conclusion, we have determined what factors statistically help/hurt a teams ability to score points in the EPL and have suggested measures that can increase/lessen these factors.
While remaining in the Premier League can be lucrative it is not easy. One must keep a club efficient enough to remain above the bottom 3 rankings of the EPL. To acheive this, proper data analytics (by going through the stages of data collection, proccessing, hypothesis testing, and interpretation) are crucial.
https://drive.google.com/file/d/15qGW2Zypikj4ehth09xCViibzTWZXTHV/view?usp=drive_link