This is a case study taken from YouTube. The idea was to implement a case study of a Data Science project from scratch to completion. The final code is available on GitHub.
Python, Pandas, Scipy, Requests, BeautifulSoup.
Web Scraping: This is an automated technique that allows data to be extracted from websites. It is a process by which a program, known as a "web scraper", extracts specific information from the HTML code of a web page, which is then stored and analyzed.
Poisson Distribution: This is a discrete distribution that describes the number of events that occur in a fixed time interval or region of opportunity.
Conditions for its application:
The number of events can be counted.
The occurrence of events is independent.
The rate at which events occur is constant.
Two events cannot occur at exactly the same instant in time.
Formula: An event can occur k times in a given time interval:
The probability of observing events in an interval is given by the following equation:
Where:
k: Number of occurrences (≥ 0).
X: Discrete random variable (≥ 0).
λ (Lambda): Average expected number of occurrences (≥ 0).
Use of Python with its libraries such as Pandas, to load and manage data through DataFrames; Scipy to simulate the Poisson distribution; Requests, to query a page and perform web scraping; and BeautifulSoup, to parse HTML with a specific parser.
Recall concepts previously learned about the Poisson distribution and its applicability.
Be able to program an entire Data Science project from scratch, starting with web scraping a page to collect data, then cleaning and preparing it, and finally building the model and generating a final prediction with that model.
Problem: A prediction is needed for the 2022 World Cup champion.
To obtain the necessary data, it was decided to take it from the Wikipedia pages corresponding to that World Cup.
The choice to use Wikipedia for the data is not random; this method of data collection was chosen because, to build the prediction model, historical data on matches played by the national teams participating in the World Cup are needed. Wikipedia has the same page format for each World Cup, which facilitates the web scraping process.
As part of the model, a parameter known as Team Strength will be used. Its purpose is to measure the strength of a soccer team based on the goals it scores and the weakness of the team based on the goals it concedes. In this case, national teams. Its calculation for a team, at a practical level, is based on the average number of goals scored, on the one hand, and the average number of goals conceded, on the other. These values are used in a Poisson simulation, which helps us determine a probability value for each team. This value is then compared against the value obtained for the opposing team, and the higher value determines the winning team.
The applicability of the Poisson Distribution in this case is supported by the Goal: which is an event that can occur in the 90 minutes of a soccer match.
Lambda (λ): expected number of events per time interval, which in this case would be the average number of goals in 90 minutes.
k: would be the number of goals in a match that a team could score.
Let's also look at its conditions:
The number of events can be counted: Clearly, the number of goals in a match can be counted and are integer (discrete) values, not continuous ones, such as 1.5 goals. Therefore, this condition is met.
The occurrence of events is independent: The occurrence of one goal does not affect the probability of another goal. This is "debatable" because a goal can encourage the other team, the one that conceded the goal, to become even more motivated and try to score a goal to tie the game. For practical purposes, a goal will not affect the probability of another goal being scored. Therefore, this condition is met.
The rate at which events occur is constant: We could say that the probability of a goal occurring in one 90-minute match is the same as the probability of it occurring in another 90-minute match, so this condition is met.
Two events cannot occur at exactly the same instant in time: In the same match, a goal cannot be scored at the same instant in time as another goal. Therefore, this condition is met.
The application was directly worked on in four stages using four Python files.
The first stage begins by accessing the 2022 World Cup website and obtaining the fixtures for the upcoming matches. (NOTE: In this case, since the 2022 World Cup has already been played, it's not the fixtures that were obtained, but rather the matches and their results. This data had to be modified later in the final phase to continue with the case as if the prediction were being made before the World Cup was played.) This is all found in the main.py file of the application code.
In the second stage, web scraping is performed to extract all the data; that is, a historical record of the matches from all World Cups is created. This is done in the data_extraction.py file.
The third stage proceeds to "clean" the data, either by removing empty spaces, NaN characters, or duplicate data. Incorrect data was also modified. This can be seen in the data_cleaning.py file.
There should be a fifth step, which would be deploying the application, but for practical purposes, the application will only be maintained in a GitHub repository.
Below is an image detailing the previous steps.
The algorithm is based on four components (Python files) as explained in the previous section. Each of them, except the last, generates new output files, which are explained below:
main.py: Obtains the dict_table file, which is a Python dictionary containing the World Cup fixtures.
data_extraction.py: Obtains two files containing all the matches from all World Cups. One is fifa_worldcup_history_data.csv, and the other is fifa_worldcup_fixture.csv, with the fixture matches already formatted.
data_cleaning.py: Extracts the files mentioned in the previous step, and generates two new files with the data already "cleaned" and ready to be processed by the model. These are: clean_fifa_worldcup_history_data.csv and clean_fifa_worldcup_fixture.csv.
model_creation.py: This Python file does not generate any other output files. Use the files mentioned above to build and run the predictive model. The result is a screen display of who will win the tournament in the final match of the 2022 World Cup.
All the code and the entire application are available in this GitHub repository.