Shih-Chieh's final project

by Shih-Chieh Dai

08 Dec 2022

Introduction

As a basketball fan, I have always been fascinated by the statistics and data that are generated by the game. The National Basketball Association (NBA) is a particularly rich source of data, with players generating a wide range of statistics on a daily basis.

I am excited about the prospect of doing a data analysis project on NBA data for several reasons. First and foremost, it would allow me to combine my passion for basketball with my interest in data analysis. I enjoyed the process of working with data. Also, I learned how to use the data science packages in Python, such as Pandas, Matplotlib, Scipy, and Numpy, when I worked on the project. This project gave me the opportunities to familiar with the data science packages in Python.

Another reason I am interested in doing a data analysis project on NBA data is that it would allow me to learn more about the game itself. I am currently playing a NBA Fantasy Game. In the game, the player would act as a general manager. The player can draft, trade, claim the players. The goal to win the game is to accumulate the stats of your roaster. There are several criterias, such as point, rebound, and assistant. By analyzing the data, I could gain a better understanding of how different players and teams perform. This knowledge could help me to become a more informed fan, and I could make my Fantasy strategy with the help of the project.

In conclusion, I am interested in doing a data analysis project on NBA data because it would allow me to combine my passion for basketball with my interest in data analysis, learn more about the game itself, and develop valuable skills and insights. I believe that such a project would be challenging, rewarding, and a lot of fun.

Data Science Packages

In this project, I used several data science packages in Python. Moreover, these packages are also been widely used in the data science community. I believe the project give me the chance to familiar with them. I would introduce the packages I used in this section.

Pandas: Pandas is a widely used tool for analyzing the data. There are two powerful data structures, such as Data Frames and Series. The user can process and analyze the tabular and columnar data easily with Pandas. In this project, I used Pandas to process my CSV file.

import pandas as pd

df = pd.read_csv('nba_stats_2022-2023.csv')
print(df.head(10))

Numpy: Numpy is a Python package that provides tool for working with array and marices of numerical data. NumPy provides a variety of data types and functions for working with array and matrices of numerical data. It includes tools for creating and manipulating array, performing mathematical operations on arrays, and working with arrays of different shapes and sizes.

In my project, I used Numpy to select the numerical columns of NBA data so that I can calculate the z-score.

import numpy as np

# Select numerical column to calculate z-score
numeric_cols = temp_df.select_dtypes(include=[np.number]).columns

Spicy: Spicy is a Python Library for scientific computing. The authors implemented several algorithm and Statistic tools in C, and they wrapped the package as a Python package. Python is a high-level programming language. It is easy to write and read Python. However, the speed of Python is slow compare to C or C++.

Source: https://github.com/niklas-heer/speed-comparison

In my project, I used z-score to do the data analysis. The z-score is a useful tool for comparing values in a dataset to the mean of the dataset, and for determining how unusual or typical a given value is. For example, a z-score of 0 indicates that a value is exactly the same as the mean of the dataset, while a z-score of 2 indicates that a value is two standard deviations above the mean. z-score is a widely used method in analyzing NBA data. By calculating the z-score for each player’s stats, I can determine how far above or below the league average each player is in various criterias, such as points per game, rebounds per game, and assists per game. This can help me and the user to identify players who are particularly exceptional or below average in certain areas.

from scipy.stats import zscore

def get_z_score_result(action):
  
  if action in action_map:
    # get values
    temp_df = df_new
    
    # Calculate z_score
    # Select numerical column to calculate z-score
    numeric_cols = temp_df.select_dtypes(include=[np.number]).columns
    result = temp_df[numeric_cols].apply(zscore)

Matplotlib: Matplotlib is a Python library for plotting. You can use matplotlib to create plots and figures, such as bar charts, histograms, and scatter plots. It is widely used in data science community for plotting.

In my project, I used matplotlib for plotting the bar chart for the selected criteria and team.

import matplotlib.pyplot as plt

def plot_the_data_by_action(action, team):
  if action in action_map:
    # Filter the data by the team
    df_plot = df_new.loc[df_new["TEAM"] == team]
    
    # Plot the bar chart
    ax = df_plot.plot.bar(x="FULL NAME", y=action_map[action], rot=0)
    print(df_plot["FULL NAME"])
    
    # Rotate the X-label and change the font size to 6
    plt.xticks(rotation=45, fontsize=6)
    plt.show()

Here is the example of the bar chart created by using matplotlib in my project.

This figure was generated on a trinket. It seems like the trinket constrains the size of the figure, and thus the X label was cut.

Other: There is a tool I used this semester, and it is not a data science computing tool. The tool I want to introduce is called HackMD. HackMD is an editor for writing markdown documents. It is a web-based application, and the user can write and edit the markdown document in real-time. While the user is editing the document, it will show the result on the right-hand side. I used HackMD to write my reflection the whole semester.

Here is the screenshot of the interface.

As you can see, I can write my markdown on the left-hand side, and the result will show on the right-hand side.

Progress

There are four main progress of this project.

  1. Project initial plan (11/03~11/10). In this period, I proposed the project idea. I decided to do a data abalysis project on NBA data. Also, I made some goals for this project. The milestones at that time were:
    • Find the external data
    • Data preprocessing
    • Data Analysis Method
  2. Interface of my project (11/10~11/17). In this week, I found the csv file for the data analysis project. The data was from https://www.nbastuffer.com/ They provide up-to-date data in .xlsx format. I manually transfer the format of the data to CSV. In addition, I implement the interface of my program.

    The milestones at that time were:

    • Find Dataset
    • Plan the interface (i.e. menu)
    • Data Preprocessing
    • Analysis Methods (Z-score)
    • Data Visualization

    Here is the trinket of the version of project.

  3. Project Update (11/17~12/01). To be honest, I did not have to much progress this week since we had Thanks giving holiday. I spent most of my time to explore and learn the Data Science packages in Python. It was that time I decided to use Pandas to process my CSV file. I met a NaN issue when I read the file.
     0   NaN  ...                                              105.9                                                                                                                
     1   NaN  ...                                              103.8                                                                                                                
     2   NaN  ...                                              101.3                                                                                                                
     3   NaN  ...                                              110.9                                                                                                                
     4   NaN  ...                                              103.7                                                                                                                
     5   NaN  ...                                              105.0          
    

    When I read the file, there was one column with NaN. I then step back to the data. I found that not all columns were necessary for my project. The milestones at that time were:

    • Find Dataset
    • Plan the interface (i.e. menu)
    • Data Preprocessing
    • Analysis Methods (Z-score)
    • Data Visualization
    • Add help function (New milestore)

    One thing I want to mentioned here is I added a new milestone here. I added “Add help function” since I found that the function can help the user to know how to use my program.

    Here us my trinket code.

  4. Final Result (12/01~12/08) In the final week of my project, I have a big improvement on my progress. Thanksfully, I completed all the milestones I set. First, I moved all the functions for processing, analyzing, and plotting the data into another file. data_preprocess.py. I only left the interface stuff in the main.py

    Second, I updated the menu.

     menu_dict = {
         "1": "Show Stats",
         "2": "Performance Leader",
         "3": "Run Analysis",
         "4": "Data Visualization",
         "5": "Help me!",
         "6": "Exit"
         }
    

    I added Help me! in the menu.

    Third, I fixed the categories I want to used in the project. The user can select the category with the number, and I created a dictionary to map the number to the criteria.

     action_map = {
       "1": "PPG",
       "2": "RPG",
       "3": "APG",
       "4": "eFG%",
       "5": "FT%",
       "6": "3P%",
       "7": "BPG",
       "8": "SPG",
       "9": "MPG",
       "10": "ORTG",
       "11": "DRTG",
       "12": "TOPG"
        
     }
    

    To help the user understand what the abbreviation means, I created a dictionary to store the hint of the criteria.

     action_hints = {
       "PPG": "Average Point per game.",
       "RPG": "Average Rebound per game.",
       "APG": "Average Assistant per game.",
       "eFG%": "Effective Shooting Percentage With eFG%, three-point shots made are worth 50% more than two-point shots made. eFG% Formula=(FGM+ (0.5 x 3PM))/FGA",
       "FT%": "Free Throw Field Goal Percentage",
       "3P%": "3-Point Field Goal Percentage",
       "BPG": "Average Block per game.",
       "SPG": "Average Steal per game.",
       "MPG": "Average Minute per game.",
       "ORTG": "Offensive Rating: Individual offensive rating is the number of points produced by a player per 100 total individual possessions.",
       "DRTG": "Defensive Rating: Individual defensive rating estimates how many points the player allowed per 100 possessions he individually faced while staying on the court.",
       "TOPG": "Average Turn Over per game."
     }
    

    I put my final program in the end of this reflection.

    Final milestones:

    • Find Dataset
    • Plan the interface (i.e. menu)
    • Data Preprocessing
    • Analysis Methods (Z-score)
    • Add help function (New milestore)
    • Data Visualization

Roadblocks

There are several roadblocks I met when I worked on this project. Moreover, the roadblocks always become more challenging. However, I believe that this indicates my programming skill got improve when I solved the roadblock.

For example, the roadblock I first met was I could see NaN in the file I read.

0   NaN  ...                                              105.9                                                                                                                
1   NaN  ...                                              103.8                                                                                                                
2   NaN  ...                                              101.3                                                                                                                
3   NaN  ...                                              110.9                                                                                                                
4   NaN  ...                                              103.7                                                                                                                
5   NaN  ...                                              105.0          

However, when I looked back to the data, I found I could just select the data I need. I also found that after I selected the columns I needed, the issue got solved.

The other roadblock was this error:

numpy.AxisError: axis 0 is out of bounds for array of dimension 0 pandas zscore

This error occured when I tried to implement the function for calculating the z-score. When I saw this error code, I had no idea how to solve it. The only solution in my mind was Google it!

After I did my google study, I then found that some data type of columns were not numerical, and I could use Numpy to select the numerical columns. I had a sense of achivement when I solved the issue.

import numpy as np

# Select numerical column to calculate z-score
numeric_cols = temp_df.select_dtypes(include=[np.number]).columns

In addition, from my debugging experience throughout the semester, I found a pattern for debugging. The pattern is when I see an error and have no idea about it, I can just Google it.

Limitations

There are some limitations I found in this program. First, the user can not just type the player’s first name or last name to search their data. The user can only use full name to do the search. I think if I have more time to do the project, I will split the players’ full name to first and last name to support this function.

Second, the user can not check the stats by position. The data from https://www.nbastuffer.com/ did contain position information. However, some of the players can do multiple positions. Therefore, it needs more effort to allow the user to check the stats by position.

Conclusion

When I first set my milestones for the project, I was not sure if I could complete the data visualization function. It looked like data visualization was very challenging. However, after a month of effort, I completed all the goals I set for the project. I have a lot of fun during the coding process. Some parts of the project were challenging for me, and I did not know how to implement the functions I wanted in the beginning. However, I learned that I could just check the documentation for the library I used. It looks challenging, but it is actually not that hard. The only thing to do is just check the document, and you will know how to use the software library.

This semester, we did a lot of turtle programming. Also, we learned how to use GitHub and wrote markdown documents for our coding work. I believe such a process makes me learn how to use GitHub and document our code. I think these skills are important in the industry. Therefore, I am grateful that I can learn these skills in this course.

Overall, the exercises and the project are the learning process for learning how to program. It actually took me a lot of time to do them. However, I believe such a learning process is essential to learn to program, and this process is challenging, rewarding, and a lot of fun.

My Trinket Code

I am a second year MSIS student focusing on data science. Find Shih-Chieh Dai on Twitter, Github, and on the web.