Nathan's Final Project

by nathanstern93

08 Dec 2022

For my final project I knew that I wanted to do something with data analytics because that is what I am most interested in. I’ve never been the most creative person, so trying to create a game just did not appeal to me very much. I also have always loved sports and sports analytics, but have recently started doing more with sports analytics during my time in graduate school. I’ve done a fair amount of sports analytics work on R, but have never created anything super interactive. I chose the NBA because it is my favorite sport to watch and analyze. With the NBA, analytics and statistics can be a bit obfuscate because there is so much noise in a game. This is as opposed to a sport like baseball, which is essentially a series of discrete events by individual players with almost no interference by their teammates. As a result, baseball has always been ahead of the curve in terms of accepting analytics as part of its culture. In basketball, It’s very difficult to boil down events on the court to descriptive statistics because so much that players do is dependent on who they are playing against, and who they are playing with. However, to me, this complication makes their analytics and statistics even more exciting. It forces the NBA analytics field to be more innovative and creative than some other sports’ analytics. For example, how does one define a players’ ability to make their teammates better? How do you define a players’ ability to elevate their entire teams’ defensive efficiency? This also makes it even more difficult to compare players; and since there are only five players on the court, individual players have an outsize impact on their teams’ ability to win compared to almost any other sport. As a result of these two factors, the debates on who is better are the most contentious, interesting, and important in basketball more than any other sport. Every basketball fan has their own top 10 lists: top 10 players ever, top 10 players right now, top 10 best scorers ever, etc. So, I wanted to make an interactive that in some way could compare NBA players, and see who leads in what.

When I set out on this project, my scope was a bit more ambitious than it ended up being, and I had to evolve it based on time and coding constraints. My first thought was that I wanted to make a sortable list of the best players ever that the user could sort based on individual career statistics. I ended up not going with this for a lot of different reasons. First, I could not find a dataset that only included specifically the best players ever. So, I decided to try to find a larger data set that I could then subset into a list of the best players ever. The only datasets that I could find for this were on Kaggle, and they were data sets of the complete history of the NBA. These sets were massive because each row represented both an individual player and an individual season. I decided to try one that seemed comprehensive, and had about 30,000 rows. My idea was to manually choose the rows of only the best players ever. There is no way to specifically pick the best players based on any type of numerical parameter because some players had shorter careers but are considered all-time-greats, even though their career statistics may not stack up against the other top guys. So, the only way to really do this right was to load a list of the top 75 players ever into python and tell the program to choose only rows that contained any of the names from the list in the “name” column. This then would have given me all the individual seasons of the best players ever, so I would have also needed to do a group by function to collapse and sum the statistics for all the individual players into one row per player with their total career statistics. However, I ran into another roadblock here because I could not figure out a way to load rows from a csv that met this condition. The only code I could find involved first loading the entire csv into a dataframe, and then subsetting that dataframe. However, trinket is not powerful enough to handle that large of a data set. I could have done this easily manually in Microsoft excel with a macro, but I really wanted to do this project entirely on trinket using python code.

Since I decided that the top 75 players ever dataset was not feasible in this project on trinket, I decided to just look at all NBA data sets on Kaggle. I eventually found a dataset that contained every players’ individual statistics only from the last NBA season (2021-22). This seemed like a pretty reasonable concession to my original idea. Instead of looking at leaders all time, we could just look at statistical leaders from the last season. I successfully loaded the entire data set into trinket so this seemed like a go.

I decided to use the CSV menu activity we did in class as the basis for what I was doing because it was a reasonable enough facsimile. There were a number of different directions I could go with this, and I tried a bunch of different things. At first I tried loading a menu where the user could sort the entire dataset based on a certain row that they wanted. So if they wanted to see the league leaders in points, they could sort the entire dataset by points descending. Ultimately I just couldn’t quite get this to the point where I felt comfortable with the output. It just felt unwieldy and potentially inaccurate. Once again, I decided to narrow the scope. I just felt like I would be much more comfortable with presenting a clean product, even if it meant making it more simple.

The idea that seemed the most doable to me was to present a menu where the user can simply see which player led the league in whatever statistic they wanted. This is still a useful thing to do, even if it isn’t quite as useful as seeing an entire list or top 10 of leaders. This also gave me the advantage of being able to create a sentence that the user could understand easily. So first, I loaded the file in and created a menu with an option for each of the statistics that I felt were relevant. I then had to get the list of dictionaries, and get the keys, or column headers from the first row. I added in code later to remove players who played less than 30 games from the dataframe. I realized that this biased the statistics because there were some players who only played a couple of games but in those games had a lot of points or minutes, etc. This would limit our dataset to only relatively high-usage players. I then had to rename the keys, or column headers, that included integers or special characters like %, because that would end up screwing up my code later on. I then created the code for each string choice that the user could select. I entered in code that would retrieve the row with the maximum amount of that sleected parameter, and then used that row in the resulting string output. This string out put would say [Player name] led the NBA with [amount + parameter]. Ultimately this output looked really clean and easy to understand. Obviously I had to put all of this in a while true loop so that the user had the option of returning back to the menu, or exiting, after selecting their chosen string.

Ultimately, I think I could have done more on this if I had more time and if I had been using terminal or another program that could handle more code. In any case, I’m happy with the clean output of my program, and think it’s very easy to use.

https://trinket.io/python3/3371ea3e62

Here's a little about nathanstern93 Find nathanstern93 on Twitter, Github, and on the web.