Datasets : NBA Project

DataSciR > Datasets

2. Datasets

2.1 NBA Stats Datasets

To get the needed data about players, games, seasons and all relevant meta-data, we extracted statistical data from basketball-reference.com - a website which provides historical basketball statistics of basketball players and teams from various US American and European leagues including the NBA.

We use the following datasets:

Game data : GameId, Date, Gametime and the final score
Player data : PlayerId, Name of each player and the twitter account name
Player season stats : Player stats accumulated for each season
Player game stats : Performance data for each unique player/game combination

The player game stats dataset also includes the BPM metric we use as the main variable. Since the metric considers the overall performance of a player within a game (including offensive and defensive effort) we decided to use it as our main game performance indicator for the correlation analysis. BPM uses a players box score information, position, and the teams overall performance to estimate the players contribution in points above league average per 100 possessions played.

We picked those players, who continuously played in the regular seasons 2016/17 - 2018/19 (the last two seasons before Covid-19). We didn’t consider the playoffs here, as many players didn’t get into the playoffs with their teams but still played a full regular season and therefore provide enough interesting play-data for our analysis. Furthermore, we only wanted those players in our dataset, who stayed at their respective team for the whole observation time. The idea behind this was to eliminate team switches as possible factors that influence the players performance. Additionally we wanted only those players who had on-court time in at least 80% of the games during regular season. On this dataset we applied a cutoff value to get only those players, whose performance is relatively unstable in comparison to their colleagues (standard deviation for BPM was higher or equal to 8). The last parameter we wanted to include into our selection is a minimum Twitter follower count of 1.000 to ensure social media activity.

2.2 Twitter Dataset

With the given data we were able to extract the tweets using the Twitter API. To only extract tweets that can be assumed to be relevant for a specific game day, we delimited the time range of tweets to be considered for the extraction to the time between 24 hours and 45 minutes before a game (to be on the safe side, we first extracted tweets in a range of 48 hours before a game and boiled it down to 24 hours in an extra step). With the first limit we wanted to avoid that tweets, related to another match get considered. The 45 minute delimiter was set according to the assumption, that it is unlikely for players to check their twitter just 45 minutes before a game. Alongside with the raw text, the date of creation, the retweet count, the reply count, the like count and the quote count were added to the dataset. The tweets were then processed further to extract the text and emojis to run a sentiment analysis.