Introduction

The purpose of this project was to help enhance my statistical programming and modeling skills by analyzing NBA scheduling data. The raw data analyzed consisted of data of every NBA game played from the 2014 through 2023 season, including game date, which team was home and away team, the winner, and latitude/longitude of the home team’s stadium. Another raw data sheet consisted of performance data for every game such as field goals made/missed, rebounds, fouls, turnovers, points scored, and other basic basketball counting stats. Through this project, I was able to sharpen my R programming and modeling skills, learn how to use HTML files and the knittr() function to generate clean R-output, and enhance my understanding of the Github repository. Unfortunately, I am not able to share the raw data due to copyright and legal reasons.

The first 4 questions I came up with allowed me to practice using the tidyverse package in R to sort through the data and calculate concrete statistics based on the variables and conditions of the question.

In question 5, I analyzed and modeled two trends in NBA scheduling. These two trends are the average amount of consecutive home/away games played by teams each season, and the average amount of times a team has back-to-back games in each season.

In question 6, I created wrote a function and designed a plotting tool that generates interactive plot using the plotly() function that visualizes a specific team’s schedule density for any given season. Separate raw data of Denver’s 2024-25 schedule was used as an example, but the plot can visualize any team’s schedule by simply changing the 3-letter team abbreviation in the code. For example, you could visualize the Knick’s schedule by changing “DEN” to “NYK”. Hovering over the plotted points show the game date, opponent, home/away, and whether the game is on the second night of back-to-back games.

In question 7, I created a logistic regression model using different variables such as kilometers traveled east and west from the previous game (possible jetlag), density variables such as a back-to-back game or if the game falls within a 4-in-6 stretch, and opponent win percentage (strength). The model estimates the total amount of games won or lost for each team due to scheduling relating factors from the 2019-2023 season.

Analyses and explanations of the results for each question are written under the specific question.

Setup and Data

Question 1

How many times are the Thunder scheduled to play 4 games in 6 nights in the provided 80-game draft of the 2024-25 season schedule? (In other words, how many games are the 4th game played over the past 6 nights?”)

## [1] 26

ANSWER 1:

26 4-in-6 stretches in OKC’s draft schedule.

Question 2

From 2014-15 to 2023-24, what is the average number of 4-in-6 stretches for a team in a season? (Each team/season is adjusted to per-82 games)

## [1] 25.10331

ANSWER 2:

25.1 4-in-6 stretches on average.

Question 3

Which of the 30 NBA teams has had the highest average number of 4-in-6 stretches between 2014-15 and 2023-24? Which team has had the lowest average? Adjust each team/season to per-82 games. Is the difference between most and least from Q3 surprising, or do you expect that size difference is likely to be the result of chance?

## Lowest: NYK , 22.19
## Highest: CHA , 28.11

ANSWER 3:

  • Most 4-in-6 stretches on average: CHA (28.11)
  • Fewest 4-in-6 stretches on average: NYK (22.19)

The difference is surprising and statistically significant. A two-sample t-test produced a p-value of 0.0478, meaning that if there were no true difference in 4-in-6 stretches between CHA and NYK, we would expect to see a difference this large less than 5% of the time by random chance.

Question 4

What was BKN’s defensive eFG% in the 2023-24 season? What was their defensive eFG% that season in situations where their opponent was on the second night of back-to-back?

total_fgm total_fga total_fg3m eFG_pct
3410 7255 1066 0.543
total_fgm total_fga total_fg3m eFG_pct
650 1418 217 0.535

ANSWER 4:

  • BKN Defensive eFG%: 54.3%
  • When opponent on a B2B: 53.5%

Based on these findings, it appears that a teams shooting performance against the Nets during the 2023 season doesn’t really change regardless of whether or or not they are playing on the second night of back-to-back games.

Question 5

Please identify at least 2 trends in scheduling over time. In other words, how are the more recent schedules different from the schedules of the past?

Trend 1: Measuring the Average Length of Home and Away Stretches in Each Season

PLOT EXPLANATIONS: The plot represents the average stretch of home and away games per season. While the averages for home and away are about the same in every season, the average stretch from 2014-2019 was about 2. In 2020 the average stretch went up to about 2.5 games likely due to COVID scheduling. Since 2021, stretches have been an average of about 2.25 games, and have slightly increased each year.

Trend 2: Investigating How Many Back-to-backs Happen per Season Over Time

PLOT EXPLANATION: The plot represents the decrease in back-back games a team typically plays since the 2014 season. The overall decreasing trend (with the exception of the unusual COVID year) could represent the league attmepting to limit player injuries and fatigue.

Question 6

Please design a plotting tool to help visualize a team’s schedule for a season. The plot should cover the whole season and should help the viewer contextualize and understand a team’s schedule, potentially highlighting periods of excessive travel, dense blocks of games, or other schedule anomalies.

The tool is used to DEN’s 80-game 2024-25 schedule.

Plot Usage:

This plot can be used to effectively illustrate to a team’s front office what their schedule density looks like for an upcoming season without the need to deeply analyze data on their own. It can help the front office and coaching make decisions and plan ahead of time on things such as playing time, preparation, and training routines based on schedule density in different parts of the season. This will allow team to effectively rest players to maximize team performance and reduce the chance of injury and fatigue.

Question 7

Please estimate how many more/fewer regular season wins each team has had due to schedule-related factors from 2019-20 though 2023-24. The final answer for each team is the total amunt of games won/lost from 2019-2023, not a season average.

## 
## Call:
## glm(formula = win ~ home + opp_winpct + east_km + west_km + rest_days + 
##     b2b + four_in_six, family = binomial, data = sched_loc)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.410e+00  6.328e-02 -38.079  < 2e-16 ***
## home         5.900e-01  2.812e-02  20.984  < 2e-16 ***
## opp_winpct   4.413e+00  9.998e-02  44.146  < 2e-16 ***
## east_km      1.018e-05  1.946e-05   0.523    0.601    
## west_km     -6.611e-06  1.936e-05  -0.342    0.733    
## rest_days   -2.700e-04  6.970e-04  -0.387    0.698    
## b2b         -1.976e-01  3.683e-02  -5.364 8.12e-08 ***
## four_in_six -6.396e-02  3.630e-02  -1.762    0.078 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 33213  on 23957  degrees of freedom
## Residual deviance: 30536  on 23950  degrees of freedom
## AIC: 30552
## 
## Number of Fisher Scoring iterations: 4
team schedule_wins
PHX 19.18341
LAL 19.09084
DEN 19.04653
MEM 18.92139
HOU 18.86539
DAL 18.85763
OKC 18.79604
BOS 18.77284
UTA 18.75423
NOP 18.73602
BKN 18.71351
LAC 18.66042
MIL 18.64544
MIA 18.60822
PHI 18.58958
TOR 18.55851
GSW 18.51565
NYK 18.51152
CLE 18.48231
IND 18.46192
SAC 18.46034
POR 18.45848
WAS 18.43538
SAS 18.41728
MIN 18.40147
DET 18.32976
ORL 18.30956
ATL 18.11807
CHI 17.99587
CHA 17.92517
##      variable  odds_ratio prob_baseline   prob_new    prob_shift
## 1 (Intercept)  0.08985274           0.5 0.08244485 -4.175552e-01
## 2        home  1.80403435           0.5 0.64337099  1.433710e-01
## 3  opp_winpct 82.55776159           0.5 0.98803223  4.880322e-01
## 4     east_km  1.00001018           0.5 0.50000254  2.544395e-06
## 5     west_km  0.99999339           0.5 0.49999835 -1.652676e-06
## 6   rest_days  0.99973000           0.5 0.49993249 -6.750957e-05
## 7         b2b  0.82070827           0.5 0.45076319 -4.923681e-02
## 8 four_in_six  0.93804085           0.5 0.48401501 -1.598499e-02
## 
## Call:
## glm(formula = win ~ home + opp_winpct + east_km + west_km + rest_days + 
##     b2b + four_in_six, family = binomial, data = sched_loc)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.410e+00  6.328e-02 -38.079  < 2e-16 ***
## home         5.900e-01  2.812e-02  20.984  < 2e-16 ***
## opp_winpct   4.413e+00  9.998e-02  44.146  < 2e-16 ***
## east_km      1.018e-05  1.946e-05   0.523    0.601    
## west_km     -6.611e-06  1.936e-05  -0.342    0.733    
## rest_days   -2.700e-04  6.970e-04  -0.387    0.698    
## b2b         -1.976e-01  3.683e-02  -5.364 8.12e-08 ***
## four_in_six -6.396e-02  3.630e-02  -1.762    0.078 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 33213  on 23957  degrees of freedom
## Residual deviance: 30536  on 23950  degrees of freedom
## AIC: 30552
## 
## Number of Fisher Scoring iterations: 4

Explanation of Regression Output and Model Diagnostics:

This is a logistic regression model that compares the actual probability of a team winning a game to the “neutral” probability of winning a game. The probability is represented by the equation 1/(1+exp-(-2.41+0.59*home+4.41*league win % against opponent-.198*back-to-back game-0.064*4in6 games+.00001*km traveled east-.000006*km traveled west-.0002*rest days). ****(League win % against opponent (or opp_winpct) is how much the rest of the league has beaten your opponent before you played them. A lower % means its a good team, so you’d expect a lower probability to win). The prob_baseline assumes an even 50/50 matchup. The prob_new represents the probability of winning the game when you add that factor, holding all the other factors the same. For example, when looking at the home factor and holding all other factors the same (no regard to distance travelled, b2b’s, etc.) win probability increases by 14% when playing at home (0.64-0.5 in the prob_effects output). Home, opp_winpct, b2b, and 4in6 are binary variables. The rest are continuous. This means that, for opponent win %, there’s a 98% chance of winning if the rest of the league has never lost against your opponent (win % against your opponent is 1). This is obviously not realistic. Using the incremental odds ratio rule and scaling the increase in win % to 0.1 rather than 1, the probability of beating your opponent goes up by 11% when the leagues winning % against your opponent increases by 0.1 (.500 –> .600). Then for every team, every game since 2019 is plugged into the model equation to estimate the probability of winning that game. All of these probabilities are summed up for each of the 30 teams and is called the “actual prediction”. We then repeat this process by plugging in 0 for every variable to create the most neutral (no advantage or disadvantage) game possible for every team and calculate the probability of winning each game. These probabilities are summed up for every team and called the “neutral prediction.” We then calculate the actual-neutral for every team, and the difference is the amount of games won or lost due to schedule-related factors. NOTE: Zero is not plugged in for opp_winpct in the neutral probability calculation because the one factor you cant control is who the team plays.

Reasoning for Including Each Variable in Model:

Home was chosen to see how much playing at home affected a team’s chances of winning. Opponent win % was used to measure the strength of the opponent in any given game and the effect it had on a team’s chances to win. East and west kilometers traveled from the previous game were selected to see if distance traveled or a timezone change had an effect on a team’s chances to win. East and west were separate variables due to “losing time” when travelling from a west to east timezone and “gaining time” vice versa. Rest days, back-to-backs, and 4-in-6’s were all density variables used measure the effect of a team chances of winning when having more or less time between games.

Analysis of Results:

A range of only 1.3 wins between the highest and lowest team, with every team having gained wins, shows that the scheduling committee does a pretty good job at making schedules as fair as possible for every team.

Thank you and Conclusion:

Thank you for taking the time to look at my analysis and findings. This activity greatly improved my overall coding ability, data analytics and visualization skills, and my ability to work with HTML files in R. I hope you found these findings informative and insightful.