This notebook performs a SQL query on the UW Madison database and does a hypothesis test on each day's GPA compared against all of the other days' GPAs
For each day:
$H_0$ = There is no difference between this day's GPAs and the other days' GPAs
$H_A$ = There is a difference between this day's GPAs and the other days' GPAs
from sqlalchemy import create_engine
from scipy import stats
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from statsmodels.stats.multitest import multipletests
engine = create_engine('postgresql:///uw_madison')
query = """SELECT * FROM days_gpas;"""
all_days_df = pd.read_sql(query, engine)
all_days_df.describe()
all_days_df.head()
all_days_df = all_days_df.loc[all_days_df['section_gpa'] != 4.0]
all_days_df.describe()
days = ['mon', 'tues', 'wed', 'thurs', 'fri', 'sat', 'sun']
gpas = []
p_vals = {}
one_d_p_vals = []
for day in days:
day_df = all_days_df.loc[all_days_df[str(day)] == 'true', 'section_gpa']
day_choice = np.random.choice(day_df, size=70, replace=False)
not_day_df = all_days_df.loc[all_days_df[str(day)] != 'true', 'section_gpa']
not_day_choice = np.random.choice(not_day_df, size=70, replace=False)
gpas.append({"day": day_choice, "not_day": not_day_choice})
p_vals[str(day)] = stats.ttest_ind(day_choice, not_day_choice, equal_var=False)
one_d_p_vals.append(stats.ttest_ind(day_choice, not_day_choice, equal_var=False)[1])
full_days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
fig, ax = plt.subplots(nrows=7, ncols=1, figsize=(15,20))
counter = 0
for row in range(ax.shape[0]):
ax[row].hist(gpas[counter]["day"], alpha=0.5, bins=25, range=(2.5, 4.0), label=f'{full_days[counter]} GPAs')
ax[row].axvline(gpas[counter]["day"].mean(), color='#1f77b4', alpha=1, linestyle='dashed',label=f'{full_days[counter]} Mean')
ax[row].hist(gpas[counter]["not_day"], color='#ff7f0e', alpha=0.5, bins=25, range=(2.5, 4.0), label=f'GPAs from Days Other than {full_days[counter]}')
ax[row].axvline(gpas[counter]["not_day"].mean(), color='#ff7f0e', alpha=1, linestyle='dashed',label=f'Days Other than {full_days[counter]} Mean')
ax[row].legend()
ax[row].set_xlabel('GPA');
counter += 1
fig.tight_layout()
p_vals
We can see from the plots and the T statistics of the random samples from the weekday populations that the GPAs are not different to a statistically significant amount. In these cases, we fail to reject the null hypothesis for each weekday.
However, the classes that meet on weekends have GPAs that are statistically distinct from the other days. For Saturday and Sunday, we reject the null hypothesis.
multipletests(one_d_p_vals)
When using the Bonferroni correction to account for multiple tests, we can see that we fail to reject the null hypothesis for Monday through Thursday, and reject the null hypothesis for Friday through Sunday.
Due to the random selection of classes however, the results can change.