Comparing GPAs of Days of the Week at UW Madison since 2006

This notebook performs a SQL query on the UW Madison database and does a hypothesis test on each day's GPA compared against all of the other days' GPAs

For each day:

$H_0$ = There is no difference between this day's GPAs and the other days' GPAs

$H_A$ = There is a difference between this day's GPAs and the other days' GPAs


Import necessary Python libraries and modules

In [1]:
from sqlalchemy import create_engine
from scipy import stats
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from statsmodels.stats.multitest import multipletests

Query the database to create one overall Pandas dataframe which is then filtered to remove any classes where the GPA is 4.0

In [2]:
engine = create_engine('postgresql:///uw_madison')
In [3]:
query = """SELECT * FROM days_gpas;"""
In [4]:
all_days_df = pd.read_sql(query, engine)
all_days_df.describe()
Out[4]:
section_number section_gpa
count 175962.000000 175962.000000
mean 5.318677 3.414629
std 11.613988 0.363838
min 1.000000 0.000000
25% 1.000000 3.154255
50% 1.000000 3.447619
75% 4.000000 3.716216
max 234.000000 3.998418
In [5]:
all_days_df.head()
Out[5]:
uuid name mon tues wed thurs fri sat sun subject_name section_number section_gpa
0 00015734-b612-3152-bf5f-7f6855e1c0c0 Fundamentals of Analytical Sci false false true false false false false Chemistry 1 3.117647
1 00015734-b612-3152-bf5f-7f6855e1c0c0 Fundamentals of Analytical Sci false true false true false false false Chemistry 1 3.117647
2 0002389b-0bda-3f47-b5e7-e9d8973cb2e9 Corporate&Special Info Serv true false false false false false false Library and Information Studies 1 3.818182
3 00028b06-6e42-3a3e-b484-69fd61baf978 Spectrochemical Measuremts true false true false true false false Chemistry 1 3.425926
4 0006b3f8-3403-35d5-bda2-9489e3c84434 Consultatn&Suprvsn-Soc Work false true false false false false false Social Work 1 3.730769
In [6]:
all_days_df = all_days_df.loc[all_days_df['section_gpa'] != 4.0]
all_days_df.describe()
Out[6]:
section_number section_gpa
count 175962.000000 175962.000000
mean 5.318677 3.414629
std 11.613988 0.363838
min 1.000000 0.000000
25% 1.000000 3.154255
50% 1.000000 3.447619
75% 4.000000 3.716216
max 234.000000 3.998418

The overall dataframe is then used to make dataframes for each day and for all other days. E.g. a Monday dataframe with a dataframe for all days besides Monday. Then randomly pick 70 from each dataframe to make plots and conduct hypothesis testing on.

In [7]:
days = ['mon', 'tues', 'wed', 'thurs', 'fri', 'sat', 'sun']
In [17]:
gpas = []
p_vals = {}
one_d_p_vals = []

for day in days:
    day_df = all_days_df.loc[all_days_df[str(day)] == 'true', 'section_gpa']
    day_choice = np.random.choice(day_df, size=70, replace=False)
    
    not_day_df = all_days_df.loc[all_days_df[str(day)] != 'true', 'section_gpa']
    not_day_choice = np.random.choice(not_day_df, size=70, replace=False)
    gpas.append({"day": day_choice, "not_day": not_day_choice})
    p_vals[str(day)] = stats.ttest_ind(day_choice, not_day_choice, equal_var=False)
    one_d_p_vals.append(stats.ttest_ind(day_choice, not_day_choice, equal_var=False)[1])

Generate plots comparing each day against all other days

In [9]:
full_days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
In [10]:
fig, ax = plt.subplots(nrows=7, ncols=1, figsize=(15,20))

counter = 0

for row in range(ax.shape[0]):
    ax[row].hist(gpas[counter]["day"], alpha=0.5, bins=25, range=(2.5, 4.0), label=f'{full_days[counter]} GPAs')
    ax[row].axvline(gpas[counter]["day"].mean(), color='#1f77b4', alpha=1, linestyle='dashed',label=f'{full_days[counter]} Mean')
    ax[row].hist(gpas[counter]["not_day"], color='#ff7f0e', alpha=0.5, bins=25, range=(2.5, 4.0), label=f'GPAs from Days Other than {full_days[counter]}')
    ax[row].axvline(gpas[counter]["not_day"].mean(), color='#ff7f0e', alpha=1, linestyle='dashed',label=f'Days Other than {full_days[counter]} Mean')
    ax[row].legend()
    ax[row].set_xlabel('GPA');
    counter += 1
    
fig.tight_layout()

Display the T statistic and p-values for each day against all other days

In [11]:
p_vals
Out[11]:
{'mon': Ttest_indResult(statistic=0.3861586905726627, pvalue=0.6999807413618497),
 'tues': Ttest_indResult(statistic=1.158900732637778, pvalue=0.24868242363225904),
 'wed': Ttest_indResult(statistic=0.1702067405316272, pvalue=0.865101860131468),
 'thurs': Ttest_indResult(statistic=-0.7744049316301642, pvalue=0.4400190061996502),
 'fri': Ttest_indResult(statistic=-2.8518213298163033, pvalue=0.0050304608883447715),
 'sat': Ttest_indResult(statistic=4.211019878637128, pvalue=4.8332157044897395e-05),
 'sun': Ttest_indResult(statistic=5.624870922850782, pvalue=1.1384256658117605e-07)}

We can see from the plots and the T statistics of the random samples from the weekday populations that the GPAs are not different to a statistically significant amount. In these cases, we fail to reject the null hypothesis for each weekday.

However, the classes that meet on weekends have GPAs that are statistically distinct from the other days. For Saturday and Sunday, we reject the null hypothesis.

Use Bonferroni Correction

In [18]:
multipletests(one_d_p_vals)
Out[18]:
(array([False, False, False, False,  True,  True,  True]),
 array([9.92773724e-01, 8.64354734e-01, 8.64354734e-01, 8.64354734e-01,
        3.30240674e-03, 2.19425913e-06, 5.30000743e-06]),
 0.007300831979014655,
 0.0071428571428571435)

When using the Bonferroni correction to account for multiple tests, we can see that we fail to reject the null hypothesis for Monday through Thursday, and reject the null hypothesis for Friday through Sunday.

Due to the random selection of classes however, the results can change.

In [ ]: