This notebook performs a SQL query on the UW Madison database and does a hypothesis test comparing STEM and non-STEM classes. STEM classes are defined by which subjects have their books in the Teaching and Learning Research Guide for STEM and SBE libraries: https://researchguides.library.wisc.edu/teaching_learning
$H_0$ = There is no difference between GPAs of STEM and non-STEM classes
$H_A$ = There is a difference between the GPAs of STEM and non-STEM classes
from sqlalchemy import create_engine
from scipy import stats
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
engine = create_engine('postgresql:///uw_madison')
stem_query = """SELECT * FROM stem_classes;"""
stem_df = pd.read_sql(stem_query, engine)
stem_df.describe()
non_stem_query = """SELECT * FROM non_stem_classes;"""
non_stem_df = pd.read_sql(non_stem_query, engine)
non_stem_df.describe()
stem_df = stem_df.loc[stem_df['section_gpa'] != 4.0]['section_gpa']
stem_df.describe()
non_stem_df = non_stem_df.loc[non_stem_df['section_gpa'] != 4.0]['section_gpa']
non_stem_df.describe()
stem_choice = np.random.choice(stem_df, size=2000, replace=False)
non_stem_choice = np.random.choice(non_stem_df, size=2000, replace=False)
fig, ax = plt.subplots()
ax.hist(stem_choice, density=True, color='#1f77b4', alpha=0.5, bins=50, range=(1.0, 4.0), label='STEM Classes')
ax.axvline(stem_choice.mean(), color='#1f77b4', alpha=1, linestyle='dashed',label='STEM Class Mean')
ax.hist(non_stem_choice, density=True, color='#ff7f0e', alpha=0.5, bins=50, range=(1.0, 4.0), label='Non-STEM Classes')
ax.axvline(non_stem_choice.mean(), color='#ff7f0e', alpha=1, linestyle='dashed', label='Non-STEM Class Mean')
ax.legend()
ax.set_xlabel('GPA');
stats.ttest_ind(stem_choice, non_stem_choice, equal_var=False)
The T-statistic and shapes of the histograms show that the GPAs for STEM and non-STEM classes are different and the p-value shows that the differences are significant.
As such, we reject the null hypothesis.