Comparing STEM and non-STEM GPAs at UW Madison since 2006¶

This notebook performs a SQL query on the UW Madison database and does a hypothesis test comparing STEM and non-STEM classes. STEM classes are defined by which subjects have their books in the Teaching and Learning Research Guide for STEM and SBE libraries: https://researchguides.library.wisc.edu/teaching_learning

$H_0$ = There is no difference between GPAs of STEM and non-STEM classes

$H_A$ = There is a difference between the GPAs of STEM and non-STEM classes

Import necessary Python libraries and modules¶

from sqlalchemy import create_engine
from scipy import stats
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

Query the database and create dataframes of STEM and non-STEM classes¶

engine = create_engine('postgresql:///uw_madison')

stem_query = """SELECT * FROM stem_classes;"""

stem_df = pd.read_sql(stem_query, engine)
stem_df.describe()

non_stem_query = """SELECT * FROM non_stem_classes;"""

non_stem_df = pd.read_sql(non_stem_query, engine)
non_stem_df.describe()

Remove the classes where the average GPA is 4.0¶

stem_df = stem_df.loc[stem_df['section_gpa'] != 4.0]['section_gpa']
stem_df.describe()

count    29263.000000
mean         3.385967
std          0.379954
min          0.000000
25%          3.117647
50%          3.416667
75%          3.700000
max          3.996622
Name: section_gpa, dtype: float64

non_stem_df = non_stem_df.loc[non_stem_df['section_gpa'] != 4.0]['section_gpa']
non_stem_df.describe()

count    57333.000000
mean         3.517342
std          0.325315
min          0.000000
25%          3.300000
50%          3.562500
75%          3.789474
max          3.998418
Name: section_gpa, dtype: float64

Choose 2,000 random classes from each population to do hypothesis testing on¶

stem_choice = np.random.choice(stem_df, size=2000, replace=False)
non_stem_choice = np.random.choice(non_stem_df, size=2000, replace=False)

Generate histogram plots for GPAs¶

fig, ax = plt.subplots()
ax.hist(stem_choice, density=True, color='#1f77b4', alpha=0.5, bins=50, range=(1.0, 4.0), label='STEM Classes')
ax.axvline(stem_choice.mean(), color='#1f77b4', alpha=1, linestyle='dashed',label='STEM Class Mean')
ax.hist(non_stem_choice, density=True, color='#ff7f0e', alpha=0.5, bins=50, range=(1.0, 4.0), label='Non-STEM Classes')
ax.axvline(non_stem_choice.mean(), color='#ff7f0e', alpha=1, linestyle='dashed', label='Non-STEM Class Mean')
ax.legend()
ax.set_xlabel('GPA');

Perform statistical testing for the hypothesis¶

stats.ttest_ind(stem_choice, non_stem_choice, equal_var=False)

Ttest_indResult(statistic=-11.595118593138709, pvalue=1.377867556258258e-30)

The T-statistic and shapes of the histograms show that the GPAs for STEM and non-STEM classes are different and the p-value shows that the differences are significant.

As such, we reject the null hypothesis.

	section_number	section_gpa
count	29263.000000	29263.000000
mean	3.876260	3.385967
std	10.121531	0.379954
min	1.000000	0.000000
25%	1.000000	3.117647
50%	1.000000	3.416667
75%	2.000000	3.700000
max	296.000000	3.996622

	section_number	section_gpa
count	57333.000000	57333.000000
mean	5.135350	3.517342
std	12.715356	0.325315
min	1.000000	0.000000
25%	1.000000	3.300000
50%	1.000000	3.562500
75%	3.000000	3.789474
max	271.000000	3.998418