Aaron Washington Chen, Ph. D.

Data Scientist · Chemical Engineer · Photographer
Email awc33@cornell.edu for Data and Engineering
Email aaron@composedandfocusedstudios.com for Photography

Aaron Washington Chen is a data scientist currently contributing to Analytics Vidhya! Outside of data science, he has a Ph. D. in Chemical Engineering; loves traveling, food, and pets; and runs a small photography business called Composed and Focused Studios and you can find me at various places with the icons below.

Experience

Niantic Labs

Spread Augmented Reality (AR) development and encourage healthy gameplay through inspiring outdoor exploration, exercise, and meaningful social interaction by applying data engineering and data science skills

Lead analyses and reporting of financial impact of gameplay features inside Pokemon Go, leading to the redesign of content and project setting for the future

Write extract, transform, and load (ETL) jobs using Airflow and SQL to simplify and accelerate analyses and decision making within primary product and overall organization

Consult as data scientist across multiple games and AR platform; assist with OKR setting across game and platform; and act as subject matter liaison between game, product, and platform teams

Define and demonstrate projects leading to development and adoption of MLOps tech stack (Kubeflow, Ray, and MLFlow) on a cross functional organizational level

Design of experiments to investigate impact of and reduce excess costs caused by targeted players

Set and monitor targets for deployment of features inside Pokemon Go

Identify affected users and metrics from app issues and assist with make good events

Centers for Medicare and Medicaid Services

Improve health & healthcare experiences of >100 million beneficiaries served by the Centers by helping to modernize the Center’s technology & using data science skills, techniques, & tools to advise leadership

Generated topic models from ~400k unstructured nursing home reports w/ Python & Natural Language Processing (NLP) while providing commentary on & assisting w/modernizing legacy technology infrastructure

Wrangled CDC & Operation: Warp Speed data (millions of time series records) to create Quicksight visualizations for leadership & direct assistance to 15k skilled nursing facilities

Analyzed technology contracts valued at $5 million to recommend improvements to tech stack, data handling, & organizational role responsibilities

Collaborated on impact analyses to determine value of group’s $500 million contracts

Contributed to specifications development of future contract work, data & technology infrastructure, & analytics stack w/an estimated budget of $1 billion

Develop onboarding and hiring of over 50 data scientists across Federal government

Projects

Coding Livestream

Working through data science projects live!

I'm currently doing a weekly coding livestream! Here, I just work through data science projects and problems and teach as I'm going!

The project will change with different datasets, and I'll update this portfolio website when a project is "finished".

The first project involves using a simple Pokemon dataset from Kaggle and wrangling it to make something useful: For me, a type matchup predictor that I can use in Pokemon Go!

You can find me at Twitch and YouTube.

Check out the Pokemon project here!

MeaLeon

Using Machine Learning and Natural Language Processing to Find New Foods

MeaLeon uses machine learning and natural language processing to suggest new dishes! It takes in a user's dish and the cuisine it's from, looks in a static database, and returns 5 dishes with similar ingredients but different cuisines.

The inspiration for MeaLeon has several combined origin stories. I once was making ground beef tacos and realized too late that I ran out of cumin. So, I grabbed the closest thing in my pantry that was mostly cumin: garam masala. Suddenly, Tex-Mex tacos turned into something more like roti. This reminded me of a friend who hated sauerkraut and kim chi, and she quipped "every culture has a disgusting fermented cabbage dish." Lastly, I can get bored of my go to dishes and was seeking an easier way to create new dishes using mostly the same ingredients. These experiences and thoughts made me wish for something like MeaLeon to exist.

This solo project combines quite a few different technologies into one full stack data science web app. The MVP is deployed on Heroku via a Flask app and uses primarily Python 3, RESTful API, HTML, and CSS. Improvements will incorporate web scraping, AWS services, MongoDB, classifier algorithms, and likely neural networks.

Next steps would be to increase the number of recipes in the database via scraping recipe sites from different cuisines, adding more cuisines as classifier targets, and perform more advanced machine learning on unclassified recipes to further add recipes into the database.

Check out MeaLeon here!

Ad Hoc Analyses

These one off analyses were done as short, independent projects

Predicting Prices in King County, WA

Using one year of sales data to predict prices

This was an initial look into real estate sales data from King County, Washington, and the goal was to use the given one year of data to make a price predictor with linear regression.

My group (pair) worked to make a model that not only reduced the R-squared value, but also the root mean squared error (RMSE). Our time limit was 4 days.

We ended up getting a model with a score of ~0.8 and RMSE of ~$130k. These results were shown after reducing extra features for use in simple linear regression and after using a more sophisticated LassoCV model. Our next steps would be to obtain more data to improve the model, as we found out that the LassoCV regression was still underfitting the data, even though we were synthesizing hybrid data from the source data. True to an iterative approach to data science, we will apply our learnings to future work and this description and file will be updated should an improvement be devised.

View Jupyter Notebook
View GitHub Repo

How are Soccer Wins Affected by Rain?

Combining SQL with weather API calls

This was an exploration into a SQL database containing results from one year of German soccer matches and required performing extract, transform, and load (ETL) of the SQL database and combining queries with API calls to see if rain had an affect on these matches.

This was a solo project and I worked to make visualizations of result (win/loss/tie) percentage for each team in this German league and adding another batch of result percentage when rain on the match day was added. The time limit was 1 business day.

I decided to use stacked vertical bar plots for these results. While these kinds of plots are contentious visualizations, I felt that they were sensical here because each team played the same number of matches and thus it would be very easy to see the result percentages for each of the teams in the league.

Next steps would be to granularize the weather data even further. The MVP only checked for rain on the match day in Berlin, however the API could actually check for hourly weather data. In addition, not every team in this league plays in Berlin. In order to make these improvements to the model, it will be necessary to consult with soccer experts who know when and where each of the teams played.

View Jupyter Notebook
View GitHub Repo

GPAs at UW Madison

Using SQL Queries to Conduct Hypothesis Tests

This was an exploration into a SQL database containing GPAs from over 10 years of grades at UW Madison to perform hypothesis testing.

This was a pair project and we were able to ask and answer 6 questions in the time limit of 4 days.

We wanted to see how GPAs were affected by the class meeting day of the week (Do GPAs change day to day or weekend to weekday), the class meeting time of day (Do GPAs change when the class meets in the morning vs the afternoon), the subject material (Do STEM classes have different GPAs from non-STEM ones?), possible inflation (Is there a difference between GPAs in 2006/2007 and 2016/2017?), yearly inconsistency (Are there any years where the GPAs are different from the others?), and instructor inconsistency using one department (In the math department, are there teachers that grade differently from others?).

Next steps would be to check for and remove possible data leakage and cross validation issues. This dataset is pretty open ended and can provide for many questions to ask and attempt to answer.

View Jupyter Notebook
View GitHub Repo

US Real Estate Investments

Conducting Time Series Forecasting to Predict Returns 3 Years into the Future

This was an exploration into a large CSV from Zillow containing national real estate sales data over 100 years.

This was initially a pair project but had to be adapted to a solo one. The goal was to find the five zipcodes that were best to purchase property in. The time limit was 4 days.

This project was very open ended and was a good challenge. The dataset of over 15,000 zipcodes was too large to investigate each record effectively and required creative thinking to filter down the data into something that could be analyzed.

Using other resources (economics data and forecasts, Zillow trend data, and news reports), I created my own metrics for "best investment" and potential housing values. I chose zipcodes with high recent search trend activity, consistent pricing trends since the 2008 recession, large housing supply, and lower negative mortgage percentage.

To be clear, my actual recommendation for a hypothetical client with this money to invest was to not spend it on real estate due to strong signs for another recession. However, if the client insisted on real estate, these were the options I presented.

Next steps would be to compare projections from the model to more recent pricing data and economic reports. In addition, it would be interesting to compare the results from this Facebook Prophet model with individualized ARIMA models.

View Summary Notebook
View Presentation Slides
View GitHub Repo

Education

Data Science Fellow

Flatiron School Intensive

Doctor of Philosophy from the University of Massachusetts-Amherst in Chemical Engineering

Colloid and Interface Science | Fluid Mechanics | Materials Science

Thesis: “Particle and Protein Behavior upon Graphene as Compared with Other Common Surfaces”

Explored dynamic particle-surface and molecule-surface interactions to develop inexpensive and easily sourced micropatterned surfaces (characterized with Atomic Force Microscopy, X-Ray Photoelectron Spectroscopy, and Contact Angle Goniometry) for usage in microfluidic devices capable of dynamically and specifically sorting rolling particles based on mechanical and chemical properties

Analyzed adsorption and relaxation kinetics of protein (fibrinogen) on surfaces with both Total Internal and Near-Brewster Angle Reflectometry to find biomimetic effects in devices and determine in vivo device compatibility

Established a benchmark for additional graphene studies as a biomaterial by analyzing protein- and silica particle-surface colloidal interactions upon graphene, polycarbonate, and glass

Projects required application of fluid mechanics chemistry, chemical engineering, biochemistry, biology; computer programming in Interactive Data Language, R, Matlab, Mathematica, and Excel Visual Basic Macros; and internal collaboration with groups at UMass Amherst and external industrial and academic contacts

Mentored four undergraduate students from four different schools with project direction and connected them with additional educational or research resources

Provided a variety of different teaching assistance (grading, office hours, solution set creation and explanation, and recitation instruction) covering introductory chemical engineering, thermodynamics, and process design

Check out my thesis here!

Bachelor of Science from Cornell University in Chemical Engineering

Minor in Biomedical Engineering | Modeled Biological Networks

Member of the Student Chapter of the American Institute of Chemical Engineers (Vice President, Class Representative, Historian)

Member of the Student Chapter of Engineers for a Sustainable World (Historian)

Collaborated to structure a model of ~500 interactions to study apoptosis via unfolded protein and stress response (2nd author)

Check out the paper here!

Interests

In addition to being analytically minded, I enjoy photography! Pursuing photography has led me to amazing travels, fun hikes, and a healthier lifestyle. How has it led to a healthier lifestyle? Well, what's the point of having so much expensive camera gear if you can't carry it with you?