Aaron Washington Chen, Ph. D.


Aaron Washington Chen is a data scientist currently contributing to Analytics Vidhya! Outside of data science, he has a Ph. D. in Chemical Engineering; loves traveling, food, and pets; and runs a small photography business called Composed and Focused Studios and you can find me at various places with the icons below.


Experience

Niantic Labs

Spread Augmented Reality (AR) development and encourage healthy gameplay through inspiring outdoor exploration, exercise, and meaningful social interaction by applying data engineering and data science skills
  • Lead analyses and reporting of financial impact of gameplay features inside Pokemon Go, leading to the redesign of content and project setting for the future
  • Write extract, transform, and load (ETL) jobs using Airflow and SQL to simplify and accelerate analyses and decision making within primary product and overall organization
  • Consult as data scientist across multiple games and AR platform; assist with OKR setting across game and platform; and act as subject matter liaison between game, product, and platform teams
  • Define and demonstrate projects leading to development and adoption of MLOps tech stack (Kubeflow, Ray, and MLFlow) on a cross functional organizational level
  • Design of experiments to investigate impact of and reduce excess costs caused by targeted players
  • Set and monitor targets for deployment of features inside Pokemon Go
  • Identify affected users and metrics from app issues and assist with make good events
  • Centers for Medicare and Medicaid Services

    Improve health & healthcare experiences of >100 million beneficiaries served by the Centers by helping to modernize the Center’s technology & using data science skills, techniques, & tools to advise leadership
  • Generated topic models from ~400k unstructured nursing home reports w/ Python & Natural Language Processing (NLP) while providing commentary on & assisting w/modernizing legacy technology infrastructure
  • Wrangled CDC & Operation: Warp Speed data (millions of time series records) to create Quicksight visualizations for leadership & direct assistance to 15k skilled nursing facilities
  • Analyzed technology contracts valued at $5 million to recommend improvements to tech stack, data handling, & organizational role responsibilities
  • Collaborated on impact analyses to determine value of group’s $500 million contracts
  • Contributed to specifications development of future contract work, data & technology infrastructure, & analytics stack w/an estimated budget of $1 billion
  • Develop onboarding and hiring of over 50 data scientists across Federal government
  • Projects

    Coding Livestream

    Working through data science projects live!

    I'm currently doing a weekly coding livestream! Here, I just work through data science projects and problems and teach as I'm going!

    The project will change with different datasets, and I'll update this portfolio website when a project is "finished".

    The first project involves using a simple Pokemon dataset from Kaggle and wrangling it to make something useful: For me, a type matchup predictor that I can use in Pokemon Go!

    You can find me at Twitch and YouTube.

    MeaLeon

    Using Machine Learning and Natural Language Processing to Find New Foods

    MeaLeon uses machine learning and natural language processing to suggest new dishes! It takes in a user's dish and the cuisine it's from, looks in a static database, and returns 5 dishes with similar ingredients but different cuisines.

    The inspiration for MeaLeon has several combined origin stories. I once was making ground beef tacos and realized too late that I ran out of cumin. So, I grabbed the closest thing in my pantry that was mostly cumin: garam masala. Suddenly, Tex-Mex tacos turned into something more like roti. This reminded me of a friend who hated sauerkraut and kim chi, and she quipped "every culture has a disgusting fermented cabbage dish." Lastly, I can get bored of my go to dishes and was seeking an easier way to create new dishes using mostly the same ingredients. These experiences and thoughts made me wish for something like MeaLeon to exist.

    This solo project combines quite a few different technologies into one full stack data science web app. The MVP is deployed on Heroku via a Flask app and uses primarily Python 3, RESTful API, HTML, and CSS. Improvements will incorporate web scraping, AWS services, MongoDB, classifier algorithms, and likely neural networks.

    Next steps would be to increase the number of recipes in the database via scraping recipe sites from different cuisines, adding more cuisines as classifier targets, and perform more advanced machine learning on unclassified recipes to further add recipes into the database.

    Ad Hoc Analyses

    These one off analyses were done as short, independent projects

    Predicting Prices in King County, WA

    Using one year of sales data to predict prices

    This was an initial look into real estate sales data from King County, Washington, and the goal was to use the given one year of data to make a price predictor with linear regression.

    My group (pair) worked to make a model that not only reduced the R-squared value, but also the root mean squared error (RMSE). Our time limit was 4 days.

    We ended up getting a model with a score of ~0.8 and RMSE of ~$130k. These results were shown after reducing extra features for use in simple linear regression and after using a more sophisticated LassoCV model. Our next steps would be to obtain more data to improve the model, as we found out that the LassoCV regression was still underfitting the data, even though we were synthesizing hybrid data from the source data. True to an iterative approach to data science, we will apply our learnings to future work and this description and file will be updated should an improvement be devised.

    How are Soccer Wins Affected by Rain?

    Combining SQL with weather API calls

    This was an exploration into a SQL database containing results from one year of German soccer matches and required performing extract, transform, and load (ETL) of the SQL database and combining queries with API calls to see if rain had an affect on these matches.

    This was a solo project and I worked to make visualizations of result (win/loss/tie) percentage for each team in this German league and adding another batch of result percentage when rain on the match day was added. The time limit was 1 business day.

    I decided to use stacked vertical bar plots for these results. While these kinds of plots are contentious visualizations, I felt that they were sensical here because each team played the same number of matches and thus it would be very easy to see the result percentages for each of the teams in the league.

    Next steps would be to granularize the weather data even further. The MVP only checked for rain on the match day in Berlin, however the API could actually check for hourly weather data. In addition, not every team in this league plays in Berlin. In order to make these improvements to the model, it will be necessary to consult with soccer experts who know when and where each of the teams played.

    GPAs at UW Madison

    Using SQL Queries to Conduct Hypothesis Tests

    This was an exploration into a SQL database containing GPAs from over 10 years of grades at UW Madison to perform hypothesis testing.

    This was a pair project and we were able to ask and answer 6 questions in the time limit of 4 days.

    We wanted to see how GPAs were affected by the class meeting day of the week (Do GPAs change day to day or weekend to weekday), the class meeting time of day (Do GPAs change when the class meets in the morning vs the afternoon), the subject material (Do STEM classes have different GPAs from non-STEM ones?), possible inflation (Is there a difference between GPAs in 2006/2007 and 2016/2017?), yearly inconsistency (Are there any years where the GPAs are different from the others?), and instructor inconsistency using one department (In the math department, are there teachers that grade differently from others?).

    Next steps would be to check for and remove possible data leakage and cross validation issues. This dataset is pretty open ended and can provide for many questions to ask and attempt to answer.

    US Real Estate Investments

    Conducting Time Series Forecasting to Predict Returns 3 Years into the Future

    This was an exploration into a large CSV from Zillow containing national real estate sales data over 100 years.

    This was initially a pair project but had to be adapted to a solo one. The goal was to find the five zipcodes that were best to purchase property in. The time limit was 4 days.

    This project was very open ended and was a good challenge. The dataset of over 15,000 zipcodes was too large to investigate each record effectively and required creative thinking to filter down the data into something that could be analyzed.

    Using other resources (economics data and forecasts, Zillow trend data, and news reports), I created my own metrics for "best investment" and potential housing values. I chose zipcodes with high recent search trend activity, consistent pricing trends since the 2008 recession, large housing supply, and lower negative mortgage percentage.

    To be clear, my actual recommendation for a hypothetical client with this money to invest was to not spend it on real estate due to strong signs for another recession. However, if the client insisted on real estate, these were the options I presented.

    Next steps would be to compare projections from the model to more recent pricing data and economic reports. In addition, it would be interesting to compare the results from this Facebook Prophet model with individualized ARIMA models.

    Education

    Data Science Fellow

    Flatiron School Intensive

    Doctor of Philosophy from the University of Massachusetts-Amherst in Chemical Engineering

    Colloid and Interface Science | Fluid Mechanics | Materials Science

    Thesis: “Particle and Protein Behavior upon Graphene as Compared with Other Common Surfaces”

    Explored dynamic particle-surface and molecule-surface interactions to develop inexpensive and easily sourced micropatterned surfaces (characterized with Atomic Force Microscopy, X-Ray Photoelectron Spectroscopy, and Contact Angle Goniometry) for usage in microfluidic devices capable of dynamically and specifically sorting rolling particles based on mechanical and chemical properties

    Analyzed adsorption and relaxation kinetics of protein (fibrinogen) on surfaces with both Total Internal and Near-Brewster Angle Reflectometry to find biomimetic effects in devices and determine in vivo device compatibility

    Established a benchmark for additional graphene studies as a biomaterial by analyzing protein- and silica particle-surface colloidal interactions upon graphene, polycarbonate, and glass

    Projects required application of fluid mechanics chemistry, chemical engineering, biochemistry, biology; computer programming in Interactive Data Language, R, Matlab, Mathematica, and Excel Visual Basic Macros; and internal collaboration with groups at UMass Amherst and external industrial and academic contacts

    Mentored four undergraduate students from four different schools with project direction and connected them with additional educational or research resources

    Provided a variety of different teaching assistance (grading, office hours, solution set creation and explanation, and recitation instruction) covering introductory chemical engineering, thermodynamics, and process design

    Bachelor of Science from Cornell University in Chemical Engineering

    Minor in Biomedical Engineering | Modeled Biological Networks

    Member of the Student Chapter of the American Institute of Chemical Engineers (Vice President, Class Representative, Historian)

    Member of the Student Chapter of Engineers for a Sustainable World (Historian)

    Collaborated to structure a model of ~500 interactions to study apoptosis via unfolded protein and stress response (2nd author)

    Interests

    In addition to being analytically minded, I enjoy photography! Pursuing photography has led me to amazing travels, fun hikes, and a healthier lifestyle. How has it led to a healthier lifestyle? Well, what's the point of having so much expensive camera gear if you can't carry it with you?