Aaron Washington Chen is a data scientist currently contributing to Analytics Vidhya! Outside of data science, he has a Ph. D. in Chemical Engineering; loves traveling, food, and pets; and runs a small photography business called Composed and Focused Studios and you can find me at various places with the icons below.
I'm currently doing a weekly coding livestream! Here, I just work through data science projects and problems and teach as I'm going!
The project will change with different datasets, and I'll update this portfolio website when a project is "finished".
The first project involves using a simple Pokemon dataset from Kaggle and wrangling it to make something useful: For me, a type matchup predictor that I can use in Pokemon Go!
MeaLeon uses machine learning and natural language processing to suggest new dishes! It takes in a user's dish and the cuisine it's from, looks in a static database, and returns 5 dishes with similar ingredients but different cuisines.
The inspiration for MeaLeon has several combined origin stories. I once was making ground beef tacos and realized too late that I ran out of cumin. So, I grabbed the closest thing in my pantry that was mostly cumin: garam masala. Suddenly, Tex-Mex tacos turned into something more like roti. This reminded me of a friend who hated sauerkraut and kim chi, and she quipped "every culture has a disgusting fermented cabbage dish." Lastly, I can get bored of my go to dishes and was seeking an easier way to create new dishes using mostly the same ingredients. These experiences and thoughts made me wish for something like MeaLeon to exist.
This solo project combines quite a few different technologies into one full stack data science web app. The MVP is deployed on Heroku via a Flask app and uses primarily Python 3, RESTful API, HTML, and CSS. Improvements will incorporate web scraping, AWS services, MongoDB, classifier algorithms, and likely neural networks.
Next steps would be to increase the number of recipes in the database via scraping recipe sites from different cuisines, adding more cuisines as classifier targets, and perform more advanced machine learning on unclassified recipes to further add recipes into the database.
This was an initial look into real estate sales data from King County, Washington, and the goal was to use the given one year of data to make a price predictor with linear regression.
My group (pair) worked to make a model that not only reduced the R-squared value, but also the root mean squared error (RMSE). Our time limit was 4 days.
We ended up getting a model with a score of ~0.8 and RMSE of ~$130k. These results were shown after reducing extra features for use in simple linear regression and after using a more sophisticated LassoCV model. Our next steps would be to obtain more data to improve the model, as we found out that the LassoCV regression was still underfitting the data, even though we were synthesizing hybrid data from the source data. True to an iterative approach to data science, we will apply our learnings to future work and this description and file will be updated should an improvement be devised.
This was an exploration into a SQL database containing results from one year of German soccer matches and required performing extract, transform, and load (ETL) of the SQL database and combining queries with API calls to see if rain had an affect on these matches.
This was a solo project and I worked to make visualizations of result (win/loss/tie) percentage for each team in this German league and adding another batch of result percentage when rain on the match day was added. The time limit was 1 business day.
I decided to use stacked vertical bar plots for these results. While these kinds of plots are contentious visualizations, I felt that they were sensical here because each team played the same number of matches and thus it would be very easy to see the result percentages for each of the teams in the league.
Next steps would be to granularize the weather data even further. The MVP only checked for rain on the match day in Berlin, however the API could actually check for hourly weather data. In addition, not every team in this league plays in Berlin. In order to make these improvements to the model, it will be necessary to consult with soccer experts who know when and where each of the teams played.
This was an exploration into a SQL database containing GPAs from over 10 years of grades at UW Madison to perform hypothesis testing.
This was a pair project and we were able to ask and answer 6 questions in the time limit of 4 days.
We wanted to see how GPAs were affected by the class meeting day of the week (Do GPAs change day to day or weekend to weekday), the class meeting time of day (Do GPAs change when the class meets in the morning vs the afternoon), the subject material (Do STEM classes have different GPAs from non-STEM ones?), possible inflation (Is there a difference between GPAs in 2006/2007 and 2016/2017?), yearly inconsistency (Are there any years where the GPAs are different from the others?), and instructor inconsistency using one department (In the math department, are there teachers that grade differently from others?).
Next steps would be to check for and remove possible data leakage and cross validation issues. This dataset is pretty open ended and can provide for many questions to ask and attempt to answer.
This was an exploration into a large CSV from Zillow containing national real estate sales data over 100 years.
This was initially a pair project but had to be adapted to a solo one. The goal was to find the five zipcodes that were best to purchase property in. The time limit was 4 days.
This project was very open ended and was a good challenge. The dataset of over 15,000 zipcodes was too large to investigate each record effectively and required creative thinking to filter down the data into something that could be analyzed.
Using other resources (economics data and forecasts, Zillow trend data, and news reports), I created my own metrics for "best investment" and potential housing values. I chose zipcodes with high recent search trend activity, consistent pricing trends since the 2008 recession, large housing supply, and lower negative mortgage percentage.
To be clear, my actual recommendation for a hypothetical client with this money to invest was to not spend it on real estate due to strong signs for another recession. However, if the client insisted on real estate, these were the options I presented.
Next steps would be to compare projections from the model to more recent pricing data and economic reports. In addition, it would be interesting to compare the results from this Facebook Prophet model with individualized ARIMA models.
Thesis: “Particle and Protein Behavior upon Graphene as Compared with Other Common Surfaces”
Explored dynamic particle-surface and molecule-surface interactions to develop inexpensive and easily sourced micropatterned surfaces (characterized with Atomic Force Microscopy, X-Ray Photoelectron Spectroscopy, and Contact Angle Goniometry) for usage in microfluidic devices capable of dynamically and specifically sorting rolling particles based on mechanical and chemical properties
Analyzed adsorption and relaxation kinetics of protein (fibrinogen) on surfaces with both Total Internal and Near-Brewster Angle Reflectometry to find biomimetic effects in devices and determine in vivo device compatibility
Established a benchmark for additional graphene studies as a biomaterial by analyzing protein- and silica particle-surface colloidal interactions upon graphene, polycarbonate, and glass
Projects required application of fluid mechanics chemistry, chemical engineering, biochemistry, biology; computer programming in Interactive Data Language, R, Matlab, Mathematica, and Excel Visual Basic Macros; and internal collaboration with groups at UMass Amherst and external industrial and academic contacts
Mentored four undergraduate students from four different schools with project direction and connected them with additional educational or research resources
Provided a variety of different teaching assistance (grading, office hours, solution set creation and explanation, and recitation instruction) covering introductory chemical engineering, thermodynamics, and process design
Member of the Student Chapter of the American Institute of Chemical Engineers (Vice President, Class Representative, Historian)
Member of the Student Chapter of Engineers for a Sustainable World (Historian)
Collaborated to structure a model of ~500 interactions to study apoptosis via unfolded protein and stress response (2nd author)
In addition to being analytically minded, I enjoy photography! Pursuing photography has led me to amazing travels, fun hikes, and a healthier lifestyle. How has it led to a healthier lifestyle? Well, what's the point of having so much expensive camera gear if you can't carry it with you?