Data Projects
Natural Disaster NLP Modelling
Twitter has become an important communication channel in times of emergency. The purpose in this case is to build three machine learning models (Bag of Words, Vectorization, and GloVe) that each predict which Tweets are about real disasters and which one’s are not. One has access to a dataset of 10,000 tweets that were hand classified as true or false.
https://github.com/bkuropa/Natural_Disaster
Data: https://www.kaggle.com/competitions/nlp-getting-started/rules
A random forest model utilized with vectorization of words along with keywords associated with texts found with true disaster scenarios.
New York City Citibike Program Visualization
The following is a visual analysis of the popularity of the Citibike rental program running in New York city. One can gain a strong understanding of the key starting and ending location for the cyclists in New York, as well as the highest traffic times for biking, repairs, trip lengths and even gender differences.
https://public.tableau.com/app/profile/bryan.k5567/viz/Citi_bikes_2/Story1
Data: https://ride.citibikenyc.com/system-data
Software: Pandas, Tableau Desktop, Jupyter Notebook, Excel.
Neural Network Charity Analysis
The company Alphabet Soup has provided some initial data which is hoped to be well-processed by a (basic) neural network to gain stronger understanding of what drives a successful loan based on such information as application types, amount of funding asked/provided, industry affiliation, or organization type. At the end of the model, it is hoped that a true/false classifier (sigmoid here) can determine if an application will pass or not.
COVID molecular property prediction (coming soon)
COVID has drastically changed the world for all of us. It has separated families, made many very sick, and introduced a necessity to work out of the house. The sooner scientists discover molecules that display antiviral activity against the virus, the more effective we will be at treating patients with COVID. This includes, for example, over the counter medications that people can take.
Using machine learning, molecular predictors can be generated through past successes. In this case - since SARS-CoV-2 is new to us - one can train with E-coli inhibition as well as SARS (swine flu) success.
Analysis of US Healthcare Fraud (coming soon)
The impact of Medicare fraud is widespread and involves medical-based organizations including hospitals, drug manufacturers and insurance providers, as well as small practice medical professionals. Pulse Insights took on this challenge, successfully developing a data analytics tool predicting Medicare fraud before it happens.
The GDELT Project Analysis (https://www.gdeltproject.org/)
Supported by Google Jigsaw, the GDELT Project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.
Elasticsearch & Kibana Dashboard
Trump supporters tweeting in 2018
Sentiment analysis in R:
Compiling the Tweets of 1000s of Trump fans, one can observe the words appearing most frequently. The larger the word image, the more frequently it is used.