Name: Nhan Le
Class: BDA 594
This repository showcases two applied data analytics projects completed for SDSU’s BDA 594: Big Data Analytics course. Each project focuses on real-world datasets and demonstrates skills in data cleaning, exploratory data analysis, visualization, natural language processing, and social network analysis. Together, they highlight a well-rounded set of capabilities relevant to data analyst and data science roles.
📊 1. R Data Analysis Project: Public Health & Text Mining
This project explores public health trends in San Diego County using the dataset Leading_Causes_of_Death_in_SD_2011_2016.csv. Key components include:
-
Importing, cleaning, and exploring a multi-year mortality dataset
-
Computing summary statistics and identifying patterns across regions and time
-
Creating clear visualizations with ggplot2 to communicate insights
-
Performing text mining using custom corpora, including:
- Class definitions of “Big Data”
- A historical text (England Opium Monopoly.txt)
- Generating two sets of word clouds using tm, wordcloud, and RColorBrewer
- Building reproducible R scripts covering data wrangling, visualization, and NLP
This project demonstrates proficiency in R programming, EDA, statistical reasoning, and natural language processing.
🕸 2. Social Network Analysis with Gephi
This project constructs and analyzes a social network based on Twitter conversations about vaccine-exemption topics. The workflow includes:
- Cleaning raw tweet data using OpenRefine, including Clojure-based text extraction
- Extracting user mentions and retweets to build a usable EdgeList
- Final data preparation and validation in Excel
- Importing the network into Gephi and generating interactive network layouts
- Applying layout algorithms such as Fruchterman-Reingold and ForceAtlas
- Computing key network metrics: in-degree, out-degree, modularity, density, and network diameter
- Producing final visualizations for both In-Degree and Out-Degree networks
This project highlights skills in social network analysis, graph theory, data cleaning, and visualization using OpenRefine and Gephi.