Skip to content

YoungGer/Berkeley_ML_Spark

Repository files navigation

Berkeley_ML_Spark

Here are the assignments for Berkeley Courses CS100.1x and CS190.1x which introduce machine learning using Spark.

CS100.1x Introduction to Big Data with Apache Spark

  • Lab 1: Introduce Spark data model, transformations, and actions, and write a word counting program to count the words in all of Shakespeare's plays.
  • Lab 2: Use Spark to explore a NASA Apache web server log.
  • Lab 3: Perform text analysis and entity resolution on Google and Amazon product listings using Spark in the third course lab.
  • Lab 4: Use Spark's mllib Machine Learning library to perform collaborative filtering on a movie dataset in the fourth course lab.

CS190.1x Scalable Machine Learning

  • Lab 1: Gain hands on experience using Python's scientific computing library to manipulate matrices and vectors, and learn about lambda functions.
  • Lab 2: Exactlly the same as Lab 1 of CS100.1x.
  • Lab 3: Millionsong Regression Pipeline. Develop an end-to-end linear regression pipeline to predict the release year of a song given a set of audio features. You will implement a gradient descent solver for linear regression, use Spark's machine Learning library ( mllib) to train additional models, tune models via grid search, improve accuracy using quadratic features, and visualize various intermediate results to build intuition.
  • Lab 4: Click-through Rate Prediction Pipeline. Construct a logistic regression pipeline to predict click-through rate using data from a recent Kaggle competition. You will extract numerical features from the raw categorical data using one-hot-encoding, reduce the dimensionality of these features via hashing, train logistic regression models using mllib, tune hyperparameter via grid search, and interpret probabilistic predictions via a ROC plot.
  • Lab 5: Neuroimaging Analysis via PCA - Identify patterns of brain activity in larval zebrafish. You will work with time-varying images (generated using a technique called light-sheet microscopy) that capture a zebrafish's neural activity as it is presented with a moving visual pattern. After implementing distributed PCA from scratch and gaining intuition by working with synthetic data, you will use PCA to identify distinct patterns across the zebrafish brain that are induced by different types of stimuli.

Thanks

Thanks for the fantastic courses. It's my first time using Spark and it opens my eyes about processing huge datasets using Spark.

About

Berkeley Courses CS100.1x and CS190.1x about machine learning using Spark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors