Berkeley_ML_Spark

Here are the assignments for Berkeley Courses CS100.1x and CS190.1x which introduce machine learning using Spark.

CS100.1x Introduction to Big Data with Apache Spark

Lab 1: Introduce Spark data model, transformations, and actions, and write a word counting program to count the words in all of Shakespeare's plays.
Lab 2: Use Spark to explore a NASA Apache web server log.
Lab 3: Perform text analysis and entity resolution on Google and Amazon product listings using Spark in the third course lab.
Lab 4: Use Spark's mllib Machine Learning library to perform collaborative filtering on a movie dataset in the fourth course lab.

CS190.1x Scalable Machine Learning

Lab 1: Gain hands on experience using Python's scientific computing library to manipulate matrices and vectors, and learn about lambda functions.
Lab 2: Exactlly the same as Lab 1 of CS100.1x.
Lab 3: Millionsong Regression Pipeline. Develop an end-to-end linear regression pipeline to predict the release year of a song given a set of audio features. You will implement a gradient descent solver for linear regression, use Spark's machine Learning library ( mllib) to train additional models, tune models via grid search, improve accuracy using quadratic features, and visualize various intermediate results to build intuition.
Lab 4: Click-through Rate Prediction Pipeline. Construct a logistic regression pipeline to predict click-through rate using data from a recent Kaggle competition. You will extract numerical features from the raw categorical data using one-hot-encoding, reduce the dimensionality of these features via hashing, train logistic regression models using mllib, tune hyperparameter via grid search, and interpret probabilistic predictions via a ROC plot.
Lab 5: Neuroimaging Analysis via PCA - Identify patterns of brain activity in larval zebrafish. You will work with time-varying images (generated using a technique called light-sheet microscopy) that capture a zebrafish's neural activity as it is presented with a moving visual pattern. After implementing distributed PCA from scratch and gaining intuition by working with synthetic data, you will use PCA to identify distinct patterns across the zebrafish brain that are induced by different types of stimuli.

Thanks

Thanks for the fantastic courses. It's my first time using Spark and it opens my eyes about processing huge datasets using Spark.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
CS100.1x Introduction to Big Data with Apache Spark		CS100.1x Introduction to Big Data with Apache Spark
CS190.1x Scalable Machine Learning		CS190.1x Scalable Machine Learning
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Berkeley_ML_Spark

CS100.1x Introduction to Big Data with Apache Spark

CS190.1x Scalable Machine Learning

Thanks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Berkeley_ML_Spark

CS100.1x Introduction to Big Data with Apache Spark

CS190.1x Scalable Machine Learning

Thanks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages