Overview
This series is aimed at providing a comprehensive view on building ,designing and developing an analytics\AI data pipeline for stack overflow using the AWS stack and finally build a dashboard in Einstein Analytics.
Pipelines are the heart of analytics and ML and quite often this is the hardest part of an analytics or ML problem. If you have a well designed pipeline, then half your battle is over.
- Introduction to Stack Overflow and Business Requirements.
- Technical Design Architecture For an Analytics Pipeline.
- Data Ingestion using Kinesis Firehose and boto3.
- ETL and Data Processing Using Apache Spark on AWS EMR.
- Data Storage in Redshift.
- Einstein Analytics Data Prep & Dashboards.
Motivation
End to End Analytics solutions are always a challenge. It's easier to build a dashboard from a CSV file. But building a live dashboard by streaming data and transforming it to the relevant form is quite difficult to achieve. There are many technical considerations to be taken into account.
Goal
My Goal was to map out the thought process involved in creation of a full fledged analytical solution using AWS and Einstein Analytics.
Tools Used
- S3 as a data lake.
- Kinesis to stream stack over flow data.
- Apache Spark to process the data
- Redshift as a datawarehouse to store to transformed data.
- Einstein Analytics for Data Visualization.
Technical Design Architecture
- Kinesis Firehose is chosen to stream the data from stack api and output it to S3 bucket folder.
- Spark will batch process the streams from S3 on a daily basis and output the transformed data back to Redshift- This will be a script scheduled on ec2 once every day.
- Einstein Analytics will use it native S3 connector to sync the data and display it in the dashboards- Dashboards will be refreshed everyday with the data of yesterday’s
Blog
Full Blog Post can be read here :