Skip to content

AllieUbisse/Waste-water-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Sanitary Sewer Overflows

Acknowledgement

Data was gathered from ⭐bloomington gov

Requirements

The following are the requirements to run the notebooks


  • Databricks community account which is Free
  • Cluster runtime 5.4 with spark 2.4 to 3.0
  • Import the waste-water-analysis.dbc file to Databricks and it will generate all the required tools

Overview

Sanitary Sewer Overflows (SSO) are releases of untreated sewage into the environment. City of Bloomington Utilities Department records and maintains data for all SSO events that occur within Bloomington's wastewater collection and treatment system. Additionally, each event is reported to the Indiana Department of Environmental Management.

Excel Worksheet labeled "Sanitary Sewer Overflow Master" is data recorded following each SSO event from 1996 forward. This contains SSO incidents from 1996 forward, including overflow dates, locations, estimated flow, and any additional data we have about the individual event (i.e. precipitation, blockage, power outage, snow melt, etc).

Objectives

  • Understanding Data Quality checks
  • Generating Ideas on how we can achieve the main goal(s)
  • Verify the given matadata
  • Simulate the main project
  • Learning Pyspark
  • Data Cleaning, Data Quality checks
  • Setup a Delta Lake Architecture

Data Dictionary

Column Type Label Description
Manhole text
Start_Date timestamp n/a
End_Date text
Location text
Event text n/a
Rain text
Gallons text
Lat numeric
Long numeric

Tasks

  1. verify features(columns).
  2. verify data types.
  3. understand the meaning of missing data.
  4. verify data entries.
  5. explode/split compound feaures to simpler/atomic features.
  6. what assumptions can you draw from the data or what do you understand?
  7. create new aggrigate fetures from your assumptions validated using domain knowledge.
  8. visualize your assumption or relationships that might exist.
  9. So what might be the main problem behind the problem and how can this data help better the situation?
  10. Apply the thoughts from 9 and validate using domain knowledge or with stackholders.
  11. suppose the ideas are valid and we have a stream of data, implement a SPARK STRUCTURED STREAM ETL
  12. Build a DataBricks DashBoard using the streamed data/ static data.
  13. Can the dashboard answer business Questions?
  14. if #13 is No/not sure! what is irralevant and what can be improved?

Data quality checks overview Report

# review 1st 5 rows
sewer_df.show(5)

Output

+-------+-------------------+-------------------+--------------------+-------------+----+-------+-----------+------------+
|Manhole|         Start_Date|           End_Date|            Location|        Event|Rain|Gallons|        Lat|        Long|
+-------+-------------------+-------------------+--------------------+-------------+----+-------+-----------+------------+
|   3430|1996-01-17 00:00:00|1996-01-17 00:00:00|        Gifford Road|Precipitation|null|   9000|39.15461147|-86.58559815|
|   1004|1996-01-17 00:00:00|1996-01-17 00:00:00|        Micro Motors|Precipitation|null| 378000|39.15424046|-86.53475475|
|   3607|1996-01-17 00:00:00|1996-01-17 00:00:00|  Sherwood Oaks Park|Precipitation|null|   6000|39.12983645|-86.51441395|
|   3138|1996-01-17 00:00:00|1996-01-17 00:00:00|Tapp Road Lift St...|Precipitation|null|  16000| 39.1366197|-86.56178514|
|   1004|1996-01-23 00:00:00|1996-01-23 00:00:00|        Micro Motors|Precipitation|null|  90000|39.15424046|-86.53475475|
+-------+-------------------+-------------------+--------------------+-------------+----+-------+-----------+------------+
only showing top 5 rows
# let's verify schema with the data dctionary & what we saw from the previous cell
sewer_df.printSchema()

output

root
 |-- Manhole: string (nullable = true)
 |-- Start_Date: timestamp (nullable = true)
 |-- End_Date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Event: string (nullable = true)
 |-- Rain: string (nullable = true)
 |-- Gallons: string (nullable = true)
 |-- Lat: double (nullable = true)
 |-- Long: double (nullable = true)

# let's cast Manhole-->int, End_Date-->timestamp and Gallons-->integer/long
sewer_df = (sewer_df.withColumn('End_Date', sewer_df.End_Date.cast('timestamp'))
                    .withColumn('Gallons', sewer_df.Gallons.cast('int'))
                    .withColumn('Manhole', sewer_df.Manhole.cast('int'))
                    )
# print changes
sewer_df.printSchema()

output

root
 |-- Manhole: integer (nullable = true)
 |-- Start_Date: timestamp (nullable = true)
 |-- End_Date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Event: string (nullable = true)
 |-- Rain: string (nullable = true)
 |-- Gallons: integer (nullable = true)
 |-- Lat: double (nullable = true)
 |-- Long: double (nullable = true)

Missing value report


It is important to discuss the missing data with the stackholders or the data team
to understand the infomation convey by the missing data.

  • Lets take for example the Rain Feature with 608 missing data, what does that mean?

The was no rain that particular day, right?
So this means we can't drop the rain feature instead we fill it with a reasonable value e.g 'No Rain'

  • Location, long and lat on the other hand might need the stackholders/ data team to clearify
    if the is some link/direction info between the pipes which can help us infer the location, before any assumption.

  • 'Event' most being Precipitation is rain, snow, sleet, or hail — any kind of weather condition where something's falling from the sky.

This feature will make sense to be null/missing if the is also no rain, but not always.
let's mark Nan as clear sky unless validated otherwise.

Lets investigate a bit more into these categorical features

sewer_df.select('Location','Event', 'Rain').distinct().show()

output

+--------------------+--------------------+----+
|            Location|               Event|Rain|
+--------------------+--------------------+----+
|Winston-Thomas Ol...|       Leaking joint|null|
|       5900 S Rogers|                null|null|
|Industrial lift s...|Power outage at l...|null|
|  Sherwood Oaks Park|                null|null|
|Grimes Lane - Mic...|       Precipitation|1.58|
|        Gifford Road|       Precipitation|1.23|
|College Mall - St...|       Precipitation|null|
|SW of cul-de-sac ...|Sewer main broken...|null|
|Walnut Creek Lift...|       Precipitation|1.79|
|        Micro Motors|                null|2.50|
|  Blucher Poole WWTP|                null|null|
|         Dunn meadow|Blockage in sewer...|null|
|   Indiana Warehouse|       Precipitation|null|
|Grimes Lane - Mic...|       Precipitation|2.95|
|1500 S Rogers - I...|Snow melt / preci...|1.20|
|        Gifford Road|       Precipitation|0.43|
|Brookdale & Woodburn|Snow melt / preci...|4.38|
|2600 Block of N W...|Blockage in sewer...|null|
|College Mall - Bl...|       Precipitation|1.90|
|  Tower Lift Station|         Power surge|null|
+--------------------+--------------------+----+
only showing top 20 rows

Missing values Assumptions Review


  • So Rain is the measure of rain in volume/litres not what we suspected. We must extend the data dictionary,

now is better we can fill Null for Rain with a Zero meaning the was no Rain.
Wait! This is not a hackathon Allie, we might need to pull weather data from google Earth to validate the assumptions.

  • 'Event' has 48 most being Precipitation but we can see that we have 48 unique values for Event e.g Blockage, Sewer main blockage , power and more

  • Location & Event Seems to be compound features,
    we need to do some deep dive mining and see how we can explode them if possible

  • Cast Rain to float which in pyspark is DoubleType or Just FloatType following Java data-types

more drill and wrangling is required here

Visualizations report

This section will be added after the main private project ends


WORK IN PROGRESS

  • Due to the kick-off of the main PoC project this project had to be paused.
  • Looking forward to complete all the task to enhence my skills

About

Practice Project preparing for the Private Explore Ai Proof of Concept Project using DataBricks & Spark.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors