- What is big data? Describe some of the parameters.
- Name and describe the two main hardware strategies for working with big data.
- What is a problem introduced by implementing the hardware strategy used by Hadoop and other typical big data systems?
- What is a cluster?
- In what two ways can we describe a node?
- What is the software architecture pattern used by tools like Hadoop?
- At a high level, name the three main components of Hadoop, and briefly describe the role each one plays.
- What type of storage system is used in Hadoop? Describe how the system works including the default size of the units of data that make up this system.
- What problem does the default data unit size try to solve in the Hadoop echo system. Why is this a problem?
- What components make up the standard storage system in Hadoop?
- What problem is encountered with the default setup of the Hadoop storage system and what is a solution?
- Which component changes when implementing the solution above and how does it change?
- What problem is introduced by implementing the above solution?
- What is the way we solve this problem in Hadoop and with other components in the Hadoop ecosystem?
- What is WORM?
- Explain the Map Reduce programming paradigm, including a definition of map, reduce, and shuffle.
- In Map Reduce, as implemented by Hadoop, how are calculations persisted?
- Why is it best to match the number of mappers to the number of blocks?
- Why is it better to "push" the code/calculations to the data than to "pull" the data to the calculations?
- In one sentence, describe Apache Spark.
- Why is a tool like Spark preferred over something like the Hadoop implementation of Map Reduce?
- Name and describe the components of Spark and their functions.
- Name and describe the fundamental data structure in spark.
- What are the two categories of operations in Spark. How do they differ?
- What is lazy evaluation?
- Similar to Map Reduce, there is a way that calculations are split in Spark. Describe this split and the ideal data size for this split.
- Describe the difference between working at the RDD level and the Dataframe level.
- How do Dataframes relate to RDDs?
- What is the name of the mechanism that makes Dataframes the preferred method for most for interacting with data in Spark?
- Why should the schema always be defined for data in production Spark?
- Briefly describe the medallion architecture for data.
- What is the CAP theorem? Describe each part and name the part that must always be considered when working in a distributed system.
- Describe OLTP vs OLAP.
- Describe the two main categories of file formats and which each is good for.
- Three file formats where discussed during the class, where does each fit in to the main file format category?
- What does DAG stand for and how does it work?
- Which component in Hive is responsible for managing the metadata of the tables?
- Which component in Hive is responsible for the data and calculations?
- What is the difference between managed and external tables in Hive?
- What does it mean to partition a table in a tool like Hive?
- What is the messaging paradigm in tools like kafka? Hint: answer can be in the form of (one-word)
- Name and describe the three main components found in Kafka.
- What is the difference between a source and sink in a streaming processing system?
- What is the difference between stream and batch processing?
- Name and describe the three types of processing semantics.
- What is the difference between event time and processing time?
- What role does the trigger play in Spark Structured streaming?
- Name and understand the difference between the two types of windows discussed during the lectures.
- What role does the watermark play in Spark Structured Streaming? Hint: It may be helpful to define the watermark by the problem it is trying to solve.