Study Questions

What is big data? Describe some of the parameters.
Name and describe the two main hardware strategies for working with big data.
What is a problem introduced by implementing the hardware strategy used by Hadoop and other typical big data systems?
What is a cluster?
In what two ways can we describe a node?
What is the software architecture pattern used by tools like Hadoop?
At a high level, name the three main components of Hadoop, and briefly describe the role each one plays.
What type of storage system is used in Hadoop? Describe how the system works including the default size of the units of data that make up this system.
What problem does the default data unit size try to solve in the Hadoop echo system. Why is this a problem?
What components make up the standard storage system in Hadoop?
What problem is encountered with the default setup of the Hadoop storage system and what is a solution?
Which component changes when implementing the solution above and how does it change?
What problem is introduced by implementing the above solution?
What is the way we solve this problem in Hadoop and with other components in the Hadoop ecosystem?
What is WORM?
Explain the Map Reduce programming paradigm, including a definition of map, reduce, and shuffle.
In Map Reduce, as implemented by Hadoop, how are calculations persisted?
Why is it best to match the number of mappers to the number of blocks?
Why is it better to "push" the code/calculations to the data than to "pull" the data to the calculations?
In one sentence, describe Apache Spark.
Why is a tool like Spark preferred over something like the Hadoop implementation of Map Reduce?
Name and describe the components of Spark and their functions.
Name and describe the fundamental data structure in spark.
What are the two categories of operations in Spark. How do they differ?
What is lazy evaluation?
Similar to Map Reduce, there is a way that calculations are split in Spark. Describe this split and the ideal data size for this split.
Describe the difference between working at the RDD level and the Dataframe level.
How do Dataframes relate to RDDs?
What is the name of the mechanism that makes Dataframes the preferred method for most for interacting with data in Spark?
Why should the schema always be defined for data in production Spark?
Briefly describe the medallion architecture for data.
What is the CAP theorem? Describe each part and name the part that must always be considered when working in a distributed system.
Describe OLTP vs OLAP.
Describe the two main categories of file formats and which each is good for.
Three file formats where discussed during the class, where does each fit in to the main file format category?
What does DAG stand for and how does it work?
Which component in Hive is responsible for managing the metadata of the tables?
Which component in Hive is responsible for the data and calculations?
What is the difference between managed and external tables in Hive?
What does it mean to partition a table in a tool like Hive?
What is the messaging paradigm in tools like kafka? Hint: answer can be in the form of (one-word)
Name and describe the three main components found in Kafka.
What is the difference between a source and sink in a streaming processing system?
What is the difference between stream and batch processing?
Name and describe the three types of processing semantics.
What is the difference between event time and processing time?
What role does the trigger play in Spark Structured streaming?
Name and understand the difference between the two types of windows discussed during the lectures.
What role does the watermark play in Spark Structured Streaming? Hint: It may be helpful to define the watermark by the problem it is trying to solve.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Study Questions

FilesExpand file tree

study_questions.md

Latest commit

History

study_questions.md

File metadata and controls

Study Questions