Skip to content

Latest commit

 

History

History
65 lines (50 loc) · 2.1 KB

File metadata and controls

65 lines (50 loc) · 2.1 KB

Big Data

Introduction

In today's enterprise world, managing and analyzing vast amounts of data is crucial for gaining competitive insights and driving informed decision-making. This course will equip you with a fundamental understanding of big data concepts and the Hadoop framework, including HDFS, MapReduce, and YARN, tailored to enterprise applications. By the end of this course, you will possess the foundational skills needed to implement and leverage big data technologies in an enterprise environment, enhancing your organization's data processing capabilities.

Educational goals - objectifs pédagogiques

  • Discover how to manage the spectacular growth of data in the company.
  • Explore the different components of a Big Data cluster and how they interact.
  • Understand Big Data paradigms.
  • Understand the advantages of Open Source solutions.
  • Develop a Big Data project from scratch.

Prerequisites

SQL and Python programming, a good understanding of the Linux shell and Git.

Recommanded previous courses include DevOps and Git.

Modules

Module 1 (3h) - Big Data introduction

  • Information Systems
  • Distributed systems
  • Horizontal vs vertical scaling
  • Data structure
  • History of data
  • Distributed systems
  • The 3 Vs
  • Who needs Big Data?
  • Big Data clusters
  • Big Data clusters
  • The Hadoop Ecosystem
  • Data skils and profiles

Module 2 (3h) - Hadoop core: HDFS and YARN

  • Hadoop ecosystem introduction
  • Hadoop ecosystem projects
  • Hadoop core components
  • HDFS: presentation
  • HDFS: Master / Slave architecture
  • HDFS: Files storage
  • HDFS: Data replication example
  • HDFS: Client interactions
  • HDFS: Important properties
  • HDFS: Single Master mode vs High Availability
  • YARN: presentation
  • YARN: Architecture
  • YARN: Applications
  • YARN: Application lifecycle
  • YARN: Job scheduler and resource management

Module 3 (3h) - Distributed processing and the MapReduce framework

  • HDFS + YARN architecture
  • MapReduce: a framework
  • MapReduce: Application steps
  • MapReduce: Word count example
  • MapReduce: Distribution on a cluster
  • MapReduce: Important properties
  • MapReduce vs other frameworks