Lecture

Scheduling: Under the Hood

Related lectures (32)

Scheduling Decisions: Data Locality and Multitenancy

Explores data locality in scheduling decisions for multi-tenant platforms and discusses Hadoop's architecture, execution engine optimizations, and fault tolerance strategies.

Hadoop: Execution Models

Explores Hadoop's execution models, fault tolerance, data locality, and scheduling, highlighting the limitations of MapReduce and alternative distributed processing frameworks.

General-Purpose Distributed Execution System

Explores the design of a general-purpose distributed execution system, covering challenges, specialized frameworks, decentralized control logic, and high-performance shuffle.

Advanced Spark Optimization Techniques: Managing Big Data

Discusses advanced Spark optimization techniques for managing big data efficiently, focusing on parallelization, shuffle operations, and memory management.

Distributed Information Systems: Overview and Models

Covers Distributed Information Systems, key tasks, methods, projects, evaluation, and exam support.

Data Wrangling with Hive: Managing Big Data Efficiently

Covers data wrangling techniques using Apache Hive for efficient big data management.

Big Data Challenges: Distributed Computing with Spark

Explores big data challenges, distributed computing with Spark, RDDs, hardware requirements, MapReduce, transformations, and Spark DataFrames.

Operating System Roles: Referee and Resource Manager

Covers the operating system's role as a referee in managing resources and ensuring security through fault isolation, resource sharing, and communication.

Introduction to Spark Runtime Architecture

Covers the Spark runtime architecture, including RDDs, transformations, actions, and caching for performance optimization.

Spark Data Frames

Covers Spark Data Frames, distributed collections of data organized into named columns, and the benefits of using them over RDDs.

Introduction to Applied Data Analysis

Introduces the Applied Data Analysis course at EPFL, covering a broad range of data analysis topics and emphasizing continuous learning in data science.

Introduction to Spark runtime architecture

Introduces Apache Spark, covering its key features, history, RDDs, architecture, and distributed computing framework.

Coordination and Scheduling

Explores coordination and scheduling in operating systems, covering lost wakeup problems, scheduling algorithms, and coordination primitives like sleep and wakeup.

Machine learning: Physics and Data

Delves into the intersection of physics and data in machine learning models, covering topics like atomic cluster expansion force fields and unsupervised learning.

Scaling up: Spark and Big Data

Explores the challenges of big data processing and introduces Spark as a solution.

Transport Equation: Numerical Analysis

Covers optimization, control problems, and neural networks in the context of the transport equation.

Data, big data, clouds and IoT

Explores data representation, databases, cloud computing, and challenges in the cloud environment.

Data Cleaning Challenges: Optimizing Error Detection

Addresses challenges in data cleaning for analysis, proposing optimizations to reduce processing time.

Data Science Essentials

Covers the essentials of data science, including data handling, visualization, and analysis, emphasizing practical skills and active engagement.

Execution Models for Distributed Computing - 2nd generation

Explores the 2nd generation of execution models for distributed computing, focusing on Spark and Resilient Distributed Datasets (RDDs).