Lecture

Scaling up: Spark and Big Data

Related lectures (31)

Big Data Challenges: Scaling to Massive Data

Explores challenges of handling massive data in the era of big data, discussing solutions like MapReduce and Spark.

Big Data Challenges: Distributed Computing with Spark

Explores big data challenges, distributed computing with Spark, RDDs, hardware requirements, MapReduce, transformations, and Spark DataFrames.

Big Data Best Practices and Guidelines

Covers best practices and guidelines for big data, including data lakes, architecture, challenges, and technologies like Hadoop and Hive.

Linear Regression and Logistic Regression

Covers linear and logistic regression for regression and classification tasks, focusing on loss functions and model training.

Advanced Spark Optimization Techniques: Managing Big Data

Discusses advanced Spark optimization techniques for managing big data efficiently, focusing on parallelization, shuffle operations, and memory management.

Logistic Regression: Vegetation Prediction

Explores logistic regression for predicting vegetation proportions in the Amazon region through remote sensing data analysis.

Data Wrangling with Hive: Managing Big Data Efficiently

Covers data wrangling techniques using Apache Hive for efficient big data management.

Introduction to Applied Data Analysis

Introduces the Applied Data Analysis course at EPFL, covering a broad range of data analysis topics and emphasizing continuous learning in data science.

Hadoop Ecosystem: Architectural Choices & MapReduce Programming

Explores the Hadoop ecosystem's architecture and MapReduce programming model, emphasizing strengths and limitations.

General Introduction to Big Data

Covers data science tools, Hadoop, Spark, data lake ecosystems, CAP theorem, batch vs. stream processing, HDFS, Hive, Parquet, ORC, and MapReduce architecture.

Introduction to Spark Runtime Architecture

Covers the Spark runtime architecture, including RDDs, transformations, actions, and caching for performance optimization.

Supervised Learning Fundamentals

Introduces the fundamentals of supervised learning, including loss functions and probability distributions.

Big Data Ecosystems: Technologies and Challenges

Covers the fundamentals of big data ecosystems, focusing on technologies, challenges, and practical exercises with Hadoop's HDFS.

Logistic Regression: Fundamentals and Applications

Explores logistic regression fundamentals, including cost functions, regularization, and classification boundaries, with practical examples using scikit-learn.

Scaling to Massive Data: Spark Fundamentals

Covers the fundamentals of scaling to massive data using Spark, focusing on RDDs, transformations, actions, Spark architecture, and Spark's machine learning toolkit.

Big Data: Best Practices and Guidelines

Covers best practices and guidelines for big data, including data lakes, typical architecture, challenges, and technologies used to address them.

Integrating Scalable Data Storage and Map Reduce Processing with Hadoop

Covers the integration of scalable data storage and map reduce processing using Hadoop, including HDFS, Hive, Parquet, ORC, Spark, and HBase.

Introduction to Spark Runtime Architecture

Introduces Apache Spark, covering its architecture, RDDs, transformations, actions, fault tolerance, deployment options, and practical exercises in Jupyter notebooks.

Logistic Regression: Cost Functions & Optimization

Explores logistic regression, cost functions, gradient descent, and probability modeling using the logistic sigmoid function.

Multiclass Classification

Covers the concept of multiclass classification and the challenges of linearly separating data with multiple classes.