Lecture

Spark DataFrames: Basics and Optimization

Related lectures (32)

Data Wrangling with Hive: Managing Big Data Efficiently

Covers data wrangling techniques using Apache Hive for efficient big data management.

Covers data manipulation and exploration using Python with a focus on visualization techniques.

Introduces the basics of data science, covering decision trees, machine learning advancements, and deep reinforcement learning.

Data Wrangling with Hadoop

Covers data wrangling techniques using Hadoop, focusing on row versus column-oriented databases, popular storage formats, and HBase-Hive integration.

Spark Data Frames

Covers Spark Data Frames, distributed collections of data organized into named columns, and the benefits of using them over RDDs.

Big Data Best Practices and Guidelines

Covers best practices and guidelines for big data, including data lakes, architecture, challenges, and technologies like Hadoop and Hive.

General Introduction to Data Science

Offers a comprehensive introduction to Data Science, covering Python, Numpy, Pandas, Matplotlib, and Scikit-learn, with a focus on practical exercises and collaborative work.

Advanced Spark Optimization Techniques: Managing Big Data

Discusses advanced Spark optimization techniques for managing big data efficiently, focusing on parallelization, shuffle operations, and memory management.

General Introduction to Big Data

Covers data science tools, Hadoop, Spark, data lake ecosystems, CAP theorem, batch vs. stream processing, HDFS, Hive, Parquet, ORC, and MapReduce architecture.

Data Wrangling with Hadoop: Storage Formats and Hive

Explores data wrangling with Hadoop, emphasizing storage formats and Hive for big data processing.

Decision Tree Classification

Covers decision tree classification using KNIME Analytics Platform for data preprocessing and model creation.

Data Cleaning Challenges: Optimizing Error Detection

Addresses challenges in data cleaning for analysis, proposing optimizations to reduce processing time.

Advanced Pandas Functions

Focuses on advanced pandas functions for data manipulation, exploration, and visualization with Python, emphasizing the importance of understanding and preparing data.

Data Wrangling with Hadoop: Advanced Techniques

Covers advanced data wrangling techniques using Hadoop, focusing on Hive and HBase integration.

Hadoop Ecosystem: Architectural Choices & MapReduce Programming

Explores the Hadoop ecosystem's architecture and MapReduce programming model, emphasizing strengths and limitations.

Advanced Spark Optimization

Delves into advanced Spark optimization techniques, emphasizing data partitioning, shuffle operations, and memory management.

Data Wrangling Techniques: HBase and Hive Integration

Covers data wrangling techniques using HBase and Hive, focusing on integration and practical applications.

Introduction to Data Stream Processing

Covers the fundamentals of data stream processing, including tools like Apache Storm and Kafka, key concepts like event time and window operations, and the challenges of stream processing.

Data Science for Engineers: Part 2

Explores data manipulation, exploration, and visualization in data science projects using Python.

Introduction to Spark Runtime Architecture

Introduces Apache Spark, covering its architecture, RDDs, transformations, actions, fault tolerance, deployment options, and practical exercises in Jupyter notebooks.