Skip to main content

Search

Show all results for

Home

Lecture

Data Wrangling with Hadoop: Storage Formats and Hive

About
Privacy
Disclaimer

Copyright © 2026 EPFL, all rights reserved

Graph Chatbot

Description

This lecture covers data wrangling techniques with Hadoop, focusing on storage formats like ORC, Parquet, and HBase. It also delves into Hive, explaining its role as a big data warehouse for relational queries on large datasets.

Instructors (3)

Pamela Isabel Delgado Borda

I am a PhD student in the School of Computer and Communication Sciences at EPFL. I am part of the Operating Systems Laboratory and my advisor is Prof. Willy Zwaenepoel. I received my Bachelor's degree in Systems Engineering from Universidad Catolica Boliviana, Bolivia in 2008 and Master's degree in Computer Science, specialization Foundations of Software from EPFL in 2012.

Olivier Verscheure

Official source

https://mediaspace.epfl.ch/media/0_5odw0x3h

About this result

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

In course

COM-490: Large-scale data science for real-world data

This hands-on course teaches the tools & methods used by data scientists, from researching solutions to scaling up prototypes to Spark clusters. It exposes the students to the entire data science pipe

Related lectures (32)

Data Wrangling with Hadoop

Covers data wrangling techniques using Hadoop, focusing on row versus column-oriented databases, popular storage formats, and HBase-Hive integration.

Data Wrangling with Hive: Managing Big Data Efficiently

Covers data wrangling techniques using Apache Hive for efficient big data management.

General Introduction to Big Data

Covers data science tools, Hadoop, Spark, data lake ecosystems, CAP theorem, batch vs. stream processing, HDFS, Hive, Parquet, ORC, and MapReduce architecture.

Data Science Visualization with Pandas

Covers data manipulation and exploration using Python with a focus on visualization techniques.

Introduction to Data Science

Introduces the basics of data science, covering decision trees, machine learning advancements, and deep reinforcement learning.

Ontological neighbourhood

Computer engineering

Databases: Relational databases