Course

COM-490: Large-scale data science for real-world data

Lectures in this course (41)

Covers collaborative data science tools, big data concepts, Spark, and data stream processing, with tips for the final project.

Data Stream Processing: Apache Kafka and Spark

Covers data stream processing with Apache Kafka and Spark, including event time vs processing time, stream processing operations, and stream-stream joins.

Introduction to Spark Runtime Architecture

Introduces Apache Spark, covering its architecture, RDDs, transformations, actions, fault tolerance, deployment options, and practical exercises in Jupyter notebooks.

Collaborative Data Science: Tools and Techniques

Introduces collaborative data science tools like Git and Docker, emphasizing teamwork and practical exercises for effective learning.

Introduction: What do we mean by Data Science?

Introduces the team, provides a crash course on Python, and explores the journey into Data Science and the importance of refining data.

Big Data: Best Practices and Guidelines

Covers best practices and guidelines for big data, including data lakes, typical architecture, challenges, and technologies used to address them.

General Introduction to Data Science

Offers a comprehensive introduction to Data Science, covering Python, Numpy, Pandas, Matplotlib, and Scikit-learn, with a focus on practical exercises and collaborative work.

Introduction to Data Stream Processing

Introduces data stream processing, covering batch vs stream processing, real-time insights, applications, challenges, and tools like Apache Kafka and Spark Streaming.

Spark DataFrames: Basics and Optimization

Covers the basics of Spark DataFrames, their advantages, performance comparison with RDDs, and practical demos.

Big Data Ecosystems: Technologies and Challenges

Covers the fundamentals of big data ecosystems, focusing on technologies, challenges, and practical exercises with Hadoop's HDFS.

Data formats and data wrangling with Hadoop

Explores Apache Hive for data warehousing, data formats, and partitioning, with practical exercises in querying and connecting to Hive.

Data Wrangling Techniques: HBase and Hive Integration

Covers data wrangling techniques using HBase and Hive, focusing on integration and practical applications.

Advanced Spark Optimizations and Partitioning

Covers advanced Spark optimizations, memory management, shuffle operations, and data partitioning strategies to improve big data processing efficiency.

Collaborative Data Science: Tools and Git Workflow

Explores tools like Git and Docker for collaborative data science projects.

Advanced Spark Optimization

Delves into advanced Spark optimization techniques, emphasizing data partitioning, shuffle operations, and memory management.

Spark Data Frames

Covers Spark Data Frames, distributed collections of data organized into named columns, and the benefits of using them over RDDs.

Integrating Scalable Data Storage and Map Reduce Processing with Hadoop

Covers the integration of scalable data storage and map reduce processing using Hadoop, including HDFS, Hive, Parquet, ORC, Spark, and HBase.

Advanced Spark Optimizations and Partitioning

Delves into advanced Spark optimizations, partitioning, data skew, persistency, MLlib, and best practices.

Data Wrangling with Hive: Managing Big Data Efficiently

Covers data wrangling techniques using Apache Hive for efficient big data management.

Big Data Best Practices and Guidelines

Covers best practices and guidelines for big data, including data lakes, architecture, challenges, and technologies like Hadoop and Hive.