Module information

CS4225/CS5425 Big Data Systems for Data Science at the National University of Singapore (NUS) is a module that teaches us how to use big data systems to solve real world problems.

For AY22/23 Sem 1, the module is taught by Prof. Bryan Hooi Kuen-Yew.

This is offered in both Semester 1 and 2.

Schedule

Venue: i3-Auditorium

Lecture time: Friday 1830-2030

Tutorial time: Friday 2030-2130

Tutorials are not every week. Only on weeks which have tutorial.

Module breakdown

  • Assignment 1 on Hadoop 25%
  • Assignment 2 on Spark 25%
  • Midterms 20%
  • Finals 30%

Both Midterms and Finals are open book.

The Midterms are held physically in the Multi-Purpose Hall (MPH) and Finals are both held online and in person.

Prerequisites

  • CS2102 or IT2002

Module Details

The module involves:

  1. Network Resources
  2. How to use Hadoop and Spark
  3. MapReduce
  4. NoSQL & SQL
  5. ACID & BASE (Mainly Base)
  6. Large Graph processing
  7. Stream Processing

These are the topics which are covered

  1. MapReduce
  2. Relational Databases
  3. Data Mining
  4. NoSQL
  5. Apache Spark
  6. Large Graph Processing
  7. Stream Processing

Lectures

The lectures by Prof Bryan were very interesting and engaging. I enjoyed his lecture style and learnt a lot from the lectures.

He goes through the concepts slowly and explains them with an example which makes it very easy to understand.

The lectures were held both online and in person. I found it easier to understand the concepts when it was held in person.

The concepts were also very useful in real life applications.

Tutorials

The tutorials for the module were 1 hour long and were not conducted every week. There were a total of 4 tutorials throughout the entire duration of the module.

The topics were mainly on:

NoTopic
1Hadoop
2NoSQL & Spark
3Graph Processing
4Stream Processing

Some of the tutorials were conducted right after the content was discussed in the lecture.

There might not be time to complete the tutorial after the topic is completed in class.

Looking at the tutorials beforehand made me understand the topics better.

Assignment 1: Hadoop

The assignment is about using hadoop to find the K max number of words in a large text file.

This assignment is conducted in Java. A starting script was provided by the assignment TAs. They also helpfully included a guide on how to set up the environment to start working on the instructions.

The instructions were very clear and easy to follow.

I recommend using the Hadoop cluster in SoC and complete the assignment using VSCode Remote SSH. This was easier to set up compared to the other methods outlined within the handbook.

Some digging within the Hadoop documentation was required to understand how to use the Hadoop API.

Some googling was required to see how to pipe the data from map reduce pipeline to another.

Assignment 2: Apache Spark

This assignment is relatively easy and requires some knowledge of the pandas/spark library in Python.

The assignment can be completed using syntax from either one of them.

We were given 2 log files together with 2 notes, which we have to use to answer the questions.

It is some sort of forensics assignment.

Ratings

Overall the module was very interesting and useful during system design interviews.

Workload 4/10

Lectures: 3/10 (Light)

  • Mainly listening and jolting down notes.
  • They were relatively easy to digest

Tutorials: 3/10 (Light)

  • Mainly listening and jolting down notes as well.
  • Re-doing the tutorials were good for revision

Assignments: 5/10 (Moderate)

  • Some research is required on the side of the students regarding the assignments.

Organization 8/10

The module was very well organized, the tutorials were on topic. The assignments were relatively easy to understand and complete.

There were no hiccups during the module.

Learning 9/10

Many of the topics were very interesting and useful. I have learnt more about distributed systems and how different pipelines can be used to organize data.

Enjoyment 9/10

The class was very relaxing and I enjoyed the lectures, tutorials and assignments as a whole. It felt more like learning for the content instead of learning in order to score during exams.

Usefulness 9/10

The content was very useful for Systems Design interview questions, especially the content on NoSQL, SQL and distributed systems.

The contents of the module can be applied to to achieve a scalable system for data science.

Overall 8.75/10

Overall, I found this module to be very enjoyable for learning I would definitely recommend this to anyone who is specializing in Database Systems or would want to learn more about databases and distributed systems.

Expected Grade: A/A-

The bell curve for this module was quite high.

Actual Grade: B+

  1. NUSMods
  2. Hadoop API
  3. Data Intensive Text Processing With Map Reduce
  4. My CS4225 Notes