HW 5: Map Reduce (Rust)
MapReduce is a programming model for scalable and highly parallelized data processing. It abstracts away the complexities of developing a fault tolerant distributed system by exposing a simple API where the user specifies two functions:
- map: produces a set of key/value pairs from the input data
- reduce: combines values corresponding to the same key
With these functions, tasks can be automatically parallelized and executed on a cluster. This paradigm is particularly powerful since it allows programmers with little background in distributed systems to write parallelizable code for a wide variety of real world tasks.
In this assignment, which is loosely based on a lab from MIT, you will be implementing your own fault tolerant MapReduce system in Rust. Specifically, you will be implementing a coordinator process that distributes tasks to worker processes that have already been implemented for you. You will also handle worker failure by implementing heartbeats and task redistribution. The system design you will be building is similar to that outlined in the MapReduce paper.
In this assignment, we use the term “coordinator” rather than “master”.
Components
This assignment is split up into three parts with distinct deadlines to help you space out the workload. Deadlines can be found on Ed.
gRPC lab
To help you get a basic idea of how the coordinator will communicate with workers, we have put together a lab that walks you through how gRPC works. gRPC is a modern open source Remote Procedure Call (RPC) framework. This lab is worth 5% of your grade on the entire assignment. The code you write for the lab will be in a separate directory (lab-grpc-rs
) from the rest of the HW Map Reduce code (hw-map-reduce-rs
). To trigger the autograder for the lab portion, you must push the code in the lab-grpc-rs
directory. There are no extensions for the lab portion of this assignment.
Checkpoint
You will be expected to complete the first part of the assignment (up to and including the Tasks section) by an earlier deadline. This checkpoint is worth 5% of your grade on the entire assignment. Note that for the checkpoint, you have to manually trigger the autograder build. If you miss the checkpoint, you may get up to a 95% on the assignment if you pass all the tests by the final deadline, assuming that you have completed the lab in time. If you additionally do not turn in the lab in time, you may get up to a 90% on the assignment. There are no extensions for the checkpoint portion of this assignment.
Final
The final component consists of the rest of the tasks (through Fault tolerance).
Getting started
It is strongly recommended that you complete this assignment locally rather than on your VM. Compilation will be much faster, and you will not need to worry about starting up your VM and connecting to it.
If you have not yet set up a local repository, run the following commands directly on your computer (feel free to modify the first command if you would like to set up the repo in a different location):
git clone git@github.com:Berkeley-CS162/studentXXX.git ~/code/personal # CHANGE THIS!
cd ~/code/personal
git remote rename origin personal
git pull personal main
git remote add staff git@github.com:Berkeley-CS162/student0.git
git pull staff main
cd hw-map-reduce-rs
With your repo set up, pull the staff starter code:
cd ~/code/personal
git pull staff main
cd hw-map-reduce-rs
If you made changes to your repo on a different (virtual) machine, make sure to run git pull personal main
as well.
If you are doing the lab locally, you will need to install CMake by following the directions here. Make sure you install a relatively new version of CMake (at least 3.20.0) to ensure that you don’t run into any issues.
To check if CMake is installed correctly, run
cmake --version
and check if it outputs the version you expect.If Rust is not installed, install it according to the directions here.