Fault tolerance
Table of contents
In this part, you will complete your MapReduce system by implementing fault tolerance. Specifically, you will update your coordinator to handle worker crashes and failure. Your system does not need to tolerate failures of the coordinator. You also do not need to implement Byzantine fault-tolerance (that is, you can assume that workers do not behave maliciously).
You should handle worker crashes by detecting when a worker has failed, and reassigning relevant tasks to other workers. Do not reassign tasks if a worker is still alive, but is executing slowly. (You might do this for a real system, but for this assignment, we’ll keep it simple and only expect tasks to be reassigned if the worker running them fails.)
Worker crashes
Your coordinator should be able to determine whether a worker has died. This should be implemented by checking whether the worker has sent a heartbeat in the last TASK_TIMEOUT_SECS
seconds. When choosing a task to assign, you should consider failed tasks as available for (re)assignment.
Tasks should be eligible for (re)assignment if:
- The task is a reduce task, is incomplete, and was assigned to a worker that crashed.
- The task is a map task, and was assigned to a worker that crashed.
Note that upon a worker crash, map tasks assigned to that worker must be re-executed, even if they were marked complete. This is because map task outputs are buffered in memory, and that memory is no longer accessible if a worker crashes.
Completed reduce tasks should not be reassigned, since their output is stored on disk.
Reduce task failures
In the case where a reduce worker tries and fails to reach a worker for map task results, the worker should not crash. Instead, it should receive a new assignment, and if the worker it tried to reach is determined to be truly dead, its map tasks should be reassigned.
In this case, the worker will issue a FailTask
RPC to alert the coordinator that the worker is no longer working on the failed task. Implement this RPC so that the coordinator reassigns the task as necessary.
The protocol buffers have already been provided for you in proto/coordinator.proto
:
message FailTaskRequest {
uint32 worker_id = 1;
uint32 job_id = 2;
uint32 task = 3;
bool reduce = 4;
bool retry = 5;
string error = 6;
}
In this scenario, the worker will set retry = true
because the task should work once the appropriate map tasks are re-executed. The coordinator should note this and assign the task again. You can choose whether or not it waits before re-assigning the task.
Job failures
If an error that cannot be fixed occurs, the job should fail. That is, no more tasks for the job should be assigned, and polling the job’s status with the PollJob
RPC should give failed = true
. It does not matter if you set the PollJobReply
’s done
field to true
or false
.
Examples of errors that should cause a job to fail immediately include:
- Being unable to find or open an input file
- Being unable to write to an output file
- Receiving an error from an application
map
orreduce
function
In the case of these errors, the worker will issue a FailTask
RPC with retry = false
and an error message in the error
field.
You must ensure that error messages returned by map
or reduce
functions are reported to clients via the errors
field of the PollJobReply
message. It is OK to add additional context to the error message, but you must preserve the precise message reported by the worker.
Debugging
Fault tolerance is difficult debug due to its dependence on timing. For some tips, take a look at the Testing and debugging section.
Autograder
After completing this, you should be passing all the autograder tests.