Fault tolerance

Worker crashes
Reduce task failures
Job failures
Debugging
Autograder

In this part, you will complete your MapReduce system by implementing fault tolerance. Specifically, you will update your coordinator to handle worker crashes and failure. Your system does not need to tolerate failures of the coordinator. You also do not need to implement Byzantine fault-tolerance (that is, you can assume that workers do not behave maliciously).

You should handle worker crashes by detecting when a worker has failed, and reassigning relevant tasks to other workers. Do not reassign tasks if a worker is still alive, but is executing slowly. (You might do this for a real system, but for this assignment, we’ll keep it simple and only expect tasks to be reassigned if the worker running them fails.)

Worker crashes

Your coordinator should be able to determine whether a worker has died. This should be implemented by checking whether the worker has sent a heartbeat in the last TASK_TIMEOUT_SECS seconds. When choosing a task to assign, you should consider failed tasks as available for (re)assignment.

Tasks should be eligible for (re)assignment if:

The task is a reduce task, is incomplete, and was assigned to a worker that crashed.
The task is a map task, and was assigned to a worker that crashed.

Note that upon a worker crash, map tasks assigned to that worker must be re-executed, even if they were marked complete. This is because map task outputs are buffered in memory, and that memory is no longer accessible if a worker crashes.

Completed reduce tasks should not be reassigned, since their output is stored on disk.

Reduce task failures

In the case where a reduce worker tries and fails to reach a worker for map task results, the worker should not crash. Instead, it should receive a new assignment, and if the worker it tried to reach is determined to be truly dead, its map tasks should be reassigned.

In this case, the worker will issue a FailTask RPC to alert the coordinator that the worker is no longer working on the failed task. Implement this RPC so that the coordinator reassigns the task as necessary.

The protocol buffers have already been provided for you in proto/coordinator.proto:

message FailTaskRequest {
  uint32 worker_id = 1;
  uint32 job_id = 2;
  uint32 task = 3;
  bool reduce = 4;
  bool retry = 5;
  string error = 6;
}

In this scenario, the worker will set retry = true because the task should work once the appropriate map tasks are re-executed. The coordinator should note this and assign the task again. You can choose whether or not it waits before re-assigning the task.

Job failures

If an error that cannot be fixed occurs, the job should fail. That is, no more tasks for the job should be assigned, and polling the job’s status with the PollJob RPC should give failed = true. It does not matter if you set the PollJobReply’s done field to true or false.

Examples of errors that should cause a job to fail immediately include:

Being unable to find or open an input file
Being unable to write to an output file
Receiving an error from an application map or reduce function

In the case of these errors, the worker will issue a FailTask RPC with retry = false and an error message in the error field.

You must ensure that error messages returned by map or reduce functions are reported to clients via the errors field of the PollJobReply message. It is OK to add additional context to the error message, but you must preserve the precise message reported by the worker.

Debugging

Fault tolerance is difficult debug due to its dependence on timing. For some tips, take a look at the Testing and debugging section.

Autograder

After completing this, you should be passing all the autograder tests.