Raft (2) FAQ | Notion

Q: What are some uses of Raft besides GFS master replication?

A: You could (and will) build a fault-tolerant key/value database using Raft.

You could make the MapReduce master fault-tolerant with Raft.

You could build a fault-tolerant locking service.

Q: In Section 8, why does a new leader need to commit a no-op entry at the start of its term?

A: The problem situation is shown in Figure 8, where if S1 becomes leader after (b), it cannot know if its last log entry (2) is committed or not. The situation in which the last log entry will turn out not to be committed is if S1 immediately fails, and S5 is the next leader; in that case S5 will force all peers (including S1) to have logs identical to S5's log, which does not include entry 2.

Untitled

But suppose S1 manages to commit a new entry during its term (term 4). If S5 sees the new entry, S5 will erase 3 from its log and accept 2 in its place. If S5 does not see the new entry, S5 cannot be the next leader if S1 fails, because it will fail the Election Restriction. Either way, once S1 has committed a new entry for its term, it can correctly conclude that every preceding entry in its log is committed.

The no-op text at the end of Section 8 is talking about an optimization in which the leader executes and answers read-only commands (e.g. get("k1")) without committing those commands in the log. For example, for get("k1"), the leader just looks up "k1" in its key/value table and sends the result back to the client. If the leader has just started, it may have at the end of its log a put("k1", "v99"). Should the leader send "v99" back to the client, or the value in the leader's key/value table? At first, the leader doesn't know whether that v99 log entry is committed (and must be returned to the client) or not committed (and must not be sent back). So (if you are using this optimization) a new Raft leader first tries to commit a no-op to the log; if the commit succeeds (i.e. the leader doesn't crash), then the leader knows everything before that point is committed.

Q: How does using the heartbeat mechanism to provide leases (for read-only) operations work, and why does this require timing for safety (e.g. bounded clock skew)?

A: I don't know exactly what the authors had in mind. Perhaps every AppendEntries RPC the leader sends out says or implies that the no other leader is allowed to be elected for the next 100 milliseconds. If the leader gets positive responses from a majority, then the leader can serve read-only requests for the next 100 milliseconds without further communication with the followers.

This requires the servers to have the same definition of what 100 milliseconds means, i.e. they must have clocks that tick at close to the same rate.

Q: What exactly do the C_old and C_new variables in Section 6 (and Figure 11) represent? Are they the leader in each configuration?

A: They are the set of servers in the old/new configuration. The paper doesn't provide details. I believe it's the identities (network names or addresses) of the servers.

Q: When transitioning from cluster C_old to cluster C_new, how can we create a hybrid cluster C_{old,new}? I don't really understand what that means. Isn't it following either the network configuration of C_old or of C_new? What if the two networks disagreed on a connection?

A: During the period of joint consensus (while Cold,new is active), the leader is required to get a majority from both the servers in Cold and the servers in Cnew.

There can't really be disagreement, because after Cold,new is committed into the logs of both Cold and Cnew (i.e. after the period of joint consensus has started), any new leader in either Cold or Cnew is guaranteed to see the log entry for Cold,Cnew.

Q: I'm confused about Figure 11 in the paper. I'm unsure about how exactly the transition from 'C_old' to 'C_old,new' to 'C_new' goes. Why is there the issue of the cluster leader not being a part of the new configuration, where the leader steps down once it has committed the 'C_new' log entry? (The second issue mentioned in Section 6)

A: Suppose C_old={S1,S2,S3} and C_new={S4,S5,S6}, and that S1 is the leader at the start of the configuration change. At the end of the configuration change, after S1 has committed C_new, S1 should not be participating any more, since S1 isn't in C_new. One of S4, S5, or S6 should take over as leader.

Q: About cluster configuration: During the configuration change time, if we have to stop receiving requests from the clients, then what's the point of having this automated configuration step? Doesn't it suffice to just 1) stop receiving requests 2) change the configurations 3) restart the system and continue?

A: The challenge here is ensuring that the system is correct even if there are failures during this process, and even if not all servers get the "stop receiving requests" and "change the configuration" commands at the same time. Any scheme has to cope with the possibility of a mix of servers that have and have not seen or completed the configuration change -- this is true even of a non-automated system. The paper's protocol is one way to solve this problem.

Q: The last two paragraphs of section 6 discuss removed servers interfering with the cluster by trying to get elected even though they've been removed from the configuration. Wouldn't a simpler solution be to require servers to be shut down when they leave the configuration? It seems that leaving the cluster implies that a server can't send or receive RPCs to the rest of the cluster anymore, but the paper doesn't assume that. Why not? Why can't you assume that the servers will shut down right away?