ZooKeeper FAQ | Notion

Q: What's the main take-away from this paper?

A: ZooKeeper's main academic contribution lies in the detailed design of a storage system specialized to fault-tolerant high-performance configuration management: watches, sessions, the choice of consistency, the specific semantics of the operations. It builds on existing work such as Chubby and Paxos.

A lot of what's interesting for us in 6.824 is the idea that one can obtain fault tolerance by keeping the critical state in fault-tolerant storage (ZooKeeper), and running the computation in non-fault-tolerant servers. For example, a MapReduce master might keep state about jobs, task status, workers, location of intermediate output, &c, in ZooKeeper. If the master fails, a new computer can be selected to run the MapReduce master software, and load its state from ZooKeeper. This provides fault-tolerance for the master without the complexity of state-machine replication (e.g. without having to write the MapReduce master using a Raft library). Maybe you can think of this as providing fault-tolerance by making the state alone fault-tolerant, whereas use of Raft makes the entire computation fault-tolerant. This general pattern isn't new -- for example it's how database-backed web sites work -- but ZooKeeper is a good fit if your main concern is managing fault-tolerant services.

Q: Why are only update requests A-linearizable? Why not reads as well?

A: The authors want high total read throughput, so they want ZooKeeper replicas to be able to satisfy client reads without involving the leader. A given replica may not know about a committed write (if it's not in the majority that the leader waited for), or may know about a write but not yet know if it is committed. Thus a replica's state may lag behind the leader and other replicas. Thus serving reads from replicas can yield data that doesn't reflect recent writes -- that is, reads can return stale results.

Q: How does linearizability differ from serializability?

A: The usual definition of serializability is much like linearizability, but without the requirement that operations respect real-time ordering.

Have a look at this explanation: http://www.bailis.org/blog/linearizability-versus-serializability/

Section 2.3 of the ZooKeeper paper uses "serializable" to indicate that the system behaves as if writes (from all clients combined) were executed one by one in some order. The "FIFO client order" property means that reads occur at specific points in the order of writes, and that a given client's successive reads never move backwards in that order. One thing that's going on here is that the guarantees for writes and reads are different.

Q: What is pipelining?

A: There are two things going on here. First, the ZooKeeper leader (really the leader's Zab layer) batches together multiple client operations in order to send them efficiently over the network, and in order to efficiently write them to disk. For both network and disk, it's often far more efficient to send a batch of N small items all at once than it is to send or write them one at a time. This kind of batching is only effective if the leader sees many client requests at the same time; so it depends on there being lots of active clients.

The second aspect of pipelining is that ZooKeeper makes it easy for each client to keep many write requests outstanding at a time, by supporting asynchronous operations. From the client's point of view, it can send lots of write requests without having to wait for the responses (which arrive later, as notifications after the writes commit). From the leader's point of view, that client behavior gives the leader lots of requests to accumulate into big efficient batches.

A worry with pipelining is that operations that are in flight might be re-ordered, which would cause the problem that the authors to talk about in 2.3. If a the leader has many write operations in flight followed by write to ready, you don't want those operations to be re-ordered, because then other clients may observe ready before the preceding writes have been applied. To ensure that this cannot happen, Zookeeper guarantees FIFO for client operations; that is the client operations are applied in the order they have been issued.

Q: What does wait-free mean?

A: The precise definition: A wait-free implementation of a concurrent data object is one that guarantees that any process can complete any operation in a finite number of steps, regardless of the execution speeds of the other processes. This definition was introduced in the following paper by Herlihy: https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf

Zookeeper is wait-free because it processes one client's requests without needing to wait for other clients to take action. This is partially a consequence of the API: despite being designed to support client/client coordination and synchronization, no ZooKeeper API call is defined in a way that would require one client to wait for another. In contrast, a system that supported a lock acquire operation that waited for the current lock holder to release the lock would not be wait-free.

Ultimately, however, ZooKeeper clients often need to wait for each other, and ZooKeeper does provide a waiting mechanism -- watches. The main effect of wait-freedom on the API is that watches are factored out from other operations. The combination of atomic test-and-set updates (e.g. file creation and writes condition on version) with watches allows clients to synthesize more complex blocking abstractions (e.g. Section 2.4's locks and barriers).

Q: How does the leader know the order in which a client wants a bunch of asynchronous updates to be performed?

A: The paper doesn't say. The answer is likely to involve the client numbering its asynchronous requests, and the leader tracking for each client (really session) what number it should next expect. The leader has to keep state per session anyway (for client session timeouts), so it might be little extra work to track a per-session request sequence number. This information would have to be preserved when a leader fails and another server takes over, so the client sequence numbers are likely passed along in replicated log entries.

Q: What does a client do if it doesn't get a reply for a request? Does it re-send, in case the network lost a request or reply, or the leader crashed before committing? How does ZooKeeper avoid re-sends leading to duplicate executions?