Raft & Paxos

Consensus 202211232125

Situations where consensus is important:

  • Leader election
    • In a database with single leader replication, all nodes need to agree on the leader.
  • Atomic commits 202211232128:
    • If you have a transaction spanning several nodes or partitions, the transaction might fail on some nodes but succeed on others.
    • You need to get all nodes to agree that either something went wrong (and we need to roll back the transaction) or nothing went wrong, in which case the transaction needs to commit.

Remember that total order broadcast 202211232155 requires messages to be delivered exactly once, in the same order, to all nodes. If you think about it, total order broadcast is equivalent to performing several rounds of consensus. In each round, nodes propose the message they want to send next and then decide on the next message to be delivered in total order.

However, if consensus algorithms are like total order broadcast, and total order broadcast is like single leader replication (which has a leader, and the leader applies the writes to all the followers in the same order), and single leader replication needs to have a leader… it seems like to elect a leader (have consensus), you need a leader. There is a bit of chicken and egg here.

  • Consensus protocols all use some form of leader, but they don’t guarantee that the leader is unique. Instead, they have several epochs and ensure that the leader is unique within each epoch.
  • Every time the current leader is thought dead, a vote is started among the nodes to elect a new leader. If there is a conflict between leaders in two epoch numbers, the leader with the higher epoch number prevails.
  • Before a leader can decide anything, it must collect votes from a quorum of nodes and see if they respond in favor of the proposal.
  • Therefore, there are two rounds of voting - one to decide on a leader and the other to vote on the leader’s proposal (to make sure there is no leader with a higher epoch value). If a proposal passes, then because of the pigeonhole principle, at least one node must have voted for both the leader’s current proposal and the leader’s initial election. That must mean there is no newly elected leader, and the leader can proceed with their chosen value.

Differences between this and two-phase commit 202211232128:

  1. In 2PC, the coordinator is not elected.
  2. Fault-tolerant consensus algorithms require votes from a majority of nodes and 2PC requires a yes” vote from every participant.

Zookeeper 202211232246 uses a consensus algorithm like this under the hood.

Drawbacks of consensus 202211232125 algorithms:

  1. The process by which nodes vote on proposals before they’re decided is a kind of synchronous replication. Databases are often configured to use asynchronous replication, in which some committed data can be lost on failover. Many people choose to accept this risk for the sake of better performance.
  2. Consensus systems need a strict majority to operate. That means you need at least five nodes to tolerate two failures, for example.
  3. Consensus algorithms assume a fixed set of nodes that participate in voting, which means that you can’t just add or remove nodes in the cluster.
  4. Consensus algorithms rely on timeouts to detect failed nodes. In environments with variable network delays, this could lead to disruptive false positive timeouts.

Created from: Algorithms to know before system design interviews 202211231052


uid: 202211231100 tags: #distributed-systems #software-engineering #algorithms


Date
February 22, 2023