Skip to main content

Raft Leader Election

Diego Ongaro & John Ousterhout, 2014 — "In Search of an Understandable Consensus Algorithm"

Raft was designed specifically to be understandable. Most consensus algorithms (like Paxos) are notoriously difficult to implement correctly. Raft breaks the problem into clean, orthogonal pieces.

In our project, Raft handles leader election only — not log replication (that's handled by TO-Multicast for writes). The leader is responsible for serving read operations to ensure linearisable reads.


Why We Need a Leader

Without a leader:

  1. Client A uploads "report.pdf v2" to Node 0
  2. TO-Multicast delivers it to all nodes (Node 0, 1, 2 all have v2)
  3. Client B reads "report.pdf" from Node 1
  4. BUT Node 1 might still be delivering the message (TO-Multicast delivery is asynchronous)

Client B briefly sees stale data — the old version of the file.

By routing all reads through the leader, we guarantee reads always see the latest committed state. The leader, by definition, is the node that's most up-to-date.

The Three Roles

Follower

The default state. A follower:

  • Listens for heartbeats from the leader
  • Responds to vote requests from candidates
  • If no heartbeat received within electionTimeout (150–300ms), becomes a candidate

Candidate

A node that wants to become leader. A candidate:

  • Increments the term counter
  • Votes for itself
  • Sends requestVote() to every other node
  • If it receives votes from a majority (≥ 2 out of 3), becomes leader
  • If it receives a heartbeat from a valid leader, steps back to follower
  • If the election times out without a winner, starts a new election (incremented term)

Leader

The node in charge. After winning the election, the leader:

  • Sends heartbeats every 75ms to all followers to prevent new elections
  • Serves all read operations (DOWNLOAD, SEARCH, LIST)
  • Writes still go through TO-Multicast (not leader-only)

The Term

Every election happens in a term — a monotonically increasing integer. This is Raft's "logical clock."

Term 0: No leader yet (cluster starting)
Term 1: Node 2 wins first election → becomes leader
Term 2: Node 2 crashes → Node 0 wins election → becomes leader
Term 3: Node 0 still going strong (re-elected)

The term prevents a stale leader from disrupting the cluster:

  • If a node receives a message with a term lower than its current term, it rejects it
  • If a node receives a message with a term higher than its current term, it updates its term and becomes a follower

Vote Logic

@Override
public boolean receiveVoteRequest(int candidateId, int term) throws RemoteException {
// Reject votes from past terms
if (term < currentTerm.get()) {
return false;
}

// If candidate is from a higher term, update our term and reset our vote
if (term > currentTerm.get()) {
currentTerm.set(term);
role.set(RaftRole.FOLLOWER);
votedFor = -1; // haven't voted in this term yet
}

// Grant vote if we haven't voted yet in this term
if (votedFor == -1) {
votedFor = candidateId;
return true;
}

// Already voted for someone else this term
return false;
}

Heartbeat Mechanism

The leader sends a heartbeat (receiveHeartbeat(leaderId, term)) to every follower every 75ms. Each heartbeat resets the follower's election timer.

// Leader's heartbeat loop
sendHeartbeats = Executors.newSingleThreadScheduledExecutor();
sendHeartbeats.scheduleAtFixedRate(() -> {
if (role.get() != RaftRole.LEADER) return;
for (ReplicaNodeInterface peer : peers) {
try {
peer.receiveHeartbeat(nodeId, currentTerm.get());
} catch (RemoteException e) {
// Peer is down — skip
}
}
}, 0, HEARTBEAT_INTERVAL_MS, TimeUnit.MILLISECONDS);

Why Randomized Timeouts?

If every node had the same election timeout, they would all become candidates simultaneously, splitting the vote 1-1-1 — no one gets a majority.

// Each node picks a random timeout at startup
this.electionTimeoutMs = 150 + ThreadLocalRandom.current().nextInt(150);
// Range: 150ms – 299ms

With randomized timeouts, one node will almost always time out first, become a candidate, win the election quickly, and start sending heartbeats that prevent the others from becoming candidates.

What Happens If the Leader Crashes?

  1. Heartbeats stop arriving
  2. All followers hit their election timeouts after 150–300ms
  3. One follower (the one with the shortest randomized timeout) becomes a candidate
  4. The candidate wins the election (gets 2 out of 3 votes)
  5. The new leader starts sending heartbeats
  6. Clients are redirected to the new leader

Total downtime: ~200ms — faster than the human eye can perceive.

Integration with the File System

// In ReplicaNode handleClientOperation():
if (op.isWrite()) {
// Writes go through TO-Multicast (not leader-dependent)
return broadcastAndWait(op);
} else {
// Reads must go through the leader for linearisability
if (nodeId != currentLeader) {
// I'm not the leader — forward to the leader
ReplicaNodeInterface leader = peers.get(currentLeader);
return leader.handleClientOperation(op, sessionToken);
}
// I am the leader — serve the read locally
return executeLocally(op);
}

Next: Mutual TLS

Now you understand how the cluster stays coordinated. The next page covers the encryption and authentication layer. → Mutual TLS