Raft Leader Election

Diego Ongaro & John Ousterhout, 2014 — "In Search of an Understandable Consensus Algorithm"

Raft was designed specifically to be understandable. Most consensus algorithms (like Paxos) are notoriously difficult to implement correctly. Raft breaks the problem into clean, orthogonal pieces.

In our project, Raft handles leader election only — not log replication (that's handled by TO-Multicast for writes). The leader is responsible for serving read operations to ensure linearisable reads.

Why We Need a Leader

Without a leader:

Client A uploads "report.pdf v2" to Node 0
TO-Multicast delivers it to all nodes (Node 0, 1, 2 all have v2)
Client B reads "report.pdf" from Node 1
BUT Node 1 might still be delivering the message (TO-Multicast delivery is asynchronous)

Client B briefly sees stale data — the old version of the file.

By routing all reads through the leader, we guarantee reads always see the latest committed state. The leader, by definition, is the node that's most up-to-date.

The Three Roles

Follower

The default state. A follower:

Listens for heartbeats from the leader
Responds to vote requests from candidates
If no heartbeat received within electionTimeout (150–300ms), becomes a candidate

Candidate

A node that wants to become leader. A candidate:

Increments the term counter
Votes for itself
Sends requestVote() to every other node
If it receives votes from a majority (≥ 2 out of 3), becomes leader
If it receives a heartbeat from a valid leader, steps back to follower
If the election times out without a winner, starts a new election (incremented term)

Leader

The node in charge. After winning the election, the leader:

Sends heartbeats every 75ms to all followers to prevent new elections
Serves all read operations (DOWNLOAD, SEARCH, LIST)
Writes still go through TO-Multicast (not leader-only)

The Term

Every election happens in a term — a monotonically increasing integer. This is Raft's "logical clock."

Term 0: No leader yet (cluster starting)
Term 1: Node 2 wins first election → becomes leader
Term 2: Node 2 crashes → Node 0 wins election → becomes leader
Term 3: Node 0 still going strong (re-elected)

The term prevents a stale leader from disrupting the cluster:

If a node receives a message with a term lower than its current term, it rejects it
If a node receives a message with a term higher than its current term, it updates its term and becomes a follower

Vote Logic

@Override
public boolean receiveVoteRequest(int candidateId, int term) throws RemoteException {
    // Reject votes from past terms
    if (term < currentTerm.get()) {
        return false;
    }

    // If candidate is from a higher term, update our term and reset our vote
    if (term > currentTerm.get()) {
        currentTerm.set(term);
        role.set(RaftRole.FOLLOWER);
        votedFor = -1;   // haven't voted in this term yet
    }

    // Grant vote if we haven't voted yet in this term
    if (votedFor == -1) {
        votedFor = candidateId;
        return true;
    }

    // Already voted for someone else this term
    return false;
}

Heartbeat Mechanism

The leader sends a heartbeat (receiveHeartbeat(leaderId, term)) to every follower every 75ms. Each heartbeat resets the follower's election timer.

// Leader's heartbeat loop
sendHeartbeats = Executors.newSingleThreadScheduledExecutor();
sendHeartbeats.scheduleAtFixedRate(() -> {
    if (role.get() != RaftRole.LEADER) return;
    for (ReplicaNodeInterface peer : peers) {
        try {
            peer.receiveHeartbeat(nodeId, currentTerm.get());
        } catch (RemoteException e) {
            // Peer is down — skip
        }
    }
}, 0, HEARTBEAT_INTERVAL_MS, TimeUnit.MILLISECONDS);

Why Randomized Timeouts?

If every node had the same election timeout, they would all become candidates simultaneously, splitting the vote 1-1-1 — no one gets a majority.

// Each node picks a random timeout at startup
this.electionTimeoutMs = 150 + ThreadLocalRandom.current().nextInt(150);
// Range: 150ms – 299ms

With randomized timeouts, one node will almost always time out first, become a candidate, win the election quickly, and start sending heartbeats that prevent the others from becoming candidates.

What Happens If the Leader Crashes?

Heartbeats stop arriving
All followers hit their election timeouts after 150–300ms
One follower (the one with the shortest randomized timeout) becomes a candidate
The candidate wins the election (gets 2 out of 3 votes)
The new leader starts sending heartbeats
Clients are redirected to the new leader

Total downtime: ~200ms — faster than the human eye can perceive.

Integration with the File System

// In ReplicaNode handleClientOperation():
if (op.isWrite()) {
    // Writes go through TO-Multicast (not leader-dependent)
    return broadcastAndWait(op);
} else {
    // Reads must go through the leader for linearisability
    if (nodeId != currentLeader) {
        // I'm not the leader — forward to the leader
        ReplicaNodeInterface leader = peers.get(currentLeader);
        return leader.handleClientOperation(op, sessionToken);
    }
    // I am the leader — serve the read locally
    return executeLocally(op);
}

Next: Mutual TLS

Now you understand how the cluster stays coordinated. The next page covers the encryption and authentication layer. → Mutual TLS

Why We Need a Leader​

The Three Roles​

Follower​

Candidate​

Leader​

The Term​

Vote Logic​

Heartbeat Mechanism​

Why Randomized Timeouts?​

What Happens If the Leader Crashes?​

Integration with the File System​

Next: Mutual TLS​