Stake-Weighted Quality of Service Under a Microscope

Given that Solana has relatively low transaction costs, stake-weighted quality of service is its answer to prevent spam bots from causing congestion and degrading overall performance.

And while a lot of good literature has been published about how SWQoS makes the Solana network faster and more efficient, none of them really touch on how the client implements this or what behaviors can be expected when sending transactions to a leader.

This Helius article does come close with a comprehensive overview of the TPU, but it still leaves me scratching my head. If SWQoS is ultimately about packet prioritization, how does the Agave client choose the network packets that get to graduate from the Fetch Stage?

The following deep dive will be an exploration of the Agave Client networking code with the goal of figuring out how SWQoS works, and while we’re at it, perhaps we can discover a couple interesting kernels of information along the way.

Note: Throughout this post, i will be referencing code with following commit: b0b556a

A UDP Bait-and-switch

Every good story has a beginning, and ours starts where every other network packet’s story begins: the TPU.

Most articles i’ve seen about the TPU describe it as a series of stages: Fetch Stage, SigVerify Stage, Banking Stage, Broadcast Stage. The Fetch Stage is what’s responsible for ingesting packets from peers and is where we’ll focus most of our attention in this post.

And as we look through the TPU’s new_with_client function, we can see an instantiation of the Fetch Stage:

// core/src/tpu.rs: L172

let (packet_sender, packet_receiver) = unbounded();
let (vote_packet_sender, vote_packet_receiver) = unbounded();
let (forwarded_packet_sender, forwarded_packet_receiver) = unbounded();
let fetch_stage = FetchStage::new_with_sender(
    transactions_sockets,
    tpu_forwards_sockets,
    tpu_vote_sockets,
    exit.clone(),
    &packet_sender, // <-- sender
    &vote_packet_sender,
    &forwarded_packet_sender,
    forwarded_packet_receiver,
    poh_recorder,
    Some(tpu_coalesce),
    Some(bank_forks.read().unwrap().get_vote_only_mode_signal()),
    tpu_enable_udp,
);

Ok great, so the Fetch Stage gets the packet_sender and uses that to pass packets to the SigVerify Stage?

// core/src/tpu.rs: L286

SigVerifier::Local(SigVerifyStage::new(
    packet_receiver, // <-- receiver
    verifier,
    "solSigVerTpu",
    "tpu-verifier",
))

Nice! Looks like thats the case. The Fetch Stage listens for packets to arrive on some socket(s), processes them, and then passes them off to the SigVerify Stage. A perfect start to our TPU pipeline.

Well… not exactly. Theres this little cli flag called tpu_enable_udp, and if we take a look at its definition, we see that its actually deprecated, defaulting to false:

// execute.rs: L266

let tpu_enable_udp = if matches.is_present("tpu_enable_udp") {
    warn!("Submission of TPU transactions via UDP is deprecated.");
    true
} else {
    DEFAULT_TPU_ENABLE_UDP
};

If we follow the FetchStage::new_with_sender definition, we can see that its UDP socket receiver implementation gets skipped without that flag set to true.

So, is the Fetch Stage a lie? Well, yes and no. The Fetch Stage lives on in the form of three QUIC connection servers: one for handling normal user transactions, one for handling votes, and one for forwarding transactions.

// core/src/tpu.rs: L224 

// Streamer for user transactions
let SpawnServerResult {
    endpoints: _,
    thread: tpu_quic_t,
    key_updater,
} = spawn_server(
    "solQuicTpu",
    "quic_streamer_tpu",
    transactions_quic_sockets,
    keypair,
    packet_sender,
    exit.clone(),
    staked_nodes.clone(),
    tpu_quic_server_config,
)
.unwrap();

We still have the same packet_sender as before, but interestingly we now have staked_nodes which is a mapping that contains the stake weight of each individual node, along with the total stake of the entire network.

Armed with this knowledge, we can proceed deeper into this streamer implementation, and figure out how it decides which packets are lucky enough to be passed to the SigVerify Stage.

Getting Ready to Process Connections

We find ourselves in nonblocking/quic.rs::run_server where client connections are processed. Before the main connection loop, there are a few critical structs that are shared with each connection as they arrive.

stream_load_ema tracks stream load using an exponential moving average to impose bandwidth restrictions on QUIC connections. It can differentiate between staked and unstaked connections, allocating a fixed bandwidth for unstaked connections and a pro-rata bandwidth for staked connections.

Later on, I’ll go deeper into how exactly this stream load is tracked and what it means for each connection.

An unstaked_connection_table and staked_connection_table are initialized to track information about each active connection. Each table is made up of a mapping of ConnectionTableKey -> Vec<ConnectionEntry>.

A connection key is either an IP address or a pubkey associated with the connection (which is used to determine stake), and it points to a vector of connection entries which each hold a connection type, the last time the connection updated, and the number of active streams associated with the connection.

One cool implementation detail is that ConnectionEntry implements Drop so when an entry is removed from a vector, it can automatically close the connection:

// streamer/src/nonblocking/quic.rs: L1372

impl Drop for ConnectionEntry {
    fn drop(&mut self) {
        if let Some(conn) = self.connection.take() {
            conn.close(
                CONNECTION_CLOSE_CODE_DROPPED_ENTRY.into(),
                CONNECTION_CLOSE_REASON_DROPPED_ENTRY,
            );
        }
        self.cancel.cancel();
    }
}

With the global setup out of the way, a loop takes each connection as they come in and runs setup logic on them.

To Be or Not To Be… Staked

We move on to nonblocking/quic.rs::setup_connection where Agave decides whether the connection is considered staked or not. As we’ll see, the client code unveils an interesting result:

There’s no guarantee that sending a transaction from a staked validator will actually get you a staked connection.

setup_connection kicks off by checking how much stake the current connection has with get_connection_stake. If a stake amount is detected, then the following calculation is performed:

// streamer/src/nonblocking/quic.rs: L745

let min_stake_ratio =
    1_f64 / (max_streams_per_ms * STREAM_THROTTLING_INTERVAL_MS) as f64;
let stake_ratio = stake as f64 / total_stake as f64;
let peer_type = if stake_ratio < min_stake_ratio {
    // If it is a staked connection with ultra low stake ratio, treat it as unstaked.
    ConnectionPeerType::Unstaked
} else {
    ConnectionPeerType::Staked(stake)
};

Interesting! If the peer’s stake is less than ~0.002% of the entire network stake, then the connection is considered to be unstaked. At the time of writing, 0.002% of the total native stake is ~7500 SOL, or just about 1.5 million USD.

With our connection type determined, we now handle the connection differently based on whether it is staked or unstaked. Lets first look at unstaked connections.

Unstaked Connection Setup

Before continuing with the unstaked connection, Agave checks that there is space for this new connection. If it has already reached max capacity, it drops the oldest 10% of active connections in the unstaked connection_table.

// streamer/src/nonblocking/quic.rs: L423

fn prune_unstaked_connection_table(
    unstaked_connection_table: &mut ConnectionTable,
    max_unstaked_connections: usize,
    stats: Arc<StreamerStats>,
) {
    if unstaked_connection_table.total_size >= max_unstaked_connections {
        const PRUNE_TABLE_TO_PERCENTAGE: u8 = 90;
        let max_percentage_full = Percentage::from(PRUNE_TABLE_TO_PERCENTAGE);

        let max_connections = max_percentage_full.apply_to(max_unstaked_connections);
        let num_pruned = unstaked_connection_table.prune_oldest(max_connections);
        stats.num_evictions.fetch_add(num_pruned, Ordering::Relaxed);
    }
}

If you recall my aside about Drop being implemented for ConnectionEntry, we get to see it in action when the oldest connections are pruned. Calling self.table.swap_remove_index(index) frees the reference from the connection_table, and closes the connection when the memory is fully dropped.

// streamer/src/nonblocking/quic.rs: L1414
fn prune_oldest(&mut self, max_size: usize) -> usize {
    let mut num_pruned = 0;
    let key = |(_, connections): &(_, &Vec<_>)| {
        connections.iter().map(ConnectionEntry::last_update).min()
    };
    while self.total_size.saturating_sub(num_pruned) > max_size {
        match self.table.values().enumerate().min_by_key(key) {
            None => break,
            Some((index, connections)) => {
                num_pruned += connections.len();
                self.table.swap_remove_index(index); // <-- bye bye connection
            }
        }
    }
    self.total_size = self.total_size.saturating_sub(num_pruned);
    num_pruned
}

After old connections are pruned, we move on to calling handle_and_cache_new_connection. But here we will pause and look at how staked connections are handled. Eventually, our staked connection will lead us back to this same function so we will handle that later.

Staked Connection Setup

The simple path for the staked connection is that there is space in the connection_table for it. In this case, we simply move on to the next phase of connection handling.

But a bit of a wrinkle gets introduced when there is no space for the incoming connection.

The connection table invokes prune_random to select random peers to drop. It defaults to selecting a maximum of two peers for which it disconnects all connections from one of the peers:

// streamer/src/nonblocking/quic.rs: L1436

fn prune_random(&mut self, sample_size: usize, threshold_stake: u64) -> usize {
    let num_pruned = std::iter::once(self.table.len())
        .filter(|&size| size > 0)
        .flat_map(|size| {
            let mut rng = thread_rng();
            repeat_with(move || rng.gen_range(0..size))
        })
        .map(|index| {
            let connection = self.table[index].first();
            let stake = connection.map(|connection: &ConnectionEntry| connection.stake());
            (index, stake)
        })
        .take(sample_size)
        .min_by_key(|&(_, stake)| stake)
        .filter(|&(_, stake)| stake < Some(threshold_stake))
        .and_then(|(index, _)| self.table.swap_remove_index(index))
        .map(|(_, connections)| connections.len())
        .unwrap_or_default();
    self.total_size = self.total_size.saturating_sub(num_pruned);
    num_pruned
}

Following the code we see that all table entries are first randomized, and two table entries are chosen with take(sample_size). The peer with the smaller of the two stakes is unlucky and gets all their connections removed from the table.

With the low-stake connections ousted, this leaves room for the new connection to be added.

But not so fast! What happens when the two randomly chosen peers do not have a stake that is lower than the incoming connection?

The incoming connection is demoted and is treated as an unstaked connection!

Check out the comment in the code:

// streamer/src/nonblocking/quic.rs: L792

// If we couldn't prune a connection in the staked connection table, let's
// put this connection in the unstaked connection table. If needed, prune a
// connection from the unstaked connection table.

The connections that are most at risk of being downgraded to unstaked connections come from low-staked peers. If there arent a lot of peers in the network with a lower stake, then the connection is more likely to be considered non-priority.

Assigning Stream Bandwidth

Having moved on from connection prioritization, nonblocking/quic.rs::handle_connection assigns bandwidth to a QUIC connection which determines how many streams it is allowed to send over the stream_throttling_interval_ms period. Currently, the throttling period defaults to 100ms.

With the connection now created, an infinite loop is made that calls connection.accept_uni() which accepts the next incoming unidirectional stream from the peer.

When processing a new stream, we compare the number of streams read in the current throttle interval for this connection and compare it to availabe_load_capacity_in_throttling_duration, which is the function that calculates our throughput within the connection.

Before we continue, it becomes worthwhile to discuss the StakedStreamLoadEMA struct that is in charge of assigning stream limits to connections.

We’ll start with a portion of StakedStreamLoadEMA::new:

// streamer/src/nonblocking/stream_throttle.rs: L42

let allow_unstaked_streams = max_unstaked_connections > 0;
let max_staked_load_in_ema_window = if allow_unstaked_streams {
    (max_streams_per_ms
        - Percentage::from(MAX_UNSTAKED_STREAMS_PERCENT).apply_to(max_streams_per_ms))
        * EMA_WINDOW_MS
} else {
    max_streams_per_ms * EMA_WINDOW_MS
};

let max_unstaked_load_in_throttling_window = if allow_unstaked_streams {
    Percentage::from(MAX_UNSTAKED_STREAMS_PERCENT)
        .apply_to(max_streams_per_ms * STREAM_THROTTLING_INTERVAL_MS)
        .saturating_div(max_unstaked_connections as u64)
} else {
    0
};

On instantiation, it calculates the max staked load in the EMA (Exponential Moving Average) window to be 80% of the of max streams per millisecond, multiplied by the EMA_WINDOW_MS. This just means that there exists an EMA window and 80% of the streams in that window are assignable to staked connections.

max_unstaked_load_in_throttling_window takes the max streams allowed per millisecond and gets 20% of it, multiplying by the STREAM_THROTTLING_INTERVAL_MS. Finally, the value is divided by the max allowed unstaked connections. This yields the total number of streams a single connection is allowed to have during the course of the throttling window, which is slightly different than the EMA window. The EMA window is used to calculated the current load, while the throttling window is used to cap a connection’s streams until the window resets.

When a new connection comes in, we can use these two values to calculate available_load_capacity_in_throttling_duration which tells us how many streams this particular connection is allowed to have in the current throttling window. This is critical because it is a system-wide calculation, and other connections affect how many streams can be assigned to the current connection.

Lets take a look at a portion of this function:

// streamer/src/nonblocking/stream_throttle.rs: L148

match peer_type {
    ConnectionPeerType::Unstaked => self.max_unstaked_load_in_throttling_window,
    ConnectionPeerType::Staked(stake) => {
        // If the current load is low, cap it to 25% of max_load.
        let current_load = u128::from(cmp::max(
            self.current_load_ema.load(Ordering::Relaxed),
            self.max_staked_load_in_ema_window / 4,
        ));

        // Formula is (max_load ^ 2 / current_load) * (stake / total_stake)
        let capacity_in_ema_window = (u128::from(self.max_staked_load_in_ema_window)
            * u128::from(self.max_staked_load_in_ema_window)
            * u128::from(stake))
            / (current_load * u128::from(total_stake));

        let calculated_capacity = capacity_in_ema_window
            * u128::from(STREAM_THROTTLING_INTERVAL_MS)
            / u128::from(EMA_WINDOW_MS);
        let calculated_capacity = u64::try_from(calculated_capacity).unwrap_or_else(|_| { /* other code */ });

        // 1 is added to `max_unstaked_load_in_throttling_window` to guarantee that staked
        // clients get at least 1 more number of streams than unstaked connections.
        cmp::max(
            calculated_capacity,
            self.max_unstaked_load_in_throttling_window
                .saturating_add(1),
        )
    }
}

For unstaked connections, the value is easy. We just use max_unstaked_load_in_throttling_window which we calculated earlier.

For staked connections, we first figure out what the current load of the system is and use it in the formula max_load^2 / current_load.

This formula creates a dynamic allocation system where total capacity remains constant (when summed across all connections) but individual allocations vary inversely with load. For example, when load is low (5,000), connections get 4x more capactiy than when load is at max (20,000), as shown in the tests.

As a safety measure, the minimum load is capped to 25% of the max load due to the inverse nature of the formula. If the system was under very low load, the calculated capacity would give a huge portion of available streams to that single peer, and they could effectively DoS the rest of the network from getting their transactions processed.

Once the stream is acquired and is within the allotted load for the connection, chunks from the stream are loaded in.

After all chunks are loaded, handle_chunks is called which sends the accumulated chunks via the packet_sender. This occurs here, and the packets are sent on their way to the SigVerify stage.

One interesting consequence of this implementation is that streams in a connection are processed sequentially in order. Once all chunks have been read, they are decoded into a single transaction.

We can interpret this as one stream being equivalent to sending one transaction over the wire. Sending a second transaction would require a new stream.

Wrap up

Stake-weighted quality of service certainly has a bit of a winding implementation, with lots of small edge cases and surprising results at times.

Hopefully this post was enlightening and perhaps you learned a thing or two about how the feature is implemented.

Some Helpful Defaults

While digging through the code, I kept referring back to some default values that were scattered across the codebase. I’ll leave them here in case they are useful while reading this post:

max_connections_per_peer: 8
max_staked_connections: 2000
max_unstaked_connections: 500
max_streams_per_ms: 500 (500,000 transactions per second)
ema_window_ms: 50
stream_throttling_interval_ms: 100