Netcode Architectures Part 3: Snapshot Interpolation

Explore netcode architecture based on interpolating world state snapshots

By Jay Mattis | Wednesday, June 07, 2023

In the third part of this series, we’ll be taking a look at what’s commonly referred to as snapshot interpolation. This technique was popularized by Quake and—especially given Quake’s extensive lineage—has proliferated widely:

A family tree representing engines and games derived from Quake
Sources: Wikipedia / Wikimedia

Given its origins, it’s probably not surprising that this model is often ideal for shooters. In contrast to the lockstep and rollback models discussed in previous articles, this architecture is the first in which the state of the world is disjoint in time. In other words, objects are split into two different time streams. To understand what this means exactly, let’s take a look at how this architecture works.

Architecture

Snapshot interpolation is a client-server architecture whereby clients send their inputs to the server, the server advances the simulation using these inputs, and then the server sends back a snapshot of the resulting game state to the clients. To keep the game responsive, clients predict a subset of objects by applying local inputs immediately. This selective prediction is what makes the client’s view a combination of the world at two different states in time:

Interpolated objects: Presented as they were at a specific moment in the past
Predicted objects: Presented as they are expected to be at a specific moment in the future

The video below illustrates what this means for the player experience as compared to local multiplayer:

It’s useful to think of these two different sets of objects, the predicted and the interpolated, as traveling forwards in time along two different time streams that are offset from each other—one ahead of the server and one behind, respectively. The state of objects are always aligned in time with respect to one another but only within the same time stream.

For example, if your character is predicted and an incoming projectile is interpolated, you may find it difficult to dodge at the last moment. This is because, although your inputs are taking effect immediately and you are predicting where your character will be on the server if you dodge, your timing is based on where you see the projectile. Because it’s interpolated, the projectile is displayed where it was on the server at some point in the past rather than where it will be on the server when your dodge is applied. If implemented this way, it is likely that the projectile will have already hit you on the server by the time your request to dodge is received. On the other hand, if the incoming projectile is predicted on the client, it will be aligned in time with your character movement and you should be able to dodge it at the last second effectively.

While this concept may seem strange and unintuitive, it has significant benefits:

Unlike the lockstep architecture, the game responds to player input instantly. Even rollback typically requires some amount of input delay to avoid predicting remote players too far ahead in time. No input delay is required when using snapshot interpolation because only local players are predicted.
Clients require significantly less CPU time compared to full rollback architectures. Because the vast majority of objects are interpolated, the client does not need to execute any game logic associated with them at all.
The client is interpolating objects between known states received from the server, so objects only travel through states in which they’ve actually been. None of the prediction or extrapolation warping artifacts seen in other architectures are typically present when using snapshot interpolation.
Since snapshots contain the state of all objects at a single moment in time, objects are aligned in time with one another, and interactions between them play out in sync.

The last two points are critical guarantees for providing accurate lag compensation as we’ll discuss later on.

Under the Hood

To gain a better understanding of how this is accomplished, let’s take a look at what this process might look like for a single client connected to the server:

You can see that players sample their input, send it to the server (usually along with several past inputs to protect against packet loss), and apply it to predicted objects (e.g. the player character) all on the same frame. Predicted objects respond instantly to player input without waiting for a response from the server. However, the server is the authority so any discrepancies between what was predicted and what actually transpired must be reconciled when the result eventually arrives. So, how is this accomplished?

Prediction, Reconciliation, and Applying Inputs

To predictively apply local inputs, the states of all predicted objects are first loaded from the most recently received snapshot. Each snapshot contains all relevant objects at a single moment in time, so the states of all predicted objects will be the result of the server applying all inputs up until that time. These objects are then predicted forward in time by (re)applying each subsequent local input in chronological order.

You may notice that this process is very similar to the way rollback architectures reconcile new remote inputs. While true, there are two key differences:

Only predicted objects are affected, not the entire game state.
Object states are loaded from a server snapshot rather than from previously saved state. Note that this removes the strict bitwise determinism requirement.

Fixed vs. Variable Tick Rates

The video above illustrates input sampling and prediction as a fixed tick rate simulation, i.e., each input represents what the player was pressing for the same amount of time as every other. It is demonstrated this way for clarity and for parity with earlier articles but it is by no means a requirement.

While some games such as Overwatch work this way, others such as Quake and its derivatives like Apex Legends, Call of Duty, and Counter-Strike, allow inputs to represent a variable amount of time. This is accomplished by attaching a timestamp to inputs as they are generated, and then applying inputs by calculating the difference from the last input applied. The simulation is then advanced using the new input for that duration.

When considering whether to use a fixed or variable tick rate, there are some things to consider:

A variable tick rate allows clients to sample input at a rate that is synchronized with their rendering rate. This means players with high refresh rate monitors can sample input more frequently and experience reduced input latency. Conversely, when using a fixed tick rate, there can be aliasing between the input sampling rate and the client’s rendering rate which requires extra care to render smoothly and can introduce additional input latency.
Executing game logic with variable time deltas may introduce unexpected differences in behavior, gameplay, and performance depending on a player’s input rate. For example, a player jumping while sampling input at 30 Hz may take a different arc through the air than someone who is sampling input at 120 Hz due to differences in integration precision. Furthermore, a player sampling input at 300 Hz may negatively impact performance on both client and server due to the rate at which inputs need to be applied compared to a player sampling at 60 Hz. For this reason, sampling rates often need to be clamped to both a minimum and maximum value which may negate a lot of the benefits described above.
Snapshots are generated on the server at a fixed rate. If players move at variable rates, aliasing can occur between their rate of movement and the rate of snapshot generation. This causes uneven motion when remote players are interpolated on clients. By contrast, when using a fixed tick rate (at some multiple of the snapshot generation rate), each snapshot will incorporate the same duration of movement for all players.

Interpolation

What about the vast majority of objects which aren’t predicted? To render them smoothly, the client keeps a buffer of snapshots received from the server and displays these objects as an interpolation between them. Because interpolation requires at least two data points, clients must wait until they’ve received two snapshots before they can begin interpolating.

In an ideal world, servers would send snapshots at a constant rate and packets would be delivered equally spaced in time. The client could then interpolate from the first snapshot to the second over that fixed interval. A third snapshot would then arrive just in time to interpolate from the second to the third, and so on.

In practice, packets are subject to jitter and loss, and the time spent waiting for the next snapshot will vary. If the client interpolates to the most recent snapshot and the next isn’t yet available, there are generally two options while waiting for it to arrive:

Stop updating the state of the affected objects
Extrapolate the state of affected objects using previous snapshots

Since both degrade the player experience, it’s important to carefully consider how far back in time to interpolate objects. If the interpolation time is too close to the most recent snapshot received, the next snapshot may not arrive by the time it is needed, as described above. On the other hand, if the chosen interpolation time is too far in the past, there will be even more latency between the player’s predicted actions and the rest of the world. The ideal interpolation time ultimately depends on current network conditions.

Prerequisites

As described above, strict bitwise determinism and fixed tick rates are not requirements of this model, but it does introduce a couple of others:

Servers

Snapshot interpolation requires a server. Although the client and server can be the same physical machine (a configuration commonly referred to as a listen server), one machine is always the designated authority. If the server crashes or loses network connectivity during gameplay, it is non-trivial to restore the session and continue playing.

Game State Serialization

The server must be able to build and send clients snapshots that include the entire state of the world at a given time. This requires that network-relevant gameplay state is specifically tagged as such and that systems to serialize that data are implemented.

Depending on the scale of the game world and the minimum connection quality to be supported, serializing and transmitting the game state naively could easily saturate the available bandwidth. In order to transmit snapshots in a bandwidth-efficient manner, a number of techniques are typically employed. These include:

Bit-packing: Although memory is typically accessed a byte at a time or more, when encoding network packets every bit matters. Bit-packing is a technique where data is written bits at a time, using as few as possible to encode the necessary ranges. For example, if a player’s health is stored as a 32-bit integer but only takes on values between 0 and 100, you could write it using only 7 bits for a range of [0, 127] instead of the full 32.
Quantization: Often large values—especially floating-point values—can be encoded using reduced precision without much quality loss. For example, consider a game with a world scale of 1 unit = 1cm and object positions stored as 32-bit floats. For objects that are just being interpolated on clients and have no game logic running against them, 1cm is probably more than sufficient precision for displaying them in the world. If the world is less than 1km³, positions could be encoded using 17 bits per axis for a range of [0, 131071]. That’s almost half the size of the original data.
Delta encoding: By having clients acknowledge snapshots as they receive them, servers can then encode only the data that has changed since the most recently acknowledged snapshot. This provides a significant reduction in bandwidth and is absolutely critical to a bandwidth-efficient implementation of this architecture.
General compression: Entropy encoding methods such as Huffman or arithmetic coding, general compression algorithms such as Lempel-Ziv, or even compression middleware like Oodle Network can often be used to squeeze out some additional gains. Note that most of these techniques require pre-training a model using samples of uncompressed network traffic. This can present a challenge during development as network traffic changes. If you don’t retrain regularly the compression ratio may go down and, in some cases, even result in payloads that are larger than the original uncompressed data.

If you’re looking for additional details on the above techniques, definitely check out Glenn Fiedler’s article on Snapshot Compression. It does an excellent job diving into some of the particulars, which are outside the scope of this article.

Challenges and Limitations

The strengths and weaknesses of this architecture are very different from the others we’ve discussed so far. It can be a great alternative for genres and game designs that don’t fit those other models but, like every architecture, it’s not without its limitations.

Dual Time Streams

The primary challenge when working with snapshot interpolation is designing around the dual time streams introduced by separating objects into two groups—those that are interpolated and those that are predicted. In general, this split limits the interactions that can occur between groups.

For example, if jumping on opponents’ heads is a critical part of a game’s design, snapshot interpolation will be challenging because remote players are interpolated while the local player is predicted. As a result, players will be jumping to the location where their opponents were rather than where they are expected to be. Assuming players are constantly in motion, a player may see themselves perfectly aligned on an opponents’ head while the server and everyone else will see the opponent has already moved away from that location. Even worse, say a successful stomp is supposed to flatten an opponent; attackers can’t see this outcome at the time of impact because interpolated objects cannot be modified by prediction. In this case, the server would need to mark the opponent as flattened and only once the client has interpolated to the snapshot containing this state would they see the result. That additional delay would be proportional to the player’s latency to the server and would probably be a poor user experience for a game designed around this mechanic.

This type of inconsistency comes into play in many scenarios where players interact with moving objects, e.g., melee attacks, player-to-player collisions, and moving platforms. This issue is particularly relevant for hitscan weapons, especially given the snapshot model’s popularity among shooters. If nothing is done to compensate for this gap between predicted and interpolated objects, players would need to lead their targets for the server to register a hit when their shot is processed. The shooter is aiming where players were at some point in the past and their shot won’t be processed by the server until some point in the future. In fact, players had to lead their targets in Quake III for exactly this reason.

Backwards Reconciliation

These days, developers employ a technique called backwards reconciliation. This works by temporarily rewinding the state of objects on the server when processing a player’s input to match the interpolation that client was performing at the time. Doing so turns the perceived negative of dual time streams into a massive benefit: servers can accurately reconstruct what a client was seeing when they performed a particular action. In the case of hitscan weapons, the server can determine precisely who or what the player had in their crosshairs when they fired their weapon. This prevents many common exploits where clients are modified to lie to the server about where they were, who they shot, their rate of fire, etc., while guaranteeing that if a player pulls the trigger with someone in their crosshairs that their shot is honored by the server.

While backwards reconciliation enables accurate server-authoritative hit detection, care should be taken to place limits on how far back state can be rewound. At extreme latencies, backwards reconciliation can cause players to feel like they are being shot after running behind cover or when they are already out of range from their attacker.

For more details on this technique, see our Performing Lag Compensation in Unreal Engine 5 article.

Scale

Another challenge when working with snapshot interpolation is managing the scale of a game. Small to medium scale games are easily achieved but, as the scale of a game world grows larger, there are a number of factors that limit how well this model scales.

Snapshot Size

Servers have to send the entire state of the world in each snapshot. Even with the techniques described above, denser worlds ultimately mean larger snapshots. Typically, bandwidth isn’t the limiting factor as the average available bandwidth has increased dramatically in recent years. Instead, the bottleneck tends to be the MTU (Maximum Transmission Unit), or the maximum size of a packet you can transmit from server to client. If snapshots exceed this limit, they can be fragmented and sent as a series of smaller packets, but that means that if a single fragment is lost then the entire snapshot fails to arrive. The probability of losing a snapshot gets exponentially worse as the number of fragments grows and so a limit is quickly reached where even a little bit of packet loss will prevent any new snapshots from arriving.

Server Performance

Although delta-encoding is critical to efficient bandwidth usage—especially in larger games—it can cause packet generation to become expensive in terms of server CPU usage. Servers must generate a unique payload per client for a couple of reasons:

Snapshots are encoded as deltas from the most recent snapshot the client has acknowledged.
Different sets of objects or fields are typically sent to each client, depending on whether they are network-relevant, predicted, etc.

This creates a workload that tends to grow with the square of the number of players. For example, if there are 4 players each controlling a character then the server must encode 16 object deltas (4 characters for each of the 4 players) whereas if there are 8 players the server must encode 64 object deltas. In other words, when doubling the player count, expect the time spent generating packets to take roughly four times as long. With small player and object counts, the time spent generating packets is negligible but, as player and object counts grow, packet generation can begin to dwarf the time spent advancing the game simulation. There are a number of opportunities for caching to speed this process up but they are outside the scope of this article.

Conclusion

Snapshot interpolation is a powerful architecture that is an excellent fit for many small to medium scale games. It is an especially popular choice for shooters where backwards reconciliation can be employed to provide accurate server-authoritative hit detection. In fact, snapshots power many of today’s top FPS titles such as Apex Legends, Call of Duty, and Counter-Strike.

Due to its use of dual time streams, however, it is not a great fit for games that require direct and responsive interaction with other objects in the world such as fighting and sports games. For example, Rocket League’s predecessor, Supersonic Acrobatic Rocket-Powered Battle Cars, was networked using snapshot interpolation but, due to the challenges of predicted cars interacting with an interpolated ball, Psyonix decided to move to a rollback architecture for Rocket League and the player experience was much improved. You can learn more about this in Jared Cone’s excellent GDC talk, It IS Rocket Science! The Physics of Rocket League Detailed.

Next Steps

Be sure to check out the additional reading below if you want to learn more. I also highly recommend checking out the Quake III source code. Although it doesn’t include an implementation of backwards reconciliation, it is extremely useful as a clean and readable (and debuggable!) reference.

If you found yourself wondering whether larger games could be supported by only transmitting partial state updates, stay tuned for part four of this series where we’ll dive into the Tribes networking model, an architecture that’s also similar to those used within Unreal Engine and the Halo series. It’s a potential option for supporting massive worlds while remaining more responsive than a lockstep architecture would allow. We’ll answer the burning question: is it worth what you have to give up?

Jay Mattis is a founder of High Horse Entertainment and author of SnapNet.
SnapNet delivers AAA netcode for real-time multiplayer games.

Additional Reading

Snapshot Interpolation
Glenn Fiedler

Overwatch Gameplay Architecture and Netcode
Timothy Ford

Source Multiplayer Networking
Valve Developer Community