Ever since moving back from Windows, I’ve been paranoid about latency in games on Linux. Slight changes to the environment or settings can, all of a sudden, make the mouse feel very floaty. There have been many community discussions on this topic and it’s good knowing I’m not alone on this.

To investigate this, I used a small microcontroller acting as a USB mouse, paired with a light sensor, to measure click-to-photon latency. Flashed with Open Source LDAT, it can capture hundreds of samples and record them to a CSV file, unattended.

Measurements were done on two computers: a desktop and a laptop. They both have an Ada-generation RTX card and a Zen 4 CPU. I used the virtually the same NixOS config on both, and an up-to-date Windows 11 build. The same display was used too - an LG C1 120Hz over HDMI. I have Radeon GPUs laying around, and plan on testing gamescope on them in particular, but that’s going to have to wait until the next batch of captures. App settings were selected not to bottleneck on hardware. My goal was to easily hit 120 FPS on a 120 Hz output and test for any queueing effects in the software stack.

I used KDE Wayland at version 6.6.4, Proton-GE 10-33, mangohud 0.8.2 for late FPS limiting, Nvidia 595.58.03. On Windows, I used either the Nvidia control panel or RTSS for FPS limiting, interchangeably.

Despite the automated nature of the tool, it still ends up being a lot of work. Controlling all the variables is a major pain, and I often discovered new things partway through the testing, which invalidated prior measurements. LG webOS toggling Black Frame Insertion when you connect a different computer on the same port, or a game not changing V-Sync modes eagerly in the settings menu.

Synthetic tests

As a quick validation run, and to test display settings quickly and easily, I built my own latency testing tool. It’s just a black square that instantly goes white as soon as you click on it, perfect for the tool to use. I added a configurable delay to simulate input processing. The test was performed on a clean chromium profile with nothing except for out of the box defaults.

How to read the charts

Each chart isolates one parameter. Its values are on the Y axis. Boxes with the same color share the same held-equal context; those parameters and values are listed in the legend, or by hovering the plot. Each variant is represented by an IQR boxplot, with min-max whiskers and a vertical line representing the median.

This looks exactly as expected, the medians and minimums are shifted roughly by the amount of delay. Except, why is my desktop computer slower than the laptop? They are running the same versions of software from a reused NixOS config, very similar hardware too. I expected them to match, or for the desktop to have a slight edge, if anything. To minimize the differences further, I created a brand new user account on the desktop and ran the test again:

There it was, something about my desktop profile was introducing at least 3 ms of latency! From here, I tried a bunch of things: plasma-manager to diff my existing profile against a clean one, removed all virtual desktops, disabled all KWin effects and any display scaling. It was when I randomly started closing down all apps that I found the culprit - it was the zed editor. Apparently, this GPU-accelerated text editor can add latency to all my other apps while itself idling in the background. Eventually, I found a good proxy metric to detect this issue:

$ kwincg="/sys/fs/cgroup$(systemctl --user show -p ControlGroup --value plasma-kwin_wayland.service)"
$ sudo bpftrace -e '
tracepoint:syscalls:sys_enter_ioctl
/cgroup == cgroupid("'$kwincg'") && args->cmd == 0xc03864bc/
{
  @mode_atomic = count();
}

interval:s:1
{
  print(@mode_atomic);
  clear(@mode_atomic);
}
'

When there’s at least one open zed window on the current virtual desktop, kwin does two DRM_IOCTL_MODE_ATOMIC calls for every display refresh frame. I reproduced this penalty with VRR enabled as well. Minimizing zed windows is as good as closing it entirely. Interestingly, the rate of ioctl calls grows proportional to input rate, like the polling rate of a moving mouse. I tried triggering the same effect with a real mouse, but I couldn’t obtain a clean capture this way. It seems peripherals and running apps can impact latency of the overall desktop and unrelated apps. I think it may be a systemic issue, so I’ll submit a bug report later.

Thankfully, this does not affect fullscreen games. I identified this issue after measuring everything else, so I’m glad this finding didn’t invalidate my existing in-game measurements.

LG display settings

Next up, I tested various settings on the TV.

Setting the input mode to PC (which locks out a bunch of picture settings) made no impact, while Black Frame Insertion seemed to add exactly one frame’s worth of delay. This one really hurts because I love using it. It seems like their implementation adds extra buffering, even though it could be done with a lagless rolling scan.

HDR had a tiny, but measurable effect.

Game tests

I hoped to find a game that supported all three major graphics APIs so I could compare between them. There is a handful of those, typically based on Unreal Engine, but advertising all but one of them as experimental. I ended up with three games tested, each one having a reproducible setup to capture measurements with. Comparing across games is pointless, since they’re all going to have different animation timings. Instead, I’m going to focus on the different tunables for each API.

Doom Eternal (Vulkan)

This one was easy to set up - just load a dark level (so any of them) with infinite ammo and observe the precision rifle muzzle flash on some dark wall. The game uses Vulkan, so it doesn’t need a translation layer on Linux. I couldn’t get it to run directly on Wayland, despite this exact issue having been fixed last year.

If we don’t cap FPS below refresh rate, it starts buffering frames when V-Sync is enabled. That latency can be recovered by disabling it, as seen in the next chart. We don’t risk frame tearing when it’s running through XWayland.

VRR by itself isn’t a significant factor:

Neither are Nvidia’s Windows-exclusive settings that I tested:

Borderlands 3 (DX11, DX12)

I modded my save game to remove the magazine clip attachment on a weapon. This makes it not drain any ammo - ideal for looping 500 muzzle flash measurements back to back. The game has an annoying bug where changing V-Sync in DX11 mode doesn’t apply immediately, you need to change resolution or other settings before it takes effect. Many measurements were invalidated by this bug…

Windows had consistently lower latency, sometimes significantly so when V-Sync was used:

Skipping XWayland can claw back the latency in these cases:

DX12 is consistently slower across both operating systems. There might be some Unreal Engine hack that improves it, like the OneFrameThreadLag CVar, but I did not test any.

I tried various platform- and API-specific switches:

Nvidia’s Ultra Low Latency Mode on Windows
VKD3D_SWAPCHAIN_LATENCY_FRAMES=1 and VKD3D_SWAPCHAIN_IMAGES=2 for DX12
DXVK_CONFIG="d3d9.maxFrameLatency=1;dxgi.maxFrameLatency=1" for DX11

The only one that had an impact was the VKD3D one for latency, but even then, it was still significantly lagging DX11:

Capping FPS and not letting it queue up at the refresh rate mark makes the biggest difference.

Hades 2 (DX12)

This game had a longer wind-up in the animation, but it was consistent. Measurements showed similar conclusions as in prior tests. The following settings help, all things being equal, but are not necessarily cumulative:

Capping below refresh rate
Using wine_wayland
Setting VKD3D_SWAPCHAIN_LATENCY_FRAMES=1

Summary and recommendations

My takeaway is to prioritize wine_wayland, use late FPS limiting, VKD3D_SWAPCHAIN_LATENCY_FRAMES=1 in DX12 games, and VRR if the game can’t hit a stable target or has bad frame pacing. I really hoped to push the V-Sync, non-VRR results lower, but I don’t see how to get there. After all, there’s still some buffering or delay on Windows as well.

Gaming over the network

I also tested USB over IP and Sunshine+Moonlight between two Linux hosts. Hit this 7.0 regression with Sunshine. USB over IP is basically as fast as the network,

Future work

There’s a few things I’m looking out for and hoping to test for impact:

FIFO_LATEST_READY_KHR
commit-timing in KWin
A frame limiter that can can use the above with VK_EXT_present_timing for well-paced integer-divisors of refresh rate

These will be very useful for mangochill. The only limiter capable of that today is gamescope, when running on the DRM backend. That flow doesn’t work with Nvidia GPUs.

I’m also interested in checking out the dxvk forks which focus on latency: