How to Identify and Resolve a QUIC Congestion Control Bug Stemming from a Linux Kernel Optimization
This guide walks you through diagnosing and fixing a subtle bug in QUIC's CUBIC congestion control that causes the congestion window (cwnd) to remain stuck at its minimum after a congestion collapse. The bug originated from a Linux kernel optimization designed to align CUBIC with RFC 9438's app-limited exclusion rule—a perfectly valid fix for TCP that, when ported to Cloudflare's quiche (a QUIC implementation), triggered unexpected behavior. By the end of this guide, you'll know how to reproduce, analyze, and patch similar issues in your own implementations.
What You Need
- Familiarity with CUBIC congestion control: Understand its core logic—growing cwnd on no loss, shrinking on loss, and the cubic function for growth.
- Access to quiche source code (or your own QUIC stack): The bug exists in the ported CUBIC module, typically
quiche/src/congestion/cubic.rsor similar. - Integration test environment: A test that simulates heavy early packet loss (e.g., 30-50% loss rate) to trigger congestion collapse.
- Logging and monitoring tools: To track cwnd values over time, especially after recovery attempts.
- Basic knowledge of Linux kernel CUBIC implementation: Especially the app-limited exclusion logic added in recent kernels.
Step-by-Step Instructions
-
Understand CUBIC's Behavior During Congestion Collapse
CUBIC, defined in RFC 9438, manages cwnd using a cubic function. In normal operation, it increases cwnd aggressively when no loss is detected and reduces it by a factor (typically 0.7) on a loss event. However, in rare cases—such as a severe congestion collapse early in a connection—the cwnd can drop to its minimum (e.g., 2 packets). The algorithm must then recover by probing for available bandwidth. In TCP, an app-limited exclusion prevents CUBIC from unnecessarily reducing cwnd when the sender is application-limited (i.e., not sending data due to lack of application data). The Linux kernel integrated this exclusion as a fix, but when ported to QUIC, it introduced a flaw.

Source: blog.cloudflare.com -
Identify the Symptom: Persistent Test Failures After Loss
Set up an integration test that applies heavy loss (e.g., 40% packet loss) in the first few RTTs of a QUIC connection. For the original bug, such a test failed about 61% of the time. The failure indicator: after the loss event, the cwnd stays at its minimum (say 2 segments) and never recovers, causing throughput to stall indefinitely. Log cwnd values at each ACK and after any loss detection. If the cwnd remains flat at the minimum for many RTTs despite successful transmissions, you've hit the bug.
-
Trace the Root Cause: App-Limited Exclusion
When a sender is app-limited (no data to send), CUBIC should not reduce cwnd further. The Linux kernel added a check: if the sender is app-limited, skip the congestion window reduction. This works correctly for TCP because the app-limited state is reliably detected via the socket's send buffer. In QUIC, however, the app-limited detection logic differs. In quiche, the flag indicating app-limited was set incorrectly during the recovery phase after a collapse. Specifically, after a loss event, the code marked the connection as app-limited, and then later, when the app-limited exclusion logic checked this flag, it prevented cwnd from growing even after the condition ended. The cwnd got stuck because the recovery algorithm assumed the sender didn't need to increase the window.
-
Locate the Offending Code in Your QUIC Implementation
Search for where the app-limited flag is set and where CUBIC applies the exclusion. In quiche, the bug resided in the packet processing logic: after a loss, the code set a variable
app_limitedto true, but never reset it when the sender became un-limited. This flag was then used in the CUBIC module to skip cwnd updates. Look for logic like:
Source: blog.cloudflare.com if app_limited { return; }within the CUBIC congestion window update path. Confirm that the flag remains true after the sender resumes full data transmission. That's the root cause.
-
Apply the One-Line Fix: Break the Cycle
The elegant fix, as discovered by the Cloudflare team, is to reset the
app_limitedflag when the sender actually transmits data. Near the point where a packet is sent and the connection transitions from app-limited to active, add:app_limited = false;This ensures that once data is flowing again, the congestion control logic can resume normal operation. In quiche, this was inserted in the function that acknowledges outgoing packets. Test this fix by re-running the same heavy-loss integration test—the pass rate should jump to near 100%.
-
Verify Recovery and Regression Test
After applying the fix, monitor cwnd traces. You should see cwnd start at the minimum, then gradually increase as the cubic function takes over—typically growing slowly at first, then more rapidly. Run a full suite of congestion control tests, including normal steady-state, low-loss, and high-loss scenarios. Also test edge cases like zero-window probing and idle periods. Confirm that the fix doesn't break other aspects of CUBIC behavior, especially the app-limited exclusion for genuine app-limited periods.
Tips for Robust Congestion Control Testing
- Always test congestion collapse recovery: Most test suites cover steady-state growth, but the most critical bugs lurk in recovery phases. Include scenarios with >30% loss in the first few RTTs.
- Instrument cwnd logging: Log cwnd changes per ACK or per RTT. Use tools like
ss -ifor TCP, or custom QUIC event hooks. - Beware of ported kernel optimizations: Kernel fixes for TCP may not directly map to QUIC due to differences in how app-limited and flow control states are tracked. Always verify the assumptions.
- Use continuous integration with random loss patterns: Non-deterministic tests can surface intermittent bugs. The 61% failure rate in this case was a red flag.
- Document internal flags: When adding state like
app_limited, ensure it's cleared at every appropriate transition. A single missing reset can cause a deadlock. - Review RFC 9438 §4.2-12: Understand the exact conditions for app-limited exclusion. Implement them precisely, but with awareness of the underlying transport's semantics.
Related Articles
- Linux Q&A: New Projects Folder, Ubuntu AI, Fedora 44, and More Open Source Highlights
- A Comprehensive Guide to KDE’s €1.28 Million STF Grant: Boosting Plasma, KDE Linux, and Frameworks
- Revisiting Unity: A Modern Revival of Ubuntu's Classic Desktop Using Wayfire and Libadwaita
- Understanding Fragnesia: A New Linux Kernel Local Privilege Escalation Bug
- Critical Linux Kernel Flaw 'Dirty Frag' Exploited in the Wild – Patch Immediately
- How to Upgrade and Adapt to Changes in Fedora Atomic Desktops 44
- 7 Key Insights into the Ubuntu Infrastructure Meltdown
- Fedora Asahi Remix 44 Brings Fedora Linux to Apple Silicon Macs with Enhanced Features