x

CXL KV Offload for Multi-Turn Inference

Benchmarking XCENA MX1 with LMCache + vLLM

구현회 (Hyunhoi Koo) · Research Engineer, Lablup
2026.05.28

x

What Is CXL Memory?

"DRAM on a PCIe card"

Byte-addressable memory you mmap from userspace and DMA from a GPU, with capacity beyond CPU DIMM channels.

x

Agentic Workloads Strain the Memory Hierarchy

Agent characteristics

  • Long contexts (100K+ tokens).
  • Same prefix replayed every turn.
  • Many sessions in flight in parallel.
  • KV cache ~16 GiB per context.

KV cache offload options

  • DRAM: local, fast, capped by DIMM channels.
  • NFS: shared across nodes, network latency per fetch.
  • CXL: local DMA, capacity via PCIe slots.

VRAM alone cannot hold the working KV set. Without an offload tier, every turn pays full re-prefill.

x

Architecture

  • LMCache intercepts vLLM's KV cache lifecycle.
  • After prefill: async store to CXL.
  • On hit: DMA from CXL to GPU VRAM.
  • Maru backend: L1 mapped onto CXL devdax via mmap.
  • One device backs multiple replicas.

vLLM + FlashInfer LMCache store / retrieve KV Maru backend L1 = mmap'd devdax XCENA MX1 CXL DRAM

x

Experiment Setup

Cache

XCENA MX1, 466 GiB CXL DRAM on /dev/dax9.0. LMCache L1 = 256 GiB per replica (Maru backend).

Stack

Qwen3-30B-A3B BF16 MoE on 2x RTX Pro 6000 (96 GB each). vLLM 0.20.1 + LMCache.

Workload

10 contexts x 5 turns. ~174K tokens per context (BFD-packed Python). MAX_MODEL_LEN=200K.

x

Benchmark 1: Single Replica, Sequential

Benchmark 1 per-turn TTFT, OFF vs ON

3.7x wall clock

4.2x avg TTFT

~15x warm turns

x

Benchmark 2: Dual Replica, 10-Way Concurrent

Per-turn speedup, B1 and B2

3.8x wall clock  ·  ~17x per-request steady state  ·  holds under concurrency stress

x

CXL Compared

DRAM vs CXL per-turn TTFT

vs DRAM

  • Same code path on both tiers (maru backend).
  • Turn 0 cold prefill: ~45 s on both.
  • Turn 1 first retrieval: 3.5 s CXL, 3.6 s DRAM.
  • Steady-state (turn 2-4): ~1.0 s on both.

vs NFS

  • Cluster-wide capacity.
  • Per-fetch network + storage round-trip.
x

Backend.AI, vLLM, and CXL Memory

x

Questions?

Lablup x XCENA joint benchmark. CXL DRAM on the XCENA MX1 card used as LMCache's L1 backing tier in front of vLLM, for long-context multi-turn agentic inference.

- Quick anchor; the previous talk covered the hardware in depth. - For this deck: CXL memory means mmap-able DRAM behind PCIe, DMA-accessible from a GPU. - The "DRAM on a PCIe card" framing is a simplification; CXL is more than that (cache-coherent fabric, etc.) but this captures what matters for KV offload.

- Agentic workloads replay long prefixes across many parallel sessions. - KV per context (~16 GiB) times active sessions exceeds GPU VRAM. - All three offload tiers (DRAM, NFS, CXL) are valid candidates with different tradeoffs. - This deck focuses on the CXL path; the prior deck covered NFS-based offload.

- LMCache: KV cache management layer that plugs into vLLM. - Intercepts the prefill / decode lifecycle to capture and restore KV blocks. - Multi-tier storage backends: host DRAM, CXL, local disk, remote (Mooncake, NIXL, etc). - Originally from CacheBlend/LMCache project; this deck uses a Lablup fork that adds the Maru backend. - Maru backend: LMCache's path for direct CXL devdax integration. - L1 is mmap'd directly onto /dev/dax9.0; reads and writes go through the standard memory path. - No plugin or RPC indirection. The alternative `dax` mode (used in B6) goes through a plugin layer. - Multiple replicas mmap the same device into non-overlapping regions.

- Plain CXL DRAM via devdax, no managed pool. - MoE model fits in 96 GB at BF16 with room for KV. - 174K-token contexts make each prefill expensive enough that the offload trade-off is visible. - Two GPUs, no NVLink between them.

- One GPU, one request at a time, no batching. - OFF: every turn does full prefill, TTFT flat at ~40 s. - ON: turn 0 cold (~41 s, ~1.3 s offload penalty), turn 1 retrieves from CXL (~3.5 s), turn 2+ ~1 s. - Warm turns drop to ~1 s because vLLM's block allocator retains some KV in VRAM.

- 2 GPUs, 10 in-flight requests per replica, both replicas share one CXL device. - Steady-state per-request ~17x faster with CXL KV offload. - Concurrency increases cache thrashing; offload still ~3.8x wall-clock. - OFF queues requests on top of full re-prefill; CXL offload sidesteps both. - Device far from saturated: ~8% of raw 49 GiB/s bandwidth used. Scales further.

- B6 runs the same workload with host DRAM as LMCache's L1 in place of CXL. - Steady state (turns 2-4) lands at the same floor on both tiers; GPU PCIe DMA sets it. - Turn 1 difference (3.5 s vs 1.15 s) tracks to Maru's first-retrieval software path, not the memory tier. - Hardware-only test: CXL sustains 9.6 GiB/s, comparable to DRAM. - Prior deck covered NFS (VAST) KV offload. Cluster-wide capacity, multi-second per-fetch latency. - Each tier serves a different use case; CXL trades cluster capacity for locality.

- Backend.AI runs vLLM in production: managed deployment, autoscaling, routing. - Exploring CXL-aware scheduling with XCENA: per-replica CXL allocation, co-locate with CXL-equipped nodes. - CXL capacity as a first-class scheduling resource. - Potential architecture: two-tier KV cache. - L1: CXL (instead of DRAM). - L2: VAST.