CXL KV Offload for Multi-Turn Inference

Lablup x XCENA joint benchmark. CXL DRAM on the XCENA MX1 card used as LMCache's L1 backing tier in front of vLLM, for long-context multi-turn agentic inference.

- Quick anchor; the previous talk covered the hardware in depth. - For this deck: CXL memory means mmap-able DRAM behind PCIe, DMA-accessible from a GPU. - The "DRAM on a PCIe card" framing is a simplification; CXL is more than that (cache-coherent fabric, etc.) but this captures what matters for KV offload.

- Agentic workloads replay long prefixes across many parallel sessions. - KV per context (~16 GiB) times active sessions exceeds GPU VRAM. - All three offload tiers (DRAM, NFS, CXL) are valid candidates with different tradeoffs. - This deck focuses on the CXL path; the prior deck covered NFS-based offload.

- LMCache: KV cache management layer that plugs into vLLM. - Intercepts the prefill / decode lifecycle to capture and restore KV blocks. - Multi-tier storage backends: host DRAM, CXL, local disk, remote (Mooncake, NIXL, etc). - Originally from CacheBlend/LMCache project; this deck uses a Lablup fork that adds the Maru backend. - Maru backend: LMCache's path for direct CXL devdax integration. - L1 is mmap'd directly onto /dev/dax9.0; reads and writes go through the standard memory path. - No plugin or RPC indirection. The alternative `dax` mode (used in B6) goes through a plugin layer. - Multiple replicas mmap the same device into non-overlapping regions.

- Plain CXL DRAM via devdax, no managed pool. - MoE model fits in 96 GB at BF16 with room for KV. - 174K-token contexts make each prefill expensive enough that the offload trade-off is visible. - Two GPUs, no NVLink between them.

- One GPU, one request at a time, no batching. - OFF: every turn does full prefill, TTFT flat at ~40 s. - ON: turn 0 cold (~41 s, ~1.3 s offload penalty), turn 1 retrieves from CXL (~3.5 s), turn 2+ ~1 s. - Warm turns drop to ~1 s because vLLM's block allocator retains some KV in VRAM.

- 2 GPUs, 10 in-flight requests per replica, both replicas share one CXL device. - Steady-state per-request ~17x faster with CXL KV offload. - Concurrency increases cache thrashing; offload still ~3.8x wall-clock. - OFF queues requests on top of full re-prefill; CXL offload sidesteps both. - Device far from saturated: ~8% of raw 49 GiB/s bandwidth used. Scales further.

- B6 runs the same workload with host DRAM as LMCache's L1 in place of CXL. - Steady state (turns 2-4) lands at the same floor on both tiers; GPU PCIe DMA sets it. - Turn 1 difference (3.5 s vs 1.15 s) tracks to Maru's first-retrieval software path, not the memory tier. - Hardware-only test: CXL sustains 9.6 GiB/s, comparable to DRAM. - Prior deck covered NFS (VAST) KV offload. Cluster-wide capacity, multi-second per-fetch latency. - Each tier serves a different use case; CXL trades cluster capacity for locality.

- Backend.AI runs vLLM in production: managed deployment, autoscaling, routing. - Exploring CXL-aware scheduling with XCENA: per-replica CXL allocation, co-locate with CXL-equipped nodes. - CXL capacity as a first-class scheduling resource. - Potential architecture: two-tier KV cache. - L1: CXL (instead of DRAM). - L2: VAST.

CXL KV Offload for Multi-Turn Inference

Benchmarking XCENA MX1 with LMCache + vLLM

What Is CXL Memory?

Agentic Workloads Strain the Memory Hierarchy

Agent characteristics

KV cache offload options

Architecture

Experiment Setup

Cache

Stack

Workload

Benchmark 1: Single Replica, Sequential

Benchmark 2: Dual Replica, 10-Way Concurrent

CXL Compared

vs DRAM

vs NFS

Backend.AI, vLLM, and CXL Memory

Questions?