XCENA MX1, 466 GiB CXL DRAM on /dev/dax9.0. LMCache L1 = 256 GiB per replica (Maru backend).
Qwen3-30B-A3B BF16 MoE on 2x RTX Pro 6000 (96 GB each). vLLM 0.20.1 + LMCache.
10 contexts x 5 turns. ~174K tokens per context (BFD-packed Python). MAX_MODEL_LEN=200K.
3.7x wall clock
4.2x avg TTFT
~15x warm turns
3.8x wall clock · ~17x per-request steady state · holds under concurrency stress
maru backend).Lablup x XCENA joint benchmark. CXL DRAM on the XCENA MX1 card used as LMCache's L1 backing tier in front of vLLM, for long-context multi-turn agentic inference.
- Quick anchor; the previous talk covered the hardware in depth. - For this deck: CXL memory means mmap-able DRAM behind PCIe, DMA-accessible from a GPU. - The "DRAM on a PCIe card" framing is a simplification; CXL is more than that (cache-coherent fabric, etc.) but this captures what matters for KV offload.
- Agentic workloads replay long prefixes across many parallel sessions. - KV per context (~16 GiB) times active sessions exceeds GPU VRAM. - All three offload tiers (DRAM, NFS, CXL) are valid candidates with different tradeoffs. - This deck focuses on the CXL path; the prior deck covered NFS-based offload.
- LMCache: KV cache management layer that plugs into vLLM. - Intercepts the prefill / decode lifecycle to capture and restore KV blocks. - Multi-tier storage backends: host DRAM, CXL, local disk, remote (Mooncake, NIXL, etc). - Originally from CacheBlend/LMCache project; this deck uses a Lablup fork that adds the Maru backend. - Maru backend: LMCache's path for direct CXL devdax integration. - L1 is mmap'd directly onto /dev/dax9.0; reads and writes go through the standard memory path. - No plugin or RPC indirection. The alternative `dax` mode (used in B6) goes through a plugin layer. - Multiple replicas mmap the same device into non-overlapping regions.
- Plain CXL DRAM via devdax, no managed pool. - MoE model fits in 96 GB at BF16 with room for KV. - 174K-token contexts make each prefill expensive enough that the offload trade-off is visible. - Two GPUs, no NVLink between them.
- One GPU, one request at a time, no batching. - OFF: every turn does full prefill, TTFT flat at ~40 s. - ON: turn 0 cold (~41 s, ~1.3 s offload penalty), turn 1 retrieves from CXL (~3.5 s), turn 2+ ~1 s. - Warm turns drop to ~1 s because vLLM's block allocator retains some KV in VRAM.
- 2 GPUs, 10 in-flight requests per replica, both replicas share one CXL device. - Steady-state per-request ~17x faster with CXL KV offload. - Concurrency increases cache thrashing; offload still ~3.8x wall-clock. - OFF queues requests on top of full re-prefill; CXL offload sidesteps both. - Device far from saturated: ~8% of raw 49 GiB/s bandwidth used. Scales further.
- B6 runs the same workload with host DRAM as LMCache's L1 in place of CXL. - Steady state (turns 2-4) lands at the same floor on both tiers; GPU PCIe DMA sets it. - Turn 1 difference (3.5 s vs 1.15 s) tracks to Maru's first-retrieval software path, not the memory tier. - Hardware-only test: CXL sustains 9.6 GiB/s, comparable to DRAM. - Prior deck covered NFS (VAST) KV offload. Cluster-wide capacity, multi-second per-fetch latency. - Each tier serves a different use case; CXL trades cluster capacity for locality.
- Backend.AI runs vLLM in production: managed deployment, autoscaling, routing. - Exploring CXL-aware scheduling with XCENA: per-replica CXL allocation, co-locate with CXL-equipped nodes. - CXL capacity as a first-class scheduling resource. - Potential architecture: two-tier KV cache. - L1: CXL (instead of DRAM). - L2: VAST.