Hangrui Cao. ← Back to writing
2026 · 05 · 28 LLM Inference Distributed Systems ~ 15 min read

NCCL vs NVSHMEM: two answers to the same question

If you've wired up a multi-GPU training run, you've used NCCL. If you've read any recent MoE-inference paper worth its citation count, you've seen NVSHMEM. They sit next to each other on every NVIDIA system and answer the same question — how do these GPUs talk to each other? — at different abstraction levels. This is a short attempt to untangle the two and to argue why the second one matters more than it used to.

Most people who work on distributed deep learning learn one of these libraries first and never quite see the other. NCCL [1] is the default in PyTorch DDP, Megatron, DeepSpeed, JAX — anywhere you've ever called torch.distributed.all_reduce. NVSHMEM [2] is the library people reach for when NCCL stops being a good fit, which historically has meant "almost never" but in 2026 increasingly means "sparse all-to-all in Mixture-of-Experts inference."

The simplest way to keep them straight is this:

NCCL gives you collectives. NVSHMEM gives you memory.

Everything else — algorithms, performance characteristics, the integration story — follows from that distinction.


01 NCCL: collectives as a primitive

NCCL's mental model is bulk synchronous parallel. You assemble a buffer on every participating GPU, you call a collective, and after a synchronization point every GPU has the result the collective promised — sum, gather, broadcast, etc. The primitives are the things you'd find in MPI:

From the calling code, every NCCL operation is one line:

// gradient AllReduce — what DDP issues under the hood
ncclAllReduce(grad_send, grad_recv, n,
              ncclFloat, ncclSum,
              comm, stream);
cudaStreamSynchronize(stream);  // barrier

Underneath, NCCL launches a CUDA kernel that runs ring or tree-based algorithms chosen by message size and topology. On a modern node it negotiates NVLink for intra-node bandwidth and InfiniBand or NVL Switch fabrics for cross-node hops. The library is twelve years old and excruciatingly well-tuned: if your communication pattern is regular and your messages are large enough to amortize the ~5–10 µs kernel-launch overhead, NCCL is almost certainly within a few percent of what's achievable.

That ergonomic floor is exactly the problem when your communication pattern isn't regular.


02 NVSHMEM: memory as a primitive

NVSHMEM gives every participating GPU — a PE ("processing element," the SHMEM term) — a view of a global symmetric heap. You allocate from this heap with nvshmem_malloc and the resulting pointer is valid on every PE at the same offset. From any PE, you can directly put data into a remote PE's heap, or get from it, with one-sided RDMA-style operations:

// device-side — yes, called from inside a CUDA kernel
__global__ void my_kernel(float *symmetric_buf, int peer) {
    // ... do some compute ...

    nvshmem_float_put_nbi(symmetric_buf + offset,
                          src_data, n,
                          peer);

    // ... more compute, overlapped with the put ...

    nvshmem_quiet();  // wait for completion
}

Three things in that snippet are the whole story:

On the API surface NVSHMEM also offers get (one-sided read), atomics (nvshmem_int_atomic_add et al.), and signal-based completion (nvshmem_signal_op + nvshmem_wait_until) — which together let you build producer–consumer patterns inside a single kernel. There are collective wrappers (nvshmem_collective_*), but they're not why anyone uses the library; they exist mostly so NVSHMEM can claim feature parity with OpenSHMEM.


03 The trade-off, concretely

NCCLNVSHMEM
Programming model Bulk synchronous · collectives One-sided · PGAS (symmetric heap)
Call site Host code; each op launches a kernel Inside CUDA kernels, alongside compute
Per-op overhead ~5–10 µs kernel launch + sync Single-digit µs end-to-end on NVLink
Granularity Whole-buffer collectives Word- to message-level puts/gets
Synchronization Implicit per collective Explicit — fences, signals, atomics
Algorithm choice Library picks ring / tree / etc. You build the algorithm yourself
Compute/comm overlap Stream-level (separate kernels) Intra-kernel — true fusion
Integration PyTorch, Megatron, JAX, DeepSpeed, … Custom kernels; growing in frameworks
Best at Regular, dense, large messages Irregular, fine-grained, latency-bound

Two rows in that table do the most work: call site and granularity. NCCL is host-side and buffer-grained. NVSHMEM is device-side and word-grained. If your problem fits the first column — regular, dense, predictable — NCCL is a more ergonomic and equally fast choice. If it lives in the second column, NCCL forces you to round your communication up to a coarser shape (typically dense all-to-all with worst-case bandwidth) where the cost can be many times what you actually need.


04 Where each one wins today

Training

NCCL dominates and probably always will. Data-parallel gradient AllReduce is NCCL's home court — large, dense, regular, run thousands of times per training step with a predictable pattern. Tensor-parallel AllGather / ReduceScatter is the same story. The collectives map 1:1 to the operations, the messages are big enough to hide launch overhead, and the framework integration is mature.

Inference — dense models

NCCL still wins. Tensor-parallel attention and MLP collectives are regular and large; vLLM, TensorRT-LLM, and SGLang all use NCCL by default for these paths. Sequence-parallel variants (FlashAttention-3's ring attention, DeepSpeed Ulysses) start to want finer-grained patterns, and you'll see NVSHMEM show up in the research kernels — but in production, NCCL is still the path of least resistance.

Inference — MoE

This is where the story flips. In MoE, every token has to be routed to its top-k experts, and the assignment is decided at runtime by a learned router. The resulting communication is fundamentally sparse and irregular: the actual bytes any rank sends to any other rank depend on which tokens picked which experts in this batch.

NCCL's ncclAllToAll has to assume worst-case bandwidth — every rank sending to every rank — even when most of the rank-to-rank traffic is empty. A custom NVSHMEM kernel can put each token directly into the right expert's queue with one-sided puts and signal completion, paying only for the data that actually moves. DeepEP [3] — the DeepSeek MoE all-to-all kernel released in early 2025 — is the highest-profile example, and the gap over a plain NCCL-based implementation is not a few percent.

Fused communication–compute

Anywhere you want to overlap communication and compute inside a single kernel — fused all-to-all + matmul, attention + ring rotation, KV-cache migration during a decode step — NVSHMEM is the only practical option. NCCL operates at kernel granularity, so you can overlap at the stream level but not within a kernel; for workloads where the inner loop wants to issue a small message every few hundred cycles, NVSHMEM is fundamentally the right abstraction.


05 How DeepEP actually does it

DeepEP [3] is worth a closer look because it's the cleanest public example of what an NVSHMEM-native all-to-all looks like in production. The library ships two kernel variants that share a programming model but make very different choices about latency versus throughput — and the gap between them is a useful illustration of how the abstraction level we've been talking about actually pays off.

Two kernels, one symmetric heap, zero CPU on the critical path.

The two-kernel split

The first kernel is the high-throughput dispatch / combine pair used in training and prefill. It assumes the batch is large enough that bandwidth, not latency, is the bottleneck, and it leans on both NVLink (intra-node) and RDMA (inter-node) to move FP8-quantized tokens.

The second is a low-latency decode kernel. During autoregressive decoding each step only routes a tiny number of tokens (one per sequence in the batch), so the bandwidth-optimal algorithm of the first kernel collapses under launch and setup overhead. The low-latency kernel goes pure RDMA via NVSHMEM IBGDA, skips the NVLink-first staging, and is built to minimize per-token wall time.

Both kernels speak the same NVSHMEM symmetric-heap dialect. They differ in topology choices, FP8 vs BF16 paths, and how they overlap with the surrounding kernel — not in the underlying communication primitive.

Dispatch, sketched

Strip out the buffer management and the dispatch phase looks like this:

__global__ void dispatch(const Token *tokens, const int *routing,
                         Token *symmetric_expert_buf,
                         int *symmetric_counter) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= n_tokens) return;

    int expert = routing[tid];          // router output
    int dst_pe = expert / experts_per_pe;
    int slot   = nvshmem_int_atomic_fetch_add(
                     symmetric_counter, 1, dst_pe);

    // one-sided put: this token into the expert's symmetric buf
    nvshmem_putmem_nbi(symmetric_expert_buf + slot * Tsize,
                       tokens + tid * Tsize, Tsize, dst_pe);

    // signal completion — destination spins on this counter
    nvshmemx_signal_op(&symmetric_done_signal[dst_pe],
                       1, NVSHMEM_SIGNAL_ADD, dst_pe);
}

Three things in that sketch are doing the real work:

IBGDA: removing the CPU from the path

The most under-discussed performance trick in DeepEP is its use of NVSHMEM IBGDA — InfiniBand GPUDirect Async. With IBGDA, the CUDA kernel itself posts work-queue entries directly to the InfiniBand HCA's doorbell. There is no host thread translating an NVSHMEM call into a verbs ibv_post_send, no syscall, no CPU on the critical path at all. The kernel writes 64 bytes to a memory-mapped doorbell and the NIC takes over.

For the decode kernel this matters enormously: every microsecond of CPU-mediated setup overhead is paid per token per step, and there are thousands of decode steps in a typical generation. With IBGDA, the RDMA send latency is bounded by the NIC and the fabric, not by the host's scheduler or by waking up a proxy thread.

Asymmetric topology: NVLink first, then RDMA

Modern training clusters are not flat — within a node you have NVLink at hundreds of GB/s; across nodes you have InfiniBand at tens of GB/s. The high-throughput dispatch kernel exploits this asymmetry explicitly:

This avoids the worst-case pattern where every GPU on node A talks directly to every GPU on node B over RDMA — a pattern that consumes in HCA queue resources for a constant amount of cross-fabric data. With NCCL's collectives, you'd typically rely on the library to figure this out for you. With NVSHMEM, you express it directly in the kernel and pay only for the topology you have.

FP8 in, BF16 out

DeepSeek-V3 already trains and infers in FP8 for the GEMMs, and DeepEP's dispatch path carries this through: tokens are quantized to FP8 before they leave the source GPU, halving the bytes on the wire compared to a BF16 dispatch. The combine path returns in BF16 because the expert outputs have already been dequantized on the destination — there's no accuracy reason to re-quantize for the trip home, and BF16 keeps the gradient-friendly numerics intact for training.

This is the kind of design choice that's natural to make when you're writing the communication kernel yourself. With NCCL you'd be locked into whatever dtype the collective operates on; with NVSHMEM, the wire format is whatever you decide to nvshmem_putmem.

What this buys, concretely

DeepSeek reports the high-throughput dispatch hitting roughly NVLink line rate on H800 (~150 GB/s) and ~50 GB/s on RDMA for the inter-node leg, with the low-latency kernel completing a decode-step all-to-all in the low hundreds of microseconds. The absolute numbers depend on cluster, NIC count, and measurement methodology — but the qualitative story matters more than the exact figures: a custom NVSHMEM kernel that knows the topology, the sparsity pattern, and the dtype contract beats a generic dense collective by enough that the engineering cost is obviously justified.

None of these tricks are conceptually new. Each one — one-sided RDMA, atomic counters, IBGDA, topology-aware staging — exists in the HPC literature going back a decade or more. What's new is the cost/benefit calculation: at H100/H800 cluster prices, with MoE in production at hundreds of millions of tokens per day, spending a quarter of an engineer-year on a custom kernel that saves 30% on routing cost pays for itself in a week.


06 Why the NVSHMEM moment is now

For most of the 2010s, NVSHMEM lived in HPC. Climate codes, particle simulations, a few graph-analytics papers. The deep learning crowd cared about dense AllReduce, NCCL did dense AllReduce well, and that was the end of the conversation.

Three things changed in 2024–2025:

None of these dethrone NCCL. Training still wants collectives. Dense inference still wants collectives. But the share of the LLM serving stack where collectives are the wrong abstraction has gone from "a research curiosity" to "a production line-item" in two years.


07 Picking the right tool

The decision usually collapses to two questions:

  1. Is your communication a standard collective on regular data? AllReduce gradients, AllGather a tensor, broadcast a parameter. Use NCCL. You'll get something close to peak and you'll write one line of code.
  2. Do you want to issue communication from inside a kernel, or is the traffic sparse / irregular? Now you need NVSHMEM. The code is harder, the synchronization is your problem, and the payoff is the ability to do things collectives literally cannot do.

A reasonable production stack uses both. NCCL for training and dense-inference TP. NVSHMEM in custom kernels for MoE routing, KV-cache movement, and any fused-comm-compute paths where the inner loop demands fine-grained traffic. They're not competitors — they're answers to different abstraction levels of the same question.

The interesting open question is what happens to the abstraction layer above both of them. PyTorch's torch.distributed currently exposes the collective view; nothing in mainstream Python frameworks gives you a clean handle on one-sided device-side communication. Until that changes, NVSHMEM lives mostly inside hand-written CUDA — which is fine for kernel authors and a real friction for everyone else. Watch this space; the people writing the next generation of inference engines clearly are.

References

  1. NVIDIA, NCCL: NVIDIA Collective Communications Library. developer.nvidia.com/nccl
  2. NVIDIA, NVSHMEM Documentation. docs.nvidia.com/nvshmem
  3. DeepSeek-AI, DeepEP: An efficient expert-parallel communication library, 2025. github.com/deepseek-ai/DeepEP
  4. OpenSHMEM specification — the standard NVSHMEM implements. openshmem.org
Hangrui Cao · 2026.05.28 · More writing · Contact