2026 · 05 · 28 LLM Inference Distributed Systems ~ 15 min read

NCCL vs NVSHMEM: two answers to the same question

If you've wired up a multi-GPU training run, you've used NCCL. If you've read any recent MoE-inference paper worth its citation count, you've seen NVSHMEM. They sit next to each other on every NVIDIA system and answer the same question — how do these GPUs talk to each other? — at different abstraction levels. This is a short attempt to untangle the two and to argue why the second one matters more than it used to.

Most people who work on distributed deep learning learn one of these libraries first and never quite see the other. NCCL [1] is the default in PyTorch DDP, Megatron, DeepSpeed, JAX — anywhere you've ever called torch.distributed.all_reduce. NVSHMEM [2] is the library people reach for when NCCL stops being a good fit, which historically has meant "almost never" but in 2026 increasingly means "sparse all-to-all in Mixture-of-Experts inference."

The simplest way to keep them straight is this:

NCCL gives you collectives. NVSHMEM gives you memory.

Everything else — algorithms, performance characteristics, the integration story — follows from that distinction.

01 NCCL: collectives as a primitive

NCCL's mental model is bulk synchronous parallel. You assemble a buffer on every participating GPU, you call a collective, and after a synchronization point every GPU has the result the collective promised — sum, gather, broadcast, etc. The primitives are the things you'd find in MPI:

ncclAllReduce — every rank ends up with the elementwise sum (or max, etc.) of all ranks' buffers. The bread and butter of data-parallel gradient sync.
ncclAllGather / ncclReduceScatter — the matched pair that powers ZeRO and tensor parallel.
ncclBroadcast, ncclSend / ncclRecv — point-to-point primitives, used less often but available.

From the calling code, every NCCL operation is one line:

// gradient AllReduce — what DDP issues under the hood
ncclAllReduce(grad_send, grad_recv, n,
              ncclFloat, ncclSum,
              comm, stream);
cudaStreamSynchronize(stream);  // barrier

Underneath, NCCL launches a CUDA kernel that runs ring or tree-based algorithms chosen by message size and topology. On a modern node it negotiates NVLink for intra-node bandwidth and InfiniBand or NVL Switch fabrics for cross-node hops. The library is twelve years old and excruciatingly well-tuned: if your communication pattern is regular and your messages are large enough to amortize the ~5–10 µs kernel-launch overhead, NCCL is almost certainly within a few percent of what's achievable.

That ergonomic floor is exactly the problem when your communication pattern isn't regular.

02 NVSHMEM: memory as a primitive

NVSHMEM gives every participating GPU — a PE ("processing element," the SHMEM term) — a view of a global symmetric heap. You allocate from this heap with nvshmem_malloc and the resulting pointer is valid on every PE at the same offset. From any PE, you can directly put data into a remote PE's heap, or get from it, with one-sided RDMA-style operations:

// device-side — yes, called from inside a CUDA kernel
__global__ void my_kernel(float *symmetric_buf, int peer) {
    // ... do some compute ...

    nvshmem_float_put_nbi(symmetric_buf + offset,
                          src_data, n,
                          peer);

    // ... more compute, overlapped with the put ...

    nvshmem_quiet();  // wait for completion
}

Three things in that snippet are the whole story:

It runs in a CUDA kernel. Communication issued from device code, alongside compute, on the same SMs. No host involvement, no separate stream, no kernel launch per message.
It's one-sided. The remote PE doesn't have to call a matching recv. There's no rendezvous, no two-sided handshake — just a write into a memory region the remote PE has agreed to expose.
It's non-blocking (_nbi). The put is fire-and-forget; you fence later with nvshmem_quiet or signal-based completion.

On the API surface NVSHMEM also offers get (one-sided read), atomics (nvshmem_int_atomic_add et al.), and signal-based completion (nvshmem_signal_op + nvshmem_wait_until) — which together let you build producer–consumer patterns inside a single kernel. There are collective wrappers (nvshmem_collective_*), but they're not why anyone uses the library; they exist mostly so NVSHMEM can claim feature parity with OpenSHMEM.

03 The trade-off, concretely

	NCCL	NVSHMEM
Programming model	Bulk synchronous · collectives	One-sided · PGAS (symmetric heap)
Call site	Host code; each op launches a kernel	Inside CUDA kernels, alongside compute
Per-op overhead	~5–10 µs kernel launch + sync	Single-digit µs end-to-end on NVLink
Granularity	Whole-buffer collectives	Word- to message-level puts/gets
Synchronization	Implicit per collective	Explicit — fences, signals, atomics
Algorithm choice	Library picks ring / tree / etc.	You build the algorithm yourself
Compute/comm overlap	Stream-level (separate kernels)	Intra-kernel — true fusion
Integration	PyTorch, Megatron, JAX, DeepSpeed, …	Custom kernels; growing in frameworks
Best at	Regular, dense, large messages	Irregular, fine-grained, latency-bound

Two rows in that table do the most work: call site and granularity. NCCL is host-side and buffer-grained. NVSHMEM is device-side and word-grained. If your problem fits the first column — regular, dense, predictable — NCCL is a more ergonomic and equally fast choice. If it lives in the second column, NCCL forces you to round your communication up to a coarser shape (typically dense all-to-all with worst-case bandwidth) where the cost can be many times what you actually need.

04 Where each one wins today

Training

NCCL dominates and probably always will. Data-parallel gradient AllReduce is NCCL's home court — large, dense, regular, run thousands of times per training step with a predictable pattern. Tensor-parallel AllGather / ReduceScatter is the same story. The collectives map 1:1 to the operations, the messages are big enough to hide launch overhead, and the framework integration is mature.

Inference — dense models

NCCL still wins. Tensor-parallel attention and MLP collectives are regular and large; vLLM, TensorRT-LLM, and SGLang all use NCCL by default for these paths. Sequence-parallel variants (FlashAttention-3's ring attention, DeepSpeed Ulysses) start to want finer-grained patterns, and you'll see NVSHMEM show up in the research kernels — but in production, NCCL is still the path of least resistance.

Inference — MoE

This is where the story flips. In MoE, every token has to be routed to its top-k experts, and the assignment is decided at runtime by a learned router. The resulting communication is fundamentally sparse and irregular: the actual bytes any rank sends to any other rank depend on which tokens picked which experts in this batch.

NCCL's ncclAllToAll has to assume worst-case bandwidth — every rank sending to every rank — even when most of the rank-to-rank traffic is empty. A custom NVSHMEM kernel can put each token directly into the right expert's queue with one-sided puts and signal completion, paying only for the data that actually moves. DeepEP [3] — the DeepSeek MoE all-to-all kernel released in early 2025 — is the highest-profile example, and the gap over a plain NCCL-based implementation is not a few percent.

Fused communication–compute

Anywhere you want to overlap communication and compute inside a single kernel — fused all-to-all + matmul, attention + ring rotation, KV-cache migration during a decode step — NVSHMEM is the only practical option. NCCL operates at kernel granularity, so you can overlap at the stream level but not within a kernel; for workloads where the inner loop wants to issue a small message every few hundred cycles, NVSHMEM is fundamentally the right abstraction.

05 How DeepEP actually does it

DeepEP [3] is worth a closer look because it's the cleanest public example of what an NVSHMEM-native all-to-all looks like in production. The library ships two kernel variants that share a programming model but make very different choices about latency versus throughput — and the gap between them is a useful illustration of how the abstraction level we've been talking about actually pays off.

Two kernels, one symmetric heap, zero CPU on the critical path.

The two-kernel split

The first kernel is the high-throughput dispatch / combine pair used in training and prefill. It assumes the batch is large enough that bandwidth, not latency, is the bottleneck, and it leans on both NVLink (intra-node) and RDMA (inter-node) to move FP8-quantized tokens.

The second is a low-latency decode kernel. During autoregressive decoding each step only routes a tiny number of tokens (one per sequence in the batch), so the bandwidth-optimal algorithm of the first kernel collapses under launch and setup overhead. The low-latency kernel goes pure RDMA via NVSHMEM IBGDA, skips the NVLink-first staging, and is built to minimize per-token wall time.

Both kernels speak the same NVSHMEM symmetric-heap dialect. They differ in topology choices, FP8 vs BF16 paths, and how they overlap with the surrounding kernel — not in the underlying communication primitive.

Dispatch, sketched

Strip out the buffer management and the dispatch phase looks like this:

__global__ void dispatch(const Token *tokens, const int *routing,
                         Token *symmetric_expert_buf,
                         int *symmetric_counter) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= n_tokens) return;

    int expert = routing[tid];          // router output
    int dst_pe = expert / experts_per_pe;
    int slot   = nvshmem_int_atomic_fetch_add(
                     symmetric_counter, 1, dst_pe);

    // one-sided put: this token into the expert's symmetric buf
    nvshmem_putmem_nbi(symmetric_expert_buf + slot * Tsize,
                       tokens + tid * Tsize, Tsize, dst_pe);

    // signal completion — destination spins on this counter
    nvshmemx_signal_op(&symmetric_done_signal[dst_pe],
                       1, NVSHMEM_SIGNAL_ADD, dst_pe);
}

Three things in that sketch are doing the real work:

Per-token routing decisions, per-token puts. A naive ncclAllToAll would have to round this up to "every rank sends a worst-case-sized buffer to every other rank." DeepEP only moves the bytes that the router actually selected.
Remote atomic counter for placement. nvshmem_int_atomic_fetch_add on the destination's counter gives this token a unique slot in the expert's buffer — no global synchronization, no two-sided handshake. The atomic happens on the receiver's GPU.
Signal-based completion. Instead of an all-ranks barrier, the receiver kernel spins on nvshmem_signal_wait_until for the counter to reach its expected token count. The barrier collapses from "everyone synchronize" to "wait until I have what I'm owed."

IBGDA: removing the CPU from the path

The most under-discussed performance trick in DeepEP is its use of NVSHMEM IBGDA — InfiniBand GPUDirect Async. With IBGDA, the CUDA kernel itself posts work-queue entries directly to the InfiniBand HCA's doorbell. There is no host thread translating an NVSHMEM call into a verbs ibv_post_send, no syscall, no CPU on the critical path at all. The kernel writes 64 bytes to a memory-mapped doorbell and the NIC takes over.

For the decode kernel this matters enormously: every microsecond of CPU-mediated setup overhead is paid per token per step, and there are thousands of decode steps in a typical generation. With IBGDA, the RDMA send latency is bounded by the NIC and the fabric, not by the host's scheduler or by waking up a proxy thread.

Asymmetric topology: NVLink first, then RDMA

Modern training clusters are not flat — within a node you have NVLink at hundreds of GB/s; across nodes you have InfiniBand at tens of GB/s. The high-throughput dispatch kernel exploits this asymmetry explicitly:

Tokens destined for an expert on a remote node are first sent via RDMA to one designated "landing" GPU on the destination node.
That landing GPU then redistributes intra-node via NVLink to the actual expert-holding GPU on its node.
The combine path runs in reverse: NVLink reduce within node, then RDMA back to the source node.

This avoids the worst-case pattern where every GPU on node A talks directly to every GPU on node B over RDMA — a pattern that consumes N² in HCA queue resources for a constant amount of cross-fabric data. With NCCL's collectives, you'd typically rely on the library to figure this out for you. With NVSHMEM, you express it directly in the kernel and pay only for the topology you have.

FP8 in, BF16 out

DeepSeek-V3 already trains and infers in FP8 for the GEMMs, and DeepEP's dispatch path carries this through: tokens are quantized to FP8 before they leave the source GPU, halving the bytes on the wire compared to a BF16 dispatch. The combine path returns in BF16 because the expert outputs have already been dequantized on the destination — there's no accuracy reason to re-quantize for the trip home, and BF16 keeps the gradient-friendly numerics intact for training.

This is the kind of design choice that's natural to make when you're writing the communication kernel yourself. With NCCL you'd be locked into whatever dtype the collective operates on; with NVSHMEM, the wire format is whatever you decide to nvshmem_putmem.

What this buys, concretely

DeepSeek reports the high-throughput dispatch hitting roughly NVLink line rate on H800 (~150 GB/s) and ~50 GB/s on RDMA for the inter-node leg, with the low-latency kernel completing a decode-step all-to-all in the low hundreds of microseconds. The absolute numbers depend on cluster, NIC count, and measurement methodology — but the qualitative story matters more than the exact figures: a custom NVSHMEM kernel that knows the topology, the sparsity pattern, and the dtype contract beats a generic dense collective by enough that the engineering cost is obviously justified.

None of these tricks are conceptually new. Each one — one-sided RDMA, atomic counters, IBGDA, topology-aware staging — exists in the HPC literature going back a decade or more. What's new is the cost/benefit calculation: at H100/H800 cluster prices, with MoE in production at hundreds of millions of tokens per day, spending a quarter of an engineer-year on a custom kernel that saves 30% on routing cost pays for itself in a week.

06 Why the NVSHMEM moment is now

For most of the 2010s, NVSHMEM lived in HPC. Climate codes, particle simulations, a few graph-analytics papers. The deep learning crowd cared about dense AllReduce, NCCL did dense AllReduce well, and that was the end of the conversation.

Three things changed in 2024–2025:

MoE became a production architecture. Mixtral, DeepSeek-V3, GPT-4o-mini's rumored MoE backbone — production inference is suddenly full of sparse, irregular, expert-routed communication patterns that NCCL was not built for.
NVLink Switch and NVL72 lifted the intra-rack bandwidth ceiling. When every GPU in a rack can talk to every other at 1.8 TB/s, the bottleneck is no longer raw bandwidth but per-message overhead — and a kernel-launch per collective is suddenly a real cost rather than rounding error.
Disaggregated serving became real. Splitting prefill and decode across different GPU pools (Splitwise, DistServe, DeepSeek's KV-cache router) requires moving KV-cache pages between machines mid-request — exactly the kind of fine-grained, asynchronous traffic that one-sided puts handle gracefully and collectives don't.

None of these dethrone NCCL. Training still wants collectives. Dense inference still wants collectives. But the share of the LLM serving stack where collectives are the wrong abstraction has gone from "a research curiosity" to "a production line-item" in two years.

07 Picking the right tool

The decision usually collapses to two questions:

Is your communication a standard collective on regular data? AllReduce gradients, AllGather a tensor, broadcast a parameter. Use NCCL. You'll get something close to peak and you'll write one line of code.
Do you want to issue communication from inside a kernel, or is the traffic sparse / irregular? Now you need NVSHMEM. The code is harder, the synchronization is your problem, and the payoff is the ability to do things collectives literally cannot do.

A reasonable production stack uses both. NCCL for training and dense-inference TP. NVSHMEM in custom kernels for MoE routing, KV-cache movement, and any fused-comm-compute paths where the inner loop demands fine-grained traffic. They're not competitors — they're answers to different abstraction levels of the same question.

The interesting open question is what happens to the abstraction layer above both of them. PyTorch's torch.distributed currently exposes the collective view; nothing in mainstream Python frameworks gives you a clean handle on one-sided device-side communication. Until that changes, NVSHMEM lives mostly inside hand-written CUDA — which is fine for kernel authors and a real friction for everyone else. Watch this space; the people writing the next generation of inference engines clearly are.

References

NVIDIA, NCCL: NVIDIA Collective Communications Library. developer.nvidia.com/nccl
NVIDIA, NVSHMEM Documentation. docs.nvidia.com/nvshmem
DeepSeek-AI, DeepEP: An efficient expert-parallel communication library, 2025. github.com/deepseek-ai/DeepEP
OpenSHMEM specification — the standard NVSHMEM implements. openshmem.org

Hangrui Cao · 2026.05.28 · More writing · Contact