Beyond Collectives: How TransferEngine Unlocks Portable, High-Speed Communication for Modern LLMs

Large Language Models (LLMs) are evolving at a breathtaking pace. We’ve moved from monolithic models to architectures like Mixture-of-Experts (MoE), which scale efficiently to trillions of parameters, and disaggregated inference, where different stages of model execution—prefill and decode—run on distinct, specialized clusters. These new designs are elegant, but they expose a critical weakness in today’s machine learning infrastructure: communication.

For years, distributed training and inference have relied on collective communication libraries—such as NVIDIA’s NCCL or PyTorch Distributed—which excel at synchronized operations like AllReduce and Broadcast. These are ideal for conventional data and tensor parallel workloads where every GPU acts in lockstep. But with emerging workloads like MoE routing and disaggregated inference, this paradigm starts to break down. These tasks are sparse, dynamic, and non-uniform, requiring flexible point-to-point communication instead of rigid, collective synchronization.

Unfortunately, the high-performance tools designed for point-to-point communication—built on Remote Direct Memory Access (RDMA)—are often locked to specific hardware. A solution that shines on NVIDIA’s ConnectX-7 network interface controller (NIC) might degrade or even fail entirely on AWS’s Elastic Fabric Adapter (EFA). This vendor lock-in limits portability and performance across cloud environments.

Researchers at Perplexity AI address this challenge with TransferEngine, a portable RDMA communication library that offers a uniform, high-speed interface across heterogeneous hardware. In this deep dive, we’ll explore how TransferEngine works, the performance breakthroughs it achieves on both NVIDIA and AWS hardware, and how it powers three production-ready systems:

Disaggregated Inference: Rapid KvCache transfers for elastically scaled clusters.
Reinforcement Learning (RL): Trillion-parameter model weight updates in under 1.3 seconds.
Mixture-of-Experts (MoE): State-of-the-art decode latency with ConnectX-7 and the first viable MoE implementation on AWS EFA.

The Communication Divide: Collectives vs. Point-to-Point

RDMA in a Nutshell

At the heart of modern high-performance computing clusters lies Remote Direct Memory Access (RDMA). RDMA enables one server’s NIC to read or write directly to another server’s memory, bypassing kernel intervention and CPU overhead. This kernel bypass delivers sub-microsecond latency and up to 400 Gbps bandwidth, making it indispensable for scaling LLM systems.

RDMA supports two major operation types:

Two-sided (SEND/RECV): A coordinated handshake where the receiver posts a RECV buffer before the sender issues a SEND.
One-sided (WRITE/READ): Direct access without remote involvement—analogous to having a secure key to a friend’s house. A WRITEIMM extends this by delivering a small 32-bit “immediate” value used for signaling completion.

The Hardware Fragmentation Problem

The key problem arises because not all RDMA implementations behave alike:

NVIDIA ConnectX: Uses Reliable Connection (RC) transport—order-preserving and connection-oriented.
AWS EFA: Introduces Scalable Reliable Datagram (SRD)—connectionless and reliable but unordered.

Many RDMA libraries assume RC’s strict in-order delivery. When deployed on AWS EFA, which offers unordered delivery, these systems break or suffer severe performance losses. This incompatibility has led to fragmented solutions like NVSHMEM and DeepEP that are fast but hardware-restricted.

A table comparing RDMA transport types (RC, UC, UD, SRD) and the new common ground introduced by TransferEngine. It highlights reliability, ordering behavior, and supported operations.

Figure: RDMA transport comparison diagram showing reliable but unordered delivery as the shared foundation across NICs.

The authors’ key insight: both ConnectX and EFA support reliable delivery even if unordered. By building on this common ground, TransferEngine achieves high-performance communication across both ecosystems.

TransferEngine: A Portable RDMA Abstraction

Abstraction Without Compromise

TransferEngine provides a unified interface atop two very different software stacks—libibverbs (ConnectX) and libfabric (EFA). It exposes a clean API for flexible point-to-point data exchange, providing portable SEND/RECV for RPC-style messages and high-bandwidth one-sided WRITEs for bulk transfers.

The real innovation lies in how it handles completion notifications in an unordered network world. When large workloads involve hundreds of concurrent transfers, conventional order-dependent tracking fails. TransferEngine solves this with a new primitive: IMMCOUNTER.

The `IMMCOUNTER`: Completion Without Ordering

Here’s the idea:

Every RDMA WRITEIMM carries an immediate 32-bit value.
The receiver increments an IMMCOUNTER tied to that value upon completion—using atomic updates from the NIC’s completion queue.
The application just waits for the counter to reach its target (e.g., 100) to confirm all transfers completed.

This mechanism elegantly sidesteps ordering dependencies. It guarantees completeness regardless of arrival sequence, enabling consistent performance across unordered (EFA) and ordered (RC) transports.

Architecture diagram of TransferEngine managing GPUs across NUMA domains. Each GPU connects to multiple NICs with IMMCOUNTER-based callbacks for transfer completion.

Figure 1. Overview of TransferEngine architecture across NUMA nodes.

TransferEngine creates one worker thread per GPU, pinned to its local CPU core. Each worker coordinates all RDMA NICs attached to that GPU—essential for EFA configurations requiring multi-NIC aggregation. The engine invisibly manages these details, ensuring near-400 Gbps performance even on cloud hardware.

A Look at the API

Here’s a simplified view of the Rust-like API described in the paper:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
// Register memory (CPU or GPU) for RDMA transfers.
fn reg_mr(ptr, len, device) -> (MrHandle, MrDesc);

// Two-sided, RPC-style messaging.
fn submit_send(addr: NetAddr, msg: &[u8], cb: fn());
fn submit_recvs(len: u64, cnt: u64, cb: fn(&[u8]));

// One-sided writes for data movement.
fn submit_single_write(len: u64, imm: Option<u32>, ...);
fn submit_paged_writes(page_len: u64, imm: Option<u32>, ...);

// Completion notification without ordering.
fn expect_imm_count(imm: u32, count: u32, cb: fn());

// Group operations for MoE workloads.
fn submit_scatter(h: PeerGroupHandle, ...);
fn submit_barrier(h: PeerGroupHandle, ...);

// Synchronize GPU progress with CPU transfers.
fn alloc_uvm_watcher(cb: fn(u64, u64) -> ()) -> NonNull<u64>;

A particularly powerful concept is the UVM Watcher. It links CPU-side RDMA logic with GPU kernel progress. When a GPU kernel finishes preparing data, it writes to a Unified Virtual Memory address monitored by a low-latency CPU thread via GDRCopy. The change triggers the corresponding RDMA operation—seamlessly aligning GPU compute and network transfer.

Real-World Deployments of TransferEngine

TransferEngine isn’t just theoretical—it’s already powering production systems across various workloads.

1. Disaggregated Inference: KvCache Transfer Between Clusters

During inference, the prefill (processing context) and decode (token generation) phases are separated into dedicated clusters for efficiency. The bottleneck lies in moving the large Key-Value (KV) caches between them.

Diagram showing KV cache transfer between a prefiller node and a decoder node using RDMA over InfiniBand.

Figure 3. KV cache transfer between prefiller and decoder in disaggregated inference.

Workflow:

The scheduler assigns a prefiller and decoder node.
The decoder preallocates memory and sends its RDMA memory map (MrDesc) to the prefiller.
As each layer’s KV cache finishes on the GPU, a CUDA kernel updates the UVM Watcher.
TransferEngine detects the update and triggers a paged RDMA WRITE to the decoder.
The decoder monitors completion via expect_imm_count and begins generating tokens as soon as the cache is complete.

This approach allows independent scaling of prefill and decode clusters without synchronized reinitialization—something traditional collective-based systems cannot achieve.

2. Reinforcement Learning: Trillion-Parameter Model Updates in Seconds

In reinforcement learning fine-tuning, fresh weights from training GPUs must be instantly pushed to inference GPUs. Traditional frameworks funnel data through a single “Rank 0” GPU, bottlenecking NIC bandwidth.

Comparison of weight transfer architectures. (a) A Rank0-based collective topology with bottlenecks. (b) Point-to-point RDMA-based topology using TransferEngine for direct GPU-to-GPU transfer.

Figure 4. Rank0-based collective vs direct point-to-point transfer.

With TransferEngine, every training GPU writes directly to its target inference GPUs using one-sided RDMA operations, distributing traffic evenly across all NICs.

The team layers this with a multi-stage pipeline to overlap computation and transfer:

Pipeline diagram showing overlapping stages of weight updates: H2D memcpy, tensor preparation, RDMA transfer, and synchronization barrier.

Figure 5. Pipelined weight transfer execution.

The outcomes are stunning — 1.3-second weight updates for trillion-parameter models (like DeepSeek-V3, Qwen3, and Kimi-K2). This represents more than 100× speedup over previous RL infrastructure.

3. Mixture-of-Experts: Portable, Low-Latency Dispatch and Combine

MoE architectures route tokens dynamically across multiple “expert” GPUs. The communication overhead of dispatching and combining tokens is extremely latency-sensitive.

Diagram showing coordination between GPU kernels, a CPU proxy thread, and RDMA NICs during MoE dispatch/combine stages.

Figure 6. GPU–CPU–NIC coordination for MoE dispatch/combine.

TransferEngine powers proxy-based kernels for dispatch and combine, supporting both ConnectX and AWS EFA—a first for portable expert routing. The proxy thread uses GDRCopy to poll GPU progress and issue RDMA transfers through the IMMCOUNTER interface.

To mitigate latency, the system uses a two-phase dispatch: speculative transfers to small private buffers while routing metadata is exchanged, followed by bulk scatter into contiguous buffers.

Schematic showing tokens first sent to small private buffers, then scattered efficiently to contiguous buffers in the dispatch step.

Figure 7. Two-phase dispatch into private and contiguous buffers.

Results show state-of-the-art decode latency on ConnectX and viable MoE deployment on EFA—breaking long-standing hardware restrictions.

Performance Evaluation

Throughput Benchmarks

TransferEngine achieves near-peak hardware performance across both ConnectX and EFA.

Graphs comparing relative bandwidth for single-write and paged-write operations between TransferEngine (Ours) and NIXL on EFA and ConnectX-7.

Figure 8. Point-to-point bandwidth comparison.

For typical workloads (KV page size 64 KiB and MoE transfer size 256 KiB), TransferEngine saturates the network on both NIC types. At small message sizes, performance differences reflect hardware configuration, not library design.

Table summarizing absolute throughput (Gbps) and ops/s for both NICs across message sizes.

Table 2. EFA vs ConnectX-7 performance comparison.

MoE Decode Latency

The decode phase reveals how efficiently the system handles tightly-coupled inter-node dispatch/combine operations.

Bar chart comparing MoE decode latency across several methods and expert-parallel settings (EP8–EP64). Shows TransferEngine outperforming DeepEP and enabling EFA viability.

Figure 11. MoE decode latency across ConnectX and EFA deployments.

Highlights:

On ConnectX-7, TransferEngine surpasses DeepEP’s specialized GPU-initiated design.
On AWS EFA, it delivers the first practical low-latency MoE routing, only ~30% slower than ConnectX.

This proves portable, host-proxy designs can match or exceed conventional GPU-direct RDMA solutions.

MoE Prefill Performance

Large batch prefill workloads emphasize bandwidth utilization.

Bar chart comparing MoE prefill latency for DeepEP and TransferEngine implementations, showing competitive results even without sender-side accumulation.

Figure 12. Prefill latency comparison.

Here, DeepEP gains an edge with sender-side accumulation, but TransferEngine remains competitive without hardware-specific optimizations—illustrating robustness across differing NIC behaviors.

Conclusion: Portable RDMA for the LLM Era

As LLM architectures grow more dynamic, the limitations of collective-only communication become clear. Future systems require flexible, high-speed point-to-point data movement—independent of vendor hardware.

TransferEngine delivers precisely that. By using reliable but unordered delivery as the shared foundation between ConnectX and EFA, and introducing the novel IMMCOUNTER primitive for portable completion signaling, it removes cloud hardware barriers while maintaining top-tier performance.

Across disaggregated inference, reinforcement learning, and Mixture-of-Experts workloads, TransferEngine enables production-scale systems that combine flexibility and speed, achieving over 400 Gbps throughput and record-setting latency across multiple hardware platforms.

The takeaway is simple: Portable point-to-point communication isn’t just possible—it’s essential. TransferEngine makes it real, ushering in an era of cloud-native LLM infrastructure free from vendor lock-in and ready for boundless scalability.

The Communication Divide: Collectives vs. Point-to-Point#

RDMA in a Nutshell#

The Hardware Fragmentation Problem#

TransferEngine: A Portable RDMA Abstraction#

Abstraction Without Compromise#

The IMMCOUNTER: Completion Without Ordering#

A Look at the API#

Real-World Deployments of TransferEngine#

1. Disaggregated Inference: KvCache Transfer Between Clusters#

2. Reinforcement Learning: Trillion-Parameter Model Updates in Seconds#

3. Mixture-of-Experts: Portable, Low-Latency Dispatch and Combine#

Performance Evaluation#

Throughput Benchmarks#

MoE Decode Latency#

MoE Prefill Performance#

Conclusion: Portable RDMA for the LLM Era#