Close Menu
    Facebook X (Twitter) Instagram
    • Privacy Policy
    • Terms Of Service
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    Facebook X (Twitter) Instagram
    Block AI Report
    • Home
    • Crypto News
      • Bitcoin
      • Ethereum
      • Altcoins
      • Blockchain
      • DeFi
    • AI News
    • Stock News
    • Learn
      • AI for Beginners
      • AI Tips
      • Make Money with AI
    • Reviews
    • Tools
      • Best AI Tools
      • Crypto Market Cap List
      • Stock Market Overview
      • Market Heatmap
    • Contact
    Block AI Report
    Home»AI News»Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication
    Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication
    AI News

    Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication

    May 29, 20268 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email
    changelly


    GPU communication overhead is a measurable bottleneck in production AI workloads. According to data cited by the mKernel project, communication can consume 43.6% of the forward pass and 32% of end-to-end training time. Across popular Mixture-of-Experts (MoE) models, inter-device communication can account for up to 47% of total execution time. Researchers from UC Berkeley’s UCCL project have released mKernel, a library of persistent CUDA kernels that fuse intra-node NVLink communication, inter-node RDMA, and compute into a single kernel.

    The Problem: Host-Driven Communication

    The standard model for multi-GPU communication is host-driven: the CPU runs the control path and calls into a library like NCCL or NVSHMEM. The library issues the collective operation — an AllReduce, an AllGather, etc. — across GPUs. Compute and communication run on separate CUDA streams and overlap at kernel boundaries.

    The research team identifies two problems with this approach:

    (1) CPUs are not scaling with GPU compute. A GB300 NVL72 rack integrates 72 Blackwell Ultra GPUs and 36 Grace CPUs, delivering 720 PFLOP/s FP8/FP6, 1.44 EFLOP/s FP4 Tensor Core performance, and 130 TB/s of all-to-all intra-rack NVLink bandwidth. At those speeds, microsecond-scale host orchestration overhead — a cudaLaunchKernel call, a CPU-side “all writes done” check, an inter-stream event — shows up directly as pipeline bubbles.

    Customgpt

    (2) Host-driven systems overlap compute and communication at coarse kernel boundaries. Finer-grained overlap at the tile or chunk level is not possible from the host side.

    The alternative is GPU-driven communication: the GPU itself triggers transfers, with communication fused into the same kernel as the compute. Most existing fused kernel libraries operate within a single node, or a single GPU. mKernel targets the multi-node case.

    What mKernel Does

    mKernel is a library of persistent CUDA kernels. Each kernel fuses intra-node NVLink communication, inter-node RDMA, and dense compute into a single kernel.

    Multi-GPU + multi-node, in one kernel: Both intra-node NVLink and inter-node RDMA live inside the same persistent kernel.

    Fine-grained intra-kernel overlap: Compute and communication overlap at tile/chunk granularity, covering both intra-node and inter-node GPU communication.

    Persistent kernel with SM specialization: CTAs self-assign roles: compute, intra-comm, inter-send, inter-reduce. The number of SMs dedicated to each role is tunable per shape.

    GPU-driven networking built on libibverbs: mKernel uses GPU-initiated RDMA writes without depending on NCCL or NVSHMEM. The communication backend is written from scratch to maximize performance and support heterogeneous networking devices.

    The Five Fused Kernels

    KernelWhat it fusesDescriptionAllGather + GEMMAllGather → GEMMEach rank holds a shard of A. While ranks gather peers’ shards over NVLink/RDMA, the local GEMM consumes tiles as soon as they arrive.GEMM + AllReduceGEMM → AllReduceComputes C = A @ B and reduces partial outputs across all ranks in one launch. Output tiles are pushed into the reduction tree the instant they’re produced.MoE Dispatch + GEMMAll-to-All dispatch → grouped GEMMRoutes MoE tokens to their expert ranks (intra-node NVLink + inter-node all-to-all) and runs the per-expert grouped GEMM in the same kernel. Tokens are processed as soon as they land — no staging buffer round-trip.Ring AttentionRing KV exchange → FlashAttentionSequence-parallel attention across ranks. Each step rotates a KV chunk around the ring while the local FlashAttention consumes the previously-received chunk. Compute and the ring send/recv run concurrently inside a single persistent kernel.GEMM + ReduceScatterGEMM → ReduceScatterComputes C = A @ B and reduce-scatters the output. Each output tile is reduced and forwarded to its owning rank as soon as it is produced.

    Evaluation Setup

    The research team evaluated mKernel on two 2-node × 8-H200 clusters that differ only in their inter-node fabric:

    TestbedNodes × GPUsIntra-nodeInter-node transportNICAWS EFA2 × 8 H200NVLinkAWS EFA / SRD16 × 200 Gb/s EFA per nodeConnectX-72 × 8 H200NVLinkInfiniBand8 × 400 Gb/s NVIDIA ConnectX-7 per node

    mKernel was benchmarked against NCCL, Triton-distributed, Flux, Mercury, MagiAttention, Transformer-Engine, and ring-flash-attention. The team notes that further benchmarking at larger scale is still in progress.

    Backends and Requirements

    mKernel supports two networking backends:

    BackendMacroTransportWhere it runsCX7-DINTERNODE_BACKEND_IBVERBSlibibverbs RCConnectX-7 / InfiniBand / RoCEEFA-DINTERNODE_BACKEND_EFAlibibverbs + efadv (SRD)AWS p5/p5e (H200, EFA)

    Both backends share the same host-side API and the same on-GPU kernel. Only the proxy/session implementation differs (session.h for CX7, session_efa.h for EFA). Requirements: NVIDIA Hopper GPUs (default build targets sm_90a), CUDA 12.9, Python with PyTorch. The CX7 backend requires libibverbs development headers and libraries. The EFA backend requires AWS EFA installation with libfabric, libibverbs, efadv, and EFA headers under EFA_HOME=/opt/amazon/efa by default.

    Marktechpost’s Visual Explainer

    01 / 07 — Overview

    What is mKernel?

    mKernel is an open-source library of persistent CUDA kernels from UC Berkeley’s UCCL project. It fuses intra-node NVLink communication, inter-node RDMA, and dense compute into a single kernel.

    Most existing fused kernel libraries operate within a single node or a single GPU. mKernel is designed from the start to span node boundaries.

    43.6%

    of forward pass consumed by communication in production

    47%

    of total execution time in popular MoE models

    32%

    of end-to-end training time consumed by communication

    02 / 07 — The Problem

    Why Host-Driven Communication Falls Short

    The standard model is host-driven: the CPU calls NCCL or NVSHMEM, which issues collective operations across GPUs. The UCCL team identifies two problems.

    ⚡

    CPUs are not scaling with GPUs. A GB300 NVL72 rack delivers 720 PFLOP/s FP8/FP6 and 1.44 EFLOP/s FP4. At those speeds, microsecond-scale overhead from cudaLaunchKernel, CPU-side sync checks, and inter-stream events shows up directly as pipeline bubbles.

    🔲

    Overlap is too coarse. Host-driven systems overlap compute and communication only at kernel boundaries. Finer-grained overlap at the tile or chunk level is not possible from the host side.

    🔀

    The answer: GPU-driven communication. The GPU itself triggers fine-grained transfers, fused into the same kernel as the compute.

    03 / 07 — Design

    Four Core Design Properties

    🖧

    Multi-GPU + multi-node, in one kernel. Intra-node NVLink and inter-node RDMA both live inside the same persistent kernel.

    🔬

    Fine-grained intra-kernel overlap. Compute and communication overlap at tile/chunk granularity, covering both intra-node and inter-node communication.

    ⚙️

    Persistent kernel with SM specialization. CTAs self-assign roles: compute, intra-comm, inter-send, inter-reduce. SM split is tunable per shape.

    📡

    GPU-driven networking via libibverbs. Uses GPU-initiated RDMA writes. No NCCL or NVSHMEM dependency. Communication backend is written from scratch.

    04 / 07 — Kernels

    The Five Fused Kernels

    AllGather + GEMM

    AllGather —> GEMM

    Each rank holds a shard of A. The local GEMM consumes tiles over NVLink/RDMA as they arrive — matmul starts before the collective finishes.

    GEMM + AllReduce

    GEMM —> AllReduce

    Computes C = A @ B and reduces partial outputs across all ranks in one launch. Output tiles enter the reduction tree the instant they are produced.

    MoE Dispatch + GEMM

    All-to-All dispatch —> grouped GEMM

    Routes MoE tokens to expert ranks via NVLink + inter-node all-to-all, then runs per-expert grouped GEMM in the same kernel. No staging buffer round-trip.

    Ring Attention

    Ring KV exchange —> FlashAttention

    Sequence-parallel attention across ranks. Each step rotates a KV chunk around the ring while the local FlashAttention consumes the previously-received chunk.

    GEMM + ReduceScatter

    GEMM —> ReduceScatter

    Computes C = A @ B and reduce-scatters the output. Each tile is reduced and forwarded to its owning rank as soon as it is produced.

    05 / 07 — Evaluation

    Evaluation Setup

    Tested on two 2-node × 8-H200 clusters differing only in inter-node fabric.

    TestbedNodes × GPUsInter-nodeNIC

    AWS EFA2 × 8 H200AWS EFA / SRD16 × 200 Gb/s EFA per node
    ConnectX-72 × 8 H200InfiniBand8 × 400 Gb/s CX7 per node

    Both testbeds use NVLink intra-node. Benchmarked against: NCCL, Triton-distributed, Flux, Mercury, MagiAttention, Transformer-Engine, and ring-flash-attention. Larger-scale benchmarking is still in progress.

    06 / 07 — Backends & Requirements

    Backends & Requirements

    BackendTransportWhere it runs

    CX7libibverbs RCConnectX-7 / InfiniBand / RoCE
    EFAlibibverbs + efadv (SRD)AWS p5/p5e (H200, EFA)

    📋

    Requirements: NVIDIA Hopper GPUs (default sm_90a), CUDA 12.9, Python with PyTorch. CX7 needs libibverbs headers. EFA needs libfabric, libibverbs, efadv under EFA_HOME=/opt/amazon/efa.

    📝

    License & Attribution: MIT licensed. MMA/compute code adapted from ThunderKittens (HazyResearch).

    07 / 07 — Roadmap & Key Takeaways

    Roadmap & Key Takeaways

    ✅

    Fused GPU-driven multi-node kernels (AG+GEMM, GEMM+AR, MoE Dispatch+GEMM, Ring Attention, GEMM+RS)

    ✅

    ConnectX-7 and AWS EFA backends

    🚧

    Full heterogeneous accelerator/NIC support with topology-aware discovery, placement, routing

    🚧

    Inter-node megakernels: collapsing several fused steps into a single megakernel spanning a transformer layer

    🚧

    Blackwell GPU support

    Fuses NVLink, inter-node RDMA, and compute into a single persistent CUDA kernel

    Five kernels: AllGather+GEMM, GEMM+AllReduce, MoE Dispatch+GEMM, Ring Attention, GEMM+ReduceScatter

    GPU-initiated RDMA via libibverbs — no NCCL or NVSHMEM dependency

    Requires Hopper GPUs (sm_90a) and ConnectX-7 or AWS EFA networking

    Key Takeaways

    • mKernel fuses intra-node NVLink, inter-node RDMA, and compute into a single persistent CUDA kernel.
    • Communication overhead accounts for up to 47% of execution time in MoE models per cited production data.
    • Five kernels are included: AllGather+GEMM, GEMM+AllReduce, MoE Dispatch+GEMM, Ring Attention, and GEMM+ReduceScatter.
    • GPU-initiated RDMA is implemented directly via libibverbs — no NCCL or NVSHMEM dependency.
    • Currently requires Hopper GPUs (sm_90a) and ConnectX-7 or AWS EFA networking; Blackwell support is on the roadmap.

    Check out the Repo and Technical Details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

    Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us



    Source link

    aistudios
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Crypto Expert
    • Website

    Related Posts

    Media Advisory: MIT to establish regional quantum hub | MIT News

    May 28, 2026

    Merck and Mastercard are seeing real agentic AI results. Both say the plumbing came first.

    May 27, 2026

    Autonomous AI systems test governance in physical environments

    May 26, 2026

    Best Authentication Platforms for AI Agents and MCP Servers in 2026

    May 25, 2026
    Add A Comment

    Comments are closed.

    quillbot
    Latest Posts

    Ethereum Price Analysis Points to Short-Term Downside Risk

    May 29, 2026

    TSX Today: What to Watch for in Stocks on Friday, May 29

    May 29, 2026

    11-Year Dormant Wallets Torch $8.3M in BTC

    May 29, 2026

    Trezor Launches USDC, USDT Yield in Trezor Suite Through Morpho

    May 29, 2026

    Media Advisory: MIT to establish regional quantum hub | MIT News

    May 28, 2026
    notion
    LEGAL INFORMATION
    • Privacy Policy
    • Terms Of Service
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    Top Insights

    Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication

    May 29, 2026

    Coinbase Opens Global Crypto Derivatives to US Traders

    May 29, 2026
    aistudios
    Facebook X (Twitter) Instagram Pinterest
    © 2026 BlockAIReport.com - All rights reserved.

    Type above and press Enter to search. Press Esc to cancel.