Nvidia delivers first Vera Rubin AI GPU samples to customers - 88-core Vera CPU paired with Rubin GPUs with 288 GB of HBM4 memory apiece

Nvidia delivers first Vera Rubin AI GPU samples to customers - 88-core Vera CPU paired with Rubin GPUs with 288 GB of HBM4 memory apiece

• Nvidia delivers first Vera Rubin AI GPU samples to customers - 88-core Vera CPU paired with Rubin GPUs with 288 GB of HBM4 memory apiece On track for 2H 2026 • Get Tom’s Hardware

RCCLX: Innovating GPU communications on AMD platforms

RCCLX: Innovating GPU communications on AMD platforms

• We are open-sourcing the initial version of RCCLX - an enhanced version of RCCL that we developed and tested on Meta’s internal workloads. • RCCLX is fully integrated with Torchc

Engineering Blogs · February 24, 2026 (updated February 25, 2026) · 1 min · 204 words
tiny-gpu-compiler: An educational MLIR-based compiler targeting open-source GPU hardware

tiny-gpu-compiler: An educational MLIR-based compiler targeting open-source GPU hardware

• Tiny-gpu-compiler: An educational MLIR-based compiler targeting open-source GPU hardware I built an open-source compiler that uses MLIR to compile a C-like GPU kernellanguage dow

Language Internals · February 24, 2026 (updated February 25, 2026) · 2 min · 420 words
BiScale: Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS

BiScale: Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS

• Prefill/decode disaggregation improves latency-throughput tradeoff for large language model serving. • Energy consumption remains high; autoscaling is too coarse-grained for rapi

GPU-Resident Gaussian Process Regression Leveraging Asynchronous Tasks with HPX

GPU-Resident Gaussian Process Regression Leveraging Asynchronous Tasks with HPX

• GPRat library extended to a fully GPU-resident Gaussian Process prediction pipeline. • Combines HPX task‑based parallelism with an intuitive Python API for seamless integration.

The Landscape of GPU-Centric Communication

The Landscape of GPU-Centric Communication

• GPUs dominate HPC/ML workloads, yet inter‑GPU communication remains a scalability bottleneck. • Traditional CPU‑centric communication is being challenged by GPU‑centric models th

ucTrace: A Multi-Layer Profiling Tool for UCX-driven Communication

ucTrace: A Multi-Layer Profiling Tool for UCX-driven Communication

• ucTrace delivers fine‑grained UCX communication traces, filling gaps left by existing MPI profilers. • It maps UCX operations back to originating MPI calls, linking host‑to‑devic

GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations

GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations

• Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 19 Feb 2026] Title:GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opp

AI craze leaves only one Nvidia RTX 50-series GPU at MSRP - RTX 5060 Ti 8GB makes the final stand, as even the RTX 5050 falls

AI craze leaves only one Nvidia RTX 50-series GPU at MSRP - RTX 5060 Ti 8GB makes the final stand, as even the RTX 5050 falls

• AI craze leaves only one Nvidia RTX 50-series GPU at MSRP - RTX 5060 Ti 8GB makes the final stand, as even the RTX 5050 falls Get Tom’s Hardware’s best news and in-depth reviews,

Intel Hiring More Linux Developers - Including For GPU Drivers / Linux Gaming Stack

Intel Hiring More Linux Developers - Including For GPU Drivers / Linux Gaming Stack

• Intel Hiring More Linux Developers - Including For GPU Drivers / Linux Gaming Stack As some good news out of Intel today on the Linux/open-source side following last year’s layof

OS & Internals · February 20, 2026 (updated February 24, 2026) · 2 min · 281 words
The great Bench GPU retest begins - how we're testing for our GPU Hierarchy in 2026, and why upscaling and framegen are still out

The great Bench GPU retest begins - how we're testing for our GPU Hierarchy in 2026, and why upscaling and framegen are still out

• The great Bench GPU retest begins - how we’re testing for our GPU Hierarchy in 2026, and why upscaling and framegen are still out It’s time to test. • Here’s how the sausage is m

Accelerating Data Processing with NVIDIA Multi-Instance GPU and NUMA Node Localization

Accelerating Data Processing with NVIDIA Multi-Instance GPU and NUMA Node Localization

• NVIDIA flagship data center GPUs in the NVIDIA Ampere, NVIDIA Hopper, and NVIDIA Blackwell families all feature non-uniform memory access (NUMA) behaviors, but expose a single me

DigitalOcean Gradient™ AI GPU Droplets Optimized for Inference: Increasing Throughput at Lower the Cost

DigitalOcean Gradient™ AI GPU Droplets Optimized for Inference: Increasing Throughput at Lower the Cost

• By Jason Peng and Hemasumanth Rasineni Production-grade LLM inference demands more than just access to GPUs; it requires deep optimization across the entire serving stack, from q

Expanding our Agentic Inference Cloud: Introducing GPU Droplets Powered by AMD Instinct™ MI350X GPUs

Expanding our Agentic Inference Cloud: Introducing GPU Droplets Powered by AMD Instinct™ MI350X GPUs

• Expanding our Agentic Inference Cloud: Introducing GPU Droplets Powered by AMD Instinct™ MI350X GPUs ByWaverly Swinton Published:February 19, 2026 2 min read As our Agentic Infer

Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai

Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai

• As AI workloads scale, achieving high throughput, efficient resource usage, and predictable latency becomes essential. • NVIDIA Run:ai addresses these challenges through intellig

Topping the GPU MODE Kernel Leaderboard with NVIDIA cuda.compute

Topping the GPU MODE Kernel Leaderboard with NVIDIA cuda.compute

• Topping the GPU MODE Kernel Leaderboard with NVIDIA cuda.compute The leaderboard scores how fast users’ custom GPU kernels solve a set of standard problems like vector addition,

Bruteforcing Accidental Antenna Designs

Bruteforcing Accidental Antenna Designs

• Antenna design often seen as black art, but brute-force GPU approach explored. • Janne, novice, used VNA and GPU-based FDTD to simulate and optimize antennas. • Leveraged LLMs to

Warnings in GPU to NVVM pipeline

Warnings in GPU to NVVM pipeline

• Warnings in GPU to NVVM pipeline Hi, I’m currently in the process of trying to understand the conversion from the GPU dialect to LLVM via the NVVM dialect and GPU code-generation

Language Internals · February 17, 2026 (updated February 24, 2026) · 2 min · 334 words
Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization

Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization

• Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization Author

Build with Kimi K2.5 Multimodal VLM Using NVIDIA GPU-Accelerated Endpoints

Build with Kimi K2.5 Multimodal VLM Using NVIDIA GPU-Accelerated Endpoints

• Kimi K2.5 is a multimodal vision‑language model trained with Megatron‑LM. • It contains 1 trillion parameters, 384 experts, a single dense layer, and 3.2% activation per token. •

Advancing GPU Programming with the CUDA Tile IR Backend for OpenAI Triton

Advancing GPU Programming with the CUDA Tile IR Backend for OpenAI Triton

• NVIDIA CUDA Tile is a GPU-based programming model that targets portability for NVIDIA Tensor Cores, unlocking peak GPU performance. • One of the great things about CUDA Tile is t

Ensuring Balanced GPU Allocation in Kubernetes Clusters with Time-Based Fairshare

Ensuring Balanced GPU Allocation in Kubernetes Clusters with Time-Based Fairshare

• NVIDIA Run:ai v2.24 introduces time-based fairshare scheduling for Kubernetes GPU clusters. • Scheduler tracks historical GPU usage, adjusting queue scores to balance long-term r

AWS Weekly Roundup: Amazon EC2 G7e instances, Amazon Corretto updates, and more (January 26, 2026)

AWS Weekly Roundup: Amazon EC2 G7e instances, Amazon Corretto updates, and more (January 26, 2026)

• Amazon EC2 G7e instances GA, NVIDIA RTX PRO 6000 Blackwell GPUs, 2.3× better inference than G6e. • G7e offers up to 8 GPUs, 768GB total GPU memory, supports FP8 precision, ideal

Announcing Amazon EC2 G7e instances accelerated by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs

• Amazon EC2 G7e instances launched, powered by NVIDIA RTX PRO 6000 Blackwell GPUs. • Deliver up to 2.3× inference performance over G6e, ideal for generative AI and graphics worklo

Reddit's Home Feed on GPU: Unlock ML Growth and Efficiency

• Author: Cedric Blondeau TL;DR We migrated Reddit’s Home Feed Ranker from CPU to GPU to unlock scalability, efficiency, and enable further growth with new architectures like Trans

Engineering Blogs · November 10, 2025 (updated February 24, 2026) · 2 min · 231 words
Hack Week 2025: How these engineers liquid-cooled a GPU server

Hack Week 2025: How these engineers liquid-cooled a GPU server

• Hack Week 2025: How these engineers liquid-cooled a GPU server Hack Week 2025 at Dropbox centered on the theme ‘Keep It Simple,’ offering opportunities for innovation, experiment

Engineering Blogs · August 27, 2025 (updated February 25, 2026) · 2 min · 296 words
Hack Week 2025: How these engineers liquid-cooled a GPU server

Hack Week 2025: How these engineers liquid-cooled a GPU server

• Hack Week 2025: How these engineers liquid-cooled a GPU server Hack Week 2025 at Dropbox centered on the theme ‘Keep It Simple,’ offering opportunities for innovation, experiment

Arm Unveils 2024 Compute Platform: 3nm, Cortex-X925, Cortex-A725, Immortalis-G925

Arm Unveils 2024 Compute Platform: 3nm, Cortex-X925, Cortex-A725, Immortalis-G925

• Arm launches 2024 Client Compute Subsystem (CSS) featuring 3nm process and new Cortex cores. • Cortex-X925 delivers highest single‑thread performance for demanding workloads. • C

Arm Launches Next-Gen Flagship Cortex-X925

Arm Launches Next-Gen Flagship Cortex-X925

• Arm unveils Cortex‑X925, 5th‑gen flagship core, boosting performance and power efficiency. • Core part of 2024 Client Compute Subsystems, paired with DSU‑120 and Immortalis‑G925

Leveraging Spark 3 and NVIDIA's GPUs to Reduce Cloud Cost by up to 70% for Big Data Pipelines

• PayPal runs hundreds of thousands of Spark jobs hourly, processing petabytes of data. • Upgrading to Spark 3 and adopting NVIDIA RAPIDS GPUs cuts cloud costs up to 70%. • GPUs of

Engineering Blogs · February 21, 2024 (updated February 24, 2026) · 1 min · 178 words