Inference

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

• Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 25 Feb 2026] Title:DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference View

Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling

• Computer Science > Machine Learning [Submitted on 6 Mar 2025 (v1), last revised 24 Feb 2026 (this version, v4)] Title:Semantic Parallelism: Redefining Efficient MoE Inference via

Transform live video for mobile audiences with AWS Elemental Inference

• AWS News Blog Transform live video for mobile audiences with AWS Elemental Inference | Today, we’re announcing AWS Elemental Inference, a fully managed AI service that automatica

A flaw in using pretrained protein language models in protein-protein interaction inference models

• Abstract With the growing pervasiveness of pretrained protein language models (pLMs), pLM-based methods are increasingly being put forward for the protein-protein interaction (PP

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

• Computer Science > Machine Learning [Submitted on 4 Feb 2026] Title:CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models View PDF HTML (

Collaborative Processing for Multi-Tenant Inference on Memory-Constrained Edge TPUs

• Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 19 Feb 2026] Title:Collaborative Processing for Multi-Tenant Inference on Memory-Constrained Edge TP

Ontology-Guided Neuro-Symbolic Inference: Grounding Language Models with Mathematical Domain Knowledge

• Computer Science > Artificial Intelligence [Submitted on 19 Feb 2026] Title:Ontology-Guided Neuro-Symbolic Inference: Grounding Language Models with Mathematical Domain Knowledge

A flaw in using pretrained protein language models in protein-protein interaction inference models

• Abstract With the growing pervasiveness of pretrained protein language models (pLMs), pLM-based methods are increasingly being put forward for the protein-protein interaction (PP

Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution

• Computer Science > Machine Learning [Submitted on 24 Oct 2025 (v1), last revised 18 Feb 2026 (this version, v2)] Title:Accelerating Mobile Inference through Fine-Grained CPU-GPU

Privacy-Aware Split Inference with Speculative Decoding for Large Language Models over Wide-Area Networks

• Computer Science > Cryptography and Security [Submitted on 18 Feb 2026] Title:Privacy-Aware Split Inference with Speculative Decoding for Large Language Models over Wide-Area Net

[RFC] TensaLang: A tensor-first language for LLM inference, lowering through MLIR to CPU/CUDA

• Hello, I’ve been working on a project called TensaLang and it’s finally at a point worth sharing. • It’s a small language + compiler + runtime for writing LLM forward passes dire

DigitalOcean Gradient™ AI GPU Droplets Optimized for Inference: Increasing Throughput at Lower the Cost

• By Jason Peng and Hemasumanth Rasineni Production-grade LLM inference demands more than just access to GPUs; it requires deep optimization across the entire serving stack, from q

Expanding our Agentic Inference Cloud: Introducing GPU Droplets Powered by AMD Instinct™ MI350X GPUs

• Expanding our Agentic Inference Cloud: Introducing GPU Droplets Powered by AMD Instinct™ MI350X GPUs ByWaverly Swinton Published:February 19, 2026 2 min read As our Agentic Infer

Multi-agent cooperation through in-context co-player inference

• Computer Science > Artificial Intelligence [Submitted on 18 Feb 2026] Title:Multi-agent cooperation through in-context co-player inference View PDF HTML (experimental)Abstract:Ac

Multi-agent cooperation through in-context co-player inference

• Computer Science > Artificial Intelligence [Submitted on 18 Feb 2026] Title:Multi-agent cooperation through in-context co-player inference View PDF HTML (experimental)Abstract:Ac

Microsoft Rolls Out Next Inference Accelerator to Boost AI in Azure

• The company devised the new Maia 200 inference accelerator to improve cost and performance for AI inference processing in Azure Cloud Services.

How NVIDIA Extreme Hardware-Software Co-Design Delivered a Large Inference Boost for Sarvam AI's Sovereign Models

• As global AI adoption accelerates, developers face a growing challenge: delivering large language model (LLM) performance that meets real-world latency and cost requirements. • R

AST-PAC: AST-guided Membership Inference for Code

• Computer Science > Artificial Intelligence [Submitted on 30 Jan 2026] Title:AST-PAC: AST-guided Membership Inference for Code View PDF HTML (experimental)Abstract:Code Large Lang

AST-PAC: AST-guided Membership Inference for Code

• Computer Science > Artificial Intelligence [Submitted on 30 Jan 2026] Title:AST-PAC: AST-guided Membership Inference for Code View PDF HTML (experimental)Abstract:Code Large Lang

Announcing Amazon SageMaker Inference for custom Amazon Nova models

• AWS News Blog Announcing Amazon SageMaker Inference for custom Amazon Nova models | Since we launched Amazon Nova customization in Amazon SageMaker AI at AWS NY Summit 2025, cust

How low-bit inference enables efficient AI

• In just the past few years, large machine learning models have made incredible strides. • Today’s models are not only remarkably capable but also achieve impressive results acros

How low-bit inference enables efficient AI

• In just the past few years, large machine learning models have made incredible strides. • Today’s models are not only remarkably capable but also achieve impressive results acros

The Container paradox: Why the Inference Cloud Demands a 'Decoupled' Database

• The Container paradox: Why the Inference Cloud Demands a ‘Decoupled’ Database ByKang Xie,Nicole Ghalwash,andZach Peirce Published:February 10, 2026 5 min read Kubernetes has won

Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization

• Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization Author

Automating Inference Optimizations with NVIDIA TensorRT LLM AutoDeploy

• NVIDIA TensorRT LLM enables developers to build high-performance inference engines for large language models (LLMs), but deploying a new architecture traditionally requires signi

Now Available: Anthropic Claude Opus 4.6 on DigitalOcean's Agentic Inference Cloud

• Now Available: Anthropic Claude Opus 4.6 on DigitalOcean’s Agentic Inference Cloud ByDigitalOcean Updated:February 6, 2026 2 min read Claude Opus 4.6 is now available on the Digi

How we cut Vertex AI latency by 35% with GKE Inference Gateway

• How we cut Vertex AI latency by 35% with GKE Inference Gateway Product Manager Software Engineer Our most intelligent model is now available on Vertex AI and Gemini Enterprise As

3 Ways NVFP4 Accelerates AI Training and Inference

• 3 Ways NVFP4 Accelerates AI Training and Inference L T F R E The latest AI models continue to grow in size and complexity, demanding increasing amounts of compute performance for

LLM Inference Benchmarking - Measure What Matters

• By Piyush Srivastava, Karnik Modi, Stephen Varela, and Rithish Ramesh Production-grade LLM inference is a complex systems challenge, requiring deep co-designs - from hardware pri

Maia 200: The AI accelerator built for inference

• Maia 200: 3nm TSMC accelerator with native FP8/FP4 tensor cores, 216GB HBM3e, 272MB SRAM. • Outperforms Amazon Trainium (3× FP4) and Google TPU v7 (FP8), delivering top‑tier infe

Building the Inference Cloud, and What Comes Next

• Building the Inference Cloud, and What Comes Next ByPaddy Srinivasan CEO, DigitalOcean Published:January 7, 2026 4 min read 2025 was a defining year for DigitalOcean, not only be

Token-count-based Batching: Faster, Cheaper Embedding Inference for Queries

• Token-count-based Batching: Faster, Cheaper Embedding Inference for Queries Embedding model inference often struggles with efficiency when serving large volumes of short requests