Privacy-Aware Split Inference with Speculative Decoding for Large Language Models over Wide-Area Networks

• Computer Science > Cryptography and Security [Submitted on 18 Feb 2026] Title:Privacy-Aware Split Inference with Speculative Decoding for Large Language Models over Wide-Area Networks View PDF HTML (experimental)Abstract:We present a practical system for privacy-aware large language model (LLM) inference that splits a transformer between a trusted local GPU and an untrusted cloud GPU, communicating only intermediate activations over the network. • Our system addresses the unique challenges of autoregressive LLM decoding over high-latency wide-area networks (WANs), contributing: (1) an asymmetric layer split where embedding and unembedding layers remain local, ensuring raw tokens never leave the trusted device; (2) the first application of lookahead decoding to split inference over WANs, amortizing network round-trip latency across multiple tokens per iteration; (3) an empirical inversion attack evaluation showing that split depth provides a tunable privacy-performance tradeoff – an attacker can recover ~59%% of tokens at a 2-layer split but only ~35%% at an 8-layer split, with minimal throughput impact; (4) ablation experiments showing that n-gram speculation accepts 1.2-1.3 tokens per decoding step on average (peak of 7 observed on code), with acceptance rates consistent across model scales; (5) formal verification that lookahead decoding produces token-identical output to sequential decoding under greedy argmax, with zero quality degradation; and (6) scaling validation on Mi

Article Summaries:

Computer Science > Cryptography and Security [Submitted on 18 Feb 2026] Title:Privacy-Aware Split Inference with Speculative Decoding for Large Language Models over Wide-Area Networks View PDF HTML (experimental)Abstract:We present a practical system for privacy-aware large language model (LLM) inference that splits a transformer between a trusted local GPU and an untrusted cloud GPU, communicating only intermediate activations over the network. Our system addresses the unique challenges of autoregressive LLM decoding over high-latency wide-area networks (WANs), contributing: (1) an asymmetric

Sources:

https://arxiv.org/abs/2602.16760