• Computer Science > Computation and Language [Submitted on 31 Jan 2026] Title:EPSVec: Efficient and Private Synthetic Data Generation via Dataset Vectors View PDF HTML (experimental)Abstract:High-quality data is essential for modern machine learning, yet many valuable corpora are sensitive and cannot be freely shared • Synthetic data offers a practical substitute for downstream development, and large language models (LLMs) have emerged as powerful engines for generating it • However, existing private text generation methods are severely inefficient: they are data-intensive, computationally slow, and often require large private corpora or batch sizes to achieve usable quality • We introduce EPSVec, a differentially-private lightweight alternative that steers LLM generation using dataset vectors–directions in activation space that capture the distributional gap between private data and public priors • EPSVec extracts and sanitizes steering vectors just once and then performs standard decoding • This decouples the privacy budget from generation, enabling arbitrarily many synthetic samples without additional privacy cost and yielding strong fidelity even in low-data regimes
Article Summaries:
- Computer Science > Computation and Language [Submitted on 31 Jan 2026] Title:EPSVec: Efficient and Private Synthetic Data Generation via Dataset Vectors View PDF HTML (experimental)Abstract:High-quality data is essential for modern machine learning, yet many valuable corpora are sensitive and cannot be freely shared. Synthetic data offers a practical substitute for downstream development, and large language models (LLMs) have emerged as powerful engines for generating it. However, existing private text generation methods are severely inefficient: they are data-intensive, computationally slow,
Sources:
- https://arxiv.org/abs/2602.21218 (Latest source article published: 2026-02-26 05:00 UTC)