Context engineering case studies: Etsy-specific question answering

• This post investigates the benefits and limitations of prompt engineering in two instances of AI-assisted onboarding relying on large language model (LLM) technology. • Of particular interest is how truthful (and therefore reliable) LLM-generated answers turn out to be in the context of Etsy-specific question answering. • Among other insights, we find that asking the LLM to identify specific source snippets is a good way to flag potential hallucinations. • Over the past few years, pre-trained large-scale/foundation language models such as OpenAI’s o-series [1] and Google’s Gemini family [2] have revolutionized the field of natural language processing (NLP). • Trained on vast amounts of text, images, code, audio, and videos, such models encapsulate a great deal of world knowledge, which can be called upon to perform a wide range of downstream tasks, such as sentiment analysis, language translation, and natural language inference, among many others. • The canonical way to improve the task performance of a pre-trained general language model, when it needs specific knowledge beyond its original training, is called fine-tuning [3].

Article Summaries:

Etsy is testing prompt‑engineering techniques to improve the reliability of large‑language‑model (LLM) answers for employee onboarding. The company’s pilot focuses on Travel & Entertainment policy questions, a domain with clear rules, to avoid costly fine‑tuning on internal documents. Researchers found that instructing the LLM to cite specific source snippets effectively flags hallucinations and boosts answer truthfulness. The study compares traditional fine‑tuning with prompt‑based tuning, concluding that carefully crafted prompts can yield reliable, policy‑accurate responses for new hires without altering model weights. The findings suggest a scalable, low‑cost approach for AI‑assisted onboarding at Etsy.
Etsy tested prompt engineering to power an AI‑assisted onboarding assistant that answers questions about its Travel & Entertainment (T&E) policy. Rather than fine‑tune a large language model (LLM) on internal documents, the team explored whether carefully crafted prompts could elicit truthful, policy‑accurate responses. The pilot focused on a narrow, rule‑based domain where employees frequently seek travel guidance. Results showed that asking the LLM to cite specific source snippets effectively flagged hallucinations and improved answer reliability. The study suggests that prompt‑based tuning can be a cost‑effective alternative to fine‑tuning for delivering trustworthy policy information to new hires.
Etsy’s latest study evaluates prompt engineering as a cost‑effective alternative to fine‑tuning for AI‑assisted onboarding. Focusing on the Travel & Entertainment (T&E) policy domain, the team tested whether carefully crafted prompts could elicit truthful, policy‑accurate answers from large language models (LLMs) without retraining the model. Results show that asking the LLM to cite specific source snippets effectively flags hallucinations, improving answer reliability. While fine‑tuning remains robust, the pilot demonstrates that prompt‑based tuning can produce sufficiently accurate responses for internal policy queries, offering a scalable solution for onboarding support.
Etsy tested large‑language‑model (LLM) prompt engineering to support new‑hire onboarding in the Travel & Entertainment (T&E) policy domain. Rather than fine‑tuning a model on internal documents, the team explored whether carefully crafted prompts could elicit truthful, policy‑accurate answers. The pilot focused on identifying source snippets within the model’s responses, a technique that helped flag potential hallucinations. Results suggested that explicit instructions in prompts reduce errors and provide a cost‑effective alternative to fine‑tuning, while still delivering reliable answers for employees’ T&E questions. The study highlights prompt engineering as a viable strategy for deploying LLMs in corporate knowledge‑base applications.
Etsy tested prompt‑engineering techniques to power an AI assistant that answers new‑hire questions about its Travel & Entertainment (T&E) policy. Rather than fine‑tuning a large language model (LLM) on internal documents, the company explored whether carefully crafted prompts could elicit accurate, policy‑compliant responses. In a pilot, the team added simple instructions that asked the LLM to identify source snippets for each answer. This approach helped flag potential hallucinations and improved answer reliability. The study suggests that prompt‑based tuning can be a cost‑effective alternative to fine‑tuning for domain‑specific question answering in corporate onboarding.

Sources: