Data-Driven Simulators Outperform Prompt-Based Alternatives, Google Research Apparel Study Shows

Flaws in large language model-based user simulators have been exposed by research from Google Research introducing ConvApparel, a dataset of more than 4,000 human-AI conversations in the apparel shopping domain. The work develops a three-pillar evaluation framework to measure and address the realism gap in AI testing, with data-driven simulators outperforming prompt-based approaches across key behavioural metrics.

Long Story, Cut Short
  • ConvApparel comprises more than 4,000 human-AI conversations capturing user behaviour across helpful and unhelpful agent conditions in apparel shopping.
  • A three-pillar framework covering population-level alignment, human-likeness scoring, and counterfactual validation exposes persistent realism gaps in LLM-based user simulators.
  • Data-driven simulators using in-context learning and supervised fine-tuning outperform prompt-based systems but still fall short of genuine human behavioural realism.
When AI systems are trained to simulate human users, the gap between synthetic behaviour and genuine human complexity remains stubbornly difficult to close.
Merged Realism When AI systems are trained to simulate human users, the gap between synthetic behaviour and genuine human complexity remains stubbornly difficult to close. AI-Generated / Freepik

Persistent flaws in large language model-based user simulators have been identified and quantified by Google Research, which has introduced ConvApparel, a dataset of more than 4,000 human-AI conversations in the apparel shopping domain. Designed to establish a behavioural baseline for conversational recommender systems, a three-pillar evaluation framework quantifies the realism gap between simulated and genuine human behaviour and supports the training of more robust conversational agents, with data-driven simulators outperforming prompt-based systems but none fully closing the gap.

  • The dataset spans nearly 15,000 conversational turns, collected via a dual-agent protocol routing participants to either a helpful or an intentionally unhelpful AI recommender.
  • Three evaluation pillars, covering population-level statistical alignment, human-likeness scoring, and counterfactual validation, together assess how closely simulated user behaviour matches genuine human interaction.
  • Supervised fine-tuning and in-context learning simulators consistently outperformed prompt-based baselines on statistical alignment but were still identified as synthetic by a trained discriminator in nearly all cases.
  • The research is presented in the paper 'ConvApparel: Measuring and Bridging the Realism Gap in User Simulators', authored by researchers at Google Research.

INSIDE THE RESEARCH: ConvApparel is a dataset of more than 4,000 human-AI multi-turn conversations, built to establish a baseline for human behaviour in conversational recommender systems. Developed by Google Research scientists Ofer Meshi and Sally Goldman, the dataset was collected using a dual-agent protocol in the apparel shopping domain and is paired with an evaluation framework designed to assess the fidelity of LLM-based user simulators through three validation pillars.

  • Participants were randomly routed to one of two AI recommenders: a helpful agent designed for efficient, search-capable assistance, or an intentionally unhelpful agent that misinterpreted keywords and used degraded search retrieval.
  • Fine-grained, turn-by-turn annotations captured participants' internal states, including satisfaction, frustration, and purchase likelihood, at every conversational turn, providing ground-truth data on first-person user experience.
  • Three simulator types were evaluated: a prompt-based simulator relying on high-level behavioural instructions, an in-context learning simulator using retrieval-augmented generation to supply semantically similar human conversation examples at each turn, and a supervised fine-tuning simulator trained directly on ConvApparel transcripts using Gemini 2.5 Flash.
  • Each simulator generated 600 conversations, 300 with the helpful agent and 300 with the unhelpful agent, enabling direct comparison against the human baseline across both conditions.
  • Ethical integrity was maintained through full participant transparency, informed consent, and compensation above the living wage in each participant's country of employment.
  • The research was co-authored by Krisztian Balog, Avi Caciularu, Guy Tennenholtz, Jihwan Jeong, Amir Globerson, and Craig Boutilier.

THE FINDINGS: Three evaluation pillars applied to the ConvApparel dataset reveal a persistent realism gap across all simulator types, with data-driven approaches outperforming prompt-based systems on statistical alignment while still falling short of genuine human behavioural fidelity. A trained discriminator confidently identified nearly all simulated conversations as synthetic, regardless of the simulator type used.

  • Population-level statistical tests show that in-context learning and supervised fine-tuning simulators closely mirror human behavioural distributions in verbosity and recommendation acceptance rates, outperforming the prompt-based baseline across both measures.
  • The human-likeness scoring discriminator identified subtle but consistent synthetic artefacts in all simulator outputs, including flawless grammar and overly predictable turn-taking patterns, which distinguished them from genuine human conversations.
  • Counterfactual validation revealed that the prompt-based simulator failed to adapt when exposed to the unhelpful agent, remaining unnaturally polite and patient in conditions where human users displayed frustration and rejection.
  • In-context learning and supervised fine-tuning simulators demonstrated strong out-of-distribution generalisation, realistically shifting behaviour toward higher frustration and increased rejection rates when interacting with the unhelpful agent despite having no prior exposure to it.

THE BIGGER PICTURE: The ConvApparel findings point to a fundamental risk in current conversational AI development: agents trained exclusively against unrealistic simulators may be optimised for synthetic behaviour patterns that do not reflect genuine human interaction, undermining real-world performance. Left unaddressed, the gap between synthetic and genuine human behaviour carries direct consequences for how next-generation conversational agents are built, tested, and ultimately deployed.

  • Prompt-based simulators exhibit systematic behavioural deviations, including excessive patience, encyclopaedic domain knowledge, and inconsistent personas, that diverge significantly from the range of human responses captured in the ConvApparel dataset.
  • A simulator that overfits to its training data cannot reliably test new or experimental conversational agent policies, limiting its utility precisely when it is needed most.
  • Relying on simulators that cannot adapt to novel agent behaviour risks producing conversational agents that perform well in testing but fail to meet the expectations of real users in deployment.
  • While data-driven simulators demonstrate superior adaptability, future work is directed at using high-fidelity simulators to train conversational recommender agents from scratch and measuring the resulting real-world performance against human benchmarks.
 
 
Dated posted: 1 May 2026 Last modified: 1 May 2026