Persistent flaws in large language model-based user simulators have been identified and quantified by Google Research, which has introduced ConvApparel, a dataset of more than 4,000 human-AI conversations in the apparel shopping domain. Designed to establish a behavioural baseline for conversational recommender systems, a three-pillar evaluation framework quantifies the realism gap between simulated and genuine human behaviour and supports the training of more robust conversational agents, with data-driven simulators outperforming prompt-based systems but none fully closing the gap.
- The dataset spans nearly 15,000 conversational turns, collected via a dual-agent protocol routing participants to either a helpful or an intentionally unhelpful AI recommender.
- Three evaluation pillars, covering population-level statistical alignment, human-likeness scoring, and counterfactual validation, together assess how closely simulated user behaviour matches genuine human interaction.
- Supervised fine-tuning and in-context learning simulators consistently outperformed prompt-based baselines on statistical alignment but were still identified as synthetic by a trained discriminator in nearly all cases.
- The research is presented in the paper 'ConvApparel: Measuring and Bridging the Realism Gap in User Simulators', authored by researchers at Google Research.
INSIDE THE RESEARCH: ConvApparel is a dataset of more than 4,000 human-AI multi-turn conversations, built to establish a baseline for human behaviour in conversational recommender systems. Developed by Google Research scientists Ofer Meshi and Sally Goldman, the dataset was collected using a dual-agent protocol in the apparel shopping domain and is paired with an evaluation framework designed to assess the fidelity of LLM-based user simulators through three validation pillars.
- Participants were randomly routed to one of two AI recommenders: a helpful agent designed for efficient, search-capable assistance, or an intentionally unhelpful agent that misinterpreted keywords and used degraded search retrieval.
- Fine-grained, turn-by-turn annotations captured participants' internal states, including satisfaction, frustration, and purchase likelihood, at every conversational turn, providing ground-truth data on first-person user experience.
- Three simulator types were evaluated: a prompt-based simulator relying on high-level behavioural instructions, an in-context learning simulator using retrieval-augmented generation to supply semantically similar human conversation examples at each turn, and a supervised fine-tuning simulator trained directly on ConvApparel transcripts using Gemini 2.5 Flash.
- Each simulator generated 600 conversations, 300 with the helpful agent and 300 with the unhelpful agent, enabling direct comparison against the human baseline across both conditions.
- Ethical integrity was maintained through full participant transparency, informed consent, and compensation above the living wage in each participant's country of employment.
- The research was co-authored by Krisztian Balog, Avi Caciularu, Guy Tennenholtz, Jihwan Jeong, Amir Globerson, and Craig Boutilier.
THE FINDINGS: Three evaluation pillars applied to the ConvApparel dataset reveal a persistent realism gap across all simulator types, with data-driven approaches outperforming prompt-based systems on statistical alignment while still falling short of genuine human behavioural fidelity. A trained discriminator confidently identified nearly all simulated conversations as synthetic, regardless of the simulator type used.
- Population-level statistical tests show that in-context learning and supervised fine-tuning simulators closely mirror human behavioural distributions in verbosity and recommendation acceptance rates, outperforming the prompt-based baseline across both measures.
- The human-likeness scoring discriminator identified subtle but consistent synthetic artefacts in all simulator outputs, including flawless grammar and overly predictable turn-taking patterns, which distinguished them from genuine human conversations.
- Counterfactual validation revealed that the prompt-based simulator failed to adapt when exposed to the unhelpful agent, remaining unnaturally polite and patient in conditions where human users displayed frustration and rejection.
- In-context learning and supervised fine-tuning simulators demonstrated strong out-of-distribution generalisation, realistically shifting behaviour toward higher frustration and increased rejection rates when interacting with the unhelpful agent despite having no prior exposure to it.
THE BIGGER PICTURE: The ConvApparel findings point to a fundamental risk in current conversational AI development: agents trained exclusively against unrealistic simulators may be optimised for synthetic behaviour patterns that do not reflect genuine human interaction, undermining real-world performance. Left unaddressed, the gap between synthetic and genuine human behaviour carries direct consequences for how next-generation conversational agents are built, tested, and ultimately deployed.
- Prompt-based simulators exhibit systematic behavioural deviations, including excessive patience, encyclopaedic domain knowledge, and inconsistent personas, that diverge significantly from the range of human responses captured in the ConvApparel dataset.
- A simulator that overfits to its training data cannot reliably test new or experimental conversational agent policies, limiting its utility precisely when it is needed most.
- Relying on simulators that cannot adapt to novel agent behaviour risks producing conversational agents that perform well in testing but fail to meet the expectations of real users in deployment.
- While data-driven simulators demonstrate superior adaptability, future work is directed at using high-fidelity simulators to train conversational recommender agents from scratch and measuring the resulting real-world performance against human benchmarks.