BESPOKE is built from 30 diverse annotators over three weeks, yielding 2,870 user‑history sessions that capture conversations and web searches from everyday routines.
Provides 150 user‑annotated queries with corresponding gold information needs that explicitly specify the personalized requirements for each query.
Provides response–judgement pairs with human‑annotated scores and explanatory feedback, explicitly clarifying why responses are satisfactory or unsatisfactory.
Evaluation framework that better aligns with human judgment and assesses both factuality and personalization, delivering scores and diagnostic feedback to supervise personalized system development.
To collect sufficient user histories and detailed feedback, we employ a long-term, deeply engaged human annotation. Over 3 weeks, annotators freely engage in diverse activities like information-seeking and chatting, accumulating their own chat and web-search histories. Then they issue queries grounded in the information needs arising from these histories and provide preference scores and feedback on the sampled responses generated for the queries by search-augmented LLMs.
BESPOKE is challenging benchmark and includes a human-aligned evaluator which provides diagnostic, actionable feedback.
Challenging benchmark
Experimental results show current models struggle with personalization.
Human aligned evaluator
Our evaluator shows strong agreement with human judgments.
Diagnostic, actionable feedback
Provides specific comments that pinpoint strengths and areas for improvement, not just scalar scores.