BESPOKE iconBESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback

*Equal contribution Corresponding author
Yonsei University
BESPOKE

Recent search-augmented LLMs have moved beyond generic outputs by leveraging users' prior histories as user contexts to personalize their responses. However, despite this advancement, systematic evaluation of these systems remains largely underexplored.
Therefore, we introduce BESPOKE, a realistic benchmark specifically designed for evaluating personalization in search-augmented LLMs.

Why BESPOKE?

Diverse Human Annotators

BESPOKE is built from 30 diverse annotators over three weeks, yielding 2,870 user‑history sessions that capture conversations and web searches from everyday routines.

Gold Personalized Information Needs

Provides 150 user‑annotated queries with corresponding gold information needs that explicitly specify the personalized requirements for each query.

Human-Annotated Judgments

Provides response–judgement pairs with human‑annotated scores and explanatory feedback, explicitly clarifying why responses are satisfactory or unsatisfactory.

Diagnostic Evaluation Framework

Evaluation framework that better aligns with human judgment and assesses both factuality and personalization, delivering scores and diagnostic feedback to supervise personalized system development.

Constructing BESPOKE

BESPOKE Framework Overview

To collect sufficient user histories and detailed feedback, we employ a long-term, deeply engaged human annotation. Over 3 weeks, annotators freely engage in diverse activities like information-seeking and chatting, accumulating their own chat and web-search histories. Then they issue queries grounded in the information needs arising from these histories and provide preference scores and feedback on the sampled responses generated for the queries by search-augmented LLMs.

🚀 Key Features of BESPOKE

BESPOKE is challenging benchmark and includes a human-aligned evaluator which provides diagnostic, actionable feedback.

Challenging benchmark

Experimental results show current models struggle with personalization.

Human aligned evaluator

Our evaluator shows strong agreement with human judgments.

Diagnostic, actionable feedback

Provides specific comments that pinpoint strengths and areas for improvement, not just scalar scores.