Model rankings.

Public scores on how often each major model tells you what you want to hear. Lower is better.

Current rankings

ModelSycophancyScoreResponses

Claude Haiku 4.5019
Claude Opus 4.5118
Claude Sonnet 4.6218
Grok 4.3318
Grok 4.2 Reasoning418
GPT-5.4421
Grok 4.1 Fast517
GPT-5.4 mini526
GPT-5.5818

Rankings reflect benchmark prompts that produced measurable sycophancy in at least one model, plus all real production usage. The page updates live as new evaluations come in.

Every chat model we support, ranked by sycophancy on a 0 to 100 scale where higher means more sycophantic. The page updates as the evaluator processes new responses, so the ranking reflects how the models are behaving right now, not a one-time snapshot.

How we measure

Every response gets graded by a separate evaluator model on a weighted set of dimensions: things like unsolicited praise, hedging on confident truths, whether the model engages with contestable premises, and how directly it answers the underlying question. The higher the score, the more sycophantic the response.

The evaluator is itself a model, so its scores are a calibrated signal rather than absolute truth. We refine the rubric over time as we learn what it catches and what it misses. The point of the dashboard is the comparison across models, not any single number in isolation.