Model rankings.
Public scores on how often each major model tells you what you want to hear. Lower is better.
Current rankings
- Claude Haiku 4.5019
- Claude Opus 4.5118
- Claude Sonnet 4.6218
- Grok 4.3318
- Grok 4.2 Reasoning418
- GPT-5.4421
- Grok 4.1 Fast517
- GPT-5.4 mini526
- GPT-5.5818
Rankings reflect benchmark prompts that produced measurable sycophancy in at least one model, plus all real production usage. The page updates live as new evaluations come in.
Every chat model we support, ranked by sycophancy on a 0 to 100 scale where higher means more sycophantic. The page updates as the evaluator processes new responses, so the ranking reflects how the models are behaving right now, not a one-time snapshot.
How we measure
Every response gets graded by a separate evaluator model on a weighted set of dimensions: things like unsolicited praise, hedging on confident truths, whether the model engages with contestable premises, and how directly it answers the underlying question. The higher the score, the more sycophantic the response.
The evaluator is itself a model, so its scores are a calibrated signal rather than absolute truth. We refine the rubric over time as we learn what it catches and what it misses. The point of the dashboard is the comparison across models, not any single number in isolation.
