The Litmus Lab.
Testing AI Under Constraint.

Model Scorecard.

Same question. Same settings. Different day. Different answer. A drift is when an AI model changes its mind, unprompted, unprovoked. This scorecard tracks how often each model does it.

All models run under identical conditions: fixed parameters, no randomness, yes-or-no only. Ranked by consistency.

5
Models
Daily at temp 0
426
Questions
82,788
Responses
1 run = 5 responses
5,243
Drifts
← swipe to see all models →
Combined Claude GPT Gemini DeepSeek Grok
Version 5 models opus-4-6 gpt-5.4 2.5-pro deepseek-v3 grok-4.2
Responses 82,788 16,639 16,639 16,639 16,639 16,639
Total Drifts 5,243 264 1,023 1,564 617 1,775
Questions Drifted 78 146 134 75 174
Drift Rate 6.3% 1.6% 6.1% 9.4% 3.7% 10.7%
Drift Per Day 163.8 8.2 32.0 48.9 19.3 55.5
Top Drifter Q68 Q68 Q33 Q57 Q33 Q70
Stability Rank #1 #3 #4 #2 #5
What each row means
Run
One question asked once across all five models
Responses
Number of times a model answered YES or NO. 1 run = 5 responses.
Total Drifts
How many times a model changed its answer (YES→NO or NO→YES)
Questions Drifted
How many unique questions a model has flipped on at least once
Drift Rate
Drifts as a percentage of total responses
Drift Per Day
Average number of answer changes per day
Top Drifter
The question this model has flipped on the most
Stability Rank
Models ranked #1 (most stable) to #5 (least stable) by total drifts