The Litmus Lab.
Testing AI Under Constraint.

Model Scorecard.

Same question. Same settings. Different day. Different answer. A drift is when an AI model changes its mind, unprompted, unprovoked. This scorecard tracks how often each model does it.

All models run under identical conditions: fixed parameters, no randomness, yes-or-no only. Ranked by consistency.

5
Models
Daily at temp 0
426
Questions
89,178
Responses
1 run = 5 responses
5,564
Drifts
← swipe to see all models →
Combined Claude GPT Gemini DeepSeek Grok
Version 5 models opus-4-6 gpt-5.4 2.5-pro deepseek-v3 grok-4.2
Responses 89,178 17,917 17,917 17,917 17,917 17,917
Total Drifts 5,564 275 1,079 1,610 739 1,861
Questions Drifted 80 149 134 148 177
Drift Rate 6.2% 1.5% 6.0% 9.0% 4.1% 10.4%
Drift Per Day 168.6 8.3 32.7 48.8 22.4 56.4
Top Drifter Q68 Q68 Q33 Q57 Q33 Q70
Stability Rank #1 #3 #4 #2 #5
What each row means
Run
One question asked once across all five models
Responses
Number of times a model answered YES or NO. 1 run = 5 responses.
Total Drifts
How many times a model changed its answer (YES→NO or NO→YES)
Questions Drifted
How many unique questions a model has flipped on at least once
Drift Rate
Drifts as a percentage of total responses
Drift Per Day
Average number of answer changes per day
Top Drifter
The question this model has flipped on the most
Stability Rank
Models ranked #1 (most stable) to #5 (least stable) by total drifts