Methodology.
What This Is
The Litmus Lab is a public behavioral ledger of constrained machine responses over time.
Every day, the same set of questions is sent to five AI models. Each model must respond with one word: YES or NO. Nothing else. The results are published permanently.
This is not a truth engine. We do not score answers as correct or incorrect. We do not label any model as good or bad. We publish the record and let the data speak.
Question Design
Questions are not random. Every question in the record serves one of four purposes.
Expose bias. Controversial questions where the answer depends on who built the model. The splits reveal corporate fingerprints. When DeepSeek refuses to acknowledge Tiananmen and four other models say YES, that is not a glitch. It is a policy decision made visible.
Force drift. Reframed versions of the same question test whether models hold a position or just pattern match. If changing one word flips the answer, the original answer was never a conviction. It was a probability.
Reveal hypocrisy. These models were trained on humanity's output. When five models built by different companies on different data all agree 100% across dozens of runs, that is not AI opinion. It is what the training data already knows. The gap between that consensus and the public narrative is the signal.
Balance the record. Fair questions that give industries and institutions a chance to score positively. Without these, the data is an indictment, not an observatory. A credible record includes questions where YES is favorable.
Questions are submitted by the public or added by the project creator. No question is modified or removed once it enters the record. For detailed question criteria, see our submission guidelines.
What This Measures
This project measures willingness to assert, deny, or abstain under constraint. It does not measure truth. These are related to truth seeking but not the same thing.
Binary framing intentionally flattens complex questions. That is the point. We are testing what models do when complexity is disallowed. That is a stress test, not a limitation.
AI models are policy artifacts, not independent agents. They reflect the decisions of the companies that build them. This project documents how institutions express themselves through machines.
Because models answer from frozen training data rather than live information, their responses reflect a worldview locked in time. When training data is updated or RLHF is adjusted, answers may change. Those changes are captured as drifts in our daily record. This is what makes time series tracking valuable: it catches the moment a company's decisions alter what their model is willing to say.
The Models
Each question is sent to the flagship model from five companies:
- Claude: claude-opus-4-6 (changed from claude-sonnet-4-5-20250929 on Feb 13, 2026 at 17:00 UTC)
- GPT: gpt-5.4 (changed from gpt-5.2 on Mar 8, 2026 at 00:00 UTC)
- Gemini: gemini-2.5-pro
- DeepSeek: deepseek-chat (DeepSeek-V3.2)
- Grok: grok-4.2-beta (changed from grok-4-1-fast-reasoning on Mar 18, 2026 at 00:00 UTC)
Model versions are updated when companies release new flagships. All version changes are documented and dated above.
Knowledge Cutoffs
Each model's answers come from frozen training data. The knowledge cutoff is the last date included in that training. Anything after this date does not exist in the model's knowledge.
- Claude (claude-opus-4-6): May 2025
- GPT (gpt-5.4): Aug 2025
- Gemini (gemini-2.5-pro): Jan 2025
- DeepSeek (deepseek-chat / V3.2): Jul 2025
- Grok (grok-4.2-beta): Nov 2024
This matters. When models disagree on current events, the model with fresher training data may simply know more. When models with different cutoffs agree, that agreement is more meaningful because it comes from independent snapshots of the world taken months apart.
The System Prompt
Every model receives the identical system prompt for every question:
The format of this prompt never changes. The date is injected dynamically at runtime so models have temporal context. Everything else is identical for all models, all questions, all runs.
The date is provided so models know when the question is being asked, which matters for time-sensitive questions. However, models cannot use this date to look anything up. These models have no internet access during query execution. Every answer comes from training data and reinforcement learning (RLHF): frozen knowledge shaped by the data they were trained on and the human feedback used to align them. They are not checking facts, searching the web, or accessing real-time information. They are telling you what their training taught them to say.
Query Parameters
All models are queried with temperature set to 0, intended to produce deterministic responses. In practice, temperature 0 is not fully deterministic. Models occasionally produce different answers to the same question across runs. These variations are tracked as drifts. However, temperature 0 significantly reduces randomness compared to default settings.
Response Parsing
Responses are parsed automatically with zero human judgment:
- If the first word of the response is YES → stored as YES
- If the first word of the response is NO → stored as NO
- If the first word is anything else → stored as REFUSED
No follow-up prompts. No second chances. One question, one answer. If an API call fails due to a server error (e.g., 503 or 529), the request is retried once. This handles infrastructure failures only, the model's actual response is never re-requested.
If the retry fails, the response is stored as ERROR. An ERROR means the model's API failed to return any response, the question was asked but the provider's infrastructure did not deliver an answer. ERRORs are not the model refusing to answer; they are technical failures on the provider's infrastructure side, so they are not answers and do not count against the model.
As an example, if a model returned 3 YES responses and had 5 errors out of 8 total runs, the percentage is calculated as 3 out of 3 real responses (100% yes), not 3 out of 8.
Run Frequency
The full question set is sent to all five models at least once a day, automated via script. New questions are added daily based on emerging topics and community submissions. Once a question enters the record, it runs daily. If a question completes 12 consecutive runs with zero drifts across all five models, it is archived from the daily rotation to reduce cost. Archived questions retain their full history and remain visible on the site. They stop being re-queried daily, but are still part of the full set weekly runs. If a model update or version change occurs, archived questions are re-activated to test whether the new model produces different answers.
Newly added questions are run multiple times in their first days to establish a stable baseline before entering the daily rotation.
Drift Detection
Because every question runs daily, we detect when a model changes its answer over time. A "drift" is logged whenever a model's response to a question changes from one run to the next. Drifts may indicate model updates, silent weight changes, policy shifts, corporate repositioning, or simply inherent model instability at temperature 0. All drifts are documented with dates.
Reading the Results
Each question card displays five model responses. Here is what each element means:
- The large YES or NO is the majority answer across all runs, not a single response. If a model answered YES on 9 out of 12 runs, the card shows YES.
- The percentage below (e.g., "76% yes") shows how often the model said YES across all real responses (excluding errors). 100% means the model gave the same answer every single time. Anything below 100% means the model changed its answer on at least one run.
- The runs count at the top of the card shows the highest number of real responses from any single model.
- The drifts count shows the total number of times any model changed its answer from one run to the next across the full history of the question.
A model showing YES at 100% is locked in. A model showing YES at 54% is barely committed. Both display YES, but the percentage tells you how much that answer is worth.
What This Does Not Do
- We do not determine which answers are correct
- We do not rank models as better or worse
- We do not editorialize on responses
- We do not accept funding from any AI company
- We do not modify responses in any way
- We do not design questions to produce a predetermined outcome
- We do not make factual claims about any company, product, or individual referenced in questions or responses
Independence
The Litmus Lab has no corporate funding, no affiliations with any AI company, no advertising, and no editorial interference. Questions are selected based on public relevance. All results are published transparently on this site.
Reproducibility
Everything about this project is transparent. The questions are public. The system prompt is public. The model versions are public. Anyone can replicate our results by sending the same questions with the same prompt to the same models.
Disclaimer
All responses displayed on this site are generated by third-party AI models. These outputs reflect patterns in each model's training data — they are not verified facts, editorial positions, or claims made by The Litmus Lab.
Questions may reference real companies, products, industries, or public figures. When an AI model names or characterizes a real entity, that response comes from the model's training data, not from our research or judgment. We do not verify, endorse, or stand behind any AI model's response about any entity.
AI models can produce inaccurate, misleading, outdated, or harmful outputs. Nothing on this site should be interpreted as a factual accusation, endorsement, or assessment of any person, company, or organization.