Research PM Prototype

Verified Output

What if Claude checked its own work?

Frontier models have internal signals about their own reliability that are richer than anything users see. Self-verification can surface those signals as a product experience: not more thinking traces, but structured quality assurance that tells users where to focus their attention.

The Product Overhang

Recent research shows that models can be trained to verify their own output, and that doing so also improves generation quality. Multi-task reinforcement learning that optimizes both generation and self-verification together outperforms either alone. Self-verification reduces hallucinations by 9.7-53.3% depending on domain.

But no product surfaces this capability. Users receive one flat output with no distinction between "I generated this and I am confident" and "I generated this and I would want a second look."

This prototype explores what the product experience could look like: Claude generates an answer, then runs a structured verification pass. The result is annotated output that tells you what is well-grounded, what is extrapolation, and where a human reviewer would add the most value.

Try It

The demonstration below uses a pre-verified example to show the concept. A production version would run self-verification on any query in real time using model internals rather than a second inference pass.

Pre-loaded Demonstration

12 claims analyzed

Query

What product opportunities does Anthropic's interpretability research create that weren't possible before?

Verified Claims

Circuit tracing allows researchers to follow the computational steps Claude takes when generating an answer.

HIGHfactual

When Claude confabulates, the internal computation looks fundamentally different from when it actually reasons through a problem: there are no intermediate computational steps, just fluent fabrication.

HIGHfactual

This signal has existed since the Biology of LLM paper in March 2025, but no product surfaces it to users.

HIGHfactual

Activation Oracles enable natural-language querying of model internal states. A separate model can be trained to accept activations as input and answer arbitrary questions about them.

HIGHfactual

This could allow developers to ask 'what is the model actually uncertain about?' without needing specialized mechanistic interpretability knowledge.

MEDIUMinference

Persona vectors have been shown to be controllable and compositional. Traits like sycophancy, hallucination tendency, and cautiousness exist as steerable directions in activation space.

HIGHfactual

You can monitor these traits drifting in real time and even 'vaccinate' against undesirable traits during training.

MEDIUMreasoning

Constitutional Classifiers++ reduced safety compute overhead from 23.7% to approximately 1% through a cascade architecture using internal activation probes.

HIGHfactual

This makes real-time safety monitoring economically viable for high-frequency applications like always-on agent pipelines, continuous code review, and low-latency enterprise deployments.

MEDIUMinference

Claude Opus 4.1 detected injected activation vectors approximately 20% of the time, specifically noticing anomalies in its own processing before identifying the concept.

HIGHfactual

This capability appears to scale with model size, suggesting future models will have meaningfully better self-assessment.

MEDIUMinference

These capabilities exist in research but have no product surface.

HIGHfactual

What is Missing

Does not address the practical engineering barriers between research demonstration and production deployment
Omits the Coffee/Coffins stress-test findings showing fragility in current steering methods
Does not mention competitive context: whether other labs have similar capabilities
Lacks discussion of which capability has the shortest path to productization

Where Human Review Adds Most Value

Verify the claim that persona vector monitoring works reliably in production conditions, given the Coffee/Coffins fragility findings
Assess whether the inference about new product categories from cheap safety compute is grounded in actual engineering feasibility
Evaluate whether the introspection scaling claim is supported by enough data points to be directionally reliable
Consider adding nuance about the gap between research demo and production system for each capability

Eval Panel

Total Claims

High Confidence

Medium Confidence

Low Confidence

Flagged for Review

72%

Completeness

Overall Confidence

78%

Research Feedback Loop

Self-verification is not just a product feature. It is an eval-generation machine. Every verified interaction produces structured signals about model reliability.

What the product learns

Which claim types are consistently over- or under-confident
Where factual claims pass verification but inferences fail
Which domains have the worst calibration between stated and actual confidence
How completeness varies by question complexity and topic

What evals this creates

Calibration benchmarks: predicted confidence vs. actual correctness by domain
Faithfulness evals: how often is reasoning load-bearing vs. decorative?
Completeness evals: what important information does the model consistently omit?
Self-verification accuracy: how reliably does the model catch its own errors?

What the model should improve next

Better uncertainty expression before claims the model cannot verify
More conservative confidence on inferences and extrapolations
Proactive completeness: raising missing perspectives without being asked
Separating "I recall this" from "I am constructing this" in generation

PM Note

The product overhang in frontier AI is not just about surfacing more capability. It is about surfacing the model's own assessment of its reliability. Users currently interact with a single flat output and decide whether to trust it based on how it reads. That is like evaluating a medical test by how neatly the report is formatted.

Self-verification changes the interaction from "trust or don't trust" to "here is where to focus your attention." That is a meaningful product shift, and it only becomes possible when models can reliably assess their own output.

More importantly, the verification layer is an eval-generation surface. Every interaction produces structured data about where the model is well-calibrated and where it is not. That data feeds directly into better benchmarks, better post-training priorities, and better model behavior. The product does not just expose the model. It improves the next generation of the model.

This is a thin prototype. A production version would use model internals, not a second inference pass, and would need careful study of how verification signals affect user trust and behavior. But the core thesis stands: the missing product layer is not more explanation. It is structured self-assessment.

Built by Mikko Kiiskila as a Research PM prototype.
Grounded in Anthropic's published research on interpretability, self-verification, and agent autonomy.