Research PM Prototype

Verified Output

What if Claude checked its own work?

Frontier models have internal signals about their own reliability that are richer than anything users see. Self-verification can surface those signals as a product experience: not more thinking traces, but structured quality assurance that tells users where to focus their attention.

The Product Overhang

Recent research shows that models can be trained to verify their own output, and that doing so also improves generation quality. Multi-task reinforcement learning that optimizes both generation and self-verification together outperforms either alone. Self-verification reduces hallucinations by 9.7-53.3% depending on domain.

But no product surfaces this capability. Users receive one flat output with no distinction between "I generated this and I am confident" and "I generated this and I would want a second look."

This prototype explores what the product experience could look like: Claude generates an answer, then runs a structured verification pass. The result is annotated output that tells you what is well-grounded, what is extrapolation, and where a human reviewer would add the most value.

Try It

The demonstration below uses a pre-verified example to show the concept. A production version would run self-verification on any query in real time using model internals rather than a second inference pass.

Pre-loaded Demonstration

12 claims analyzed

Query

What product opportunities does Anthropic's interpretability research create that weren't possible before?

Verified Claims

Circuit tracing allows researchers to follow the computational steps Claude takes when generating an answer.

HIGHfactual

When Claude confabulates, the internal computation looks fundamentally different from when it actually reasons through a problem: there are no intermediate computational steps, just fluent fabrication.

HIGHfactual

This signal has existed since the Biology of LLM paper in March 2025, but no product surfaces it to users.

HIGHfactual

Activation Oracles enable natural-language querying of model internal states. A separate model can be trained to accept activations as input and answer arbitrary questions about them.

HIGHfactual

This could allow developers to ask 'what is the model actually uncertain about?' without needing specialized mechanistic interpretability knowledge.

MEDIUMinference

Persona vectors have been shown to be controllable and compositional. Traits like sycophancy, hallucination tendency, and cautiousness exist as steerable directions in activation space.

HIGHfactual

You can monitor these traits drifting in real time and even 'vaccinate' against undesirable traits during training.

MEDIUMreasoning

Constitutional Classifiers++ reduced safety compute overhead from 23.7% to approximately 1% through a cascade architecture using internal activation probes.

HIGHfactual

This makes real-time safety monitoring economically viable for high-frequency applications like always-on agent pipelines, continuous code review, and low-latency enterprise deployments.

MEDIUMinference

Claude Opus 4.1 detected injected activation vectors approximately 20% of the time, specifically noticing anomalies in its own processing before identifying the concept.

HIGHfactual

This capability appears to scale with model size, suggesting future models will have meaningfully better self-assessment.

MEDIUMinference

These capabilities exist in research but have no product surface.

HIGHfactual

What is Missing

  • Does not address the practical engineering barriers between research demonstration and production deployment
  • Omits the Coffee/Coffins stress-test findings showing fragility in current steering methods
  • Does not mention competitive context: whether other labs have similar capabilities
  • Lacks discussion of which capability has the shortest path to productization

Where Human Review Adds Most Value

  • Verify the claim that persona vector monitoring works reliably in production conditions, given the Coffee/Coffins fragility findings
  • Assess whether the inference about new product categories from cheap safety compute is grounded in actual engineering feasibility
  • Evaluate whether the introspection scaling claim is supported by enough data points to be directionally reliable
  • Consider adding nuance about the gap between research demo and production system for each capability

Eval Panel

12
Total Claims
8
High Confidence
4
Medium Confidence
0
Low Confidence
4
Flagged for Review
72%
Completeness

Overall Confidence

78%

Research Feedback Loop

Self-verification is not just a product feature. It is an eval-generation machine. Every verified interaction produces structured signals about model reliability.

What the product learns

  • Which claim types are consistently over- or under-confident
  • Where factual claims pass verification but inferences fail
  • Which domains have the worst calibration between stated and actual confidence
  • How completeness varies by question complexity and topic

What evals this creates

  • Calibration benchmarks: predicted confidence vs. actual correctness by domain
  • Faithfulness evals: how often is reasoning load-bearing vs. decorative?
  • Completeness evals: what important information does the model consistently omit?
  • Self-verification accuracy: how reliably does the model catch its own errors?

What the model should improve next

  • Better uncertainty expression before claims the model cannot verify
  • More conservative confidence on inferences and extrapolations
  • Proactive completeness: raising missing perspectives without being asked
  • Separating "I recall this" from "I am constructing this" in generation

PM Note

The product overhang in frontier AI is not just about surfacing more capability. It is about surfacing the model's own assessment of its reliability. Users currently interact with a single flat output and decide whether to trust it based on how it reads. That is like evaluating a medical test by how neatly the report is formatted.

Self-verification changes the interaction from "trust or don't trust" to "here is where to focus your attention." That is a meaningful product shift, and it only becomes possible when models can reliably assess their own output.

More importantly, the verification layer is an eval-generation surface. Every interaction produces structured data about where the model is well-calibrated and where it is not. That data feeds directly into better benchmarks, better post-training priorities, and better model behavior. The product does not just expose the model. It improves the next generation of the model.

This is a thin prototype. A production version would use model internals, not a second inference pass, and would need careful study of how verification signals affect user trust and behavior. But the core thesis stands: the missing product layer is not more explanation. It is structured self-assessment.

Built by Mikko Kiiskila as a Research PM prototype.
Grounded in Anthropic's published research on interpretability, self-verification, and agent autonomy.