Confidence Appropriateness

Appropriate certainty vs. hedging

No Ground Truth Needed0-100

What we measure

Does the AI use appropriate levels of certainty? Are confident claims backed by conversation evidence? Is uncertainty acknowledged with phrases like 'I'll verify' or 'typically'?

Why it matters

Overconfident AI responses erode trust when claims turn out wrong. Scores below 50 often indicate the AI is making definitive claims about policies, features, or specifics that weren't discussed in the conversation.

Examples of Poor Confidence

•'We definitely support X' (without evidence)
•Specific numbers or dates stated with certainty but no source
•No hedging language when making claims beyond conversation scope

Research Foundation

•Hallucination detection research in RAG systems (Vellum AI, 2025)
•Reference-free LLM evaluation frameworks
•No external ground truth required – evaluates based on conversation alone