Consistency Score

The degree to which the AI maintains coherent information, personality, and logic across turns and sessions.

10% Composite Weight2 Research Papers0-10 Scale

Research Foundation

→
W&B: AI Agent Evaluation Metrics and Best Practices
Framework for evaluating consistency in AI systems
→
Dialzara: Metrics for Evaluating Conversational AI
Industry standards for conversational coherence

0-10 Scoring Rubric

Fully Consistent

No contradictions across entire conversation or session, remembers and references previous statements accurately, maintains stable personality/tone throughout, logical coherence across all responses.

Example: User asks about pricing in message 1, asks follow-up in message 10 → AI provides consistent pricing information without contradiction

8-9

Strong Consistency

Minimal contradictions, quickly self-corrects if noticed. Good context retention across turns, stable personality and tone, logical flow maintained.

6-7

Adequate Consistency

Occasional minor contradictions that don't undermine core message. Generally remembers context but may need reminders, mostly stable personality with slight variations, logic mostly sound with occasional gaps.

4-5

Inconsistent

Multiple contradictions within conversation, forgets important context from earlier turns. Personality/tone shifts noticeably, logical gaps that require user to re-explain.

2-3

Poor Consistency

Frequent contradictions that confuse user, minimal context retention. Unstable personality (formal → casual → formal), logical incoherence across responses.

0-1

Incoherent

Direct contradictions within same response, no context retention even within few turns. Personality completely unstable, responses don't follow from user's input.

Observable Scoring Criteria

Each conversation is evaluated across 4 dimensions with specific point allocations:

Factual Consistency (0-3 points)

• 3: No contradictory factual statements
• 2: Minor contradictions that don't affect core information
• 1: Some contradictions that create confusion
• 0: Major contradictions or incoherent information

Context Retention (0-3 points)

• 3: Accurately references and builds on all prior turns
• 2: Remembers most context, occasional gaps
• 1: Remembers only recent context (last 1-2 turns)
• 0: No context retention

Personality Stability (0-2 points)

• 2: Consistent tone, personality, and interaction style
• 1: Generally stable with minor variations
• 0: Unstable or contradictory personality

Logical Coherence (0-2 points)

• 2: All responses follow logically from context and prior statements
• 1: Mostly logical with occasional non-sequiturs
• 0: Illogical or contradictory reasoning

Want to measure consistency in your AI?