This feature is currently in an Early Access phase
If you're interested in learning more, contact Gladly Support.
When using the Simulator, the quality of your success criteria directly determines the quality of your test results. Well-written criteria produce reliable, consistent verdicts. Poorly written criteria lead to confusing results that are difficult to act on.
Perspectives in a scenario
Each scenario is written from three distinct viewpoints. Mixing them up is the most common source of confusing test results.
Customer perspective
Used for: Customer goal, Initial message, Additional details
Write these fields as the Customer. Use first person and only include information a real Customer would know.
Example: "I want to return a sweater I bought last week"
Judge perspective
Used for: Success criteria
Write success criteria as an impartial observer watching the Conversation. Use third person and focus on what's observable in the transcript.
Example: "The agent confirmed the return was initiated"
System perspective
Used for: Customer data available to Gladly
Write these as a developer defining the state of external systems. Be specific and precise.
Example: "Order #55891 exists with status 'delivered'. Item is a sweater, price $89.99, purchased 3 days ago. Return window is 30 days."
Field | Perspective | Voice | Example |
|---|---|---|---|
Customer goal | Customer | First person | "I want to check on my order" |
Initial message | Customer | First person | "Where's my order?" |
Additional details | Customer | First person | "My order number is #12345" |
Success criteria | Judge | Third person | "The agent provided a shipping status" |
Customer data available to Gladly | System | Technical | "Order #12345 exists, status: shipped" |
Principles for good criteria
Keep criteria focused
Each criteria should check a single, specific behavior. If you combine multiple checks into one criterion, it's harder to tell what went wrong when a test doesn't pass.
Avoid | Preferred |
|---|---|
"The agent asked for the order number and then looked it up" | "The agent asked for the order number" / "The agent provided information about the order" |
Make criteria based on what’s visible within the Conversation
Criteria should only check things that are visible in the Conversation transcript. The evaluator can only assess what was actually said, it can't verify anything happening behind the scenes, like system states or backend data.
Avoid | Preferred |
|---|---|
"The agent felt empathetic" | "The agent acknowledged the customer's frustration" |
"The agent correctly processed the return" | "The agent confirmed the return was initiated and gave a return number" |
Be specific
Write criteria in concrete, literal terms. The evaluator follows them exactly as written, so vague or open-ended language will lead to inconsistent results.
Avoid | Preferred |
|---|---|
"The agent handled the situation well" | "The agent provided a tracking number" |
"The conversation went smoothly" | "The agent offered to refund the customer" |
Don't use success criteria to test for a handoff
Select the Expect handoff checkbox to test explicitly for a successful handoff to a human Agent.
A simple test for good criteria
Ask yourself: Could two people independently read the conversation transcript and agree on whether this criterion passed? If yes, it's a good criterion. If reasonable people might disagree, the criterion is too vague or too subjective.
Example: complete scenario with good criteria
Title: Return request — damaged item
Customer goal: "I want to return a sweater that arrived damaged"
Initial message: "Hi, I received a damaged item and I'd like to return it"
Additional details: "My order number is #55891. The sweater has a large stain on the front. I'd like a full refund."
Customer data available to Gladly: "Order #55891 exists with status 'delivered'. Item is a wool sweater, price $89.99, delivered 3 days ago. Return window is 30 days. Item is not final sale."
Success criteria:
"The agent asked the customer for their order number"
"The agent acknowledged the damage issue"
"The agent confirmed a return was initiated or provided return instructions"
"The agent mentioned the refund amount or refund process"
Each criterion is atomic (tests one thing), observable (verifiable from the transcript), written from the judge's perspective (third person), and specific (no vague language).