We're updating help docs to reflect our new product naming. Gladly Sidekick (AI) is now called just Gladly, and Gladly Hero (the Platform) is now Gladly Team. Some articles may display outdated names while we update everything. Thank you for your patience! Learn more

How Simulator Results Are Evaluated

Prev Next

This feature is currently in an Early Access phase

If you're interested in learning more, contact Gladly Support.

This article explains how the Simulator determines whether a test passed or failed, how multi-turn Conversations work, and how handoff detection factors into results.

Evaluation process

After each response from Gladly, the Simulator runs a two-phase evaluation:

Phase 1: Check success criteria

AI evaluates the full Conversation transcript and checks each success criterion. The overall result is "pass" only when every single criterion passes. If any one piece of criteria fails, the test fails overall.

Phase 2: Generate the next Customer message (if needed)

If not all criteria are met, the system generates a natural follow-up message from the simulated Customer, and the Conversation continues.

These two phases are deliberately separate. The evaluator knows the success criteria but doesn't generate Customer messages. The simulated Customer knows the objective but never sees the success criteria. This separation keeps the simulation realistic by allowing the Customer to behave like a real person, not like someone trying to make the test pass.

Multi-turn Conversations

Most real Customer interactions take multiple exchanges. The Simulator handles this naturally:

  1. The simulated Customer sends the initial message.

  2. Gladly responds.

  3. AI evaluates whether all success criteria are met.

  4. If not, the simulated Customer sends a follow-up message.

  5. Steps 2 through 4 repeat until all criteria pass or the Conversation reaches 20 turns.

Conversations are capped at 20 Customer turns

If the Gladly Agent can't satisfy all the success criteria within 20 exchanges, the test fails.

If your tests consistently hit this limit, it usually means the success criteria are too demanding for the current Guide configuration, or the scenario needs adjustment.

How handoffs are handled

A handoff occurs when the Gladly Agent transfers the Customer to a team member. The Simulator detects handoffs automatically and handles them based on the Expect handoff setting.

Expect handoff

Handoff detected?

Result

Off (default)

Yes

Fail — the agent shouldn't have handed off

Off (default)

No

Continue normally

On

Yes

Pass — the handoff was the expected outcome

On

No

Continue — keep going until the handoff happens or turn limit is reached

When Expect handoff is on but no handoff has occurred, the test keeps running even if success criteria pass

If the Conversation reaches 20 turns without a handoff, the test fails.

Debugging failed tests

When a test fails, start with the per-criterion breakdown in the results:

  • Which criteria failed? The individual criterion results tell you exactly what wasn't satisfied.

  • Read the explanation – Each criterion result includes a reason for why it passed or failed.

  • Review the transcript – Walk through the Conversation to see where the Gladly Agent’s behavior diverged from expectations.

  • Check your criteria – Sometimes the issue is with how the criterion is written, not with the Gladly Agent. See Write Effective Success Criteria for Simulator Scenarios for guidance.

  • Check your Customer data – If the simulated data doesn't match what the Gladly Agent expects, the Conversation may go off track. Make sure Customer data available to AI accurately reflects the situation your scenario requires.

  • View the Conversation – Click View conversation to review the full interaction in detail.

If a test that previously passed starts failing after a Guide change, the Simulator is working as intended

This indicates that it caught a regression. Review the transcript to understand what changed in the Gladly Agent’s behavior.