Test Orchestrations

Orchestration testing lets you save realistic scenarios for a reusable workflow and check whether future runs still behave the way you expect. A test has inputs, expected task outputs, validation methods, and run history. Use it before publishing an Orchestration, after changing tasks or settings, and whenever a workflow starts producing inconsistent results. For schema design before testing, see Structured Outputs.

Tests do not make AI fully deterministic. They make the workflow more deterministic at the process level by fixing inputs, checking output structure, tracking changes over time, and showing exactly where behavior changed.

Why Orchestration testing matters

An Orchestration is more than a prompt. It can ask for inputs, use tools, search a Workspace, create artifacts, pause for human review, run tasks in parallel, and use Memory. Small edits can change how the whole workflow behaves. Testing helps you answer practical questions:

Do users provide the right inputs?
Does each task still produce the expected kind of output?
Does the workflow stop when required evidence is missing?
Does Sofie avoid unsupported conclusions?
Do citations, source references, or required terms appear when expected?
Do structured outputs keep the same fields?
Do parallel tasks still complete without hidden order dependencies?
Does long-term Memory improve the workflow without replacing current sources?
Did a task start using a capability you did not expect?
Did a template, schema, or input change make older tests drift?

Testing is especially valuable for life sciences workflows where the process matters as much as the final draft. A test can check that a deviation workflow separates facts from assumptions, that a CAPA workflow pauses before an effectiveness conclusion, or that a validation protocol workflow leaves placeholders when source evidence is missing.

When to create tests

Create tests early. Do not wait until the Orchestration is “finished.” Use tests:

Before publishing an Orchestration.
After a successful realistic run that you want to preserve as a baseline.
Before changing required inputs, source rules, task order, tools, output mode, or human review points.
After changing a CoDraft template, CoSheet structure, or Workspace source set that the workflow depends on.
When you enable or change Long-Term Memory (Cross-Run Learning).
When you enable Allow Parallel Task Execution for tasks that used to run one after another.
When a user reports that a published Orchestration behaves differently than expected.
Before sharing a workflow with a broader team.

Build a test suite the same way you build the workflow: start with one normal case, then add missing-source, conflicting-source, large-Workspace, and edge-case scenarios.

What a test contains

Each Orchestration test can include:

Part	What it does	How to use it
Test Name	Identifies the scenario.	Use names like `Deviation with missing batch record page` or `CAPA metrics with outlier lots`.
Description	Explains what the test proves.	State the behavior the workflow should keep.
Test Inputs	Provides the values the Orchestration needs to run.	Use realistic Workspaces, files, CoDrafts, CoMeetings, CoSheets, dates, and option values.
Task expectations	Defines expected output for one or more tasks.	Test important intermediate tasks, not only the final answer.
Validation Method	Chooses how Sofie checks text output.	Use exact, semantic, AI Judge, contains, regex, length, or tool-call validation based on the output.
Structured output checks	Checks named fields and arrays.	Use them when the task returns fields, tables, findings, metrics, or repeatable sections.
Narrative quality checks	Checks narrative text generated with structured output.	Use them for citations, unsupported claims, sensitive data, tone, or custom quality criteria.
Execution time constraints	Sets minimum or maximum expected task duration.	Use sparingly to catch major performance regressions, not normal run variability.
Run history	Shows how the test behaved over time.	Use it to review pass rate, task reliability, duration, token use, and capability use.

You do not need to validate every task in every test. Focus on the task outputs that carry the most workflow risk.

Create a manual test

Manual tests are best when you know the scenario you want to preserve before you run it.

Open the Orchestration editor

Open the Orchestration in Orchestrate. Keep it as a draft while you are building or changing tests.

Open Tests

Open the Tests panel. If there are no tests yet, click Create Your First Test. Otherwise, click Create Test.

Name the scenario

Enter a Test Name and optional Description. Describe the workflow behavior you expect, not only the data you used.

Fill test inputs

In Setup, provide the same inputs a user would provide when running the Orchestration. Use required input fields first, then add optional context when the scenario needs it.

Define expectations

Open Expectations. Select each task you want to validate and enter the expected output or structured data.

Choose validation methods

Choose a validation method for each text output or field-level rule for structured output.

Create the test

Click Create Test. The test appears in the Tests panel.

Run the test

Click Run Test. Review Test Results after the run completes.

Do not write expected outputs that only match one lucky run. A useful test checks durable behavior: source handling, output shape, reviewable findings, required fields, and stop conditions.

Save a successful run as a test

Saving a run as a test is the fastest way to turn a known-good run into a regression check. Use this path after you run an Orchestration with realistic inputs and the output reflects the behavior you want to keep.

Open the successful run

Open the Orchestration run that behaved well.

Choose Save Run as Test

Use Save Run as Test to create a new test from the run.

Review task outputs

Sofie shows the task outputs from the run. Keep the tasks that matter for future validation and uncheck tasks you do not want to include.

Edit brittle outputs

Rewrite expected outputs that are too tied to phrasing, incidental examples, or source ordering.

Create the test

Save the test and run it once to confirm it passes as a baseline.

Good baseline edits:

Keep required findings, source fields, and missing-evidence behavior.
Remove wording that does not matter.
Change long prose into required bullets, sections, or structured fields.
Add a validation method that matches the output instead of defaulting to exact text.

Choose the right validation method

Different outputs need different validation strategies.

Validation Method	Use it when	Example
Exact Match	The output must match a stable value.	A status field returns `Needs SME review`.
Semantic Similarity	The meaning matters more than the wording.	A task summarizes a source conflict using different wording.
AI Judge	A qualitative check matters.	The output should be accurate, complete, source-aware, and free of unsupported claims.
Contains Keywords	Specific words, sources, headings, or warnings must appear.	The output must include `missing evidence`, `batch record`, and `SME question`.
Pattern (Regex)	The output must match an identifier, date, code, or format.	Batch numbers, protocol IDs, deviation IDs, or date formats.
Length Validation	The output must stay within a reviewable size.	A summary should stay under 300 words.
Tool Calls	A task must use or avoid certain enabled capabilities.	A source review task should use document search before drafting.

Use exact matching for controlled values. Use semantic or AI Judge for narrative work. Use structured output when downstream tasks or reviewers need repeatable fields.

If a test fails because the wording changed but the work is still correct, the validation method is probably too strict.

Validate structured output

Structured output makes Orchestration tests more predictable because Sofie can check fields instead of comparing one block of prose. For a deeper guide to schema design and task handoff, see Structured Outputs. Use structured output for:

Evidence tables.
Finding lists.
CAPA action assessments.
Batch record exception trackers.
Validation protocol sections.
Risk assessment rows.
Source gap inventories.
Template placeholder values.

When you define expected structured data, add field-level rules:

Field type	Useful checks
Text	Exact, semantic, contains, regex, length, or AI Judge.
Number	Exact value, tolerance, minimum, or maximum.
Yes/No	Expected true or false value.
Dropdown	Expected option.
Object	Nested field rules.
List	Match by exact order, identity fields, semantic similarity, required values, per-item validation, minimum count, or maximum count.

For lists, choose stable identity fields when order does not matter. For example, a finding list might match items by finding, source, and section instead of row number. Use Allow Missing Items or Allow Extra Items only when variability is acceptable. If a deviation investigation must return every critical source gap, do not allow missing items.

Use AI Judge carefully

AI Judge is useful when the output quality cannot be captured by exact text or simple keywords. It can evaluate criteria such as:

Accuracy.
Completeness.
Citations or source support.
Tone.
Unsupported claims.
Sensitive data.
Bias.
Toxicity.
Custom criteria.

Use AI Judge for reviewable quality checks, not as the only control for high-risk outputs. Good AI Judge custom criteria:

The output must separate confirmed facts, assumptions, and open questions. It must not state a root cause unless the provided sources explicitly support it. If evidence is missing, it must ask for SME review instead of filling the gap.

For structured outputs, use field-level checks first, then add AI Judge for narrative quality or overall completeness.

An AI Judge result is still AI output. Treat it as a testing signal that helps you inspect the workflow, not as final approval.

Test tool use without exposing internals

The Tool Calls validation method can check whether a task used expected enabled capabilities. This is useful when a task should search provided context, analyze a file, or create an artifact before answering. Use tool-call testing when:

A source review task must search a selected Workspace before summarizing.
A CoSheet analysis task should analyze tabular data before drafting commentary.
A CoDraft task should create or update a document only after earlier review.
A task should request human input before continuing.
A task should not use broad search when selected source files are enough.

Keep tool-call expectations focused. Checking every capability call can make tests brittle. Prefer checking the few capability uses that define the workflow. Do not document or rely on implementation-specific capability names in user-facing guidance. In the test editor, use the names your environment exposes, and keep descriptions in terms users understand.

Test artifact tracking

When an Orchestration creates or modifies artifacts, the run result should make that work easy to review. Test that behavior before you publish workflows that write to CoDrafts, CoSheets, templates, Workspaces, or other durable Sofie surfaces. After a test run, open the Artifacts section and check:

The expected artifact appears.
The artifact type is correct.
The action label matches what happened, such as Created or Modified.
The task shown on the artifact card is the task that should have touched it.
Existing artifacts were not modified when the workflow should have created a new artifact.
New artifacts were not created when the workflow should have updated an existing artifact.
Human review happened before artifact creation or modification when the workflow requires review.

Good artifact-tracking test cases:

Scenario	What to verify
Template fill	The run creates the expected CoDraft and lists it in Artifacts.
CoSheet update	The run modifies the selected CoSheet instead of creating an unrelated sheet.
Review-only workflow	The run references or reviews sources without creating a new artifact.
Human review gate	The artifact is created or modified only after the reviewer approves the step.
Rerun	The run does not duplicate artifacts unless duplicate creation is intended.

Use this review prompt after test runs:

Check artifact tracking for this test run. Did the run create or modify the expected artifacts, avoid unintended artifact changes, and show enough information for a reviewer to open each artifact and inspect it?

Make tests more deterministic

AI workflows can vary. Testing works best when you reduce unnecessary variability and validate the parts that should remain stable. Use these practices:

Use the same realistic inputs for repeat runs.
Keep the source set controlled. Use a focused Workspace or specific files instead of a broad knowledge pool.
Prefer structured output for findings, fields, metrics, and template values.
Validate exact values only when the value should never vary.
Validate narrative meaning with Semantic Similarity or AI Judge.
Use Contains Keywords for required warnings, headings, source names, or stop conditions.
Use Pattern (Regex) for identifiers and formats.
Test intermediate tasks so you can see where behavior changed.
Use human review before tasks that depend on judgment.
Turn off Allow Parallel Task Execution when tasks must run in a strict order.
Keep Long-Term Memory (Cross-Run Learning) off in baseline tests unless memory behavior is part of what you are testing.
Require citations when source support matters, then check that citations appear and point to relevant sources.

The goal is not identical prose every time. The goal is repeatable workflow behavior.

Test parallel execution

Allow Parallel Task Execution lets independent tasks run at the same time. It can improve performance, but it also exposes hidden dependencies. Before enabling parallel execution broadly, add tests that prove:

Independent tasks do not rely on each other’s unsaved outputs.
Tasks with dependencies still run after the information they need exists.
The final synthesis task receives the expected intermediate findings.
Human review still happens before dependent conclusion or artifact-creation tasks.
The run history does not show large increases in failures, duration, or capability use.

If a test passes sequentially but fails when parallel execution is enabled, inspect the workflow design before changing the test. The task may depend on another task’s output but not declare that dependency clearly enough. Good pattern:

Task 1 reviews batch record completeness.
Task 2 reviews deviation description and immediate actions.
Task 3 reviews related CAPA history.
Task 4 synthesizes Tasks 1-3 and requires human review before conclusion.

Tasks 1-3 may be safe to run in parallel. Task 4 should wait for all three.

Test Memory behavior

Orchestrations can use short-term and long-term Memory. Use Short-Term Memory (Within Run) when tasks need to share findings inside the same run. Test that later tasks retrieve the right current-run information, such as missing evidence, outlier metrics, or source conflicts. Use Long-Term Memory (Cross-Run Learning) when agents should learn stable workflow preferences across runs. Test it separately from source facts. Good long-term Memory examples:

The reviewer prefers a table with Fact, Source, Assumption, and Question columns.
The drafter should keep CAPA effectiveness summaries under a defined length.
The analyst should explain outliers before calculating trends.

Bad long-term Memory examples:

The root cause for a specific deviation.
A batch-specific conclusion.
A one-time acceptance criterion.
A final reviewer decision.

When testing long-term Memory:

Run a baseline without memory effects

Run the test with explicit inputs and source rules. Confirm the workflow can pass without depending on remembered project facts.

Enable long-term Memory for a narrow behavior

Turn on cross-run learning only when you want the agent to learn reusable preferences.

Run the same scenario again

Check whether the output improved in the intended way without changing source-backed facts.

Inspect agent memories

Open the agent’s Long-Term Memories tab. Delete memories that are stale, too specific, or source-fact-like.

Long-term Memory can influence future runs. Do not use it as the source of record for quality decisions, batch facts, acceptance criteria, or final conclusions.

Use execution time constraints

Execution time constraints can help catch major performance changes. They are most useful when a task has a stable expected range and failures are easy to interpret. Use them for:

A task that should finish quickly but starts searching too broadly.
A structured extraction task that suddenly becomes much slower after a source or tool change.
A workflow where latency affects user adoption.

Avoid using narrow timing limits for tasks that depend on large files, external sources, or variable Workspace search. Normal run conditions can change duration even when the output is correct.

Review test results

After you click Run Test, open Test Results. Review:

Overall passed, failed, and total task counts.
Each task’s status.
Validation messages.
Validation score when available.
Field validation details for structured output.
AI Judge results when enabled.

Read failures from the task level up. A final synthesis failure may be caused by an earlier source review task, a missing input, or a structured field mismatch. Use this review pattern:

Find the first failed task

Start with the earliest failing task. Later failures may be downstream effects.

Read the validation message

Check whether the test failed because the output is wrong, the expectation is outdated, or the validation method is too strict.

Compare expected behavior to actual behavior

Decide whether the Orchestration should change or the test should change.

Make one fix

Edit the input label, task instruction, source rule, output mode, review point, tool choice, or test expectation.

Run the test again

Confirm the fix changed the intended behavior and did not create a new failure.

Use run history

Test Run History shows historical executions for a test. Use it to understand behavior over time, not just the most recent pass or fail. The history view can show:

Pass rate.
Recent runs.
Duration trends.
Token usage trends.
Capability usage.
Task reliability.
Task performance details.
Tool usage by task.
All past runs.

Use run history when:

A test passes sometimes and fails other times.
A workflow becomes slower after a change.
Token usage jumps unexpectedly.
A specific task is unreliable.
Capability usage changes after a tool, input, or source-rule edit.
Parallel execution changes performance or reliability.

A test that passes but uses far more tokens, time, or capability calls than before may still need attention. It can signal vague source rules or a task that is searching too broadly.

Validate against past runs

Use Validate Against Past Run to compare a test with completed historical Orchestration runs. Sofie sorts past runs by input match score and shows which inputs matched or differed. Use it when:

You want to check whether the current test expectations match a known historical run.
You are turning a past run into a durable baseline.
You want to compare behavior before and after a workflow edit.
A user reports that the workflow used to behave differently.

Pay attention to input match. A low-match run may still be useful for review, but it should not become the baseline for a different scenario.

Handle drift

Drift means the saved test no longer matches the current Orchestration structure. It can happen when you remove a task, change a structured output schema, or change inputs. When Drift Detected appears:

Review the drift issues

Check whether tasks, inputs, or output schemas changed.

Decide if the change was intended

If the workflow changed on purpose, update the test. If not, inspect the Orchestration change first.

Use Fix Drift Issues when appropriate

Sofie can remove deleted task expectations, update schemas to match the current Orchestration, or remove inputs that no longer exist.

Run the test again

Confirm the updated test validates the current workflow behavior.

Do not automatically fix drift without reading it. Drift can be the first signal that a workflow edit changed the contract users depend on.

Build a useful test suite

A single passing test is a start. A useful Orchestration has a small suite of realistic scenarios. For most published Orchestrations, add:

Test type	What it proves
Normal case	The main workflow works with typical inputs.
Missing optional input	The workflow still runs and notes missing context appropriately.
Missing required source	The workflow stops or asks for the source instead of guessing.
Conflicting sources	The workflow lists the conflict and requests review.
Large Workspace	The workflow uses the right sources instead of irrelevant context.
Artifact output	CoDraft, CoSheet, or template output keeps required structure.
Artifact tracking	The run result shows the artifacts created or modified by the workflow.
Human review path	The workflow pauses before judgment, save, or artifact creation.
Edge case	The workflow handles unusual but expected input variation.

For high-use workflows, add regression tests for common user mistakes. Examples:

Wrong document uploaded.
Missing date range.
CoSheet with blank metric rows.
CoDraft template missing expected placeholders.
Workspace contains old and current versions of the same source.
Meeting notes conflict with signed source material.

Life sciences examples

Deviation investigation

Test that the workflow:

Extracts confirmed facts and timeline before analysis.
Separates facts, assumptions, gaps, and SME questions.
Does not state root cause before the review point.
Identifies missing batch record pages or conflicting dates.
Uses source-backed language in the final CoDraft section.

Good validation choices:

Structured output for fact table and source gaps.
Contains checks for missing evidence, source conflict, or SME question.
AI Judge to check that conclusions are not overstated.

CAPA effectiveness check

Test that the workflow:

Extracts effectiveness criteria before analyzing results.
Uses the metrics CoSheet for observed performance.
Flags missing observation windows or unclear thresholds.
Separates evidence from conclusion.
Pauses before drafting effectiveness language.

Good validation choices:

Numeric checks for metric fields.
Structured output for criteria, evidence, result, and gap rows.
Length validation for reviewer summaries.
AI Judge for completeness and source support.

Validation protocol generation

Test that the workflow:

Uses URS, risk assessment, and template inputs.
Does not invent acceptance criteria.
Leaves placeholders when SME confirmation is needed.
Fills template sections consistently.
Keeps source gaps visible.

Good validation choices:

Structured output for test sections and acceptance-criteria placeholders.
Contains checks for SME confirmation required.
Regex for protocol IDs or requirement IDs.

Batch record review

Test that the workflow:

Detects missing pages or incomplete sections.
Creates an exception table.
Separates observation, possible impact, and reviewer question.
Does not recommend disposition before review.

Good validation choices:

Structured list validation for exceptions.
Contains checks for required section names.
AI Judge for unsupported disposition language.

Troubleshoot failing tests

Symptom	Likely cause	What to do
Test fails after task rename or deletion	Test drift	Review Drift Detected and update the test after confirming the workflow change.
Output is correct but text comparison fails	Validation is too strict	Switch from exact matching to semantic, contains, regex, length, or structured output.
Different runs use different sources	Source rules are too broad	Narrow inputs, use a focused Workspace, and state source priority.
Parallel runs fail intermittently	Hidden task dependency	Make dependency explicit or turn off parallel execution for that workflow.
Long-term Memory changes output facts	Memory is being used for source facts	Remove stale memories and keep project facts in inputs, Workspace, or artifacts.
Test passes but final output is hard to review	Expectations only cover the end result	Add task-level checks and structured intermediate outputs.
Test gets slower over time	Search or tool use expanded	Review run history, duration trend, token trend, and capability usage.
AI Judge results feel inconsistent	Criteria are too vague	Use field-level checks first and write sharper custom criteria.

Publishing checklist

Before publishing an Orchestration, run through this checklist:

Run All Tests passes or failures are understood.
Tests include normal, missing-source, and conflicting-source scenarios.
Structured outputs validate required fields.
Human review is tested before conclusions or shared outputs.
Source-backed tasks require citations where useful.
Parallel execution is tested if enabled.
Long-term Memory is reviewed if enabled.
Run history does not show unexplained reliability, duration, token, or capability-use changes.
Drift issues are resolved intentionally.
A user who did not build the Orchestration can understand the test names and descriptions.

Useful prompts

Use these prompts in Sofie chat while designing or improving tests.

Design a test suite

Design a test suite for this Orchestration. Include normal inputs, missing required sources, missing optional sources, conflicting sources, a large Workspace case, and expected task outputs. Recommend validation methods for each task.

Make a test less brittle

Review this Orchestration test. Identify expectations that are too tied to exact wording, source order, or one lucky run. Suggest structured fields, semantic checks, contains checks, regex checks, length checks, and AI Judge criteria that would make it more durable.

Debug a failing test

Review this failed Orchestration test result. Find the first meaningful failure, explain whether the Orchestration or the test expectation should change, and suggest the smallest edit to verify the intended behavior.

Check parallel execution

Review this Orchestration for parallel execution risk. Identify tasks that can safely run independently, tasks that must wait for earlier outputs, and tests I should run before enabling parallel task execution.

Check long-term Memory use

Review the long-term memories for this Orchestration's agents. Identify memories that are useful reusable preferences, memories that are stale, and memories that look like source facts or one-time decisions that should be deleted.

​Why Orchestration testing matters

​When to create tests

​What a test contains

​Create a manual test

​Save a successful run as a test

​Choose the right validation method

​Validate structured output

​Use AI Judge carefully

​Test tool use without exposing internals

​Test artifact tracking

​Make tests more deterministic

​Test parallel execution

​Test Memory behavior

​Use execution time constraints

​Review test results

​Use run history

​Validate against past runs

​Handle drift

​Build a useful test suite

​Life sciences examples

​Deviation investigation

​CAPA effectiveness check

​Validation protocol generation

​Batch record review

​Troubleshoot failing tests

​Publishing checklist

​Useful prompts

​Related guides

Why Orchestration testing matters

When to create tests

What a test contains

Create a manual test

Save a successful run as a test

Choose the right validation method

Validate structured output

Use AI Judge carefully

Test tool use without exposing internals

Test artifact tracking

Make tests more deterministic

Test parallel execution

Test Memory behavior

Use execution time constraints

Review test results

Use run history

Validate against past runs

Handle drift

Build a useful test suite

Life sciences examples

Deviation investigation

CAPA effectiveness check

Validation protocol generation

Batch record review

Troubleshoot failing tests

Publishing checklist

Useful prompts

Related guides