Tests do not make AI fully deterministic. They make the workflow more deterministic at the process level by fixing inputs, checking output structure, tracking changes over time, and showing exactly where behavior changed.
Why Orchestration testing matters
An Orchestration is more than a prompt. It can ask for inputs, use tools, search a Workspace, create artifacts, pause for human review, run tasks in parallel, and use Memory. Small edits can change how the whole workflow behaves. Testing helps you answer practical questions:- Do users provide the right inputs?
- Does each task still produce the expected kind of output?
- Does the workflow stop when required evidence is missing?
- Does Sofie avoid unsupported conclusions?
- Do citations, source references, or required terms appear when expected?
- Do structured outputs keep the same fields?
- Do parallel tasks still complete without hidden order dependencies?
- Does long-term Memory improve the workflow without replacing current sources?
- Did a task start using a capability you did not expect?
- Did a template, schema, or input change make older tests drift?
When to create tests
Create tests early. Do not wait until the Orchestration is “finished.” Use tests:- Before publishing an Orchestration.
- After a successful realistic run that you want to preserve as a baseline.
- Before changing required inputs, source rules, task order, tools, output mode, or human review points.
- After changing a CoDraft template, CoSheet structure, or Workspace source set that the workflow depends on.
- When you enable or change Long-Term Memory (Cross-Run Learning).
- When you enable Allow Parallel Task Execution for tasks that used to run one after another.
- When a user reports that a published Orchestration behaves differently than expected.
- Before sharing a workflow with a broader team.
What a test contains
Each Orchestration test can include:| Part | What it does | How to use it |
|---|---|---|
| Test Name | Identifies the scenario. | Use names like Deviation with missing batch record page or CAPA metrics with outlier lots. |
| Description | Explains what the test proves. | State the behavior the workflow should keep. |
| Test Inputs | Provides the values the Orchestration needs to run. | Use realistic Workspaces, files, CoDrafts, CoMeetings, CoSheets, dates, and option values. |
| Task expectations | Defines expected output for one or more tasks. | Test important intermediate tasks, not only the final answer. |
| Validation Method | Chooses how Sofie checks text output. | Use exact, semantic, AI Judge, contains, regex, length, or tool-call validation based on the output. |
| Structured output checks | Checks named fields and arrays. | Use them when the task returns fields, tables, findings, metrics, or repeatable sections. |
| Narrative quality checks | Checks narrative text generated with structured output. | Use them for citations, unsupported claims, sensitive data, tone, or custom quality criteria. |
| Execution time constraints | Sets minimum or maximum expected task duration. | Use sparingly to catch major performance regressions, not normal run variability. |
| Run history | Shows how the test behaved over time. | Use it to review pass rate, task reliability, duration, token use, and capability use. |
Create a manual test
Manual tests are best when you know the scenario you want to preserve before you run it.Open the Orchestration editor
Open the Orchestration in Orchestrate. Keep it as a draft while you are building or changing tests.
Open Tests
Open the Tests panel. If there are no tests yet, click Create Your First Test. Otherwise, click Create Test.
Name the scenario
Enter a Test Name and optional Description. Describe the workflow behavior you expect, not only the data you used.
Fill test inputs
In Setup, provide the same inputs a user would provide when running the Orchestration. Use required input fields first, then add optional context when the scenario needs it.
Define expectations
Open Expectations. Select each task you want to validate and enter the expected output or structured data.
Choose validation methods
Choose a validation method for each text output or field-level rule for structured output.
Save a successful run as a test
Saving a run as a test is the fastest way to turn a known-good run into a regression check. Use this path after you run an Orchestration with realistic inputs and the output reflects the behavior you want to keep.Review task outputs
Sofie shows the task outputs from the run. Keep the tasks that matter for future validation and uncheck tasks you do not want to include.
Edit brittle outputs
Rewrite expected outputs that are too tied to phrasing, incidental examples, or source ordering.
- Keep required findings, source fields, and missing-evidence behavior.
- Remove wording that does not matter.
- Change long prose into required bullets, sections, or structured fields.
- Add a validation method that matches the output instead of defaulting to exact text.
Choose the right validation method
Different outputs need different validation strategies.| Validation Method | Use it when | Example |
|---|---|---|
| Exact Match | The output must match a stable value. | A status field returns Needs SME review. |
| Semantic Similarity | The meaning matters more than the wording. | A task summarizes a source conflict using different wording. |
| AI Judge | A qualitative check matters. | The output should be accurate, complete, source-aware, and free of unsupported claims. |
| Contains Keywords | Specific words, sources, headings, or warnings must appear. | The output must include missing evidence, batch record, and SME question. |
| Pattern (Regex) | The output must match an identifier, date, code, or format. | Batch numbers, protocol IDs, deviation IDs, or date formats. |
| Length Validation | The output must stay within a reviewable size. | A summary should stay under 300 words. |
| Tool Calls | A task must use or avoid certain enabled capabilities. | A source review task should use document search before drafting. |
Validate structured output
Structured output makes Orchestration tests more predictable because Sofie can check fields instead of comparing one block of prose. For a deeper guide to schema design and task handoff, see Structured Outputs. Use structured output for:- Evidence tables.
- Finding lists.
- CAPA action assessments.
- Batch record exception trackers.
- Validation protocol sections.
- Risk assessment rows.
- Source gap inventories.
- Template placeholder values.
| Field type | Useful checks |
|---|---|
| Text | Exact, semantic, contains, regex, length, or AI Judge. |
| Number | Exact value, tolerance, minimum, or maximum. |
| Yes/No | Expected true or false value. |
| Dropdown | Expected option. |
| Object | Nested field rules. |
| List | Match by exact order, identity fields, semantic similarity, required values, per-item validation, minimum count, or maximum count. |
finding, source, and section instead of row number.
Use Allow Missing Items or Allow Extra Items only when variability is acceptable. If a deviation investigation must return every critical source gap, do not allow missing items.
Use AI Judge carefully
AI Judge is useful when the output quality cannot be captured by exact text or simple keywords. It can evaluate criteria such as:- Accuracy.
- Completeness.
- Citations or source support.
- Tone.
- Unsupported claims.
- Sensitive data.
- Bias.
- Toxicity.
- Custom criteria.
Test tool use without exposing internals
The Tool Calls validation method can check whether a task used expected enabled capabilities. This is useful when a task should search provided context, analyze a file, or create an artifact before answering. Use tool-call testing when:- A source review task must search a selected Workspace before summarizing.
- A CoSheet analysis task should analyze tabular data before drafting commentary.
- A CoDraft task should create or update a document only after earlier review.
- A task should request human input before continuing.
- A task should not use broad search when selected source files are enough.
Test artifact tracking
When an Orchestration creates or modifies artifacts, the run result should make that work easy to review. Test that behavior before you publish workflows that write to CoDrafts, CoSheets, templates, Workspaces, or other durable Sofie surfaces. After a test run, open the Artifacts section and check:- The expected artifact appears.
- The artifact type is correct.
- The action label matches what happened, such as Created or Modified.
- The task shown on the artifact card is the task that should have touched it.
- Existing artifacts were not modified when the workflow should have created a new artifact.
- New artifacts were not created when the workflow should have updated an existing artifact.
- Human review happened before artifact creation or modification when the workflow requires review.
| Scenario | What to verify |
|---|---|
| Template fill | The run creates the expected CoDraft and lists it in Artifacts. |
| CoSheet update | The run modifies the selected CoSheet instead of creating an unrelated sheet. |
| Review-only workflow | The run references or reviews sources without creating a new artifact. |
| Human review gate | The artifact is created or modified only after the reviewer approves the step. |
| Rerun | The run does not duplicate artifacts unless duplicate creation is intended. |
Make tests more deterministic
AI workflows can vary. Testing works best when you reduce unnecessary variability and validate the parts that should remain stable. Use these practices:- Use the same realistic inputs for repeat runs.
- Keep the source set controlled. Use a focused Workspace or specific files instead of a broad knowledge pool.
- Prefer structured output for findings, fields, metrics, and template values.
- Validate exact values only when the value should never vary.
- Validate narrative meaning with Semantic Similarity or AI Judge.
- Use Contains Keywords for required warnings, headings, source names, or stop conditions.
- Use Pattern (Regex) for identifiers and formats.
- Test intermediate tasks so you can see where behavior changed.
- Use human review before tasks that depend on judgment.
- Turn off Allow Parallel Task Execution when tasks must run in a strict order.
- Keep Long-Term Memory (Cross-Run Learning) off in baseline tests unless memory behavior is part of what you are testing.
- Require citations when source support matters, then check that citations appear and point to relevant sources.
Test parallel execution
Allow Parallel Task Execution lets independent tasks run at the same time. It can improve performance, but it also exposes hidden dependencies. Before enabling parallel execution broadly, add tests that prove:- Independent tasks do not rely on each other’s unsaved outputs.
- Tasks with dependencies still run after the information they need exists.
- The final synthesis task receives the expected intermediate findings.
- Human review still happens before dependent conclusion or artifact-creation tasks.
- The run history does not show large increases in failures, duration, or capability use.
Test Memory behavior
Orchestrations can use short-term and long-term Memory. Use Short-Term Memory (Within Run) when tasks need to share findings inside the same run. Test that later tasks retrieve the right current-run information, such as missing evidence, outlier metrics, or source conflicts. Use Long-Term Memory (Cross-Run Learning) when agents should learn stable workflow preferences across runs. Test it separately from source facts. Good long-term Memory examples:- The reviewer prefers a table with
Fact,Source,Assumption, andQuestioncolumns. - The drafter should keep CAPA effectiveness summaries under a defined length.
- The analyst should explain outliers before calculating trends.
- The root cause for a specific deviation.
- A batch-specific conclusion.
- A one-time acceptance criterion.
- A final reviewer decision.
Run a baseline without memory effects
Run the test with explicit inputs and source rules. Confirm the workflow can pass without depending on remembered project facts.
Enable long-term Memory for a narrow behavior
Turn on cross-run learning only when you want the agent to learn reusable preferences.
Run the same scenario again
Check whether the output improved in the intended way without changing source-backed facts.
Use execution time constraints
Execution time constraints can help catch major performance changes. They are most useful when a task has a stable expected range and failures are easy to interpret. Use them for:- A task that should finish quickly but starts searching too broadly.
- A structured extraction task that suddenly becomes much slower after a source or tool change.
- A workflow where latency affects user adoption.
Review test results
After you click Run Test, open Test Results. Review:- Overall passed, failed, and total task counts.
- Each task’s status.
- Validation messages.
- Validation score when available.
- Field validation details for structured output.
- AI Judge results when enabled.
Find the first failed task
Start with the earliest failing task. Later failures may be downstream effects.
Read the validation message
Check whether the test failed because the output is wrong, the expectation is outdated, or the validation method is too strict.
Compare expected behavior to actual behavior
Decide whether the Orchestration should change or the test should change.
Make one fix
Edit the input label, task instruction, source rule, output mode, review point, tool choice, or test expectation.
Use run history
Test Run History shows historical executions for a test. Use it to understand behavior over time, not just the most recent pass or fail. The history view can show:- Pass rate.
- Recent runs.
- Duration trends.
- Token usage trends.
- Capability usage.
- Task reliability.
- Task performance details.
- Tool usage by task.
- All past runs.
- A test passes sometimes and fails other times.
- A workflow becomes slower after a change.
- Token usage jumps unexpectedly.
- A specific task is unreliable.
- Capability usage changes after a tool, input, or source-rule edit.
- Parallel execution changes performance or reliability.
Validate against past runs
Use Validate Against Past Run to compare a test with completed historical Orchestration runs. Sofie sorts past runs by input match score and shows which inputs matched or differed. Use it when:- You want to check whether the current test expectations match a known historical run.
- You are turning a past run into a durable baseline.
- You want to compare behavior before and after a workflow edit.
- A user reports that the workflow used to behave differently.
Handle drift
Drift means the saved test no longer matches the current Orchestration structure. It can happen when you remove a task, change a structured output schema, or change inputs. When Drift Detected appears:Decide if the change was intended
If the workflow changed on purpose, update the test. If not, inspect the Orchestration change first.
Use Fix Drift Issues when appropriate
Sofie can remove deleted task expectations, update schemas to match the current Orchestration, or remove inputs that no longer exist.
Build a useful test suite
A single passing test is a start. A useful Orchestration has a small suite of realistic scenarios. For most published Orchestrations, add:| Test type | What it proves |
|---|---|
| Normal case | The main workflow works with typical inputs. |
| Missing optional input | The workflow still runs and notes missing context appropriately. |
| Missing required source | The workflow stops or asks for the source instead of guessing. |
| Conflicting sources | The workflow lists the conflict and requests review. |
| Large Workspace | The workflow uses the right sources instead of irrelevant context. |
| Artifact output | CoDraft, CoSheet, or template output keeps required structure. |
| Artifact tracking | The run result shows the artifacts created or modified by the workflow. |
| Human review path | The workflow pauses before judgment, save, or artifact creation. |
| Edge case | The workflow handles unusual but expected input variation. |
- Wrong document uploaded.
- Missing date range.
- CoSheet with blank metric rows.
- CoDraft template missing expected placeholders.
- Workspace contains old and current versions of the same source.
- Meeting notes conflict with signed source material.
Life sciences examples
Deviation investigation
Test that the workflow:- Extracts confirmed facts and timeline before analysis.
- Separates facts, assumptions, gaps, and SME questions.
- Does not state root cause before the review point.
- Identifies missing batch record pages or conflicting dates.
- Uses source-backed language in the final CoDraft section.
- Structured output for fact table and source gaps.
- Contains checks for
missing evidence,source conflict, orSME question. - AI Judge to check that conclusions are not overstated.
CAPA effectiveness check
Test that the workflow:- Extracts effectiveness criteria before analyzing results.
- Uses the metrics CoSheet for observed performance.
- Flags missing observation windows or unclear thresholds.
- Separates evidence from conclusion.
- Pauses before drafting effectiveness language.
- Numeric checks for metric fields.
- Structured output for criteria, evidence, result, and gap rows.
- Length validation for reviewer summaries.
- AI Judge for completeness and source support.
Validation protocol generation
Test that the workflow:- Uses URS, risk assessment, and template inputs.
- Does not invent acceptance criteria.
- Leaves placeholders when SME confirmation is needed.
- Fills template sections consistently.
- Keeps source gaps visible.
- Structured output for test sections and acceptance-criteria placeholders.
- Contains checks for
SME confirmation required. - Regex for protocol IDs or requirement IDs.
Batch record review
Test that the workflow:- Detects missing pages or incomplete sections.
- Creates an exception table.
- Separates observation, possible impact, and reviewer question.
- Does not recommend disposition before review.
- Structured list validation for exceptions.
- Contains checks for required section names.
- AI Judge for unsupported disposition language.
Troubleshoot failing tests
| Symptom | Likely cause | What to do |
|---|---|---|
| Test fails after task rename or deletion | Test drift | Review Drift Detected and update the test after confirming the workflow change. |
| Output is correct but text comparison fails | Validation is too strict | Switch from exact matching to semantic, contains, regex, length, or structured output. |
| Different runs use different sources | Source rules are too broad | Narrow inputs, use a focused Workspace, and state source priority. |
| Parallel runs fail intermittently | Hidden task dependency | Make dependency explicit or turn off parallel execution for that workflow. |
| Long-term Memory changes output facts | Memory is being used for source facts | Remove stale memories and keep project facts in inputs, Workspace, or artifacts. |
| Test passes but final output is hard to review | Expectations only cover the end result | Add task-level checks and structured intermediate outputs. |
| Test gets slower over time | Search or tool use expanded | Review run history, duration trend, token trend, and capability usage. |
| AI Judge results feel inconsistent | Criteria are too vague | Use field-level checks first and write sharper custom criteria. |
Publishing checklist
Before publishing an Orchestration, run through this checklist:- Run All Tests passes or failures are understood.
- Tests include normal, missing-source, and conflicting-source scenarios.
- Structured outputs validate required fields.
- Human review is tested before conclusions or shared outputs.
- Source-backed tasks require citations where useful.
- Parallel execution is tested if enabled.
- Long-term Memory is reviewed if enabled.
- Run history does not show unexplained reliability, duration, token, or capability-use changes.
- Drift issues are resolved intentionally.
- A user who did not build the Orchestration can understand the test names and descriptions.
Useful prompts
Use these prompts in Sofie chat while designing or improving tests.Design a test suite
Design a test suite
Make a test less brittle
Make a test less brittle
Debug a failing test
Debug a failing test
Check parallel execution
Check parallel execution
Check long-term Memory use
Check long-term Memory use