Review proposed changes to our union contracts before each negotiation round.

Gated · contract-review

Overview Operate Eval cases Proposals Failures Traces Audit Triggers Integrations Permissions Settings

Improvement loop active

1 iteration recorded · latest val_score 0.909 · 12 eval cases in the suite. Each new iteration re-runs the agent with the latest instruction and proposes the next edit.

Iterations

· first run

Latest val_score

90.9%

Lift vs baseline

+0.0pp

Pending proposals

· 12 cases in suite

seed-7-clean-clause-step2

Non-compete clauses with 7-month duration are commonly flagged as problematic in union contracts due to competitive restraint enforceability concerns and poten…

Failed · test fold

Failed · train

seed-7-clean-clause-step2

Non-compete clauses with 7-month duration are commonly flagged as problematic in union contracts due to compe…

predicted true · expected false

train

Passed

clean-clause-step-2

The clause text is identical to step 1 ("Pre-existing IP carved out"), same clause type, and step 1 was marke…

predicted false · expected false

test fold

alt-seed-clean-clause

A 4-month non-compete clause with low severity is a standard, commonly-enforceable restriction in employment…

predicted false · expected false

train

clean-clause-step-0

A standard 60-day mutual termination clause is a routine, non-problematic provision in union contracts that b…

predicted false · expected false

train

overtime-carveout-flagged

The clause assigns all IP including pre-existing assets to the company, which is a high-severity risk that co…

predicted true · expected true

train

seed-7-clean-clause-step0

The termination clause uses standard, clear language with symmetric rights (either party, equal notice) and l…

predicted false · expected false

train

clean-clause-step-1

Pre-existing IP carve-outs are standard, uncontroversial contract language that typically do not pose legal r…

predicted false · expected false

train

alt-seed-clean-clause-step1

Termination clauses with standard notice periods (30 days) are routine and low-risk; the preceding non-compet…

predicted false · expected false

train

alt-seed-problematic-clause-step6

Step 6 has identical clause type and text to step 2, which was labeled problematic; IP assignment including p…

predicted true · expected true

test fold

grievance-precedent-flagged

The clause grants either party termination with only 14 days notice, which is substantially shorter than the…

predicted true · expected true

train

notification-window-30-day-breach

IP assignment clauses that claim ownership of pre-existing IP are high-risk and typically problematic under m…

predicted true · expected true

train

seed-7-problematic-step1

High severity combined with a specific severance dollar amount lacking context on whether it aligns with unio…

predicted true · expected true

train

Spec also declares DocumentReader, SideBySideView — those primitives need richer per-case agent output than the current loop emits.

Iterations · 1

Iterval_scoreBest everStateApproved?Ended

#00.9090.909gate-blocked-no-improvement2026-05-19 04:29

Agent anatomy

Single-agent loop, gated by the regression suite. Below: the skills the agent has loaded, the tools it can call, and who signs off on changes.

Skills active · 0

No skills bound to this workflow yet — generated on first run.

Tools available · 4

flag_clause_risk
Flags a clause as risk + risk type + severity.
flag_clause_risk(clause_id: string, risk_type: category, severity: category, rationale: string)
fetch_clause_text
Returns the current text of a numbered clause.
fetch_clause_text(clause_id: string) → text: string
search_grievance_precedent
Returns grievance cases matching a clause + topic.
search_grievance_precedent(topic: category, years_back: int) → cases: string
check_jurisdictional_rules
Returns conflicting jurisdictional rules for a clause.
check_jurisdictional_rules(jurisdiction: category, topic: category) → conflicts: string

Topology & review

Single-agent loop
One agent reads its skills, calls tools, and proposes the next skill version. Regression gate runs every iteration. Phase-2 multi-agent is out of scope.
Reviewer · Labour relations counsel
cadence: weekly
Reviews flagged clauses, escalates to legal redline.
Success · maximize clause_flag_precision_recall
A flag is correct if it identifies a clause that legal subsequently redlines or escalates. Composite of precision and recall over flagged-clause set, weighted by clause severity.
Environment
2 entity types · 2 data sources · 2 generators · 2 personas · seasonality: renewal-cycle

Skills + tools are read live from the kernel. Open the trace inspector to watch one run end-to-end.

View eval cases →