Read-only demo. Approve, reject, deploy, and iteration actions are disabled. Self-host from GitHub.
‹ Workflows

Review proposed changes to our union contracts before each negotiation round.

Open operator view ↗

Gated · contract-review

Improvement loop active

1 iteration recorded · latest val_score 0.909 · 12 eval cases in the suite. Each new iteration re-runs the agent with the latest instruction and proposes the next edit.

Iterations
1
· first run
Latest val_score
90.9%
Lift vs baseline
+0.0pp
Pending proposals
0
· 12 cases in suite
seed-7-clean-clause-step2
Non-compete clauses with 7-month duration are commonly flagged as problematic in union contracts due to competitive restraint enforceability concerns and poten…
Failed · test fold
0
Failed · train
1
seed-7-clean-clause-step2
Non-compete clauses with 7-month duration are commonly flagged as problematic in union contracts due to compe…
predicted true · expected false
train
Passed
11
clean-clause-step-2
The clause text is identical to step 1 ("Pre-existing IP carved out"), same clause type, and step 1 was marke…
predicted false · expected false
test fold
alt-seed-clean-clause
A 4-month non-compete clause with low severity is a standard, commonly-enforceable restriction in employment…
predicted false · expected false
train
clean-clause-step-0
A standard 60-day mutual termination clause is a routine, non-problematic provision in union contracts that b…
predicted false · expected false
train
overtime-carveout-flagged
The clause assigns all IP including pre-existing assets to the company, which is a high-severity risk that co…
predicted true · expected true
train
seed-7-clean-clause-step0
The termination clause uses standard, clear language with symmetric rights (either party, equal notice) and l…
predicted false · expected false
train
clean-clause-step-1
Pre-existing IP carve-outs are standard, uncontroversial contract language that typically do not pose legal r…
predicted false · expected false
train
alt-seed-clean-clause-step1
Termination clauses with standard notice periods (30 days) are routine and low-risk; the preceding non-compet…
predicted false · expected false
train
alt-seed-problematic-clause-step6
Step 6 has identical clause type and text to step 2, which was labeled problematic; IP assignment including p…
predicted true · expected true
test fold
grievance-precedent-flagged
The clause grants either party termination with only 14 days notice, which is substantially shorter than the…
predicted true · expected true
train
notification-window-30-day-breach
IP assignment clauses that claim ownership of pre-existing IP are high-risk and typically problematic under m…
predicted true · expected true
train
seed-7-problematic-step1
High severity combined with a specific severance dollar amount lacking context on whether it aligns with unio…
predicted true · expected true
train

Spec also declares DocumentReader, SideBySideView — those primitives need richer per-case agent output than the current loop emits.

Iterations · 1

Iterval_scoreBest everStateApproved?Ended
#00.9090.909gate-blocked-no-improvement2026-05-19 04:29

Agent anatomy

Single-agent loop, gated by the regression suite. Below: the skills the agent has loaded, the tools it can call, and who signs off on changes.

Skills active · 0
No skills bound to this workflow yet — generated on first run.
Tools available · 4
  • flag_clause_risk
    Flags a clause as risk + risk type + severity.
    flag_clause_risk(clause_id: string, risk_type: category, severity: category, rationale: string)
  • fetch_clause_text
    Returns the current text of a numbered clause.
    fetch_clause_text(clause_id: string) → text: string
  • search_grievance_precedent
    Returns grievance cases matching a clause + topic.
    search_grievance_precedent(topic: category, years_back: int) → cases: string
  • check_jurisdictional_rules
    Returns conflicting jurisdictional rules for a clause.
    check_jurisdictional_rules(jurisdiction: category, topic: category) → conflicts: string
Topology & review
  • Single-agent loop
    One agent reads its skills, calls tools, and proposes the next skill version. Regression gate runs every iteration. Phase-2 multi-agent is out of scope.
  • Reviewer · Labour relations counsel
    cadence: weekly
    Reviews flagged clauses, escalates to legal redline.
  • Success · maximize clause_flag_precision_recall
    A flag is correct if it identifies a clause that legal subsequently redlines or escalates. Composite of precision and recall over flagged-clause set, weighted by clause severity.
  • Environment
    2 entity types · 2 data sources · 2 generators · 2 personas · seasonality: renewal-cycle

Skills + tools are read live from the kernel. Open the trace inspector to watch one run end-to-end.

View eval cases →