Evaluation Workbench
Evaluation measure specifications and evaluation question contracts are consolidated here. These documents govern the empirical portion of the thesis. Each measure is defined at the metric, baseline, unit, and threshold level. Chapter 10 results tables are interpretable against these pre-declared expectations. Full traceability is in the Requirements–Design–Evaluation Traceability Matrix. Environment-derived requirements are operationalised here as measurable evaluation contracts. Each contract can be confirmed or refuted by the Chapter 10 demonstration evidence.
Evaluation Measure Specifications
Seven evaluation measures are operationalised across the thesis. Measures EM-4W-01 to EM-4W-03 are defined and pre-registered in Chapter 4; measures EM-09-01 to EM-09-04 are executed and reported in Chapter 10.
| Measure ID | Chapter | Metric Definition | Baseline Definition | Unit | Threshold | Data Source |
|---|---|---|---|---|---|---|
| EM-4W-01 | Ch4w | Interpretation divergence for matched regulatory obligations | Divergence under current synchronous artefact workflow | Divergence rate | Lower than baseline with practical significance | Annotation sheets; protocol logs |
| EM-4W-02 | Ch4w | Local-to-global check scope ratio per transformation event | Scope ratio in baseline non-modular workflow | Ratio | Local scope majority in bounded edits | Change-trace logs |
| EM-4W-03 | Ch4w | Rule-compliant variation yield with exception budget | Variation yield under ungoverned complements | Compliance proportion | Meets pre-registered budget and rationale coverage | Generation logs; exception register |
| EM-09-01 | Ch9 | Standards interpretability trace completeness | Baseline trace completeness | Index | Improved completeness over baseline | Ch9 case outputs |
| EM-09-02 | Ch9 | Modular-fit evidence with bounded verification signals | Baseline modular-fit without interface contracts | Composite score | Positive bounded-check trend over baseline | Ch9 results tables |
| EM-09-03 | Ch9 | Workflow burden delta across time, cognitive, and skill proxies | Baseline burden profile for matched tasks | Delta | Net reduction with stated confidence limits | Ch9 workflow outputs |
| EM-09-04 | Ch9 | Exception governance quality in discussion synthesis | Baseline exception handling quality | Quality score | Full typing and justification coverage | Ch9 discussion evidence |
Evaluation Question Contract
The five evaluation questions below are defined to organise the measures into clusters that correspond to the propositions tested in the thesis. Each evaluation question maps to a primary proposition (EQ-01↔︎P1; EQ-02↔︎P2; EQ-03↔︎P3; EQ-04↔︎P4; EQ-05↔︎P5), following the property↔︎proposition↔︎EQ alignment established in Chapter 2 §2.9. Each evaluation question is linked to the environment requirements it addresses and the measures that provide the evidence. Overall, the seven measures span all five evaluation questions, and no requirement is addressed by a measure that cannot be observed in the demonstration evidence. Therefore, the evaluation question contracts documented here establish the interpretive framework within which the Chapter 10 results tables are to be read.
| Evaluation Question | Requirement IDs | Measure IDs | Expected Chapter Output |
|---|---|---|---|
| EQ-01 | ER-01, ER-04 | EM-4W-01, EM-09-01 | Standards interpretability evidence |
| EQ-02 | ER-02, ER-05 | EM-4W-02, EM-09-02 | Modular-fit and bounded-check evidence |
| EQ-03 | ER-01, ER-02 | EM-09-01, EM-09-02 | Round-trip replay + invariant-preservation evidence (EVID-P3-REPLAY, EVID-P3-INVARIANTS) |
| EQ-04 | ER-06 | EM-4W-03, EM-09-04 | Governed variation and exception evidence |
| EQ-05 | ER-03, ER-05 | EM-09-03 | Workflow burden comparison evidence (integrated utility) |
Requirement identifiers used here are defined in the Environment-Derived Requirements Register and the Environmental Grounding Dossier. Each identifier traces to at least one design feature and one evaluation measure. In summary, the workbench constitutes the pre-registration record for the thesis’s evaluation. All measures, thresholds, and question-to-requirement linkages are declared before demonstration results are interpreted. Chapter 10 results can therefore be assessed against thresholds set independently of observed outcomes.