Modular Infrastructure for Inclusive Housing — Chapter 4w Evaluation Workbench

Evaluation Workbench

Evaluation measure specifications and evaluation question contracts are consolidated here. These documents govern the empirical portion of the thesis. Each measure is defined at the metric, baseline, unit, and threshold level. Chapter 10 results tables are interpretable against these pre-declared expectations. Full traceability is in the Requirements–Design–Evaluation Traceability Matrix. Environment-derived requirements are operationalised here as measurable evaluation contracts. Each contract can be confirmed or refuted by the Chapter 10 demonstration evidence.

Evaluation Measure Specifications

Seven evaluation measures are operationalised across the thesis. Measures EM-4W-01 to EM-4W-03 are defined and pre-registered in Chapter 4; measures EM-09-01 to EM-09-04 are executed and reported in Chapter 10.

Measure ID	Chapter	Metric Definition	Baseline Definition	Unit	Threshold	Data Source
EM-4W-01	Ch4w	Interpretation divergence for matched regulatory obligations	Divergence under current synchronous artefact workflow	Divergence rate	Lower than baseline with practical significance	Annotation sheets; protocol logs
EM-4W-02	Ch4w	Local-to-global check scope ratio per transformation event	Scope ratio in baseline non-modular workflow	Ratio	Local scope majority in bounded edits	Change-trace logs
EM-4W-03	Ch4w	Rule-compliant variation yield with exception budget	Variation yield under ungoverned complements	Compliance proportion	Meets pre-registered budget and rationale coverage	Generation logs; exception register
EM-09-01	Ch9	Standards interpretability trace completeness	Baseline trace completeness	Index	Improved completeness over baseline	Ch9 case outputs
EM-09-02	Ch9	Modular-fit evidence with bounded verification signals	Baseline modular-fit without interface contracts	Composite score	Positive bounded-check trend over baseline	Ch9 results tables
EM-09-03	Ch9	Workflow burden delta across time, cognitive, and skill proxies	Baseline burden profile for matched tasks	Delta	Net reduction with stated confidence limits	Ch9 workflow outputs
EM-09-04	Ch9	Exception governance quality in discussion synthesis	Baseline exception handling quality	Quality score	Full typing and justification coverage	Ch9 discussion evidence

Evaluation Question Contract

The five evaluation questions below are defined to organise the measures into clusters that correspond to the propositions tested in the thesis. Each evaluation question maps to a primary proposition (EQ-01↔︎P1; EQ-02↔︎P2; EQ-03↔︎P3; EQ-04↔︎P4; EQ-05↔︎P5), following the property↔︎proposition↔︎EQ alignment established in Chapter 2 §2.9. Each evaluation question is linked to the environment requirements it addresses and the measures that provide the evidence. Overall, the seven measures span all five evaluation questions, and no requirement is addressed by a measure that cannot be observed in the demonstration evidence. Therefore, the evaluation question contracts documented here establish the interpretive framework within which the Chapter 10 results tables are to be read.

Evaluation Question	Requirement IDs	Measure IDs	Expected Chapter Output
EQ-01	ER-01, ER-04	EM-4W-01, EM-09-01	Standards interpretability evidence
EQ-02	ER-02, ER-05	EM-4W-02, EM-09-02	Modular-fit and bounded-check evidence
EQ-03	ER-01, ER-02	EM-09-01, EM-09-02	Round-trip replay + invariant-preservation evidence (EVID-P3-REPLAY, EVID-P3-INVARIANTS)
EQ-04	ER-06	EM-4W-03, EM-09-04	Governed variation and exception evidence
EQ-05	ER-03, ER-05	EM-09-03	Workflow burden comparison evidence (integrated utility)

Requirement identifiers used here are defined in the Environment-Derived Requirements Register and the Environmental Grounding Dossier. Each identifier traces to at least one design feature and one evaluation measure. In summary, the workbench constitutes the pre-registration record for the thesis’s evaluation. All measures, thresholds, and question-to-requirement linkages are declared before demonstration results are interpreted. Chapter 10 results can therefore be assessed against thresholds set independently of observed outcomes.