Chapter 5 SDA Corpus Integrity Metrics
This appendix documents the corpus integrity baseline for the text-channel serialisation corpus evaluated in Chapter 5. Corpus integrity refers to the set of structural completeness, identity stability, and cross-layer coverage properties that must be established before ambiguity analysis results can be interpreted with confidence. If the corpus were structurally incomplete, contained truncation artefacts, or exhibited systematic identity instability, the ambiguity-delta figures reported in Appendix: Chapter 5 Evaluation Results and Data Package would be confounded by data quality defects rather than reflecting genuine schema performance. These metrics confirm that the corpus meets the minimum quality thresholds required for the analysis to be interpretable. Source data: serialised-text-requirements.json and the linked linguistic analysis outputs (generated 2026-03-24), archived at publish-thesis/publish-data/appendix-data/ch5-artefact-bundle/data-package/text-corpus/.
Structural Completeness
The text-channel serialisation corpus comprises 611 total records. Structural completeness checks confirm that zero records are missing required keys, zero records contain truncation markers, and zero records have malformed clause headers. These three checks address the three most common failure modes of large-scale document extraction: key omission (where a required field is absent from the serialised record), truncation (where the source text was cut short during extraction), and header malformation (where the clause identification scheme is inconsistent). Clean results across all three checks establish that the 611-record corpus is structurally sound and that no data quality defects confound the ambiguity analysis. In summary, the structural completeness baseline confirms that the corpus is fit for the ambiguity analysis that follows. The next section examines whether record identity is stable across the corpus — a property that structural completeness alone does not guarantee.
Identity Stability
Identity stability analysis examines whether the same serial identifier is associated with different textual content across the corpus — a signature of referential instability where the same identifier is used for distinct requirements. The analysis detects 8 duplicate IDs that reference different text content, and 0 duplicate IDs referencing identical text content. The presence of 8 identity-unstable records justifies the record-level identity handling implemented in the serialisation schema: the schema assigns a stable internal identifier to each requirement based on its content rather than relying solely on the source document’s nominal clause numbering. The 0 identical-text duplicates confirm that the instability is semantically meaningful (distinct requirements sharing an identifier) rather than a simple copy-paste artefact.
The 8 unstable identifiers affect 8 of 611 records (1.3% of the corpus). This rate is low enough that the corpus supports aggregate ambiguity analysis, but high enough to warrant the identity-handling mechanism as a design requirement rather than an optional feature. Building on this identity stability finding, the next section reports cross-layer coverage — the proportion of serialised clauses that can be linked to the ontological layer — which reveals the deliberate design boundary between the text corpus and the spatial-dimensional ontology built for figure-based triples.
Cross-Layer Coverage
Cross-layer coverage measures the proportion of serialised text requirements that can be linked to entries in the ontological layer — the layer that provides semantic type information, entity classification, and relational context for downstream processing. Of the 603 unique serial IDs in the corpus (derived from the 611 total by removing the 8 identity-unstable duplicates), 30 are linked to ontology entries, yielding an ontology coverage ratio of 0.0464 (4.64%).
The coverage ratio of 4.64 per cent reflects a deliberate design constraint. The ontological layer was developed to support the figures-channel triple extraction pipeline, and its scope therefore excludes the broader 611-clause text corpus. The text corpus covers a broader regulatory scope — procedural requirements, conditional applicability rules, and external standard references — that does not map readily to the spatial-dimensional ontology built for figure-based triples. The 4.64% coverage ratio evidences representational under-coverage and is declared as a design constraint in Chapter 5 rather than treated as noise.
Modal Carry-Over
Modal carry-over measures the proportion of text-corpus clauses identified as modal (carrying explicit deontic force) that are also covered by the ontological layer. The corpus contains 165 serial modal clauses. Of these, 9 are covered by the ontological layer, yielding a modal coverage ratio of 0.0545 (5.45%).
This ratio is slightly higher than the general ontology coverage ratio (5.45% versus 4.64%), indicating a marginal concentration of ontological coverage in modal content. The finding is consistent with the ontological layer’s design intent: entries were prioritised on the basis of spatial and deontic salience. However, the absolute coverage remains low, confirming that the text-channel modal analysis must be conducted primarily through direct linguistic analysis of the serialised clauses rather than through ontological inference. Overall, the modal carry-over finding reinforces that the text and ontological layers are intentionally scoped to complementary purposes, not to comprehensive mutual coverage.
Table A5-CI.1: Corpus integrity summary
| Measure | Value | Interpretation |
|---|---|---|
| Total serialised records | 611 | Baseline corpus scale |
| Missing required keys | 0 | Structural completeness confirmed |
| Truncation markers detected | 0 | Non-truncation confirmed at configured checks |
| Malformed clause headers | 0 | Syntax integrity confirmed |
| Duplicate IDs (different text content) | 8 | Identity instability present; handled by schema design |
| Duplicate IDs (identical text content) | 0 | No trivial duplication; instability is semantically real |
| Unique serial IDs | 603 | Denominator for ontology-context coverage |
| Ontology-linked unique IDs | 30 | Numerator for ontology-context coverage |
| Ontology coverage ratio | 0.0464 | Representational under-coverage (declared design bound) |
| Serial modal clauses | 165 | Denominator for deontic-force transfer |
| Ontology-covered modal clauses | 9 | Numerator for deontic-force transfer |
| Modal coverage ratio | 0.0545 | Weak modal carry-over via ontology |
Source: structured analysis of serialised-text-requirements.json and ontology linkage tables (generated 2026-03-24).
The corpus integrity checks collectively confirm three properties necessary for the Chapter 5 evaluation: (1) the corpus is structurally complete and free of extraction artefacts; (2) referential identity instability is present and explicitly handled by the schema design; and (3) cross-layer coverage asymmetry is measurable and declared as a design constraint rather than hidden as noise. Taken together, these three integrity properties establish the evidential foundation on which the ambiguity-delta and deontic-force analyses in the companion appendices rest. This establishes the corpus as a sound and interpretable basis for the evaluation claims advanced in Chapter 5.