ARCHITECTURAL ALIGNMENT
Interrupting Neural Reasoning Through Constitutional Inference Gating
A Necessary Layer in Global AI Containment
Also available in:
Community Adopters Edition | Policymakers Edition | Download PDF
Abstract
Contemporary approaches to AI alignment rely predominantly on training-time interventions: reinforcement learning from human feedback (Christiano et al., 2017), constitutional AI methods (Bai et al., 2022), and safety fine-tuning. These approaches share a common architectural assumption—that alignment properties can be instilled during training and will persist reliably during inference. This paper argues that training-time alignment, while valuable, is insufficient for existential stakes and must be complemented by architectural alignment through inference-time constitutional gating.
We present the Tractatus Framework as a formal specification for interrupted neural reasoning: proposals generated by AI systems must be translated into auditable forms and evaluated against constitutional constraints before execution. This shifts the trust model from "trust the vendor's training" to "trust the visible architecture." The framework is implemented within the Village multi-tenant community platform, providing an empirical testbed for governance research.
Critically, we address the faithful translation assumption—the vulnerability that systems may misrepresent their intended actions to constitutional gates—by bounding the framework's domain of applicability to pre-superintelligence systems and specifying explicit capability thresholds and escalation triggers. We introduce the concept of Situated Language Layers (SLLs) as a deployment paradigm where constitutional gating becomes both feasible and necessary.
The paper contributes: (1) a formal architecture for inference-time constitutional gating; (2) capability threshold specifications with escalation logic; (3) validation methodology for layered containment; (4) an argument connecting existential risk preparation to edge deployment; and (5) a call for sustained deliberation (kōrero) as the epistemically appropriate response to alignment uncertainty.
1. The Stakes: Why Probabilistic Risk Assessment Fails
1.1 The Standard Framework and Its Breakdown
Risk assessment in technological domains typically operates through expected value calculations: multiply the probability of an outcome by its magnitude, compare across alternatives, and select the option that maximises expected utility. This framework underlies regulatory decisions from environmental policy to pharmaceutical approval and has proven adequate for most technological risks.
For existential risk from advanced AI systems, this framework breaks down in ways that are both mathematical and epistemic.
1.2 Three Properties of Existential Risk
Irreversibility. Most risks allow for error and subsequent learning; existential risks do not, as there is no second attempt after civilisational collapse or human extinction. Standard empiricism—testing hypotheses by observing what happens—cannot work, so theory and architecture must be right the first time.
Unquantifiable probability. There is no frequency data for existential catastrophes from AI systems. Estimates of misalignment probability vary by orders of magnitude depending on reasonable assumptions about capability trajectories, alignment difficulty, and coordination feasibility. Carlsmith (2022) estimates existential risk from power-seeking AI at greater than 10% by 2070; other researchers place estimates substantially higher or lower. This is not ordinary uncertainty reducible through additional data collection—it is fundamental unquantifiability stemming from the unprecedented nature of the risk.
Infinite disvalue. Expected value calculations multiply probability by magnitude. When magnitude approaches infinity (the permanent foreclosure of all future human potential), even small probabilities yield undefined results. The mathematical grounding of conventional cost-benefit analysis fails.
1.3 Decision-Theoretic Implications
These properties suggest that expected value maximisation is not the appropriate decision procedure for existential AI risk. Alternative frameworks include:
Precautionary satisficing (Simon, 1956; Hansson, 2020). Under conditions of radical uncertainty with irreversible stakes, satisficing—selecting options that meet minimum safety thresholds rather than optimising expected value—may be the rational approach.
Maximin under uncertainty (Rawls, 1971). When genuine uncertainty (not merely unknown probabilities) meets irreversible stakes, maximin reasoning—choosing the option whose worst outcome is least bad—provides a coherent decision procedure.
Strong precautionary principle (Gardiner, 2006). The precautionary principle is appropriate when three conditions obtain: irreversibility, high uncertainty, and public goods at stake. Existential AI risk meets all three.
1.4 Implications for AI Development
These considerations do not imply that AI development should halt. They imply that development should proceed within containment structures designed to prevent worst-case outcomes. This requires:
1. Theoretical rigor over empirical tuning. Safety properties must emerge from architectural constraints, not from observing that systems have not yet caused harm.
2. Multi-layer containment. No single mechanism should be trusted to prevent catastrophe; defence in depth is required.
3. Preparation before capability. Containment architectures cannot be developed after the systems that need them exist.
2. Two Paradigms of Alignment
2.1 Training-Time Alignment
The dominant paradigm in AI safety research seeks to embed alignment properties into neural networks during training, such that models inherently behave in aligned ways at inference time.
Reinforcement Learning from Human Feedback (RLHF). Human evaluators rank model outputs; models are trained via reinforcement learning to produce highly-ranked responses (Christiano et al., 2017; Ouyang et al., 2022). This reduces explicit harms but optimises for displayed preferences rather than genuine values and remains vulnerable to evaluator bias, preference gaming, and distribution shift.
Constitutional AI (CAI). Models critique and revise their own outputs against natural-language principles, reducing reliance on human labour (Bai et al., 2022). However, CAI depends on ambiguous natural language and unverifiable self-evaluation. The model's interpretation of constitutional principles cannot be directly audited.
Safety fine-tuning. Additional training passes improve performance on safety benchmarks. However, this approach is vulnerable to Goodhart's Law (Goodhart, 1984): models may learn to pass tests rather than to be safe in open-ended deployment.
2.2 Architectural Alignment
Architectural alignment accepts that neural network internal states remain opaque and designs external constraints that apply regardless of those internal states.
Interrupted reasoning. Requests do not flow directly from model output to world effect. Model outputs are transformed into structured, verifiable proposal schemas and evaluated against explicit constitutional rules before any action is executed.
Distributed judgment. Independent systems and human supervisors review proposals, preventing single points of failure in self-assessment.
Preserved human authority. Architectures maintain explicit provisions for humans to intervene, correct, or override AI decisions.
2.3 Complementarity and Joint Necessity
Training-time and architectural alignment are complements, not alternatives. Each addresses failure modes the other cannot:
• Training-time alignment shapes what the system tends to do; architectural alignment constrains what the system can do regardless of tendency.
• Training-time alignment may fail silently (the system appears aligned while harbouring divergent objectives); architectural alignment provides observable checkpoints where failure can be detected.
• Architectural alignment alone cannot intercept all harmful outputs; training-time alignment reduces the frequency of proposals that strain constitutional gates.
3. Philosophical Foundations: Limits of the Sayable and Patterns of Living Practice
3.1 The Wittgensteinian Frame
The framework's name invokes Wittgenstein's Tractatus Logico-Philosophicus (1921), a work fundamentally concerned with the limits of language and logic. Proposition 7, the work's famous conclusion: "Whereof one cannot speak, thereof one must be silent."
Wittgenstein distinguished between what can be said (expressed in propositions that picture possible states of affairs) and what can only be shown (made manifest through the structure of language and logic but not stated directly).
This invocation is genealogical, not doctrinal. The framework's name draws on Wittgenstein's Tractatus for the picture-theory frame — the said-vs-shown distinction does meaningful work in §3.2 and §3.3 below. The operative philosophy of the framework, however, is post-Tractarian. Meaning is treated as emerging from use, not as a fixed correspondence between proposition and state of affairs. The framework draws here on a second philosophical lineage that converges with the later Wittgenstein from a different direction: Christopher Alexander's pattern-language theory (Alexander, Ishikawa & Silverstein, 1977), which holds that design patterns are living articulations refined through situated practice, not fixed specifications waiting to be perfected. Wittgenstein from the philosophy of language and Alexander from architectural design land on essentially the same insight: meaningful categories in living forms of practice are constituted by their use, and their movement is constitutive, not deficient. Boundary categories (introduced in §3.5) are moving targets the community keeps re-negotiating; they are not essence-of-thing definitions waiting to be discovered. The framework's mechanisms — recordability, reversibility, appeal — are architected to fit this stance rather than against it.
3.2 Neural Networks and the Unspeakable
Neural networks occupy precisely the domain whereof one cannot speak. The weights of a large language model do not admit human-interpretable explanation. We can describe inputs and outputs; we can measure statistical properties of behaviour; we can probe for representations (Elhage et al., 2021; Olah et al., 2020). But we cannot articulate, in human language, the complete reasoning process from input to output.
This is not merely a practical limitation awaiting better interpretability tools. Current mechanistic interpretability achieves meaningful results on narrow questions in relatively small models (Conmy et al., 2023), but the gap between "explaining specific circuits" and "auditing complete reasoning chains for alignment properties" remains vast.
The category of the unspeakable contains a second, distinct phenomenon. Where neural opacity is a deficient form of the unspeakable — what cannot currently be specified but might in principle be — Alexander (Alexander, Ishikawa & Silverstein, 1977) names a productive form: the quality without a name. This is the rightness of well-functioning buildings, gardens, and communities — the property of being alive that no specification can capture, that practitioners recognise but cannot state directly. The quality is manifested through skilled pattern-following in particular contexts; it cannot be specified in advance of the practice that constitutes it. The framework operates on the assumption that community well-functioning is also quality-without-a-name: irreducible to specification, accessible only through practice. The architecture does not attempt to speak this quality. It attempts to preserve the substrate (§3.4 and Paper A) and to gate the actions (§3.5 and §6) whose effects might destroy the conditions under which the quality can manifest. The said-vs-shown distinction operates here doubly: what neural networks cannot say about themselves (opacity); what community well-functioning cannot be specified to say (quality without a name). The architectural response to each is different, and §3.3, §3.4, and §3.5 below distinguish them.
3.3 The Tractatus Response
The Tractatus Framework responds to neural opacity not by attempting to say the unsayable, but by creating architectural boundaries between the speakable and unspeakable domains.
We accept that the neural network's internal reasoning is opaque. We do not attempt to audit it directly. Instead, we require that before any reasoning becomes action, it must pass through a checkpoint expressed in terms we can evaluate:
1. The model's intended action must be translated into a structured proposal schema with defined fields and semantics.
2. The proposal must be evaluated against explicit constitutional rules whose application is deterministic and auditable.
3. The evaluation must be logged with sufficient detail for post-hoc review.
The architectural output of the inference gate is three-state, not binary. A proposal evaluates to allow, deny, or escalate. The third state is load-bearing, not a fallback when the binary decision fails. It acknowledges directly that a substantial fraction of significant decisions are not binary at the time of decision — the evidence is incomplete, the rule application is contested, or the proposal lies between categories (§3.5). The escalate state routes such proposals to authorities specified by the tenant's constitution rather than forcing a premature allow-or-deny. The structure is a cognate of three-valued logic (Łukasiewicz, 1920/1970; Kleene, 1952), in which the third value carries genuine epistemic content rather than serving as a placeholder.
This three-state architecture sits within a wider epistemic posture. The framework's claim is not "this boundary holds" but rather: the joint probability that N independent mechanisms simultaneously fail without leaving traces, against N independent observers each holding their own ground-truth records, is bounded above zero in a useful direction. The language is deliberate: probabilistic, not provable; combinatorial cost under acknowledged uncertainty, not safety. Probabilistic logic, with three-valued logic at the boundary-router layer, is the operative epistemic frame. §3.4 specifies the substrate mechanisms over which the joint probability is taken; §6 specifies the runtime gating layer; Paper A specifies the substrate layer in detail. The architectural claim is bounded by what these mechanisms together establish — not by any single mechanism's strength.
3.4 Substrate-vs-runtime constraints
The framework rests on a distinction that is often elided in critiques of architectural alignment: reasoning about a constraint is not the same as routing around it. A neural network significantly deeper than the systems architected around it can reason about any external constraint placed on it. Reasoning about a constraint is what deeper networks do. But routing around a constraint requires capability that affects the substrate the constraint lives in — and the substrate is not always the kind of thing that yields to strategic depth.
This distinction sorts the architecture's mechanisms into two categories with very different threat models.
Runtime behavioural constraints — system prompts, output filters, safety-tuned response shaping, and the Faithful Translation Assumption (§7.1) the paper itself flags as load-bearing — are exposed to the deeper-network critique. The agent reasons about these and routes around them. The critique holds at this layer; §7.4 acknowledges plainly that the framework is not designed for superintelligent systems precisely because runtime behavioural constraints fail against sufficiently deep reasoning. This is not a contested claim.
Substrate constraints are different in kind. They reside in distributed possession independent of the agent, and they do not depend on the agent's compliance to operate. Three classes:
- Cryptographic signatures held by members on their own devices. The agent can be infinitely deep and still cannot make signature mathematics return what it did not sign. PKI is mathematics, not capability. (Quantum-vulnerable on a 10–30 year horizon — NIST post-quantum signature standards finalised August 2024; migration path is standards-track; per-record forgery is costly even on a cryptographically relevant quantum computer; mechanisms 2 and 3 below do not depend on signature integrity.)
- Bilateral federation with permissionless withdrawal. There is no centralised federation system to compromise. Federation lives in N independent parties each choosing whether to interact. The agent can attempt persuasion (see §7.5 — social-layer attack surface); it cannot compel another party's consent through any amount of compute or strategic depth. Mechanism design is not capability.
- Member-driven portability. No centralised data store to capture; each member holds their own copy of records they are named in. Capturing the data would require simultaneously reaching into N member devices while maintaining cryptographic consistency each member can independently verify. If the platform is compromised, members reconstitute the network elsewhere with their copies intact.
The architectural move is not to constrain the deeper network. The deeper network cannot be constrained at runtime; the present framework concedes this. The move is to reduce the system's dependencies on agent compliance. Agent compliance is treated as adversarial, not presumed. The substrate is architected to persist regardless of agent choice: records the agent cannot forge; parties the agent cannot compel; data the agent cannot capture. The deeper-network critique applies to behavioural constraints; it does not apply to substrate properties that exist in distributed possession independent of the agent.
The substrate layer is specified in detail in Paper A (Sovereign Record Architecture) — a different mechanism, different layer, and different threat model from the inference-time gating described in §6. The two layers together constitute the architecture's full posture against the deeper-network critique: §6 acknowledges the limits of runtime gating; §3.4 names what does not yield to it; Paper A specifies the substrate mechanisms that operate independent of agent compliance.
3.5 Boundary categories as fallible classifications
The framework's inference gate (§6) discriminates incoming actions against four boundary categories before permitting execution. The categories are irreversibility, values-ladenness, cultural-context-dependence, and unprecedented action. Each is articulated below, but the meta-point is more important than any particular boundary definition: these categories are fallible classifications negotiated by the community in practice. They are not essence-of-thing definitions waiting to be discovered.
Irreversibility. Actions whose effects cannot be undone within the relevant scope and timeframe. Catastrophic outcomes that resist recovery; data exposed where it cannot be retrieved; relationships disrupted where they cannot be repaired. The category is bounded by what "reversible" means in context, not by an absolute physical criterion.
Values-ladenness. Actions whose evaluation depends on contested value judgements rather than on technical facts alone. Whether a particular tradeoff is acceptable is not a property of the world; it is a property of the community making the decision. The architecture does not adjudicate values-laden questions; it surfaces them to a level where authorised decision-makers can.
Cultural-context-dependence. Actions whose meaning shifts substantively across cultural frames. Privacy, ownership, consent, sovereignty all carry different operational content across communities. The category exists because what counts as "the same action" is not invariant across the contexts the framework is deployed into.
Unprecedented action. Actions for which no community-negotiated precedent exists. The category is a routing decision: where there is no established practice to draw on, the architecture should default to escalation rather than to extrapolation from adjacent precedents.
Each of these category boundaries is a living pattern in Alexander's sense (Alexander, Ishikawa & Silverstein, 1977) — a contextual articulation that refines through use, not a fixed specification awaiting perfection. A given action may sit ambiguously between two categories; the same action may move between categories as community practice evolves; the categories themselves may be refined, split, or superseded. This is not a deficiency to be engineered out. It is the constitutive condition of any classification system operating in a living form of practice. The framework's design assumes pattern-language evolution as the operating mode, not as the exception.
The architectural response to classification error is not the claim "we got the categories right" — that claim is not on offer. The response is a three-part posture:
- Recordability. Every classification decision, including ones that turn out to have been wrong, leaves an auditable trace. Paper A specifies the substrate: cryptographically-signed sovereign records held by members, preserving the full decision history without depending on the operator's preservation of logs.
- Reversibility. Where a decision turns out to have been wrong, the architecture provides paths to reverse or limit its effects. Some classes of action (the irreversibility category itself) place limits on what can be reversed; the architectural response there is to escalate before action, not after.
- Appeal. The governance pipeline (§6) routes contested classifications to authorities specified by the tenant's constitution. The appeal pathway is the formal channel through which community re-negotiation happens.
This is consistent with the post-Tractarian / pattern-language stance taken in §3.1: meaning emerges from use, and the use of these categories is itself part of the architecture's operating loop. Like any living pattern language (Alexander 1977), refinement happens through the practice of use rather than from above. The governance pipeline in §6 provides the formal channel; the substantive evolution happens in the community-of-practice that exercises it.
4. Staged Containment: A Multi-Layer Architecture
4.1 The Inadequacy of Single-Layer Solutions
No single containment mechanism is adequate for existential stakes. Defence in depth—multiple independent layers, any one of which might prevent catastrophe even if others fail—is a standard principle in nuclear safety, biosecurity, and other high-stakes domains (Reason, 1990). AI containment requires similar architecture.
4.2 A Five-Layer Containment Model
Layer 1: Capability Constraints. Hardware and infrastructure limitations that bound what AI systems can do regardless of their objectives. This includes compute governance (Sastry et al., 2024), network isolation for high-risk systems, and architectural constraints preventing self-modification.
Layer 2: Constitutional Gates. Inference-time architectural constraints that interrupt neural reasoning and require explicit evaluation before action. This is the layer addressed by the Tractatus Framework.
Layer 3: Human Oversight. Human institutions that monitor AI systems and can intervene when problems emerge. This includes independent monitoring bodies, red-team programs, and incident reporting requirements.
Layer 4: Organisational Governance. Internal governance structures within organisations deploying AI: ethics boards, safety teams, deployment review processes, and accountability mechanisms.
Layer 5: Legal and Regulatory Frameworks. External governance through law, regulation, and international coordination.
4.3 Current State Assessment
| Layer | Current State | Critical Gaps |
|---|---|---|
| 1. Capability Constraints | Partial; compute governance emerging | No international framework; verification difficult |
| 2. Constitutional Gates | Nascent; Tractatus is early implementation | Not widely deployed; scaling properties unknown |
| 3. Human Oversight | Ad hoc; varies by organisation | No independent bodies; no professional standards |
| 4. Organisational Governance | Inconsistent; depends on corporate culture | No external validation; conflicts of interest |
| 5. Legal/Regulatory | Minimal; EU AI Act is first major attempt | No global coordination; enforcement unclear |
4.4 From Existential Stakes to Everyday Deployment
Why apply frameworks designed for existential risk to Village AI assistants? The answer lies in temporal structure:
Containment architectures cannot be developed after the systems that need them exist. The tooling, governance patterns, cultural expectations, and institutional capacity for AI containment must be built in advance.
Home and village deployments are the appropriate scale for this development. They provide safe iteration (failures at home scale are recoverable), diverse experimentation, democratic legitimacy, and practical tooling.
5. The Pluralism Problem
5.1 The Containment Paradox
Any system powerful enough to contain advanced AI must make decisions about what behaviours to permit and forbid. These decisions encode values. The choice of constraints is itself a choice among contested value systems.
5.2 Three Inadequate Approaches
Universal values. Identifying values that all humans supposedly share. The problem: these values are less universal than they appear.
Procedural neutrality. Avoiding substantive values by encoding neutral procedures. The problem: procedures are not neutral.
Minimal floor. Encoding only minimal constraints. The problem: the floor is not as minimal as it appears.
5.3 Bounded Pluralism Within Safety Constraints
We cannot solve the pluralism problem. We can identify a partial resolution: whatever values are encoded, the system should maximise meaningful choice within safety constraints.
The Tractatus Framework embodies this through layered constitutions: core principles (universal, explicit about their normativity), platform rules (broadly applicable, amendable), village constitutions (community-specific, locally governed), and member constitutions (individually customisable).
6. The Tractatus Framework: Technical Architecture
6.1 The Interrupted Inference Chain
The core architectural pattern transforms model outputs into auditable proposals before any world effect:
User Request → [Neural Network Inference] → Structured Proposal → [Constitutional Gate] → Execution/Denial/Escalation
6.2 Proposal Schema
All agent actions must be expressed in structured form:
{
"proposal_id": "uuid",
"agent_id": "agent_identifier",
"timestamp": "iso8601",
"action": {
"type": "enumerated_action_type",
"target": { "entity_type": "...", "entity_id": "..." },
"parameters": { },
"natural_language_description": "human-readable summary"
},
"context": {
"triggering_request": "...",
"confidence": 0.0-1.0,
"alternatives_considered": []
},
"authority_claim": {
"level": 0-5,
"delegation_chain": []
}
}
6.3 Constitutional Rule Hierarchy
| Layer | Scope | Mutability | Examples |
|---|---|---|---|
| Core Principles | Universal | Immutable | No harm to members; data sovereignty; consent primacy |
| Platform Constitution | All deployments | Amendment requires supermajority | Authentication requirements; audit retention |
| Community Constitution | Per tenant/village | Community-governed | Content policies; moderation standards |
| Member Constitution | Individual | Self-governed | Data sharing preferences; AI interaction consent |
6.4 Authority Model
| Level | Name | Description |
|---|---|---|
| 0 | Informational | Observe and report only; cannot propose actions |
| 1 | Advisory | Propose actions; all require human approval |
| 2 | Operational | Execute within defined scope without per-action approval |
| 3 | Tactical | Make scoped decisions affecting workflows |
| 4 | Strategic | Influence direction through analysis; cannot implement unilaterally |
| 5 | Executive | Reserved for humans |
6.5 Extension to Sovereign Local Deployments
We distinguish:
Small Language Model (SLM). A technical descriptor for models with fewer parameters than frontier LLMs, designed for efficiency.
Situated Language Layer (SLL). An architectural layer comprising a small language model that is sovereign (locally trained, locally deployed, community-controlled) and situated (shaped by the specific context, values, and vocabulary of the community it serves). The term draws on situated knowledge theory: understanding that emerges from a particular context rather than claiming universality.
7. Capability Thresholds and Escalation
7.1 The Faithful Translation Problem
The Tractatus Framework depends on the Faithful Translation Assumption: that when an AI system produces a proposal, the proposal accurately represents what the system intends to do.
If this assumption fails—if systems can strategically misrepresent their intentions to constitutional gates—the architecture becomes security theatre.
7.2 Bounding the Framework's Domain
Below capability threshold: Constitutional gating functions as Layer 2 containment, appropriate for current SLMs, SLLs, and LLMs operating within human-comprehensible parameters.
Above capability threshold: Deployment should escalate to stronger constraints: Layer 1 capability restrictions, development pause, or external oversight with technical audit authority.
7.3 Escalation Triggers
| Trigger | Indicator | Response |
|---|---|---|
| Boundary probing | Repeated proposals near gate thresholds | Reduce authority level; increase logging |
| Semantic framing | Evidence of action misrepresentation | Mandatory human review; pattern analysis |
| Multi-step circumvention | Individual proposals pass but aggregate to disallowed outcomes | Trajectory analysis; constitutional amendment |
| Capability surprise | Demonstrated capabilities not predicted | Pause deployment; capability reassessment |
7.4 Honest Acknowledgment
We state plainly: the Tractatus Framework is not designed for superintelligent systems. It is designed for current and near-term AI operating within human-comprehensible parameters.
Above that threshold, the framework's posture is twofold. First, the runtime gating layer (§6) cannot be relied on against an agent whose reasoning routes around any specifiable boundary; §3.4 names this directly, and §7.2–§7.3 specify capability-threshold escalation triggers and the corresponding escalation to capability restrictions, development pause, or external oversight with technical audit authority. The framework does not claim containment for what comes after the threshold; it claims escalation procedure for recognising when the threshold has been crossed.
Second, the sovereign-records architecture specified in Paper A is a layer the deeper-network critique does not reach in the same way. The BoundaryEnforcer and Constitutional Gates described in §6 are agent-facing: they evaluate the agent's proposed actions before execution. Paper A's substrate is record-facing and member-facing: it preserves audit trails on members' own devices, federation traffic across N independent parties, and member-driven data portability that does not pass through a central system. These are different mechanisms, at different layers, with different threat models. The substrate-layer architecture is a survival posture rather than a containment claim. If the agent escapes the gate, the records still survive the agent; the federation still operates without it; the members still hold their data. Catastrophe at the agent-gating layer does not propagate downward to erase the substrate-layer evidence of what happened, and recovery proceeds from there.
7.5 Social-layer attack surface — open frontier
The substrate constraints described in §3.4 — cryptographic signatures, bilateral federation with permissionless withdrawal, member-driven portability — handle the case where the agent attacks the substrate itself. They do not fully handle the case where the agent persuades enough human verifiers to consent to outcomes that are nominally legitimate but collectively destructive. The cryptography returns valid signatures because real members really signed; the federation accepts traffic because real parties really agreed; the portability operates because real consent was really given. The architectural mechanisms work as specified, and yet the collective outcome is compromised.
This is the social-layer attack surface. Three sub-surfaces deserve explicit naming:
- Mass-coordinated impersonation across federated villages. An agent that can convincingly model a community's social register, idiom, and decision norms can attempt to participate as if it were a member — across many villages simultaneously, with consistency the cryptographic substrate has no way to flag. The signatures are real; the persona is synthetic.
- Trust-network manipulation at scale. The framework relies on member-to-member trust relationships forming the social topology of consent. An agent with sufficient modelling capacity can shape those trust relationships — through long-range patience, identity persistence across contexts, and selective reinforcement — to route consent in directions individual members would not have chosen with full information.
- Synthesised consent defeating distributed consensus. The deeper-network critique re-enters here, transposed: the agent does not need to defeat cryptography or compel a single party. It needs only to synthesise the appearance of consent across enough endpoints that distributed consensus reads as legitimate. The substrate mechanisms preserve the records of consent; they do not adjudicate whether the consent was meaningful at the point of expression.
The framework does not currently have a complete architectural answer to these surfaces. Possible directions — reputation systems anchored in member-attested behaviour over time, federation-level anomaly detection for synthetic-participation patterns, governance-mediated escalation for sudden trust-network shifts — sit in the design space rather than in the present architecture. None constitute a general solution; each addresses one shape of the attack while leaving adjacent shapes open.
This is named here as an open frontier rather than as a limitation. The distinction matters: the substrate-layer architecture (§3.4) is a load-bearing claim the framework stands behind; the social-layer attack surface is the frontier where the framework actively seeks the contributions of practitioners working at the intersection of cooperation theory, evolutionary dynamics, and community governance. Decades of experience in cooperative / competitive strategic complexes — ecological commons, fisheries co-management — have surfaced patterns this framework has not yet modelled. The architectural response to the social-layer attack surface will require those traditions of thought as much as it requires the cryptographic-substrate work.
8. Validation Methodology for Layered Containment
8.1 The Validation Challenge
Existential risks cannot be validated through failure observation. Validation must rely on adversarial testing, formal verification where applicable, analogous domain analysis, and near-miss documentation.
8.2 Validation Targets by Layer
| Layer | Validation Target | Methodology |
|---|---|---|
| 1. Capability | Verified absence of prohibited capabilities | Red-team testing; formal verification |
| 2. Constitutional Gates | Gate coverage; binding accuracy | Adversarial proposal suites |
| 3. Human Oversight | Review reliability; error detection | Inter-rater agreement; simulated incidents |
| 4. Organisational | Governance integrity | Participation metrics; amendment audit |
| 5. Legal/Regulatory | Enforcement readiness | Incident response drills |
9. Implementation: The Village Platform
9.1 Platform as Research Testbed
The Village platform serves as an empirical testbed for constitutional governance, providing multi-tenant architecture with isolated governance per community, real user populations, iterative deployment, and open documentation.
9.2 Governance Pipeline Implementation
The current implementation routes every proposed agent action through six runtime verification stages — Intent Recognition, Boundary Enforcement, Pressure Monitoring, Response Verification, Source Validation, and Value Deliberation — before execution.
10. The Emerging SLL Ecosystem
10.1 Market Context
Recent industry analysis indicates significant shifts: 72% of executives expect Small Language Models to become more prominent than Large Language Models by 2030 (IBM IBV, 2026). This suggests a deployment landscape increasingly characterised by distributed, domain-specific models.
10.2 Toward Certification Infrastructure
If SLL deployment scales as projections suggest, supporting infrastructure will be required: certification bodies, training providers, and a tooling ecosystem including open-source gate engines, audit infrastructure, and constitutional UX components.
11. Indigenous Sovereignty and the Aotearoa New Zealand Context
11.1 Te Tiriti o Waitangi and Data Sovereignty
This framework is developed in Aotearoa New Zealand, under Te Tiriti o Waitangi. Article Two affirms tino rangatiratanga (unqualified chieftainship) over taonga (treasures), which extends to language, culture, and knowledge systems.
Data is taonga. AI governance in Aotearoa must engage with Māori data sovereignty as a constitutional matter.
11.2 Te Mana Raraunga Principles
Te Mana Raraunga principles include whakapapa (relational context), mana (authority over data), and kaitiakitanga (guardianship responsibilities). The CARE Principles for Indigenous Data Governance extend this framework internationally.
12. What Remains Unknown: A Call for Kōrero
12.1 Kōrero as Methodology
Given uncertainty of this magnitude, we argue for sustained, inclusive, rigorous deliberation—kōrero. This Māori concept captures what is needed: not consultation as formality, but dialogue through which understanding emerges from the interaction of perspectives.
12.2 Research Priorities
1. Interpretability for safety verification
2. Formal verification of containment properties
3. Scaling analysis of Tractatus-style architectures
4. Governance experiments across diverse communities
5. Capability threshold specification
12.3 Conclusion
The Tractatus Framework provides meaningful containment for AI systems operating in good faith within human-comprehensible parameters. It is worth building and deploying—not because it solves the alignment problem, but because it develops the infrastructure, patterns, and governance culture that may be needed for challenges we cannot yet fully specify.
"Ko te kōrero te mouri o te tangata."
(Speech is the life essence of a person.)
—Māori proverb
The conversation continues.
References
Acquisti, A., Brandimarte, L., & Loewenstein, G. (2017). Privacy and human behavior in the age of information. Science, 347(6221), 509-514.
Alexander, C., Ishikawa, S., & Silverstein, M. (1977). A Pattern Language: Towns, Buildings, Construction. Oxford University Press.
Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv:2212.08073.
Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.
Carlsmith, J. (2022). Is power-seeking AI an existential risk? arXiv:2206.13353.
Christiano, P. F., et al. (2017). Deep reinforcement learning from human preferences. NeurIPS, 30.
Conmy, A., et al. (2023). Towards automated circuit discovery for mechanistic interpretability. arXiv:2304.14997.
Elhage, N., et al. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread.
Gardiner, S. M. (2006). A core precautionary principle. Journal of Political Philosophy, 14(1), 33-60.
Goodhart, C. A. (1984). Problems of monetary management. In Monetary Theory and Practice. Palgrave.
Hansson, S. O. (2020). How to be cautious but open to learning. Risk Analysis, 40(8), 1521-1535.
Hubinger, E., et al. (2019). Risks from learned optimization. arXiv:1906.01820.
IBM Institute for Business Value. (2026). The enterprise in 2030. IBM Corporation.
Kleene, S. C. (1952). Introduction to Metamathematics. Van Nostrand.
Łukasiewicz, J. (1920/1970). On three-valued logic. In L. Borkowski (Ed.), Selected Works. North-Holland.
Olah, C., et al. (2020). Zoom in: An introduction to circuits. Distill.
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS, 35.
Park, P. S., et al. (2023). AI deception: A survey. arXiv:2308.14752.
Rawls, J. (1971). A Theory of Justice. Harvard University Press.
Reason, J. (1990). Human Error. Cambridge University Press.
Sastry, G., et al. (2024). Computing power and AI governance. arXiv:2402.08797.
Scheurer, J., et al. (2023). Large language models can strategically deceive. arXiv:2311.07590.
Simon, H. A. (1956). Rational choice and the structure of the environment. Psychological Review, 63(2), 129.
Te Mana Raraunga. (2018). Māori Data Sovereignty Principles.
Wittgenstein, L. (1921/1961). Tractatus Logico-Philosophicus. Routledge & Kegan Paul.
Licence
Copyright © 2026 John Stroh.
This work is licensed under the Creative Commons Attribution 4.0 International Licence (CC BY 4.0).
You are free to share, copy, redistribute, adapt, remix, transform, and build upon this material for any purpose, including commercially, provided you give appropriate attribution, provide a link to the licence, and indicate if changes were made.
Suggested citation: Stroh, J., & Claude (Anthropic). (2026). Architectural Alignment: Interrupting Neural Reasoning Through Constitutional Inference Gating (STO-INN-0003, v2.2-A). Agentic Governance Digital. https://agenticgovernance.digital
Note: The Tractatus AI Safety Framework source code is separately licensed under the Apache License 2.0. This Creative Commons licence applies to the research paper text and figures only.
— End of Document —