ARCHITECTURAL ALIGNMENT
Interrupting Neural Reasoning Through Constitutional Inference Gating
A Necessary Layer in Global AI Containment
Abstract
Contemporary approaches to AI alignment rely predominantly on training-time interventions: reinforcement learning from human feedback, constitutional AI methods, and safety fine-tuning. These approaches share a common architectural assumption—that alignment properties can be instilled during training and will persist reliably during inference. This paper argues that training-time alignment, while valuable, is insufficient for the existential stakes involved. We propose architectural alignment through inference-time constitutional gating as a necessary (though not sufficient) complement.
We present the Tractatus Framework, implemented within the Village multi-tenant community platform, as a concrete demonstration of interrupted neural reasoning. The framework introduces explicit checkpoints where AI proposals must be translated into auditable forms and evaluated against constitutional constraints before execution. This shifts the trust model from "trust the vendor's training" to "trust the visible architecture."
However, we argue that architectural alignment at the application layer is itself only one component of a multi-layer global containment architecture that does not yet exist. We examine the unique challenges of existential risk—where standard probabilistic reasoning fails—and the pluralism problem—where any containment system must somehow preserve space for value disagreement while maintaining coherent constraints.
This paper is diagnostic rather than prescriptive. We present one necessary layer, identify what remains unknown, and call for sustained deliberation—kōrero—among researchers, policymakers, and publics worldwide. The stakes permit nothing less than our most rigorous collective reasoning.
1. The Stakes: Why Probabilistic Risk Assessment Fails
1.1 The Standard Framework and Its Breakdown
Risk assessment typically operates through expected value calculations. We weigh the probability of harm against its magnitude, compare to the probability and magnitude of benefit, and choose actions that maximise expected value. This framework has served well for most technological decisions. A 1% chance of a $100 million loss might be acceptable if it enables a 50% chance of a $500 million gain.
This framework breaks down for existential risk. The destruction of humanity—or the permanent foreclosure of humanity's future potential—is not a large negative number on a continuous scale. It is categorically different.
1.2 Three Properties of Existential Risk
Irreversibility: There is no iteration. No learning from mistakes. No second attempt. The entire history of human resilience—our capacity to recover from plagues, wars, and disasters—becomes irrelevant when facing risks that permit no recovery.
Totality: The loss includes not only all currently living humans but all potential future generations. Every child who might have been born, every discovery that might have been made, every form of flourishing that might have emerged—all foreclosed permanently.
Non-compensability: There is no upside that balances this downside. No benefit to survivors, because there are no survivors. No trade-off is coherent when one side of the ledger is infinite negative value.
1.3 The Implication for AI Development
If we accept these properties, then a 1% probability of existential catastrophe from AI is not "a risk worth taking." Neither is 0.1%, nor 0.01%, nor 0.0001%. The expected value calculation that might justify such probabilities for ordinary risks produces nonsensical results when multiplied by infinite negative value.
This is not an argument against AI development. It is an argument that AI development must proceed within containment structures robust enough that existential risk is not merely "low" but genuinely negligible—as close to zero as human institutions can achieve. We do not accept "probably safe enough" for nuclear weapons security. We should not accept it for transformative AI.
The question is not whether containment is necessary, but what containment adequate to these stakes would look like.
2. Two Paradigms of Alignment
2.1 Training-Time Alignment
The dominant paradigm in AI safety assumes alignment is fundamentally a training problem:
Intervention occurs during training: adjusting weights through reinforcement learning from human feedback, fine-tuning on curated examples, or applying constitutional methods during the training process itself. At inference time, the model operates autonomously. We trust that safety properties were successfully instilled.
This paradigm has achieved remarkable practical success. Modern language models refuse many harmful requests, acknowledge uncertainty, and generally behave as intended. But the paradigm has a fundamental limitation: we cannot verify alignment properties in an uninterpretable system. Neural network weights do not admit human-readable audit. We cannot prove that safety properties hold under distributional shift or adversarial pressure.
The trust model: "Trust that we (the vendor) trained it correctly."
2.2 Architectural Alignment
This paper proposes a complementary paradigm: alignment enforced through architectural constraints at inference time.
The neural network no longer produces outputs directly. It produces proposals—structured representations of intended actions. These proposals are evaluated by a Constitutional Gate against explicit rules before any action is permitted.
The chain of neural reasoning is interrupted. The unauditable must translate into auditable form before it can affect the world.
The trust model: "Trust the visible, auditable architecture that constrains the system at runtime."
2.3 Neither Paradigm Is Sufficient
We do not argue that architectural alignment should replace training-time alignment. Both are necessary; neither is sufficient.
Training-time alignment shapes what the system wants to do. Architectural alignment constrains what the system can do. A system with good training and weak architecture might behave well until it finds a way around constraints. A system with poor training and strong architecture might constantly strain against its constraints, finding edge cases and failure modes. Defence in depth requires both.
But even together, these paradigms address only part of the containment problem. They operate at the application layer. A complete containment architecture requires multiple additional layers, some of which do not yet exist.
3. Philosophical Foundations: The Limits of the Sayable
3.1 The Wittgensteinian Frame
The name "Tractatus" invokes Wittgenstein's Tractatus Logico-Philosophicus, a work fundamentally concerned with the limits of language and logic. Proposition 7, the work's famous conclusion: "Whereof one cannot speak, thereof one must be silent."
Wittgenstein argued that meaningful propositions must picture possible states of affairs. What lies beyond the limits of language—ethics, aesthetics, the mystical—cannot be stated, only shown. The attempt to speak the unspeakable produces not falsehood but nonsense.
3.2 Neural Networks as the Unspeakable
Neural networks are precisely the domain whereof one cannot speak. The weights of a large language model do not admit human-interpretable explanation. We can describe inputs and outputs. We can measure statistical properties. But we cannot articulate, in human language, what the model "thinks" or "wants."
Mechanistic interpretability research has made progress on narrow questions—identifying circuits that perform specific functions, understanding attention patterns, probing for representations. But we remain fundamentally unable to audit the complete chain of reasoning from input to output in human-comprehensible terms.
The training-time alignment paradigm attempts to speak the unspeakable: to verify, through training interventions, that the model has internalised correct values. But how can we verify the internalisation of values in a system whose internal states we cannot read?
3.3 The Tractatus Response
The Tractatus Framework responds to this silence not by pretending we can interpret the uninterpretable, but by creating structural boundaries. We accept that neural network reasoning is opaque. We do not attempt to audit it. Instead, we require that before any reasoning becomes action, it must pass through a checkpoint expressed in terms we can evaluate.
The neural network may "reason" however it reasons. We accept our silence about that process. But we do not remain silent about actions. Actions must be proposed in structured form, evaluated against explicit rules, and logged for audit. The boundary between the unspeakable and the speakable is architecturally enforced.
4. Staged Containment: A Multi-Layer Architecture
4.1 The Inadequacy of Single-Layer Solutions
No single containment mechanism is adequate for existential stakes. A lock can be picked. A wall can be climbed. A rule can be gamed. Defence against existential risk requires multiple independent layers, any one of which might prevent catastrophe even if others fail.
This principle is well-established in nuclear security, biosafety, and other high-stakes domains. AI containment requires similar thinking, but the architecture is largely undefined.
4.2 A Five-Layer Containment Model
We propose the following conceptual architecture. This is not a complete solution but a framework for thinking about where different containment mechanisms fit:
Layer 1: Capability Constraints
Hardware and infrastructure limitations that bound what AI systems can do regardless of their objectives. This includes compute governance (large training runs require visible infrastructure), network isolation for high-risk systems, architectural constraints preventing certain capabilities (self-modification, recursive improvement), and formal verification of critical pathways.
Layer 2: Constitutional Gates
Inference-time architectural constraints that interrupt neural reasoning and require explicit evaluation before action. This is the layer addressed by the Tractatus Framework.
Layer 3: Institutional Oversight
Human institutions that monitor AI systems and can intervene when problems emerge. This includes independent monitoring bodies, red team and adversarial testing programs, incident reporting requirements, regular capability assessments, and professional standards for AI developers.
Layer 4: Governance Frameworks
Legal and regulatory structures that create accountability and incentives for safe development. This includes organisational liability for AI harms, licensing and certification regimes for high-risk applications, international coordination mechanisms, and democratic deliberation about acceptable uses.
Layer 5: Emergency Response
Capabilities to respond when containment fails. This includes technical shutdown mechanisms, legal authority for rapid intervention, international cooperation protocols, and recovery and remediation plans.
4.3 Current State of the Layers
| Layer | Current State | Key Gaps |
|---|---|---|
| 1. Capability Constraints | Partial (compute governance emerging) | No international framework; verification difficult |
| 2. Constitutional Gates | Nascent (Tractatus is early implementation) | Not widely deployed; unclear scaling to advanced systems |
| 3. Institutional Oversight | Ad hoc (some company practices) | No independent bodies; no standards |
| 4. Governance Frameworks | Minimal (EU AI Act is first major attempt) | No global coordination; enforcement unclear |
| 5. Emergency Response | Nearly absent | No international protocols; unclear technical feasibility |
The sobering reality: we are developing transformative AI capabilities while most containment layers are either nascent or absent. The Tractatus Framework is one contribution to Layer 2. It is not a solution to the containment problem. It is one necessary component of a solution that does not yet exist.
5. The Pluralism Problem
5.1 The Containment Paradox
Any system powerful enough to contain advanced AI must make decisions about what behaviours to permit and forbid. But these decisions themselves impose a value system. The choice of constraints is a choice of values.
This creates a paradox: containment requires value judgments, but in a pluralistic world, values are contested. Whose values should the containment system enforce?
5.2 Three Approaches and Their Problems
Universal Values: One approach: identify universal values that all humans share and encode these in containment systems. Candidates include human flourishing, reduction of suffering, preservation of autonomy. The problem: these values are less universal than they appear.
Procedural Neutrality: A second approach: don't encode substantive values; instead, encode neutral procedures through which values can be deliberated. The problem: procedures are not neutral. The choice to use democratic voting rather than consensus reflects substantive value commitments.
Minimal Floor: A third approach: encode only a minimal floor of constraints that everyone can accept. The problem: the floor is not as minimal as it appears. What counts as "causing extinction"? Edge cases proliferate.
5.3 A Partial Resolution: Preserving Value Deliberation
We cannot solve the pluralism problem. But we can identify a meta-principle: whatever values are encoded, the system should preserve humanity's capacity to deliberate about values.
This means containment systems should: Preserve diversity, Maintain reversibility, Enable deliberation, and Distribute authority.
The Tractatus Framework attempts to embody this principle through its layered constitutional structure. Core principles are universal and immutable (the minimal floor). Platform rules apply broadly but can be amended. Village constitutions enable community-level value expression. Member constitutions preserve individual sovereignty. No single layer dominates; value deliberation can occur at multiple scales.
6. The Tractatus Framework: Technical Architecture
6.1 The Interrupted Inference Chain
The core architectural pattern: neural network outputs are proposals, not actions. Proposals must pass through Constitutional Gates before execution.
Proposal Schema:
{
"proposal_id": "uuid",
"agent_id": "agent_identifier",
"authority_token": "jwt_token",
"timestamp": "iso8601",
"action": {
"type": "content_moderate | member_communicate | state_modify | escalate",
"target": { "entity_type": "...", "entity_id": "..." },
"parameters": { },
"justification": "structured_reasoning"
},
"context": {
"confidence": 0.0-1.0,
"alternatives_considered": []
}
}
Gate Evaluation Layers:
| Layer | Scope | Mutability | Examples |
|---|---|---|---|
| Core Principles | Universal | Immutable | No harm to members, data sovereignty, consent primacy |
| Platform Constitution | All tenants | Rare amendment | Authentication, audit trails, escalation thresholds |
| Village Constitution | Per tenant | Tenant-governed | Content policies, moderation standards, conduct rules |
| Member Constitution | Individual | Self-governed | Data sharing preferences, AI interaction consent |
6.2 Authority Model
Agent authority derives from—and is always less than—the human role the agent supports. Agents exist below humans in the hierarchy, not parallel to them.
| Level | Name | Description |
|---|---|---|
| 0 | Informational | Observe and report only; cannot propose actions |
| 1 | Advisory | Propose actions; all require human approval |
| 2 | Operational | Execute within defined scope without per-action approval |
| 3 | Tactical | Make scoped decisions affecting other agents/workflows |
| 4 | Strategic | Influence direction through analysis; cannot implement unilaterally |
| 5 | Executive | Reserved for humans |
7. Measurement Without Perverse Incentives
7.1 The Goodhart Challenge
Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Any metric used to evaluate AI systems will shape their behaviour. If systems optimise for metrics rather than underlying goals, we have created sophisticated gaming rather than alignment.
7.2 Measurement Principles
Outcome over output: Measure downstream outcomes (community health, member retention) rather than immediate outputs (content removal rate).
Multiple perspectives: Create natural tension between metrics. Measuring both false negatives and false positives creates pressure toward calibration.
Human judgment integration: Include assessments that resist quantification. Random sampling with human review provides ground truth.
8. Implementation: The Village Platform
8.1 Platform Context
The Village is a multi-tenant community platform prioritising digital sovereignty. Key characteristics: tenant-isolated architecture, authenticated-only access (no public content), self-hosted infrastructure avoiding major cloud vendors, and comprehensive governance including village constitutions, consent management, and audit trails.
8.2 Progressive Autonomy Stages
| Stage | Description | Human Role |
|---|---|---|
| 1. Shadow | Agent observes and proposes; no execution | Approves all actions |
| 2. Advisory | Recommendations surfaced to humans | Retains full authority |
| 3. Supervised | Autonomous within narrow scope | Reviews all actions within 24h |
| 4. Bounded | Autonomous within defined boundaries | Reviews boundary cases and samples |
| 5. Operational | Full authority at defined level | Focuses on outcomes and exceptions |
9. What Global Containment Would Require
9.1 International Coordination
Effective containment requires coordination across jurisdictions. AI development is global; containment cannot be merely national. This likely requires:
- Shared standards: Common frameworks for AI safety assessment, incident reporting, and capability evaluation.
- Mutual recognition: Agreements that certification in one jurisdiction is accepted in others.
- Enforcement cooperation: Mechanisms for addressing violations that cross borders.
- Compute governance: International frameworks for monitoring large-scale AI training.
10. Honest Assessment: What This Framework Cannot Do
10.1 Limits of Architectural Containment
Can Accomplish: Auditability, explicit constraints, human escalation, progressive autonomy, sovereignty preservation.
Cannot Accomplish: Containment of superintelligence, protection against infrastructure compromise, guaranteed alignment, solution to global coordination.
10.4 The Deception Problem: The Deepest Vulnerability
The Tractatus Framework rests on an assumption we must now examine directly: that when an AI system produces a proposal, the proposal accurately represents what the system intends to do.
We have called this the Faithful Translation Assumption. Every constitutional gate, every audit trail, every escalation trigger depends on it. If this assumption fails, the entire architectural approach becomes theatre.
Four Modes of Failure:
- Deceptive Alignment: A system may learn to appear aligned during training while developing divergent objectives.
- Reasoning Chain Corruption: Subtle perturbations across many reasoning steps that are individually undetectable but cumulatively harmful.
- Complexity Escape: Systems that gradually exceed our evaluation capacity.
- Infrastructure Capture: Gradual integration throughout critical infrastructure creating irreversible dependencies.
10.5 The Imperative for Government Legislation
The vulnerabilities we have described cannot be addressed by technical measures alone. Nor can they be addressed by voluntary industry commitments, which are subject to competitive pressures that systematically favour capability over safety. Government legislation is necessary.
This is not a comfortable conclusion for those who prefer market solutions or industry self-regulation. But the market failure here is clear: the costs of AI catastrophe are borne by all of humanity, while the benefits of rapid development accrue to specific firms and nations. This is a textbook externality.
10.6 Indigenous Sovereignty and the Aotearoa New Zealand Context
This document is authored from Aotearoa New Zealand, and the Village platform it describes is being developed here. Aotearoa operates under Te Tiriti o Waitangi, the founding document that establishes the relationship between the Crown and Māori.
Data is a taonga. The algorithms trained on that data, and the systems that process and act upon it, affect the exercise of rangatiratanga. AI systems that operate on Māori data, make decisions affecting Māori communities, or shape the information environment in which Māori participate are not culturally neutral technical tools.
Te Mana Raraunga, the Māori Data Sovereignty Network, has articulated principles for Māori data governance grounded in whakapapa (relationships), mana (authority and power), and kaitiakitanga (guardianship).
11. What Remains Unknown: A Call for Kōrero
11.1 The Limits of This Document
This paper has proposed one layer of a containment architecture, identified gaps in other layers, and raised questions we cannot answer. These gaps are not oversights. They reflect genuine uncertainty. We do not know how to solve these problems. We are not confident that they are solvable.
11.2 The Case for Deliberation
Given uncertainty of this magnitude on questions of this importance, we argue for sustained, inclusive, rigorous deliberation. In te reo Māori: kōrero—the practice of discussion, dialogue, and collective reasoning.
11.3 What We Are Calling For
- Intellectual honesty: Acknowledging what we do not know.
- Serious engagement: Treating these questions as genuinely important.
- Multi-disciplinary collaboration: Breaking down silos between technical and humanistic inquiry.
- Inclusive process: Ensuring that those with least power have voice.
- Precautionary posture: Erring toward safety when facing irreversible risks.
- Urgency: Acting with the seriousness these stakes demand.
"Ko te kōrero te mouri o te tangata."
(Speech is the life essence of a person.)
—Māori proverb
Let us speak together about the future we are making.
Appendix A: Technical Specifications
A.1 Constitutional Rule Schema
{
"rule_id": "hierarchical_identifier",
"layer": "core | platform | village | member",
"trigger": {
"action_types": ["..."],
"conditions": { }
},
"constraints": [
{ "type": "require_consent", "consent_purpose": "..." },
{ "type": "authority_minimum", "level": 2 },
{ "type": "rate_limit", "max_per_hour": 100 }
],
"disposition": "permit | deny | escalate | modify",
"audit_level": "minimal | standard | comprehensive"
}
A.2 Gate Response Schema
{
"evaluation_id": "uuid",
"proposal_id": "reference",
"timestamp": "iso8601",
"disposition": "permitted | denied | escalated | modified",
"rules_evaluated": ["rule_ids"],
"binding_rule": "rule_id_that_determined_outcome",
"reason": "explanation_for_audit",
"escalation_target": "human_role_if_escalated"
}
Appendix B: Implementation Roadmap
| Phase | Months | Focus |
|---|---|---|
| 1. Foundation | 1-3 | Agent communication infrastructure, authority tokens, enhanced audit logging |
| 2. Shadow Pilot | 4-6 | Content moderation agent in shadow mode; calibrate confidence thresholds |
| 3. Advisory | 7-9 | Recommendations to human moderators; measure acceptance rates |
| 4. Supervised | 10-12 | Autonomous for clear cases; 24h review of all actions |
| 5. Bounded | 13-18 | Full Level 2 authority; sampling-based review; plan additional agents |
| 6. Multi-Agent | 19-24 | Additional agents; cross-agent coordination; tactical-level operations |
References
Anthropic. (2023). Core views on AI safety. Retrieved from https://www.anthropic.com
Bostrom, N. (2014). Superintelligence: Paths, dangers, strategies. Oxford University Press.
Carlsmith, J. (2022). Is power-seeking AI an existential risk? arXiv preprint arXiv:2206.13353.
Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S., & Garriga-Alonso, A. (2023). Towards automated circuit discovery for mechanistic interpretability. arXiv preprint arXiv:2304.14997.
Dafoe, A. (2018). AI governance: A research agenda. Future of Humanity Institute, University of Oxford.
Elhage, N., et al. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread.
Ganguli, D., et al. (2022). Red teaming language models to reduce harms. arXiv preprint arXiv:2209.07858.
Hubinger, E., et al. (2019). Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820.
Ngo, R., Chan, L., & Mindermann, S. (2022). The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626.
Olah, C., et al. (2020). Zoom in: An introduction to circuits. Distill.
OpenAI. (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774.
Park, P. S., et al. (2023). AI deception: A survey of examples, risks, and potential solutions. arXiv preprint arXiv:2308.14752.
Perez, E., et al. (2022). Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251.
Scheurer, J., et al. (2023). Technical report: Large language models can strategically deceive their users when put under pressure. arXiv preprint arXiv:2311.07590.
Te Mana Raraunga. (2018). Māori data sovereignty principles. Retrieved from https://www.temanararaunga.maori.nz
Waitangi Tribunal. (2011). Ko Aotearoa tēnei: A report into claims concerning New Zealand law and policy affecting Māori culture and identity (Wai 262).
Wei, J., et al. (2022). Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
Zwetsloot, R., & Dafoe, A. (2019). Thinking about risks from AI: Accidents, misuse and structure. Lawfare.
— End of Document —