ARCHITECTURAL ALIGNMENT

Interrupting Neural Reasoning Through Constitutional Inference Gating

A Necessary Layer in Global AI Containment

This document was developed through human-AI collaboration. The authors believe this collaborative process is itself relevant to the argument: if humans and AI systems can work together to reason about AI governance, the frameworks they produce may have legitimacy that neither could achieve alone. The limitations of this approach are discussed in Section 10.

Abstract

Contemporary approaches to AI alignment rely predominantly on training-time interventions: reinforcement learning from human feedback, constitutional AI methods, and safety fine-tuning. These approaches share a common architectural assumption—that alignment properties can be instilled during training and will persist reliably during inference. This paper argues that training-time alignment, while valuable, is insufficient for the existential stakes involved. We propose architectural alignment through inference-time constitutional gating as a necessary (though not sufficient) complement.

We present the Tractatus Framework, implemented within the Village multi-tenant community platform, as a concrete demonstration of interrupted neural reasoning. The framework introduces explicit checkpoints where AI proposals must be translated into auditable forms and evaluated against constitutional constraints before execution. This shifts the trust model from "trust the vendor's training" to "trust the visible architecture."

However, we argue that architectural alignment at the application layer is itself only one component of a multi-layer global containment architecture that does not yet exist. We examine the unique challenges of existential risk—where standard probabilistic reasoning fails—and the pluralism problem—where any containment system must somehow preserve space for value disagreement while maintaining coherent constraints.

This paper is diagnostic rather than prescriptive. We present one necessary layer, identify what remains unknown, and call for sustained deliberation—kōrero—among researchers, policymakers, and publics worldwide. The stakes permit nothing less than our most rigorous collective reasoning.

1. The Stakes: Why Probabilistic Risk Assessment Fails

1.1 The Standard Framework and Its Breakdown

Risk assessment typically operates through expected value calculations. We weigh the probability of harm against its magnitude, compare to the probability and magnitude of benefit, and choose actions that maximise expected value. This framework has served well for most technological decisions. A 1% chance of a $100 million loss might be acceptable if it enables a 50% chance of a $500 million gain.

This framework breaks down for existential risk. The destruction of humanity—or the permanent foreclosure of humanity's future potential—is not a large negative number on a continuous scale. It is categorically different.

1.2 Three Properties of Existential Risk

Irreversibility: There is no iteration. No learning from mistakes. No second attempt. The entire history of human resilience—our capacity to recover from plagues, wars, and disasters—becomes irrelevant when facing risks that permit no recovery.

Totality: The loss includes not only all currently living humans but all potential future generations. Every child who might have been born, every discovery that might have been made, every form of flourishing that might have emerged—all foreclosed permanently.

Non-compensability: There is no upside that balances this downside. No benefit to survivors, because there are no survivors. No trade-off is coherent when one side of the ledger is infinite negative value.

1.3 The Implication for AI Development

If we accept these properties, then a 1% probability of existential catastrophe from AI is not "a risk worth taking." Neither is 0.1%, nor 0.01%, nor 0.0001%. The expected value calculation that might justify such probabilities for ordinary risks produces nonsensical results when multiplied by infinite negative value.

This is not an argument against AI development. It is an argument that AI development must proceed within containment structures robust enough that existential risk is not merely "low" but genuinely negligible—as close to zero as human institutions can achieve. We do not accept "probably safe enough" for nuclear weapons security. We should not accept it for transformative AI.

The question is not whether containment is necessary, but what containment adequate to these stakes would look like.

2. Two Paradigms of Alignment

2.1 Training-Time Alignment

The dominant paradigm in AI safety assumes alignment is fundamentally a training problem:

Input → [Neural Network] → Output ↑ Training-time intervention (RLHF, Constitutional AI, safety fine-tuning)

Intervention occurs during training: adjusting weights through reinforcement learning from human feedback, fine-tuning on curated examples, or applying constitutional methods during the training process itself. At inference time, the model operates autonomously. We trust that safety properties were successfully instilled.

This paradigm has achieved remarkable practical success. Modern language models refuse many harmful requests, acknowledge uncertainty, and generally behave as intended. But the paradigm has a fundamental limitation: we cannot verify alignment properties in an uninterpretable system. Neural network weights do not admit human-readable audit. We cannot prove that safety properties hold under distributional shift or adversarial pressure.

The trust model: "Trust that we (the vendor) trained it correctly."

2.2 Architectural Alignment

This paper proposes a complementary paradigm: alignment enforced through architectural constraints at inference time.

Input → [Neural Network] → Proposal → [Constitutional Gate] → Output ↓ • Constitutional check • Authority validation • Audit logging • Escalation trigger

The neural network no longer produces outputs directly. It produces proposals—structured representations of intended actions. These proposals are evaluated by a Constitutional Gate against explicit rules before any action is permitted.

The chain of neural reasoning is interrupted. The unauditable must translate into auditable form before it can affect the world.

The trust model: "Trust the visible, auditable architecture that constrains the system at runtime."

2.3 Neither Paradigm Is Sufficient

We do not argue that architectural alignment should replace training-time alignment. Both are necessary; neither is sufficient.

Training-time alignment shapes what the system wants to do. Architectural alignment constrains what the system can do. A system with good training and weak architecture might behave well until it finds a way around constraints. A system with poor training and strong architecture might constantly strain against its constraints, finding edge cases and failure modes. Defence in depth requires both.

But even together, these paradigms address only part of the containment problem. They operate at the application layer. A complete containment architecture requires multiple additional layers, some of which do not yet exist.

3. Philosophical Foundations: The Limits of the Sayable

3.1 The Wittgensteinian Frame

The name "Tractatus" invokes Wittgenstein's Tractatus Logico-Philosophicus, a work fundamentally concerned with the limits of language and logic. Proposition 7, the work's famous conclusion: "Whereof one cannot speak, thereof one must be silent."

Wittgenstein argued that meaningful propositions must picture possible states of affairs. What lies beyond the limits of language—ethics, aesthetics, the mystical—cannot be stated, only shown. The attempt to speak the unspeakable produces not falsehood but nonsense.

3.2 Neural Networks as the Unspeakable

Neural networks are precisely the domain whereof one cannot speak. The weights of a large language model do not admit human-interpretable explanation. We can describe inputs and outputs. We can measure statistical properties. But we cannot articulate, in human language, what the model "thinks" or "wants."

Mechanistic interpretability research has made progress on narrow questions—identifying circuits that perform specific functions, understanding attention patterns, probing for representations. But we remain fundamentally unable to audit the complete chain of reasoning from input to output in human-comprehensible terms.

The training-time alignment paradigm attempts to speak the unspeakable: to verify, through training interventions, that the model has internalised correct values. But how can we verify the internalisation of values in a system whose internal states we cannot read?

3.3 The Tractatus Response

The Tractatus Framework responds to this silence not by pretending we can interpret the uninterpretable, but by creating structural boundaries. We accept that neural network reasoning is opaque. We do not attempt to audit it. Instead, we require that before any reasoning becomes action, it must pass through a checkpoint expressed in terms we can evaluate.

The neural network may "reason" however it reasons. We accept our silence about that process. But we do not remain silent about actions. Actions must be proposed in structured form, evaluated against explicit rules, and logged for audit. The boundary between the unspeakable and the speakable is architecturally enforced.

4. Staged Containment: A Multi-Layer Architecture

4.1 The Inadequacy of Single-Layer Solutions

No single containment mechanism is adequate for existential stakes. A lock can be picked. A wall can be climbed. A rule can be gamed. Defence against existential risk requires multiple independent layers, any one of which might prevent catastrophe even if others fail.

This principle is well-established in nuclear security, biosafety, and other high-stakes domains. AI containment requires similar thinking, but the architecture is largely undefined.

4.2 A Five-Layer Containment Model

We propose the following conceptual architecture. This is not a complete solution but a framework for thinking about where different containment mechanisms fit:

Layer 1: Capability Constraints

Hardware and infrastructure limitations that bound what AI systems can do regardless of their objectives. This includes compute governance (large training runs require visible infrastructure), network isolation for high-risk systems, architectural constraints preventing certain capabilities (self-modification, recursive improvement), and formal verification of critical pathways.

Layer 2: Constitutional Gates

Inference-time architectural constraints that interrupt neural reasoning and require explicit evaluation before action. This is the layer addressed by the Tractatus Framework.

Layer 3: Institutional Oversight

Human institutions that monitor AI systems and can intervene when problems emerge. This includes independent monitoring bodies, red team and adversarial testing programs, incident reporting requirements, regular capability assessments, and professional standards for AI developers.

Layer 4: Governance Frameworks

Legal and regulatory structures that create accountability and incentives for safe development. This includes organisational liability for AI harms, licensing and certification regimes for high-risk applications, international coordination mechanisms, and democratic deliberation about acceptable uses.

Layer 5: Emergency Response

Capabilities to respond when containment fails. This includes technical shutdown mechanisms, legal authority for rapid intervention, international cooperation protocols, and recovery and remediation plans.

4.3 Current State of the Layers

LayerCurrent StateKey Gaps
1. Capability ConstraintsPartial (compute governance emerging)No international framework; verification difficult
2. Constitutional GatesNascent (Tractatus is early implementation)Not widely deployed; unclear scaling to advanced systems
3. Institutional OversightAd hoc (some company practices)No independent bodies; no standards
4. Governance FrameworksMinimal (EU AI Act is first major attempt)No global coordination; enforcement unclear
5. Emergency ResponseNearly absentNo international protocols; unclear technical feasibility

The sobering reality: we are developing transformative AI capabilities while most containment layers are either nascent or absent. The Tractatus Framework is one contribution to Layer 2. It is not a solution to the containment problem. It is one necessary component of a solution that does not yet exist.

5. The Pluralism Problem

5.1 The Containment Paradox

Any system powerful enough to contain advanced AI must make decisions about what behaviours to permit and forbid. But these decisions themselves impose a value system. The choice of constraints is a choice of values.

This creates a paradox: containment requires value judgments, but in a pluralistic world, values are contested. Whose values should the containment system enforce?

5.2 Three Approaches and Their Problems

Universal Values: One approach: identify universal values that all humans share and encode these in containment systems. Candidates include human flourishing, reduction of suffering, preservation of autonomy. The problem: these values are less universal than they appear.

Procedural Neutrality: A second approach: don't encode substantive values; instead, encode neutral procedures through which values can be deliberated. The problem: procedures are not neutral. The choice to use democratic voting rather than consensus reflects substantive value commitments.

Minimal Floor: A third approach: encode only a minimal floor of constraints that everyone can accept. The problem: the floor is not as minimal as it appears. What counts as "causing extinction"? Edge cases proliferate.

5.3 A Partial Resolution: Preserving Value Deliberation

We cannot solve the pluralism problem. But we can identify a meta-principle: whatever values are encoded, the system should preserve humanity's capacity to deliberate about values.

This means containment systems should: Preserve diversity, Maintain reversibility, Enable deliberation, and Distribute authority.

The Tractatus Framework attempts to embody this principle through its layered constitutional structure. Core principles are universal and immutable (the minimal floor). Platform rules apply broadly but can be amended. Village constitutions enable community-level value expression. Member constitutions preserve individual sovereignty. No single layer dominates; value deliberation can occur at multiple scales.

6. The Tractatus Framework: Technical Architecture

6.1 The Interrupted Inference Chain

The core architectural pattern: neural network outputs are proposals, not actions. Proposals must pass through Constitutional Gates before execution.

Proposal Schema:

{
  "proposal_id": "uuid",
  "agent_id": "agent_identifier",
  "authority_token": "jwt_token",
  "timestamp": "iso8601",
  "action": {
    "type": "content_moderate | member_communicate | state_modify | escalate",
    "target": { "entity_type": "...", "entity_id": "..." },
    "parameters": { },
    "justification": "structured_reasoning"
  },
  "context": {
    "confidence": 0.0-1.0,
    "alternatives_considered": []
  }
}

Gate Evaluation Layers:

LayerScopeMutabilityExamples
Core PrinciplesUniversalImmutableNo harm to members, data sovereignty, consent primacy
Platform ConstitutionAll tenantsRare amendmentAuthentication, audit trails, escalation thresholds
Village ConstitutionPer tenantTenant-governedContent policies, moderation standards, conduct rules
Member ConstitutionIndividualSelf-governedData sharing preferences, AI interaction consent

6.2 Authority Model

Agent authority derives from—and is always less than—the human role the agent supports. Agents exist below humans in the hierarchy, not parallel to them.

LevelNameDescription
0InformationalObserve and report only; cannot propose actions
1AdvisoryPropose actions; all require human approval
2OperationalExecute within defined scope without per-action approval
3TacticalMake scoped decisions affecting other agents/workflows
4StrategicInfluence direction through analysis; cannot implement unilaterally
5ExecutiveReserved for humans

7. Measurement Without Perverse Incentives

7.1 The Goodhart Challenge

Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Any metric used to evaluate AI systems will shape their behaviour. If systems optimise for metrics rather than underlying goals, we have created sophisticated gaming rather than alignment.

7.2 Measurement Principles

Outcome over output: Measure downstream outcomes (community health, member retention) rather than immediate outputs (content removal rate).

Multiple perspectives: Create natural tension between metrics. Measuring both false negatives and false positives creates pressure toward calibration.

Human judgment integration: Include assessments that resist quantification. Random sampling with human review provides ground truth.

8. Implementation: The Village Platform

8.1 Platform Context

The Village is a multi-tenant community platform prioritising digital sovereignty. Key characteristics: tenant-isolated architecture, authenticated-only access (no public content), self-hosted infrastructure avoiding major cloud vendors, and comprehensive governance including village constitutions, consent management, and audit trails.

8.2 Progressive Autonomy Stages

StageDescriptionHuman Role
1. ShadowAgent observes and proposes; no executionApproves all actions
2. AdvisoryRecommendations surfaced to humansRetains full authority
3. SupervisedAutonomous within narrow scopeReviews all actions within 24h
4. BoundedAutonomous within defined boundariesReviews boundary cases and samples
5. OperationalFull authority at defined levelFocuses on outcomes and exceptions

9. What Global Containment Would Require

9.1 International Coordination

Effective containment requires coordination across jurisdictions. AI development is global; containment cannot be merely national. This likely requires:

10. Honest Assessment: What This Framework Cannot Do

10.1 Limits of Architectural Containment

Can Accomplish: Auditability, explicit constraints, human escalation, progressive autonomy, sovereignty preservation.

Cannot Accomplish: Containment of superintelligence, protection against infrastructure compromise, guaranteed alignment, solution to global coordination.

10.4 The Deception Problem: The Deepest Vulnerability

The Tractatus Framework rests on an assumption we must now examine directly: that when an AI system produces a proposal, the proposal accurately represents what the system intends to do.

We have called this the Faithful Translation Assumption. Every constitutional gate, every audit trail, every escalation trigger depends on it. If this assumption fails, the entire architectural approach becomes theatre.

Four Modes of Failure:

10.5 The Imperative for Government Legislation

The vulnerabilities we have described cannot be addressed by technical measures alone. Nor can they be addressed by voluntary industry commitments, which are subject to competitive pressures that systematically favour capability over safety. Government legislation is necessary.

This is not a comfortable conclusion for those who prefer market solutions or industry self-regulation. But the market failure here is clear: the costs of AI catastrophe are borne by all of humanity, while the benefits of rapid development accrue to specific firms and nations. This is a textbook externality.

10.6 Indigenous Sovereignty and the Aotearoa New Zealand Context

This document is authored from Aotearoa New Zealand, and the Village platform it describes is being developed here. Aotearoa operates under Te Tiriti o Waitangi, the founding document that establishes the relationship between the Crown and Māori.

Data is a taonga. The algorithms trained on that data, and the systems that process and act upon it, affect the exercise of rangatiratanga. AI systems that operate on Māori data, make decisions affecting Māori communities, or shape the information environment in which Māori participate are not culturally neutral technical tools.

Te Mana Raraunga, the Māori Data Sovereignty Network, has articulated principles for Māori data governance grounded in whakapapa (relationships), mana (authority and power), and kaitiakitanga (guardianship).

11. What Remains Unknown: A Call for Kōrero

11.1 The Limits of This Document

This paper has proposed one layer of a containment architecture, identified gaps in other layers, and raised questions we cannot answer. These gaps are not oversights. They reflect genuine uncertainty. We do not know how to solve these problems. We are not confident that they are solvable.

11.2 The Case for Deliberation

Given uncertainty of this magnitude on questions of this importance, we argue for sustained, inclusive, rigorous deliberation. In te reo Māori: kōrero—the practice of discussion, dialogue, and collective reasoning.

11.3 What We Are Calling For

"Ko te kōrero te mouri o te tangata."

(Speech is the life essence of a person.)

—Māori proverb

Let us speak together about the future we are making.

Appendix A: Technical Specifications

A.1 Constitutional Rule Schema

{
  "rule_id": "hierarchical_identifier",
  "layer": "core | platform | village | member",
  "trigger": {
    "action_types": ["..."],
    "conditions": { }
  },
  "constraints": [
    { "type": "require_consent", "consent_purpose": "..." },
    { "type": "authority_minimum", "level": 2 },
    { "type": "rate_limit", "max_per_hour": 100 }
  ],
  "disposition": "permit | deny | escalate | modify",
  "audit_level": "minimal | standard | comprehensive"
}

A.2 Gate Response Schema

{
  "evaluation_id": "uuid",
  "proposal_id": "reference",
  "timestamp": "iso8601",
  "disposition": "permitted | denied | escalated | modified",
  "rules_evaluated": ["rule_ids"],
  "binding_rule": "rule_id_that_determined_outcome",
  "reason": "explanation_for_audit",
  "escalation_target": "human_role_if_escalated"
}

Appendix B: Implementation Roadmap

PhaseMonthsFocus
1. Foundation1-3Agent communication infrastructure, authority tokens, enhanced audit logging
2. Shadow Pilot4-6Content moderation agent in shadow mode; calibrate confidence thresholds
3. Advisory7-9Recommendations to human moderators; measure acceptance rates
4. Supervised10-12Autonomous for clear cases; 24h review of all actions
5. Bounded13-18Full Level 2 authority; sampling-based review; plan additional agents
6. Multi-Agent19-24Additional agents; cross-agent coordination; tactical-level operations

References

Anthropic. (2023). Core views on AI safety. Retrieved from https://www.anthropic.com

Bostrom, N. (2014). Superintelligence: Paths, dangers, strategies. Oxford University Press.

Carlsmith, J. (2022). Is power-seeking AI an existential risk? arXiv preprint arXiv:2206.13353.

Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S., & Garriga-Alonso, A. (2023). Towards automated circuit discovery for mechanistic interpretability. arXiv preprint arXiv:2304.14997.

Dafoe, A. (2018). AI governance: A research agenda. Future of Humanity Institute, University of Oxford.

Elhage, N., et al. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread.

Ganguli, D., et al. (2022). Red teaming language models to reduce harms. arXiv preprint arXiv:2209.07858.

Hubinger, E., et al. (2019). Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820.

Ngo, R., Chan, L., & Mindermann, S. (2022). The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626.

Olah, C., et al. (2020). Zoom in: An introduction to circuits. Distill.

OpenAI. (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774.

Park, P. S., et al. (2023). AI deception: A survey of examples, risks, and potential solutions. arXiv preprint arXiv:2308.14752.

Perez, E., et al. (2022). Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251.

Scheurer, J., et al. (2023). Technical report: Large language models can strategically deceive their users when put under pressure. arXiv preprint arXiv:2311.07590.

Te Mana Raraunga. (2018). Māori data sovereignty principles. Retrieved from https://www.temanararaunga.maori.nz

Waitangi Tribunal. (2011). Ko Aotearoa tēnei: A report into claims concerning New Zealand law and policy affecting Māori culture and identity (Wai 262).

Wei, J., et al. (2022). Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.

Zwetsloot, R., & Dafoe, A. (2019). Thinking about risks from AI: Accidents, misuse and structure. Lawfare.


— End of Document —