Problem
When agents can modify policy/constitution text, safety regressions can be introduced gradually and go unnoticed. Without versioning, signatures, and policy review gates, teams cannot prove who changed what, why it changed, or whether critical safeguards were weakened.
Solution
Store the constitution in a version-controlled, signed repository:
- YAML/TOML rules live in Git for automated rule enforcement; natural language principles guide LLM-based evaluation.
- Each commit is signed (e.g., Sigstore); CI runs automated policy checks.
- Only commits signed by approved reviewers or automated tests are merged.
- The agent can propose changes, but a gatekeeper merges them.
- Use semantic versioning: MAJOR for core safety principle changes, MINOR for additions, PATCH for clarifications.
Combine policy-as-code with release discipline: every constitutional change is diffable, reviewable, and test-gated before activation. This gives governance history, rollback capability, and auditable control over alignment policy evolution.
How to use it
- Require
git commit -Sor similar. - Run diff-based linting to flag deletions of critical rules.
- Expose constitution
HEADas read-only context in every agent episode.
Trade-offs
- Pros: Strong auditability, safer policy evolution, and fast rollback of bad constitutional changes.
- Cons: Slower policy iteration and extra operational burden for signing, review, and CI checks.
References
-
Anthropic, Constitutional AI: Harmlessness from AI Feedback (arXiv:2212.08073, 2022)
-
Hiveism, Self-Alignment by Constitutional AI
-
OpenAI, Model Spec
-
Primary source: https://substack.com/home/post/p-161422949?utm_campaign=post&utm_medium=web