Constitutional Security

KRAIT's constitutional approach to AI safety — rules compiled into the system rather than injected into prompts, compared with alternative approaches.

Rules as Structure, Not Suggestions

Constitutional AI typically refers to training models with a set of principles they should follow. KRAIT borrows the term but applies it more literally: the constitution is compiled into the system. The seven KRAIT rules are not instructions that the agent receives in its context window. They are Rust code, executed by the Narsil NIF, with no mechanism for the agent to read, interpret, or override them.

This distinction matters. A prompt-based rule can be jailbroken. A compiled structural rule cannot — it operates outside the model's awareness entirely.

Comparison with Prompt-Based Safety

Most agent frameworks inject safety instructions into the system prompt. "Do not access the filesystem." "Do not make network requests." These work surprisingly well in practice — until they don't. Prompt injection, context window overflow, and creative reinterpretation all erode prompt-based safety over long-running sessions.

KRAIT does not rely on the model's compliance. The model can generate any code it wants. If that code violates a KRAIT rule, Narsil rejects it before execution. The model's intent is irrelevant; only the structure of the generated code matters.

Comparison with Sandboxing

Container-based sandboxing (Docker, gVisor, Firecracker) restricts what a process can do at the OS level. This is valuable but coarse-grained. A sandboxed agent can still exfiltrate data over allowed network channels, encode secrets in output text, or abuse legitimate APIs in unintended ways.

KRAIT operates at a finer granularity. It inspects the semantics of the code itself, not just its system calls. Sandboxing and KRAIT are complementary — KRAIT prevents the agent from generating dangerous code, while the sandbox provides a fallback if something slips through.

Comparison with RLHF Alignment

Reinforcement learning from human feedback shapes a model's behavior during training. KRAIT enforces rules at inference time. RLHF makes the model less likely to produce harmful code. KRAIT makes it impossible for harmful code to execute. These are different layers of defense, and KRAIT is designed to work alongside aligned models, not replace alignment.