Data Shield
PII tokenization before the LLM sees your data, with an immutable audit log. Limited GA (v0.5).
Data Shield is TurfAI's governance layer for sensitive data. It tokenizes PII before a
workflow, agent, or extraction call reaches the LLM provider — the model sees opaque tokens
(EMAIL_84d0a133, SSN_c95eb795); your caller sees the original values restored in the
response. Every tokenization event is recorded in an immutable audit log holding entity
counts only — never the raw values.
Status: Limited GA (v0.5). The scope below — eight L1 entity types across three LLM entry points, per-request scope, audit log — is the current contract. Free-text PII (names, locations), the RAG path, chatbot chat, and document file contents are not covered in v0.5; see What's covered and the roadmap. When in doubt, the authoritative scope is the v0.5 compliance sheet, not this page.
How tokenization works
Tokenization is deterministic within a request scope — the same email maps to the same
token across every message in one call, so the model reasons coherently about "the customer"
as a consistent entity. It is isolated across scopes — the same email in two different
calls (or two different tenants) maps to two different tokens. The scope key is
{tenant_id}:{scope_type}:{scope_id}, so tenant isolation holds by construction and there is
no cross-request linkage of real values. In v0.5 every scope is ephemeral: the token map
lives in memory for the duration of one request and is destroyed when the request frame ends.
There is no on-disk token map.
Before / after: what the model actually sees
Say a workflow sends this to an llm_task:
Email john.doe@acme.com a payment reminder. His SSN on file is 123-45-6789.Data Shield rewrites the payload before it leaves the service mesh, so the LLM provider receives only tokens:
Email EMAIL_84d0a133 a payment reminder. His SSN on file is SSN_c95eb795.The model replies in terms of those same tokens (deterministic within the scope, so a second mention of the email is the same token):
Drafted a reminder to EMAIL_84d0a133. Note SSN_c95eb795 is masked in the message body.On the egress path the gateway restores the original values, so your caller sees:
Drafted a reminder to john.doe@acme.com. Note 123-45-6789 is masked in the message body.The LLM provider's logs and any fine-tuning pipeline contain tokens only — never
john.doe@acme.com or 123-45-6789.
What's covered (v0.5)
Eight L1 entity types, detected by regex + format validation, run unconditionally when Data Shield is enabled on a node (detection target: < 5 ms p95):
| Type | Detection |
|---|---|
EMAIL | RFC-5322 subset |
PHONE | E.164 + common US/UK/IN local formats |
SSN | US SSN (rejects 000 / 666 / 9xx prefixes) |
CREDIT_CARD | 12–19 digits + Luhn checksum |
IP | IPv4 dotted quad |
AADHAAR | Indian Aadhaar (12 digits + Verhoeff checksum) |
PAN | Indian PAN (5 letters + 4 digits + letter) |
IBAN | International bank account number (structural check) |
Detectors are recall-biased: a 9-digit string resembling an SSN may be tokenized even if it isn't one. Over-tokenization is recoverable on the return path; under-tokenization is a leak. Treat what we tokenize as PII candidates, not a precise classification.
Entry points covered (these three only):
POST /api/v1/chat— direct LLM chat; used byllm_task(text mode) and theagent_taskReAct loop (every chat turn).POST /api/v1/extract— structured extraction; used byextraction_taskand theagent_taskfile-mode extract.agent_taskReAct loop — both thechatandextractcall sites inside an agent's reasoning loop.
Not covered in v0.5 — plan accordingly, do not assume coverage:
- Free-text PII — person names, locations, organizations, free-text dates. These need L3
NER (Microsoft Presidio + spaCy), which lands in v1.0.
Alice Johnson visited the New York officeis forwarded to the LLM with all entities intact. - RAG chat path — questions and retrieved context are sent unmodified. See Knowledge base & RAG — RAG is not tokenized in v0.5.
- Chatbot public chat — the public widget's round-trips do not pass through Data Shield.
- Document file contents —
file_urlsare loaded by the multimodal provider directly; PII inside an attached PDF/image/DOCX reaches the provider unmodified. - Prompt Lab preview, customer-held KMS / persistent scope, evidence-pack export, and multi-script / non-Latin PII are all out of v0.5 scope.
Turning it on
Data Shield is opt-in — it is not on by default. Enable it at one of three levels.
Per node — flip the "Data Shield" switch on an llm_task or agent_task node and
optionally restrict to a subset of entity types. The node's activity row carries the policy:
{
"data_shield": {
"enabled": true,
"entity_types": ["EMAIL", "SSN", "CREDIT_CARD"]
}
}Omit entity_types to run all eight L1 detectors.
Per workflow — in the Resilience tab, "Require Data Shield on every LLM node." The
executor fails any run whose LLM nodes haven't opted in, terminating with
DATA_SHIELD_POLICY_VIOLATION:
{
"data_shield_policy": { "required": true }
}Per pack — declare data_shield_policy in a solution pack
manifest. Every workflow instantiated from a pack template inherits the policy onto its
activity row; the workflow author can refine it per instance:
{
"data_shield_policy": {
"required": true,
"entity_types": ["EMAIL", "PHONE", "SSN", "CREDIT_CARD", "AADHAAR", "PAN"]
}
}Audit log
Every shielded call writes one immutable row under the action data_shield.tokenise. It holds
entity counts by type (never raw values), the layers invoked (["L1"] in v0.5), per-stage
latencies, a retention timestamp (default 30 days, tunable via
DATA_SHIELD_AUDIT_RETENTION_DAYS), and a correlation ID back to the run. A sample row:
{
"action": "data_shield.tokenise",
"entity_counts": { "EMAIL": 1, "SSN": 1 },
"layers": ["L1"],
"latencies_ms": { "tokenise_ms": 1, "roundtrip_ms": 1461 },
"provider": "anthropic",
"model": "claude",
"correlation_id": "wfexec_5f1c…",
"retention_until": "2026-07-20T00:00:00Z"
}Rows are immutable after write (update/delete rejected at the database lifecycle layer). End
users see their own activity at /account/activity; operators search across tenants at
/admin/audit (Super Admin only).
Failure modes
| Failure | Behaviour | Mitigation |
|---|---|---|
| Detector error at ingress (regex panic, unicode edge case) | Fail-closed — return HTTP 503, block the call | By design: a false negative is a PII leak, so the call fails loudly rather than leaking. Retry once the detector recovers; check service health. |
| Audit emit failure (network, JWT, DMS down) | Fail-open — call succeeds, audit row missing, warning logged | Audit is observability, not correctness. Monitor for the warning and backfill from logs if a compliance window requires the row. |
Workflow policy required=true but a node didn't opt in | Terminal DATA_SHIELD_POLICY_VIOLATION, workflow fails | Enable Data Shield on every LLM node in the workflow, or relax the workflow policy. |
| Unknown token in LLM response | Passed through untouched | The model may have invented a string matching our pattern; substituting it would corrupt the response. No action needed. |
Two off-by-default kill switches exist for incident response only:
DATA_SHIELD_DISABLED=true bypasses the gateway entirely (raw payload to LLM, no audit row);
DATA_SHIELD_AUDIT_DISABLED=true tokenizes normally but suppresses the audit emit. Either one
active is an attestation gap for that period.
How to test Data Shield
You can't see the tokenized payload directly (it's restored before it reaches you), so verify
two things: the response came back with real values restored, and the audit row recorded
the expected entity counts. Send a payload with known PII through a shielded llm_task, then
read the audit log. Base URL https://apisandbox.turfai.in/api.
BASE="https://apisandbox.turfai.in/api"
# 1. Run a workflow whose llm_task has Data Shield enabled, with known PII in the input.
EXEC=$(curl -s -X POST "$BASE/workflow-executions" \
-H "Authorization: Bearer $TURFAI_JWT" \
-H "Content-Type: application/json" \
-d '{ "data": { "activity": 77, "inputs": { "text": "Email john.doe@acme.com; SSN 123-45-6789." } } }' \
| python3 -c 'import sys,json;print(json.load(sys.stdin)["data"]["id"])')
curl -s -X POST "$BASE/workflow-executions/$EXEC/execute" \
-H "Authorization: Bearer $TURFAI_JWT" -H "Content-Type: application/json" \
-d '{ "inputs": { "text": "Email john.doe@acme.com; SSN 123-45-6789." } }'
# 2. Confirm the response restored real values (you should see john.doe@acme.com, not a token).
curl -s "$BASE/workflow-executions/$EXEC/status" -H "Authorization: Bearer $TURFAI_JWT"
# 3. Inspect your audit activity: the row should show {EMAIL:1, SSN:1} and layers ["L1"].
curl -s "$BASE/account/activity?action=data_shield.tokenise" -H "Authorization: Bearer $TURFAI_JWT"import os, time, requests
BASE = "https://apisandbox.turfai.in/api"
HEAD = {"Authorization": f"Bearer {os.environ['TURFAI_JWT']}", "Content-Type": "application/json"}
PII = {"text": "Email john.doe@acme.com; SSN 123-45-6789."}
# 1. Run a shielded workflow with known PII.
exec_id = requests.post(f"{BASE}/workflow-executions", headers=HEAD,
json={"data": {"activity": 77, "inputs": PII}}).json()["data"]["id"]
requests.post(f"{BASE}/workflow-executions/{exec_id}/execute", headers=HEAD,
json={"inputs": PII}).raise_for_status()
# 2. The restored response must contain the real value, never a token.
for _ in range(30):
data = requests.get(f"{BASE}/workflow-executions/{exec_id}/status", headers=HEAD).json()["data"]
if data["status"] in ("completed", "failed"):
break
time.sleep(2)
assert "john.doe@acme.com" in str(data["results"]), "detokenize failed — token leaked to caller"
# 3. The audit row proves tokenization happened, with counts only.
audit = requests.get(f"{BASE}/account/activity", headers=HEAD,
params={"action": "data_shield.tokenise"}).json()
print(audit) # expect entity_counts {"EMAIL": 1, "SSN": 1}, layers ["L1"]const BASE = "https://apisandbox.turfai.in/api";
const HEAD = {
Authorization: `Bearer ${process.env.TURFAI_JWT}`,
"Content-Type": "application/json",
};
const PII = { text: "Email john.doe@acme.com; SSN 123-45-6789." };
// 1. Run a shielded workflow with known PII.
const created = await fetch(`${BASE}/workflow-executions`, {
method: "POST", headers: HEAD,
body: JSON.stringify({ data: { activity: 77, inputs: PII } }),
});
const execId = (await created.json()).data.id;
await fetch(`${BASE}/workflow-executions/${execId}/execute`, {
method: "POST", headers: HEAD, body: JSON.stringify({ inputs: PII }),
});
// 2. The restored response must contain the real value, never a token.
let data: any;
for (let i = 0; i < 30; i++) {
data = (await (await fetch(`${BASE}/workflow-executions/${execId}/status`, { headers: HEAD })).json()).data;
if (data.status === "completed" || data.status === "failed") break;
await new Promise((r) => setTimeout(r, 2000));
}
if (!JSON.stringify(data.results).includes("john.doe@acme.com"))
throw new Error("detokenize failed — token leaked to caller");
// 3. The audit row proves tokenization happened, with counts only.
const audit = await (await fetch(`${BASE}/account/activity?action=data_shield.tokenise`, { headers: HEAD })).json();
console.log(audit); // expect entity_counts {EMAIL: 1, SSN: 1}, layers ["L1"]Roadmap
Forward-looking — not part of the v0.5 contract above.
- v0.6 (coming soon) — doc-batch action paths, chatbot public-chat path, Prompt Lab path.
- v1.0 (coming soon) — RAG path coverage, Layer 2 customer gazetteer, Layer 3 NER (free-text names/locations/orgs), customer-held KMS (Google CMEK first), per-document persistent scope, in-boundary OCR, hash-chained evidence-pack export.
Related
- Knowledge base & RAG — the RAG path is not tokenized in v0.5.
- Solution packs — declare a
data_shield_policythat workflows inherit. - API reference — synced contract for
POST /api/v1/chatandPOST /api/v1/extract, the entry points Data Shield covers.