TurfAITurfAI Developers
Concepts

Data Shield

PII tokenization before the LLM sees your data, with an immutable audit log. Limited GA (v0.5).

Data Shield is TurfAI's governance layer for sensitive data. It tokenizes PII before a workflow, agent, or extraction call reaches the LLM provider — the model sees opaque tokens (EMAIL_84d0a133, SSN_c95eb795); your caller sees the original values restored in the response. Every tokenization event is recorded in an immutable audit log holding entity counts only — never the raw values.

Status: Limited GA (v0.5). The scope below — eight L1 entity types across three LLM entry points, per-request scope, audit log — is the current contract. Free-text PII (names, locations), the RAG path, chatbot chat, and document file contents are not covered in v0.5; see What's covered and the roadmap. When in doubt, the authoritative scope is the v0.5 compliance sheet, not this page.

How tokenization works

Tokenization is deterministic within a request scope — the same email maps to the same token across every message in one call, so the model reasons coherently about "the customer" as a consistent entity. It is isolated across scopes — the same email in two different calls (or two different tenants) maps to two different tokens. The scope key is {tenant_id}:{scope_type}:{scope_id}, so tenant isolation holds by construction and there is no cross-request linkage of real values. In v0.5 every scope is ephemeral: the token map lives in memory for the duration of one request and is destroyed when the request frame ends. There is no on-disk token map.

Before / after: what the model actually sees

Say a workflow sends this to an llm_task:

Email john.doe@acme.com a payment reminder. His SSN on file is 123-45-6789.

Data Shield rewrites the payload before it leaves the service mesh, so the LLM provider receives only tokens:

Email EMAIL_84d0a133 a payment reminder. His SSN on file is SSN_c95eb795.

The model replies in terms of those same tokens (deterministic within the scope, so a second mention of the email is the same token):

Drafted a reminder to EMAIL_84d0a133. Note SSN_c95eb795 is masked in the message body.

On the egress path the gateway restores the original values, so your caller sees:

Drafted a reminder to john.doe@acme.com. Note 123-45-6789 is masked in the message body.

The LLM provider's logs and any fine-tuning pipeline contain tokens only — never john.doe@acme.com or 123-45-6789.

What's covered (v0.5)

Eight L1 entity types, detected by regex + format validation, run unconditionally when Data Shield is enabled on a node (detection target: < 5 ms p95):

TypeDetection
EMAILRFC-5322 subset
PHONEE.164 + common US/UK/IN local formats
SSNUS SSN (rejects 000 / 666 / 9xx prefixes)
CREDIT_CARD12–19 digits + Luhn checksum
IPIPv4 dotted quad
AADHAARIndian Aadhaar (12 digits + Verhoeff checksum)
PANIndian PAN (5 letters + 4 digits + letter)
IBANInternational bank account number (structural check)

Detectors are recall-biased: a 9-digit string resembling an SSN may be tokenized even if it isn't one. Over-tokenization is recoverable on the return path; under-tokenization is a leak. Treat what we tokenize as PII candidates, not a precise classification.

Entry points covered (these three only):

  • POST /api/v1/chat — direct LLM chat; used by llm_task (text mode) and the agent_task ReAct loop (every chat turn).
  • POST /api/v1/extract — structured extraction; used by extraction_task and the agent_task file-mode extract.
  • agent_task ReAct loop — both the chat and extract call sites inside an agent's reasoning loop.

Not covered in v0.5 — plan accordingly, do not assume coverage:

  • Free-text PII — person names, locations, organizations, free-text dates. These need L3 NER (Microsoft Presidio + spaCy), which lands in v1.0. Alice Johnson visited the New York office is forwarded to the LLM with all entities intact.
  • RAG chat path — questions and retrieved context are sent unmodified. See Knowledge base & RAG — RAG is not tokenized in v0.5.
  • Chatbot public chat — the public widget's round-trips do not pass through Data Shield.
  • Document file contentsfile_urls are loaded by the multimodal provider directly; PII inside an attached PDF/image/DOCX reaches the provider unmodified.
  • Prompt Lab preview, customer-held KMS / persistent scope, evidence-pack export, and multi-script / non-Latin PII are all out of v0.5 scope.

Turning it on

Data Shield is opt-in — it is not on by default. Enable it at one of three levels.

Per node — flip the "Data Shield" switch on an llm_task or agent_task node and optionally restrict to a subset of entity types. The node's activity row carries the policy:

{
  "data_shield": {
    "enabled": true,
    "entity_types": ["EMAIL", "SSN", "CREDIT_CARD"]
  }
}

Omit entity_types to run all eight L1 detectors.

Per workflow — in the Resilience tab, "Require Data Shield on every LLM node." The executor fails any run whose LLM nodes haven't opted in, terminating with DATA_SHIELD_POLICY_VIOLATION:

{
  "data_shield_policy": { "required": true }
}

Per pack — declare data_shield_policy in a solution pack manifest. Every workflow instantiated from a pack template inherits the policy onto its activity row; the workflow author can refine it per instance:

{
  "data_shield_policy": {
    "required": true,
    "entity_types": ["EMAIL", "PHONE", "SSN", "CREDIT_CARD", "AADHAAR", "PAN"]
  }
}

Audit log

Every shielded call writes one immutable row under the action data_shield.tokenise. It holds entity counts by type (never raw values), the layers invoked (["L1"] in v0.5), per-stage latencies, a retention timestamp (default 30 days, tunable via DATA_SHIELD_AUDIT_RETENTION_DAYS), and a correlation ID back to the run. A sample row:

{
  "action": "data_shield.tokenise",
  "entity_counts": { "EMAIL": 1, "SSN": 1 },
  "layers": ["L1"],
  "latencies_ms": { "tokenise_ms": 1, "roundtrip_ms": 1461 },
  "provider": "anthropic",
  "model": "claude",
  "correlation_id": "wfexec_5f1c…",
  "retention_until": "2026-07-20T00:00:00Z"
}

Rows are immutable after write (update/delete rejected at the database lifecycle layer). End users see their own activity at /account/activity; operators search across tenants at /admin/audit (Super Admin only).

Failure modes

FailureBehaviourMitigation
Detector error at ingress (regex panic, unicode edge case)Fail-closed — return HTTP 503, block the callBy design: a false negative is a PII leak, so the call fails loudly rather than leaking. Retry once the detector recovers; check service health.
Audit emit failure (network, JWT, DMS down)Fail-open — call succeeds, audit row missing, warning loggedAudit is observability, not correctness. Monitor for the warning and backfill from logs if a compliance window requires the row.
Workflow policy required=true but a node didn't opt inTerminal DATA_SHIELD_POLICY_VIOLATION, workflow failsEnable Data Shield on every LLM node in the workflow, or relax the workflow policy.
Unknown token in LLM responsePassed through untouchedThe model may have invented a string matching our pattern; substituting it would corrupt the response. No action needed.

Two off-by-default kill switches exist for incident response only: DATA_SHIELD_DISABLED=true bypasses the gateway entirely (raw payload to LLM, no audit row); DATA_SHIELD_AUDIT_DISABLED=true tokenizes normally but suppresses the audit emit. Either one active is an attestation gap for that period.

How to test Data Shield

You can't see the tokenized payload directly (it's restored before it reaches you), so verify two things: the response came back with real values restored, and the audit row recorded the expected entity counts. Send a payload with known PII through a shielded llm_task, then read the audit log. Base URL https://apisandbox.turfai.in/api.

BASE="https://apisandbox.turfai.in/api"

# 1. Run a workflow whose llm_task has Data Shield enabled, with known PII in the input.
EXEC=$(curl -s -X POST "$BASE/workflow-executions" \
  -H "Authorization: Bearer $TURFAI_JWT" \
  -H "Content-Type: application/json" \
  -d '{ "data": { "activity": 77, "inputs": { "text": "Email john.doe@acme.com; SSN 123-45-6789." } } }' \
  | python3 -c 'import sys,json;print(json.load(sys.stdin)["data"]["id"])')

curl -s -X POST "$BASE/workflow-executions/$EXEC/execute" \
  -H "Authorization: Bearer $TURFAI_JWT" -H "Content-Type: application/json" \
  -d '{ "inputs": { "text": "Email john.doe@acme.com; SSN 123-45-6789." } }'

# 2. Confirm the response restored real values (you should see john.doe@acme.com, not a token).
curl -s "$BASE/workflow-executions/$EXEC/status" -H "Authorization: Bearer $TURFAI_JWT"

# 3. Inspect your audit activity: the row should show {EMAIL:1, SSN:1} and layers ["L1"].
curl -s "$BASE/account/activity?action=data_shield.tokenise" -H "Authorization: Bearer $TURFAI_JWT"
import os, time, requests

BASE = "https://apisandbox.turfai.in/api"
HEAD = {"Authorization": f"Bearer {os.environ['TURFAI_JWT']}", "Content-Type": "application/json"}
PII  = {"text": "Email john.doe@acme.com; SSN 123-45-6789."}

# 1. Run a shielded workflow with known PII.
exec_id = requests.post(f"{BASE}/workflow-executions", headers=HEAD,
                        json={"data": {"activity": 77, "inputs": PII}}).json()["data"]["id"]
requests.post(f"{BASE}/workflow-executions/{exec_id}/execute", headers=HEAD,
              json={"inputs": PII}).raise_for_status()

# 2. The restored response must contain the real value, never a token.
for _ in range(30):
    data = requests.get(f"{BASE}/workflow-executions/{exec_id}/status", headers=HEAD).json()["data"]
    if data["status"] in ("completed", "failed"):
        break
    time.sleep(2)
assert "john.doe@acme.com" in str(data["results"]), "detokenize failed — token leaked to caller"

# 3. The audit row proves tokenization happened, with counts only.
audit = requests.get(f"{BASE}/account/activity", headers=HEAD,
                     params={"action": "data_shield.tokenise"}).json()
print(audit)   # expect entity_counts {"EMAIL": 1, "SSN": 1}, layers ["L1"]
const BASE = "https://apisandbox.turfai.in/api";
const HEAD = {
  Authorization: `Bearer ${process.env.TURFAI_JWT}`,
  "Content-Type": "application/json",
};
const PII = { text: "Email john.doe@acme.com; SSN 123-45-6789." };

// 1. Run a shielded workflow with known PII.
const created = await fetch(`${BASE}/workflow-executions`, {
  method: "POST", headers: HEAD,
  body: JSON.stringify({ data: { activity: 77, inputs: PII } }),
});
const execId = (await created.json()).data.id;
await fetch(`${BASE}/workflow-executions/${execId}/execute`, {
  method: "POST", headers: HEAD, body: JSON.stringify({ inputs: PII }),
});

// 2. The restored response must contain the real value, never a token.
let data: any;
for (let i = 0; i < 30; i++) {
  data = (await (await fetch(`${BASE}/workflow-executions/${execId}/status`, { headers: HEAD })).json()).data;
  if (data.status === "completed" || data.status === "failed") break;
  await new Promise((r) => setTimeout(r, 2000));
}
if (!JSON.stringify(data.results).includes("john.doe@acme.com"))
  throw new Error("detokenize failed — token leaked to caller");

// 3. The audit row proves tokenization happened, with counts only.
const audit = await (await fetch(`${BASE}/account/activity?action=data_shield.tokenise`, { headers: HEAD })).json();
console.log(audit); // expect entity_counts {EMAIL: 1, SSN: 1}, layers ["L1"]

Roadmap

Forward-looking — not part of the v0.5 contract above.

  • v0.6 (coming soon) — doc-batch action paths, chatbot public-chat path, Prompt Lab path.
  • v1.0 (coming soon) — RAG path coverage, Layer 2 customer gazetteer, Layer 3 NER (free-text names/locations/orgs), customer-held KMS (Google CMEK first), per-document persistent scope, in-boundary OCR, hash-chained evidence-pack export.
  • Knowledge base & RAG — the RAG path is not tokenized in v0.5.
  • Solution packs — declare a data_shield_policy that workflows inherit.
  • API reference — synced contract for POST /api/v1/chat and POST /api/v1/extract, the entry points Data Shield covers.

On this page