CCCL #5 · London · April 2026

Agents
in the
Pipeline

From a skill that runs well locally to a service you can trust in production

Rhys Cazenove

AI Lead · NHM

01 / Starting Point

A familiar situation

You've built something that works.

A Claude Code skill that runs locally — explores, reasons, produces something useful

A doc generator triggered when code merges. A code reviewer triggered on commit. A log investigator triggered by a webhook. An alert responder triggered when error rate spikes.

It works well. You've iterated on it. You trust it. Now you want to run it as a service.

02 / The Shift

What changes

Local vs Service.

Running locally

You control it.

Agent can explore.

Human always in the loop.

Failures are visible.

Running as a service

Runs unattended.

No exploration.

No human oversight.

Drift is silent.

03 / The Answer

So how do we get there?

Three pillars to implement.

Guardrails: constrain what the agent can do at runtime. Block dangerous commands, confine it to its problem space, prevent unconventional tooling.

Confinement: an isolated, reproducible environment. No side effects, no state leakage, no dependency drift between runs.

Observability: visibility into everything the agent does, including what it's allowed to do. How you verify it hasn't drifted, and how you keep the guardrails current.

Running an agent in production without this is like letting a golden retriever loose at a buffet — enthusiastic, well-intentioned, and absolutely going to cause a scene.

04 / A Warning

Simon Willison's Lethal Trifecta

Three things dangerous together.

01 — fine alone

Access to private or sensitive data

Credentials, internal docs, personal data, confidential content

02 — fine alone

Exposure to untrusted content

External URLs, user input, web pages, third-party data sources

03 — fine alone

A mechanism to exfiltrate data

Outbound network access, file writes, API calls to external endpoints

Any one alone is manageable. All three together is the problem — untrusted content can inject instructions that use the exfiltration mechanism to leak private data. Break the flow across multiple agents. Design your architecture so no single agent holds all three.

05 / Framework

The Approach

Four parts, modular ownership.

Container Image

Confinement · tooling control

The first step. You decide exactly which libraries, frameworks, and tools are available to the agent — nothing else. Built with a Dockerfile, cached in a container registry (e.g. GitLab Registry) for reuse across every pipeline run. No dependency drift, no surprises.

Security Hooks

Guardrails · observability

PreToolUse validation intercepts every tool call before execution. PostToolUse captures metrics. Tip: configure your hooks to log every rejected call — review these to understand what the agent tried to do, then add or remove tools from the container accordingly.

Pipeline Config

Reliability

GitLab CI orchestration with manual triggers, timeouts, artifact retention and audit compliance.

Ownership Model

Accountability

Security · Infra · DevOps · Domain — each team owns their layer. No single team is a bottleneck.

05 / Framework

System Architecture

How the parts connect.

05 / Framework

Team Ownership

One harness, four teams.

06 / Security

Defense in Depth

7 independent protection layers.

04 / Architecture

Execution Flow

Agent calls a tool, hooks decide.

01GitLab CI triggers the pipeline and spins up a fresh Docker container
02Claude Code agent starts inside the container and begins working
03Agent attempts a tool call — every single one fires the PreToolUse hook first
04Hook inspects the call — allows it, or blocks it and returns a decision to the agent
05PostToolUse hook captures metrics on every completed call

# bash — fires before EVERY tool call
if is_dangerous_command "$tool_input"; then
echo '{"decision":"block"}'
exit 2
fi

05 / Observability

Use observability for a holistic view

Blocked calls are only part of the story.

Allowed calls matter just as much: a new threat pattern won't trigger any block, but it will show up in the allowed log

Anomalies in allowed traffic: unusual tool sequences, frequency spikes, or combinations that individually look fine but together look wrong

Near-misses: commands that almost matched a block pattern signal a gap in coverage before it becomes an incident

Harness drift: block rules that never fire may be redundant; new patterns appearing in allowed logs need new rules

Observability isn't just monitoring — it's the mechanism that keeps the harness current. Observe allowed calls → spot emerging patterns → update configuration → measure whether the new rule fires.

05 / Observability

Across Every Layer

Same hook, full visibility.

Security hooks

Blocked calls: pattern, category, frequency over time

Allowed calls: full log, anomaly detection, near-misses

Coverage heatmap: which rules fire vs which never have

Agent behaviour

Token spikes: far above baseline may indicate injection or a loop

Retry storms: agent repeatedly attempting the same failed call

Output drift: same skill, same input, degrading quality over time

Pipeline & container

Resource trends: CPU, memory, egress per run for cost forecasting

Dependency freshness: packages drifting from known good versions

Skills & output

Skill usage: which workflows run, how often, where they fail

Quality scores: verification pass rate, hallucination detection rate

06 / Use Cases

Live at NHM

Low risk, high value.

Use case 01

Documentation Generator

Scans git history, groups changes by theme, generates ADRs and Mermaid architecture diagrams. Runs incrementally.

Use case 02

Onboarding Generator

Creates full developer onboarding docs — app overview, architecture guide, getting started, troubleshooting.

Use case 03

Incident Analysis

Webhook-triggered when errors exceed a threshold. Uses Azure MCP to analyse the issue and prepare evidence — deep links to KQL queries and charts — so the engineer assigned has a running start before deciding the best course of action.

AI handles the repetitive analysis. Humans review, approve, and act.

07 / Results

The Numbers

A reproducible pattern.

100+

Dangerous patterns
blocked at runtime

7

Independent
security layers

5

Modular parts
clear ownership

∞

Extensible to any
agentic use case

GitLab CI

Azure DevOps

GitHub Actions

Any YAML CI/CD

Hooks are bash/PowerShell. Container runs anywhere. Platform-agnostic by design.

08 / Takeaways

What We Learned

Five things that matter.

Governance first, features second. Get the harness right before you scale use cases.

Modular ownership is essential. Security shouldn't own the skills; domain experts shouldn't own the hooks.

Hook architecture gives you observability for free. The same pattern that blocks also emits metrics.

Start boring, get valuable. Documentation and auditing are perfect low-risk pilots.

Human oversight stays in the loop. AI analyses, humans approve and refine.

Thanks · Questions?

Rhys
Cazenove

AI Lead · Natural History Museum · South Kensington

linkedin.com/in/rhyscazenove

Natural History Museum

Claude Code

"Build the governance harness before you need it — not after the agent does something you didn't expect."

CCCL #5 · April 2026

Agentsin thePipeline

You've built something that works.

Local vs Service.

Three pillars to implement.

Three things dangerous together.

Access to private or sensitive data

Exposure to untrusted content

A mechanism to exfiltrate data

Four parts, modular ownership.

Container Image

Security Hooks

Pipeline Config

Ownership Model

How the parts connect.

One harness, four teams.

7 independent protection layers.

Agent calls a tool, hooks decide.

Blocked calls are only part of the story.

Same hook, full visibility.

Low risk, high value.

Documentation Generator

Onboarding Generator

Incident Analysis

A reproducible pattern.

Five things that matter.

RhysCazenove

Agents
in the
Pipeline

Rhys
Cazenove