AI Integration

AI integration that doesn't break in production: 4 patterns we ship

AI features look good in a demo and fall over in production. Four patterns we use to ship Claude integrations that survive real load, real users, and real failure modes.

June 2026 · 10 min read

Every AI demo works. You feed it clean input, it returns something impressive, and the room nods. Then you deploy it. Noisy data arrives, rate limits kick in, a model version ships with subtly different behavior, and one day a customer calls to say your system confidently invented a transaction that never happened. This is the demo-to-production gap, and it swallows most AI integrations before they reach six months in production.

The demo-to-production gap

In 2026, AI reliability, cost, and governance are the three concerns keeping engineering leaders up at night. Not "can we make Claude do something impressive" — that part is easy. The hard part is building a system that behaves predictably when input is incomplete, when the API is slow, when the model returns something plausible-sounding but wrong, and when your finance team asks you to explain every automated decision.

We have shipped production AI systems across agricultural diagnostics, finance categorization, and document processing. The four patterns below are the ones that kept those systems alive past the first month.

Anthropic's guidance on building effective agents frames the same underlying tension well: most failures come from systems that assume the happy path. We agree. Every pattern here assumes Claude will sometimes be wrong, slow, or unavailable — and ships anyway.

Pattern 1: Structured output as the contract

Free-form Claude output is for humans. Downstream systems need typed contracts. If your application reads a Claude response with string parsing or regex, you have a latent bug that will surface in production the moment Claude changes how it phrases something.

We use Anthropic's tool_use feature to force every Claude call to return a validated JSON object matching a declared schema. On MkulimaOS, every disease-scouting call returns a DiagnosisResponse: a typed object with threat, confidence, and recommendedAction. If Claude cannot populate that schema with confidence, the call routes to human review. It does not silently corrupt the dataset.

Here is the schema definition we use with Zod, paired with the Claude tool-use interface:

import { z } from "zod";

export const DiagnosisResponseSchema = z.object({
  threat: z.string(),          // e.g. "coffee_leaf_rust"
  confidence: z.number().min(0).max(1),
  recommendedAction: z.enum([
    "monitor",
    "flag_for_review",
    "apply_treatment",
    "escalate_to_agronomist",
  ]),
  reasoning: z.string().max(300),
});

export type DiagnosisResponse = z.infer<typeof DiagnosisResponseSchema>;

// Claude tool definition (passed to the API)
export const diagnosisTool = {
  name: "submit_diagnosis",
  description: "Submit a structured plant disease diagnosis.",
  input_schema: {
    type: "object",
    properties: {
      threat: { type: "string" },
      confidence: { type: "number" },
      recommendedAction: {
        type: "string",
        enum: [
          "monitor",
          "flag_for_review",
          "apply_treatment",
          "escalate_to_agronomist",
        ],
      },
      reasoning: { type: "string" },
    },
    required: ["threat", "confidence", "recommendedAction", "reasoning"],
  },
};

Anything with confidence < 0.85 goes to the human queue. The schema is the production contract, not documentation.

Pattern 2: Deterministic fallback, every time

Every Claude call in our systems has an explicit fallback path. Not a try/catch that logs an error and moves on — a defined, business-logic-aware alternative route that keeps the system coherent.

In MkulimaOS disease scouting, a low-confidence diagnosis does not trigger automatic chemical application. It opens a task for a human agronomist to review within 24 hours. In finance categorization, uncertain entries route to a manual queue rather than being silently assigned to a catch-all category. The business is never worse off than it was before Claude was involved.

Here is the pattern in pseudo-TypeScript:

async function classifyExpense(input: ExpenseInput): Promise<ExpenseResult> {
  try {
    const raw = await callClaude(input);
    const parsed = ExpenseResponseSchema.safeParse(raw);

    if (!parsed.success || parsed.data.confidence < CONFIDENCE_THRESHOLD) {
      return {
        status: "pending_review",
        reason: parsed.success ? "low_confidence" : "parse_error",
        queuedAt: new Date().toISOString(),
      };
    }

    return { status: "classified", ...parsed.data };
  } catch (err) {
    // Network error, rate limit, or timeout
    return {
      status: "pending_review",
      reason: "api_unavailable",
      queuedAt: new Date().toISOString(),
    };
  }
}

Anthropic's writing on long-running agent harnesses covers this from a different angle: the architecture of your harness determines whether failures are recoverable. Build the harness first, then add Claude into it.

Pattern 3: Cost-aware model selection

Most production Claude calls do not need Opus. Most do not even need Sonnet. Running Opus on every request because "it's smarter" is how AI bills become a board agenda item three months after launch.

Our default routing in production:

Haiku handles routine classification tasks — categorizing a transaction, checking whether a photo is blurry, extracting a date from a document. These are the 80% of calls. Haiku costs roughly $0.00025 per 1K input tokens.
Sonnet handles structured reasoning tasks where output quality matters — disease diagnosis, draft-document generation, multi-step data extraction.
Opus handles genuine edge cases: ambiguous inputs that Sonnet flagged, high-stakes decisions with low confidence, anything that needs the deepest reasoning.

On top of model selection, prompt caching cuts the cost of repeated-context calls by up to 90%. Any pipeline with a long system prompt or large document that does not change between calls should have caching enabled. On MkulimaOS, our disease-knowledge base is a 12,000-token system prompt that is cached across every scouting call during a session. The difference in daily API cost is not small.

The right model for most production AI calls is not the most capable one. It is the cheapest one that reliably produces a parseable output for that specific input distribution.

Pattern 4: Observability before you go live

If you cannot explain what Claude said, why it said it, how long it took, and how much it cost — for every call, permanently — you are not ready for production. Log everything: prompt hash, model version, token count, latency, confidence score, and the final routing decision.

Model behavior drifts after Anthropic releases a new model version. It also drifts when your input distribution shifts — a new product category, a seasonal change in agricultural data, a new document format from a supplier. Without an eval baseline, you will not notice the drift until a downstream system starts behaving strangely.

Anthropic's post on evals for AI agents is the clearest treatment of this we have seen. The key insight: evals are not test suites. They are the monitoring layer for a system whose behavior you did not fully specify. Most teams skip evals because they feel expensive to build. They are cheap compared to the incident that follows.

At minimum, track:

Confidence score distribution over time (a shift here signals model or data drift)
Percentage of calls routed to human review (a sudden spike is a signal)
P50/P95/P99 latency per endpoint
Cost per call, per day, per feature
Parse error rate on structured outputs

Build a simple dashboard before you ship. A single Postgres table and a Metabase query is enough to start. The goal is to notice when something changes, not to build an observability product.

How we build with Claude as the Kenya Ambassador

Spidey Labs is led by Peter Kibet, Anthropic's Community Ambassador for Kenya. These patterns are not derived from blog posts — they are what we run in production today, and what we debug when they fail. We use Claude Code in our own engineering pipeline daily: writing code, reviewing PRs, scouting for regressions. The advice we give clients is the advice we take ourselves.

For engineering teams evaluating AI integration, we offer a 1-week Sprint engagement. We audit your current stack, identify where AI adds genuine leverage versus where it adds risk, and build a proof-of-concept against your real data before any larger commitment. No production systems are touched until the proof holds up.

If your team is moving toward AI integration and wants a production-grounded second opinion, start at our AI consulting service page or book a call directly. We are happy to talk through your architecture before you build anything.

References & further reading

Book a 1-week AI Sprint audit All notes

Get the next one

Notes like this one, in your inbox.

Production AI engineering, EUDR compliance work, and lessons from running real software in rural Kenya. Twice a month. Unsubscribe anytime.

Also in the pipeline

Building in public

Start a project.

Read something here that maps to a problem you have? Tell us about it — we'll tell you whether we can ship it.

Book a discovery call See the work