Failure Intelligence Architecture

Design spec for the operational failure intelligence system — severity indexing, recovery complexity, prevention patterns, related failures, deployment risk scoring, and ecosystem impact mapping.

May 18, 2026· 17 min read

#ops #failures #architecture #intelligence #debugging #platform

ShareX LinkedIn

Generate post copy →

Individual failure reports are useful. A documented failure with a clear root cause and a verified fix prevents that failure from happening to the next operator who hits the same condition. That value is real and immediate.

But failure reports in isolation only solve the problem once. The deeper value — the value that scales — comes from patterns across failures: which root causes recur, which fixes prevent multiple failures, which deployment patterns carry the highest risk, which categories of failure produce the highest repeat rate. This is what the Failure Intelligence layer exists to extract.

The Failure Archive collects operational failures. The Failure Intelligence layer turns that collection into a queryable operational knowledge base.

What Was Built: The FailureIntelligence Component

The FailureIntelligence component (components/mdx/failure-intelligence.tsx) is a structured metadata panel rendered at the top of every Failure Archive entry. It renders eight intelligence signals, all sourced from MDX frontmatter:

Severity — low, medium, high, or critical, with a visual severity badge and color coding
Category — build, runtime, deployment, data, performance, dependency, configuration, or authentication
Recovery complexity — minutes, hours, or days, indicating how much operator effort the fix requires
Prevention patterns — a list of specific actions that would have prevented this failure
Related failures — a cross-linked list of other Failure Archive entries with labeled relation types
Deployment risk — how likely this failure type is to show up at deployment time
Ecosystem impact — which other systems in the operator's stack are affected when this failure occurs
Time to detect — how quickly this failure surface becomes visible after the triggering condition
Repeat risk — how likely this failure is to recur after the initial fix

The component renders inline in the MDX document, before the narrative content. It gives the reader the operational verdict — severity, recovery cost, prevention path — before they read the story of how it happened.

The Failure Frontmatter Intelligence Schema

The existing Failure Archive frontmatter fields handle basic categorization:

YAML

severity: high              # low | medium | high | critical
failure_status: resolved    # resolved | open | partial
failure_type: deployment    # broad classification
resolution_time: 23 minutes # how long it took to fix in this instance

The intelligence layer extends this with fields that enable cross-failure analysis:

YAML

# Core intelligence fields
category: deployment                  # 8-value taxonomy (see below)
recovery_complexity: minutes          # minutes | hours | days
deployment_risk: high                 # low | medium | high
time_to_detect: "immediate"           # string: "immediate" | "next deploy" | "3 hours" | "days"
repeat_risk: low                      # low | medium | high

# Pattern arrays
prevention_patterns:
  - "Never add export const runtime = 'edge' to files that use next/og"
  - "Run tsc --noEmit before every push to surface edge compatibility errors locally"

# Cross-failure links
related_failures:
  - slug: "server-module-client-bundle"
    relation: "same-root"             # same-root | same-category | prevention-pair | escalation-risk
  - slug: "next-mdx-remote-v6-blockjs"
    relation: "same-category"
  - slug: "vercel-build-timeout"
    relation: "escalation-risk"

# Ecosystem impact
ecosystem_impact:
  - "Blocks all Vercel deployments"
  - "Prevents content updates"
  - "Breaks OG image generation for all routes"

Every field in the intelligence schema is binary or enumerated — no free-text judgments that require interpretation. This is intentional. The fields are designed to be computationally queryable in Phase 2.

Failure Category Taxonomy

The eight failure categories create a structured classification system. Each category has a specific definition, typical root causes, and canonical prevention patterns.

Build

Definition: Failure occurs during compilation or static analysis, before any runtime execution.

Typical root causes: TypeScript type errors, missing environment variables required at build time, incompatible dependency versions, invalid import paths, Edge Runtime API incompatibilities.