Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.svantic.com/llms.txt

Use this file to discover all available pages before exploring further.

Datadog Plugin for Savant

A standalone A2A agent that bridges Datadog monitoring alerts and metrics into the Savant mesh and exposes Datadog operations as tools for automated incident response.

What It Does

  • Listens to Datadog monitor webhooks (@webhook-savant in monitor notifications) and polls the Monitors and Events APIs as fallback
  • Transforms monitor state transitions into structured Savant cues with priority routing
  • Exposes 9 Datadog tools (query metrics, get/mute/unmute monitors, get events, host metrics, create events, get SLOs) callable by any agent on the mesh

Architecture

Deployment

DATADOG_API_KEY=xxxxxxxx \
DATADOG_APP_KEY=xxxxxxxx \
DATADOG_SITE=datadoghq.com \
DATADOG_WEBHOOK_TOKEN=whtoken_xxxxxxxx \
SAVANT_MESH_URL=http://savant-mesh:3000 \
SAVANT_CLIENT_ID=your-client-id \
SAVANT_CLIENT_SECRET=your-client-secret \
node dist/index.js

settings.json

{
    "plugin": "savant-datadog-plugin",
    "version": "1.0.0",
    "agent": {
        "name": "savant-datadog-plugin",
        "description": "Bridges Datadog alerts into the Savant mesh for automated incident response.",
        "port": 4200,
        "agent_type": "datadog-plugin"
    },
    "mesh": {
        "url": "http://localhost:3000",
        "tenant_id": "acme",
        "tenant_secret": "${SAVANT_TENANT_SECRET}"
    },
    "datadog": {
        "api_key": "${DATADOG_API_KEY}",
        "app_key": "${DATADOG_APP_KEY}",
        "site": "datadoghq.com",
        "webhook_token": "${DATADOG_WEBHOOK_TOKEN}"
    },
    "event_ingestion": {
        "webhook_path": "/webhooks/datadog",
        "polling_enabled": true,
        "monitor_polling_interval_seconds": 30,
        "event_polling_interval_seconds": 120
    },
    "routing_rules": [
        {
            "match": { "monitor_status": "Alert", "tags_include": ["pager"] },
            "action": { "auto_execute": true, "cue_priority": "critical" }
        },
        {
            "match": { "monitor_status": "Warn" },
            "action": { "auto_execute": false, "cue_priority": "warning" }
        },
        {
            "match": { "event_type": "monitor.recovered" },
            "action": { "log_only": false, "cue_priority": "info" }
        }
    ],
    "deduplication": {
        "enabled": true,
        "window_seconds": 300,
        "key_fields": ["monitor_id", "monitor_status"]
    },
    "tools": {
        "enabled": [
            "datadog_get_alerts",
            "datadog_query_metrics",
            "datadog_get_monitor",
            "datadog_mute_monitor",
            "datadog_unmute_monitor",
            "datadog_get_events",
            "datadog_get_host_metrics",
            "datadog_create_event",
            "datadog_get_slo"
        ]
    }
}

Tools

ToolDescription
datadog_get_alertsFetch monitors in alert/warn state
datadog_query_metricsQuery time-series metrics
datadog_get_monitorGet specific monitor details
datadog_mute_monitorMute a monitor for a duration
datadog_unmute_monitorUnmute a previously muted monitor
datadog_get_eventsFetch recent Datadog events
datadog_get_host_metricsGet CPU/memory/disk for a host
datadog_create_eventPost an event to Datadog
datadog_get_sloFetch SLO status and error budget

End-to-End Example: Automated Incident Response

  1. Datadog monitor triggers: “API latency > 500ms for 5 minutes”
  2. Plugin receives webhook, creates critical cue (tags include pager)
  3. Planner generates incident response plan:
    • Query recent metrics for the affected service
    • Check for recent deployments (via GitHub tools)
    • Gather host metrics
    • Synthesize findings into incident summary
    • Post to Slack #incidents channel
    • Create Datadog event documenting the investigation
  4. If a recent deploy is correlated, suggest rollback (pending human approval)