> ## Documentation Index
> Fetch the complete documentation index at: https://docs.svantic.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Datadog plugin

# Datadog Plugin for Savant

A standalone A2A agent that bridges Datadog monitoring alerts and metrics into the Savant mesh and exposes Datadog operations as tools for automated incident response.

## What It Does

* **Listens** to Datadog monitor webhooks (`@webhook-savant` in monitor notifications) and polls the Monitors and Events APIs as fallback
* **Transforms** monitor state transitions into structured Savant cues with priority routing
* **Exposes** 9 Datadog tools (query metrics, get/mute/unmute monitors, get events, host metrics, create events, get SLOs) callable by any agent on the mesh

## Architecture

```mermaid theme={null}
flowchart LR
	D["Datadog<br/>(external)"] -->|webhook POST| P["Datadog Plugin<br/>(A2A Agent)"] --> M["Svantic Mesh"]
```

## Deployment

```bash theme={null}
DATADOG_API_KEY=xxxxxxxx \
DATADOG_APP_KEY=xxxxxxxx \
DATADOG_SITE=datadoghq.com \
DATADOG_WEBHOOK_TOKEN=whtoken_xxxxxxxx \
SAVANT_MESH_URL=http://savant-mesh:3000 \
SAVANT_CLIENT_ID=your-client-id \
SAVANT_CLIENT_SECRET=your-client-secret \
node dist/index.js
```

## settings.json

```json theme={null}
{
    "plugin": "savant-datadog-plugin",
    "version": "1.0.0",
    "agent": {
        "name": "savant-datadog-plugin",
        "description": "Bridges Datadog alerts into the Savant mesh for automated incident response.",
        "port": 4200,
        "agent_type": "datadog-plugin"
    },
    "mesh": {
        "url": "http://localhost:3000",
        "tenant_id": "acme",
        "tenant_secret": "${SAVANT_TENANT_SECRET}"
    },
    "datadog": {
        "api_key": "${DATADOG_API_KEY}",
        "app_key": "${DATADOG_APP_KEY}",
        "site": "datadoghq.com",
        "webhook_token": "${DATADOG_WEBHOOK_TOKEN}"
    },
    "event_ingestion": {
        "webhook_path": "/webhooks/datadog",
        "polling_enabled": true,
        "monitor_polling_interval_seconds": 30,
        "event_polling_interval_seconds": 120
    },
    "routing_rules": [
        {
            "match": { "monitor_status": "Alert", "tags_include": ["pager"] },
            "action": { "auto_execute": true, "cue_priority": "critical" }
        },
        {
            "match": { "monitor_status": "Warn" },
            "action": { "auto_execute": false, "cue_priority": "warning" }
        },
        {
            "match": { "event_type": "monitor.recovered" },
            "action": { "log_only": false, "cue_priority": "info" }
        }
    ],
    "deduplication": {
        "enabled": true,
        "window_seconds": 300,
        "key_fields": ["monitor_id", "monitor_status"]
    },
    "tools": {
        "enabled": [
            "datadog_get_alerts",
            "datadog_query_metrics",
            "datadog_get_monitor",
            "datadog_mute_monitor",
            "datadog_unmute_monitor",
            "datadog_get_events",
            "datadog_get_host_metrics",
            "datadog_create_event",
            "datadog_get_slo"
        ]
    }
}
```

## Tools

| Tool                       | Description                        |
| -------------------------- | ---------------------------------- |
| `datadog_get_alerts`       | Fetch monitors in alert/warn state |
| `datadog_query_metrics`    | Query time-series metrics          |
| `datadog_get_monitor`      | Get specific monitor details       |
| `datadog_mute_monitor`     | Mute a monitor for a duration      |
| `datadog_unmute_monitor`   | Unmute a previously muted monitor  |
| `datadog_get_events`       | Fetch recent Datadog events        |
| `datadog_get_host_metrics` | Get CPU/memory/disk for a host     |
| `datadog_create_event`     | Post an event to Datadog           |
| `datadog_get_slo`          | Fetch SLO status and error budget  |

## End-to-End Example: Automated Incident Response

1. Datadog monitor triggers: "API latency > 500ms for 5 minutes"
2. Plugin receives webhook, creates critical cue (tags include `pager`)
3. Planner generates incident response plan:
   * Query recent metrics for the affected service
   * Check for recent deployments (via GitHub tools)
   * Gather host metrics
   * Synthesize findings into incident summary
   * Post to Slack #incidents channel
   * Create Datadog event documenting the investigation
4. If a recent deploy is correlated, suggest rollback (pending human approval)
