Monitoring

System, app & DB metrics + alerts

The monitoring block collects operational metrics — host resources, application traffic, database health — persists a rolling history, and raises alerts when thresholds are breached. It exposes everything over HTTP, including a live SSE stream you can wire straight into a dashboard.

Each collector is independently toggleable, so you pay only for what you watch. Alerts add email fan-out with a cooldown so a flapping metric doesn't bury your inbox.

Collectors#

Four metric sources, each switched on independently with its own sub-metric checklist. collectInterval controls how often samples are taken.

config.nucleus.json — monitoring
1{2  "monitoring": {3    "enabled": true,4    "system": {5      "enabled": true,6      "collectInterval": "30s",7      "metrics": { "cpu": true, "memory": true, "disk": true, "process": true }8    },9    "application": {10      "enabled": true,11      "metrics": { "requests": true, "responseTime": true, "errors": true, "rateLimits": true }12    },13    "database": {14      "enabled": true,15      "metrics": { "connections": true, "queryTime": true, "slowQueryThreshold": "500ms" }16    },17    "redis": { "enabled": true },18    "persistence": { "enabled": true, "flushInterval": "60s", "retentionDays": 30 },19    "alerts": {20      "enabled": true,21      "email": { "enabled": true, "recipients": ["[email protected]"] },22      "thresholds": { "cpuPercent": 80, "memoryPercent": 85, "errorRatePercent": 5, "responseTimeMs": 1000 },23      "cooldown": "10m"24    },25    "endpoints": {26      "enabled": true,27      "basePath": "/monitoring",28      "stream": { "enabled": true, "path": "/stream", "interval": "5s" },29      "history": { "enabled": true, "path": "/history", "maxMinutes": 1440 }30    }31  }32}
enabledbooleanOptional

Master switch for the monitoring subsystem.

Defaultfalse
systemobjectOptional

Host metrics. collectInterval sets the sampling cadence (e.g. 30s); metrics toggles cpu, memory, disk, network and process.

applicationobjectOptional

In-process metrics. metrics toggles requests, responseTime, errors and rateLimits — your real-time picture of API behaviour.

databaseobjectOptional

Postgres health. metrics toggles connections and queryTime, with slowQueryThreshold marking the boundary for the slow-query count.

redis{ enabled?: boolean }Optional

Toggle collection of Redis health metrics.

Persistence#

Metrics are buffered in memory and periodically flushed to the monitoring_metrics system table (generated only when monitoring + persistence are enabled) so you keep history across restarts. The table is exposed read-only through the standard entity routes — GET only, mutating methods are not generated. Tune the flush cadence and how long history is retained.

persistenceobjectOptional

History storage configuration.

enabledbooleanOptional

Persist collected metrics to the database.

flushIntervalstringOptional

How often buffered metrics are written (e.g. 60s).

retentionDaysnumberOptional

How many days of history to keep before pruning.

Alerts#

Fire notifications when a metric crosses a threshold. Email recipients receive the alert, and a cooldown prevents alert storms from a metric hovering at the boundary.

alertsobjectOptional

Threshold-based alerting.

email{ enabled?; recipients?: string[] }Optional

Email delivery of alerts to a recipient list.

thresholdsobjectOptional

The trip points: cpuPercent, memoryPercent, diskPercent, errorRatePercent, responseTimeMs and rateLimitBlocksPerMinute.

cooldownstringOptional

Minimum gap between repeat alerts for the same condition.

Endpoints#

Monitoring is exposed over HTTP under a base path, with a live stream, a point-in-time snapshot, a history query and an alerts feed.

endpointsobjectOptional

HTTP surface for metrics.

basePathstringOptional

Prefix for all monitoring routes.

stream{ path?; interval? }Optional

Server-Sent Events live feed; interval sets push frequency.

snapshot{ path? }Optional

Current metric values in a single response.

history{ path?; maxMinutes? }Optional

Query persisted history up to maxMinutes back.

alerts{ path? }Optional

Read recent and active alerts.

From the frontend#

The client exposes type-safe actions for the monitoring surface, so a status page or admin settings panel needs no bespoke fetch. The live SSE feed is read straight from the stream path.

MONITORING_HEALTH_CHECKGET · /monitoring/health

A simple status + timestamp probe for an uptime badge.

MONITORING_GET_LOGSGET · /monitoring/logs

Pull recent metric history to render charts.

MONITORING_GET_SETTINGS / MONITORING_CHANGE_SETTINGSGET|PATCH · /monitoring/settings

Read or tune which feeds are captured and at what cadence, from an admin panel.

Under the hood — the collect/flush loop#

MonitoringService runs two timers: one collects a snapshot on collectInterval, the other flushes accumulated metric points to the database on flushInterval. Redis holds the hot data; the database holds history.

collectorssystem · application · database · redisOptional

SystemCollector reads CPU/memory/disk/network/process (network via /proc/net/dev on Linux); ApplicationCollector tallies requests/response-time/errors/rate-limit blocks (fed by recordRequest / recordRateLimitBlock from the middleware); DatabaseCollector queries pg_stat_activity/pg_stat_database for connections, cache hit ratio and query timings (avg query time on PG14+); RedisCollector parses INFO for memory/clients/ops (direct connection mode only — skipped under Dapr). Each enabled metric is flattened into dotted MetricPoints (e.g. system.memory.percent).

hot data in Redislatest + 1h historyOptional

Every snapshot is written to monitoring:{appId}:latest (1h TTL) and appended to a monitoring:{appId}:history list trimmed to the last hour — so a dashboard or the GET_LOGS action gets instant recent data without touching the database.

persistence flushflushToDb + retentionDaysOptional

Collected MetricPoints buffer in memory and are flushed in batches into the monitoring_metrics table on flushInterval (default 1m); a failed flush re-queues them. A daily prune deletes rows older than retentionDays.

alertsthresholds + cooldownOptional

After each collect, AlertService compares the snapshot against thresholds (cpu 80%, memory 85%, disk 90%, error-rate 5%, response-time 1000ms, rate-limit blocks 100/min by default). A breach raises an alert and, if configured, emails the recipients — then a cooldown (default 5m) suppresses duplicates until acknowledged.

Related sections