Files
corrosion-admin-panel/corrosion-host-agent/PROTOCOL.md
Vantz Stockwell cea3d66cdd
All checks were successful
Test Asgard Runner / test (push) Successful in 3s
feat(host-agent): Rust rewrite Phase 0 — multi-instance foundation, v2 wire protocol, real telemetry
New corrosion-host-agent/ crate (Go companion-agent stays as behavior
reference until parity). Wire protocol v2 per COA-B: instance-scoped
subjects corrosion.{license}.{instance}.* + host-level .host.* — spec
in PROTOCOL.md, designed for the license->host->instance fleet model.

- Multi-instance TOML config in the foundation, not retrofitted
- NATS layer on the Vigilance production profile (infinite reconnect,
  capped backoff, 30s ping, 8192-msg offline buffer)
- Heartbeat with real sysinfo telemetry — Go agent shipped hardcoded
  disk/cpu placeholders; this is the panel's first true Resources data
- Connectivity prober (outbound TCP, periodic + on-demand)
- Host cmd channel (ping/probe/sysinfo), going-offline beacon,
  CancellationToken shutdown
- Live-fire verified against production NATS; artifacts: 3.7MB static
  linux-musl, 3.8MB windows .exe (static CRT)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 10:02:46 -04:00

5.1 KiB

Corrosion Wire Protocol v2

Status: Phase 0 implemented (host heartbeat, host commands, going-offline beacon). Per-instance command/status subjects are reserved and specified here for Phase 1.

Design

One host agent per machine supervises N game instances. Subjects are scoped license-first, then by addressee:

corrosion.{license_id}.host.*           host-level (the agent itself)
corrosion.{license_id}.{instance_id}.*  instance-level (one game server)

instance_id is a config-defined slug ([a-z0-9_-]{1,64}), validated at agent start. host is a reserved segment and can never be an instance id. Payloads are JSON. Every heartbeat carries "schema": 2 so consumers can distinguish v2 from the legacy Go companion protocol (which used corrosion.{license_id}.companion.heartbeat, no schema field).

Host-level subjects (Phase 0 — live)

corrosion.{license_id}.host.heartbeat (agent → backend, publish)

Published every heartbeat_seconds (default 60, jittered ±20%).

{
  "schema": 2,
  "timestamp": "2026-06-11T18:00:00Z",
  "agent": {
    "version": "2.0.0-alpha.1",
    "commit": "a8722a7",
    "os": "linux",
    "arch": "x86_64",
    "uptime_seconds": 86400
  },
  "host": {
    "hostname": "asgard-01",
    "cpu_percent": 12.5,
    "cpu_cores": 80,
    "mem_total_mb": 262144,
    "mem_used_mb": 81920,
    "uptime_seconds": 1209600,
    "disks": [
      { "mount": "/", "total_mb": 1907729, "free_mb": 1532211 }
    ]
  },
  "instances": [
    {
      "id": "rust-main",
      "game": "rust",
      "label": "Main 2x Vanilla",
      "state": "configured",
      "root_disk_free_mb": 1532211
    }
  ],
  "probe": {
    "timestamp": "2026-06-11T17:58:00Z",
    "results": [
      { "name": "corrosion-cdn", "host": "cdn.corrosionmgmt.com", "port": 443, "ok": true, "latency_ms": 18 }
    ]
  }
}

All telemetry is measured, never fabricated. Fields the agent cannot measure are omitted (probe before the first probe completes, hostname if unavailable).

Phase 0 instance state values: configured (root path exists), missing_root. Phase 1 adds live process states: running, stopped, crashed, starting, updating.

corrosion.{license_id}.host.cmd (backend → agent, request-reply)

Request: { "func": "<name>" }. Reply: { "status": "success" | "error", ... }.

func Reply payload
ping version, commit, uptime_seconds
probe report — fresh ProbeReport (also cached for heartbeat)
sysinfo snapshot — full heartbeat payload, collected on demand

Unknown funcs return status: "error" with a message listing supported funcs.

corrosion.{license_id}.host.going_offline (agent → backend, publish)

Best-effort beacon (500ms budget) on graceful shutdown so the panel can flip the host to offline immediately instead of waiting out heartbeat staleness. Payload: {}.

Instance-level subjects (Phase 1 — reserved, not yet implemented)

corrosion.{license_id}.{instance_id}.cmd (backend → agent, request-reply)

Lifecycle and control for one game instance. Planned funcs: start, stop, restart, status, rcon (process-class games), steam_update, oxide_install (rust), plus game-adapter-specific commands (Dune: docker lifecycle, RabbitMQ bus commands, Coriolis reset).

corrosion.{license_id}.{instance_id}.status (agent → backend, publish)

State-change events (started/stopped/crashed) so the panel does not wait for the next heartbeat.

corrosion.{license_id}.{instance_id}.console (agent → backend, publish)

Live console/log lines for the panel console view.

corrosion.{license_id}.{instance_id}.files.cmd (backend → agent, request-reply)

VueFinder-style file manager ops, jailed to the instance root. Carries over the Go agent's jailed filemanager semantics (fm_list, fm_save, ...); the legacy UNJAILED files.get/put/delete/list API is retired and will not be ported.

Backend mapping notes (Phase 0)

  • The NestJS NATS bridge subscribes corrosion.*.host.heartbeat and corrosion.*.host.going_offline.
  • Until the license→host→instance schema lands, the backend may map the host heartbeat onto the existing single server_connections row per license: companion_last_seen ← heartbeat arrival, connection_status ← connected/offline, resources ← host.cpu_percent / mem_* / first disk. Instance-level mapping activates with the fleet schema.

Probing — scope honesty

The Phase 0 prober measures outbound reachability from the host (TCP connect + latency). It cannot verify inbound port-forwarding (the thing players hit). Inbound verification requires a backend-side reverse probe service that attempts connections to the customer's public IP/ports on request; that is specified as a Phase 1+ feature and will reuse this report format with direction: "inbound".

Versioning

  • The agent embeds semver + git hash + build timestamp (--version, heartbeat agent block).
  • Schema changes bump schema and are additive where possible.