# Corrosion Wire Protocol v2 Status: **Phase 0 + Phase 1 process control implemented** (host heartbeat, host commands, going-offline beacon, per-instance start/stop/restart/status with push state events). RCON, SteamCMD, file ops, and game adapters are specified but not yet implemented. ## Design One **host agent** per machine supervises **N game instances**. Subjects are scoped license-first, then by addressee: ``` corrosion.{license_id}.host.* host-level (the agent itself) corrosion.{license_id}.{instance_id}.* instance-level (one game server) ``` `instance_id` is a config-defined slug (`[a-z0-9_-]{1,64}`), validated at agent start. `host` is a reserved segment and can never be an instance id. Payloads are JSON. Every heartbeat carries `"schema": 2` so consumers can distinguish v2 from the legacy Go companion protocol (which used `corrosion.{license_id}.companion.heartbeat`, no schema field). ## Host-level subjects (Phase 0 — live) ### `corrosion.{license_id}.host.heartbeat` (agent → backend, publish) Published every `heartbeat_seconds` (default 60, jittered ±20%). ```json { "schema": 2, "timestamp": "2026-06-11T18:00:00Z", "agent": { "version": "2.0.0-alpha.1", "commit": "a8722a7", "os": "linux", "arch": "x86_64", "uptime_seconds": 86400 }, "host": { "hostname": "asgard-01", "cpu_percent": 12.5, "cpu_cores": 80, "mem_total_mb": 262144, "mem_used_mb": 81920, "uptime_seconds": 1209600, "disks": [ { "mount": "/", "total_mb": 1907729, "free_mb": 1532211 } ] }, "instances": [ { "id": "rust-main", "game": "rust", "label": "Main 2x Vanilla", "state": "configured", "root_disk_free_mb": 1532211 } ], "probe": { "timestamp": "2026-06-11T17:58:00Z", "results": [ { "name": "corrosion-cdn", "host": "cdn.corrosionmgmt.com", "port": 443, "ok": true, "latency_ms": 18 } ] } } ``` All telemetry is measured, never fabricated. Fields the agent cannot measure are omitted (`probe` before the first probe completes, `hostname` if unavailable). Instance `state` values — process-managed (an `executable` is configured): `running`, `stopped`, `starting`, `stopping`, `crashed`; unmanaged (telemetry-only): `configured` (root exists), `missing_root`. Each instance also reports `uptime_seconds` (0 unless running). ### `corrosion.{license_id}.host.cmd` (backend → agent, request-reply) Request: `{ "func": "" }`. Reply: `{ "status": "success" | "error", ... }`. | func | Reply payload | | --------- | -------------------------------------------------------- | | `ping` | `version`, `commit`, `uptime_seconds` | | `probe` | `report` — fresh ProbeReport (also cached for heartbeat) | | `sysinfo` | `snapshot` — full heartbeat payload, collected on demand | | `update` | `{ "func": "update", "url": "https://cdn.corrosionmgmt.com/host-agent/.../corrosion-host-agent-" }` → downloads the binary + `.minisig`, verifies the minisign signature against the agent's EMBEDDED public key, atomically swaps (with `.old` rollback), replies `{ status: success, message: "...relaunching" }`, then relaunches the new binary. Rejects anything not signed by the release key and any URL that isn't `https://cdn.corrosionmgmt.com`. | Unknown funcs return `status: "error"` with a message listing supported funcs. ### `corrosion.{license_id}.host.going_offline` (agent → backend, publish) Best-effort beacon (500ms budget) on graceful shutdown so the panel can flip the host to offline immediately instead of waiting out heartbeat staleness. Payload: `{}`. ## Instance-level subjects ### `corrosion.{license_id}.{instance_id}.cmd` (backend → agent, request-reply) — LIVE Lifecycle and control for one game instance. The same `start`/`stop`/`restart`/`status` funcs work for **every** game: the agent picks a `Supervisor` impl per game — a spawned-process supervisor for Rust/Conan/Soulmask, a **docker-compose supervisor for Dune** (`docker compose up -d` / `stop` / `restart` against the instance's compose project, configured via `[instance.docker_compose]`). The wire contract is identical; only the management model behind it differs. Implemented funcs: `start`, `stop` (graceful with 30s budget, then force kill — process supervisor; Dune maps stop to `docker compose stop`), `restart`, `status` (returns `state` + `uptime_seconds`), and `rcon` — `{ "func": "rcon", "command": "" }` returns `{ "status": "success", "output": }`. Protocol per game: WebRCON (WebSocket JSON) for rust, Source RCON (Valve TCP) for conan/soulmask; explicit `kind` override available in the instance's `[instance.rcon]` config. Always targets 127.0.0.1 (agent is co-located). Errors reply `{ "status": "error", "message": ... }` — including start on an unmanaged instance, double start, missing rcon config, and unknown funcs. Also implemented: `steam_update` — `{ "func": "steam_update" }` runs SteamCMD for the instance's game (app ids: rust 258550, conan 443030, soulmask 3017310/3017300; dune rejects — Docker images, no SteamCMD), streaming progress lines to `corrosion.{license}.{instance}.steam_status` and replying on completion. Planned funcs: `oxide_install` (rust), plus game-adapter-specific commands (Dune: RabbitMQ admin-bus commands, Coriolis reset, Postgres admin surface). Dune **lifecycle** is already covered by the shared start/stop/restart funcs above; container crash-detection and state adoption on agent restart land with Phase 3b. ### `corrosion.{license_id}.{instance_id}.steam_status` (agent → backend, publish) — LIVE Per-line SteamCMD stdout during a `steam_update`, so the panel can show live update progress. Payload: `{ "timestamp", "instance_id", "line" }`. ### `corrosion.{license_id}.{instance_id}.files.cmd` (backend → agent, request-reply) — LIVE Jailed file manager, confined to the instance `root` (two-stage check: lexical normalize + canonicalize, defeating `../` traversal and symlink escape). Request `{ "op": "list|read|write|delete|rename|mkdir|mkfile|move|copy", "path": "rel/path", "dest"?, "content"?, "name"? }`; reply `{ "status": "success", "data": ... }` or `{ "status": "error", "message": ... }`. `read` caps at 5 MiB. Replaces the Go agent's UNJAILED legacy files API, which is retired and will not be ported. ### `corrosion.{license_id}.{instance_id}.status` (agent → backend, publish) — LIVE State-change events so the panel does not wait for the next heartbeat. Payload: `{ "timestamp", "instance_id", "event": { "state": ..., "exit_code"? } }`. Semantics: **keep-latest state sync**, not a lossless transition ledger — near-instant transient states (e.g. `starting` when spawn succeeds immediately) may coalesce into the following state. Consumers should treat each event as "current state is now X". Known Phase 1 limitation: the supervisor does not yet persist/adopt PIDs — if the agent itself restarts while a game server is running, the game process survives but reports `stopped` until restarted through the panel. PID adoption is queued with the service-install work. ### `corrosion.{license_id}.{instance_id}.console` (agent → backend, publish) Live console/log lines for the panel console view. ### `corrosion.{license_id}.{instance_id}.files.cmd` (backend → agent, request-reply) VueFinder-style file manager ops, jailed to the instance root. Carries over the Go agent's jailed filemanager semantics (`fm_list`, `fm_save`, ...); the legacy UNJAILED `files.get/put/delete/list` API is retired and will not be ported. ## Backend mapping notes (Phase 0) - The NestJS NATS bridge subscribes `corrosion.*.host.heartbeat` and `corrosion.*.host.going_offline`. - Until the license→host→instance schema lands, the backend may map the host heartbeat onto the existing single `server_connections` row per license: `companion_last_seen` ← heartbeat arrival, `connection_status` ← connected/offline, resources ← `host.cpu_percent` / `mem_*` / first disk. Instance-level mapping activates with the fleet schema. ## Probing — scope honesty The Phase 0 prober measures **outbound** reachability from the host (TCP connect + latency). It cannot verify **inbound** port-forwarding (the thing players hit). Inbound verification requires a backend-side reverse probe service that attempts connections to the customer's public IP/ports on request; that is specified as a Phase 1+ feature and will reuse this report format with `direction: "inbound"`. ## Authentication & tenant isolation The broker enforces per-license auth: an agent connects with `user = license_id`, `password = HMAC-SHA256(license_id, NATS_TOKEN_SECRET)` (shown on the panel Server page), and is scoped to `corrosion.{license_id}.>` only. The backend uses a privileged internal user. This makes cross-tenant access impossible at the broker, not just by convention. **Reply-subject rule:** per-license users have NO `_INBOX` permission (granting it would let one license read another's request-reply traffic). Therefore any backend→agent request-reply MUST use a reply subject inside the license namespace — e.g. `corrosion.{license_id}.reply.` — never the client's default global `_INBOX`. The agent is unaffected: it responds to whatever `msg.reply` it receives. The constraint is on the requester (the internal user has full access). The contract/CI tests run against an unauthenticated broker and use the default inbox; production request-reply must follow this rule. ## Versioning - The agent embeds semver + git hash + build timestamp (`--version`, heartbeat `agent` block). - Schema changes bump `schema` and are additive where possible.