Files
corrosion-admin-panel/corrosion-host-agent/PROTOCOL.md
Vantz Stockwell 8334fbe4c6
Some checks failed
CI / backend-types (push) Successful in 10s
CI / frontend-build (push) Successful in 15s
CI / integration (push) Has been cancelled
CI / agent-tests (push) Has been cancelled
feat(host-agent): Phase 2 — Dune docker-compose adapter via Supervisor trait
Introduce a Supervisor trait (async-trait) so the agent manages games with
different models behind one wire contract. ProcessSupervisor (spawned process:
rust/conan/soulmask) and the new DockerComposeSupervisor (dune) both impl it;
Agent.supervisors is now HashMap<String, Arc<dyn Supervisor>> and instancecmd
dispatch is game-agnostic — start/stop/restart/status identical across games,
selected by a per-game factory in main. InstanceState moved to the shared
supervisor module.

DockerComposeSupervisor drives  against
the instance's compose project, with -f/-p/single-service support and a
configurable compose binary. New [instance.docker_compose] config block.
First cut = lifecycle + cached state; container crash-detection + restart
adoption deferred to Phase 3b (reconcilable with ).

Trait choice (dyn over enum) per Commander: scales to future planes (kubectl,
AMP/podman, SSH) as new struct+impl, no central match.

56 tests green (6 new docker-compose mock-binary tests + 5 refactored process
tests), zero warnings. Live verification pending a real Dune stack.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 21:32:25 -04:00

9.6 KiB

Corrosion Wire Protocol v2

Status: Phase 0 + Phase 1 process control implemented (host heartbeat, host commands, going-offline beacon, per-instance start/stop/restart/status with push state events). RCON, SteamCMD, file ops, and game adapters are specified but not yet implemented.

Design

One host agent per machine supervises N game instances. Subjects are scoped license-first, then by addressee:

corrosion.{license_id}.host.*           host-level (the agent itself)
corrosion.{license_id}.{instance_id}.*  instance-level (one game server)

instance_id is a config-defined slug ([a-z0-9_-]{1,64}), validated at agent start. host is a reserved segment and can never be an instance id. Payloads are JSON. Every heartbeat carries "schema": 2 so consumers can distinguish v2 from the legacy Go companion protocol (which used corrosion.{license_id}.companion.heartbeat, no schema field).

Host-level subjects (Phase 0 — live)

corrosion.{license_id}.host.heartbeat (agent → backend, publish)

Published every heartbeat_seconds (default 60, jittered ±20%).

{
  "schema": 2,
  "timestamp": "2026-06-11T18:00:00Z",
  "agent": {
    "version": "2.0.0-alpha.1",
    "commit": "a8722a7",
    "os": "linux",
    "arch": "x86_64",
    "uptime_seconds": 86400
  },
  "host": {
    "hostname": "asgard-01",
    "cpu_percent": 12.5,
    "cpu_cores": 80,
    "mem_total_mb": 262144,
    "mem_used_mb": 81920,
    "uptime_seconds": 1209600,
    "disks": [
      { "mount": "/", "total_mb": 1907729, "free_mb": 1532211 }
    ]
  },
  "instances": [
    {
      "id": "rust-main",
      "game": "rust",
      "label": "Main 2x Vanilla",
      "state": "configured",
      "root_disk_free_mb": 1532211
    }
  ],
  "probe": {
    "timestamp": "2026-06-11T17:58:00Z",
    "results": [
      { "name": "corrosion-cdn", "host": "cdn.corrosionmgmt.com", "port": 443, "ok": true, "latency_ms": 18 }
    ]
  }
}

All telemetry is measured, never fabricated. Fields the agent cannot measure are omitted (probe before the first probe completes, hostname if unavailable).

Instance state values — process-managed (an executable is configured): running, stopped, starting, stopping, crashed; unmanaged (telemetry-only): configured (root exists), missing_root. Each instance also reports uptime_seconds (0 unless running).

corrosion.{license_id}.host.cmd (backend → agent, request-reply)

Request: { "func": "<name>" }. Reply: { "status": "success" | "error", ... }.

func Reply payload
ping version, commit, uptime_seconds
probe report — fresh ProbeReport (also cached for heartbeat)
sysinfo snapshot — full heartbeat payload, collected on demand
update { "func": "update", "url": "https://cdn.corrosionmgmt.com/host-agent/.../corrosion-host-agent-<plat>" } → downloads the binary + <url>.minisig, verifies the minisign signature against the agent's EMBEDDED public key, atomically swaps (with .old rollback), replies { status: success, message: "...relaunching" }, then relaunches the new binary. Rejects anything not signed by the release key and any URL that isn't https://cdn.corrosionmgmt.com.

Unknown funcs return status: "error" with a message listing supported funcs.

corrosion.{license_id}.host.going_offline (agent → backend, publish)

Best-effort beacon (500ms budget) on graceful shutdown so the panel can flip the host to offline immediately instead of waiting out heartbeat staleness. Payload: {}.

Instance-level subjects

corrosion.{license_id}.{instance_id}.cmd (backend → agent, request-reply) — LIVE

Lifecycle and control for one game instance.

The same start/stop/restart/status funcs work for every game: the agent picks a Supervisor impl per game — a spawned-process supervisor for Rust/Conan/Soulmask, a docker-compose supervisor for Dune (docker compose up -d / stop / restart against the instance's compose project, configured via [instance.docker_compose]). The wire contract is identical; only the management model behind it differs.

Implemented funcs: start, stop (graceful with 30s budget, then force kill — process supervisor; Dune maps stop to docker compose stop), restart, status (returns state + uptime_seconds), and rcon{ "func": "rcon", "command": "<console command>" } returns { "status": "success", "output": <server response> }. Protocol per game: WebRCON (WebSocket JSON) for rust, Source RCON (Valve TCP) for conan/soulmask; explicit kind override available in the instance's [instance.rcon] config. Always targets 127.0.0.1 (agent is co-located). Errors reply { "status": "error", "message": ... } — including start on an unmanaged instance, double start, missing rcon config, and unknown funcs.

Also implemented: steam_update{ "func": "steam_update" } runs SteamCMD for the instance's game (app ids: rust 258550, conan 443030, soulmask 3017310/3017300; dune rejects — Docker images, no SteamCMD), streaming progress lines to corrosion.{license}.{instance}.steam_status and replying on completion.

Planned funcs: oxide_install (rust), plus game-adapter-specific commands (Dune: RabbitMQ admin-bus commands, Coriolis reset, Postgres admin surface). Dune lifecycle is already covered by the shared start/stop/restart funcs above; container crash-detection and state adoption on agent restart land with Phase 3b.

corrosion.{license_id}.{instance_id}.steam_status (agent → backend, publish) — LIVE

Per-line SteamCMD stdout during a steam_update, so the panel can show live update progress. Payload: { "timestamp", "instance_id", "line" }.

corrosion.{license_id}.{instance_id}.files.cmd (backend → agent, request-reply) — LIVE

Jailed file manager, confined to the instance root (two-stage check: lexical normalize + canonicalize, defeating ../ traversal and symlink escape). Request { "op": "list|read|write|delete|rename|mkdir|mkfile|move|copy", "path": "rel/path", "dest"?, "content"?, "name"? }; reply { "status": "success", "data": ... } or { "status": "error", "message": ... }. read caps at 5 MiB. Replaces the Go agent's UNJAILED legacy files API, which is retired and will not be ported.

corrosion.{license_id}.{instance_id}.status (agent → backend, publish) — LIVE

State-change events so the panel does not wait for the next heartbeat. Payload: { "timestamp", "instance_id", "event": { "state": ..., "exit_code"? } }.

Semantics: keep-latest state sync, not a lossless transition ledger — near-instant transient states (e.g. starting when spawn succeeds immediately) may coalesce into the following state. Consumers should treat each event as "current state is now X".

Known Phase 1 limitation: the supervisor does not yet persist/adopt PIDs — if the agent itself restarts while a game server is running, the game process survives but reports stopped until restarted through the panel. PID adoption is queued with the service-install work.

corrosion.{license_id}.{instance_id}.console (agent → backend, publish)

Live console/log lines for the panel console view.

corrosion.{license_id}.{instance_id}.files.cmd (backend → agent, request-reply)

VueFinder-style file manager ops, jailed to the instance root. Carries over the Go agent's jailed filemanager semantics (fm_list, fm_save, ...); the legacy UNJAILED files.get/put/delete/list API is retired and will not be ported.

Backend mapping notes (Phase 0)

  • The NestJS NATS bridge subscribes corrosion.*.host.heartbeat and corrosion.*.host.going_offline.
  • Until the license→host→instance schema lands, the backend may map the host heartbeat onto the existing single server_connections row per license: companion_last_seen ← heartbeat arrival, connection_status ← connected/offline, resources ← host.cpu_percent / mem_* / first disk. Instance-level mapping activates with the fleet schema.

Probing — scope honesty

The Phase 0 prober measures outbound reachability from the host (TCP connect + latency). It cannot verify inbound port-forwarding (the thing players hit). Inbound verification requires a backend-side reverse probe service that attempts connections to the customer's public IP/ports on request; that is specified as a Phase 1+ feature and will reuse this report format with direction: "inbound".

Authentication & tenant isolation

The broker enforces per-license auth: an agent connects with user = license_id, password = HMAC-SHA256(license_id, NATS_TOKEN_SECRET) (shown on the panel Server page), and is scoped to corrosion.{license_id}.> only. The backend uses a privileged internal user. This makes cross-tenant access impossible at the broker, not just by convention.

Reply-subject rule: per-license users have NO _INBOX permission (granting it would let one license read another's request-reply traffic). Therefore any backend→agent request-reply MUST use a reply subject inside the license namespace — e.g. corrosion.{license_id}.reply.<id> — never the client's default global _INBOX. The agent is unaffected: it responds to whatever msg.reply it receives. The constraint is on the requester (the internal user has full access). The contract/CI tests run against an unauthenticated broker and use the default inbox; production request-reply must follow this rule.

Versioning

  • The agent embeds semver + git hash + build timestamp (--version, heartbeat agent block).
  • Schema changes bump schema and are additive where possible.