Introduce a Supervisor trait (async-trait) so the agent manages games with different models behind one wire contract. ProcessSupervisor (spawned process: rust/conan/soulmask) and the new DockerComposeSupervisor (dune) both impl it; Agent.supervisors is now HashMap<String, Arc<dyn Supervisor>> and instancecmd dispatch is game-agnostic — start/stop/restart/status identical across games, selected by a per-game factory in main. InstanceState moved to the shared supervisor module. DockerComposeSupervisor drives docker-compose up-d / stop / restart against the instance's compose project, with -f/-p/single-service support and a configurable compose binary. New [instance.docker_compose] config block. First cut = lifecycle + cached state; container crash-detection + restart adoption deferred to Phase 3b (reconcilable with a compose ps probe). Trait choice (dyn over enum) per Commander: scales to future planes (kubectl, AMP/podman, SSH) as new struct+impl, no central match. 56 tests green (6 new docker-compose mock-binary tests + 5 refactored process tests), zero warnings. Live verification pending a real Dune stack. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
216 lines
9.6 KiB
Markdown
216 lines
9.6 KiB
Markdown
# Corrosion Wire Protocol v2
|
|
|
|
Status: **Phase 0 + Phase 1 process control implemented** (host heartbeat,
|
|
host commands, going-offline beacon, per-instance start/stop/restart/status
|
|
with push state events). RCON, SteamCMD, file ops, and game adapters are
|
|
specified but not yet implemented.
|
|
|
|
## Design
|
|
|
|
One **host agent** per machine supervises **N game instances**. Subjects are
|
|
scoped license-first, then by addressee:
|
|
|
|
```
|
|
corrosion.{license_id}.host.* host-level (the agent itself)
|
|
corrosion.{license_id}.{instance_id}.* instance-level (one game server)
|
|
```
|
|
|
|
`instance_id` is a config-defined slug (`[a-z0-9_-]{1,64}`), validated at
|
|
agent start. `host` is a reserved segment and can never be an instance id.
|
|
Payloads are JSON. Every heartbeat carries `"schema": 2` so consumers can
|
|
distinguish v2 from the legacy Go companion protocol (which used
|
|
`corrosion.{license_id}.companion.heartbeat`, no schema field).
|
|
|
|
## Host-level subjects (Phase 0 — live)
|
|
|
|
### `corrosion.{license_id}.host.heartbeat` (agent → backend, publish)
|
|
|
|
Published every `heartbeat_seconds` (default 60, jittered ±20%).
|
|
|
|
```json
|
|
{
|
|
"schema": 2,
|
|
"timestamp": "2026-06-11T18:00:00Z",
|
|
"agent": {
|
|
"version": "2.0.0-alpha.1",
|
|
"commit": "a8722a7",
|
|
"os": "linux",
|
|
"arch": "x86_64",
|
|
"uptime_seconds": 86400
|
|
},
|
|
"host": {
|
|
"hostname": "asgard-01",
|
|
"cpu_percent": 12.5,
|
|
"cpu_cores": 80,
|
|
"mem_total_mb": 262144,
|
|
"mem_used_mb": 81920,
|
|
"uptime_seconds": 1209600,
|
|
"disks": [
|
|
{ "mount": "/", "total_mb": 1907729, "free_mb": 1532211 }
|
|
]
|
|
},
|
|
"instances": [
|
|
{
|
|
"id": "rust-main",
|
|
"game": "rust",
|
|
"label": "Main 2x Vanilla",
|
|
"state": "configured",
|
|
"root_disk_free_mb": 1532211
|
|
}
|
|
],
|
|
"probe": {
|
|
"timestamp": "2026-06-11T17:58:00Z",
|
|
"results": [
|
|
{ "name": "corrosion-cdn", "host": "cdn.corrosionmgmt.com", "port": 443, "ok": true, "latency_ms": 18 }
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
All telemetry is measured, never fabricated. Fields the agent cannot measure
|
|
are omitted (`probe` before the first probe completes, `hostname` if
|
|
unavailable).
|
|
|
|
Instance `state` values — process-managed (an `executable` is configured):
|
|
`running`, `stopped`, `starting`, `stopping`, `crashed`; unmanaged
|
|
(telemetry-only): `configured` (root exists), `missing_root`. Each instance
|
|
also reports `uptime_seconds` (0 unless running).
|
|
|
|
### `corrosion.{license_id}.host.cmd` (backend → agent, request-reply)
|
|
|
|
Request: `{ "func": "<name>" }`. Reply: `{ "status": "success" | "error", ... }`.
|
|
|
|
| func | Reply payload |
|
|
| --------- | -------------------------------------------------------- |
|
|
| `ping` | `version`, `commit`, `uptime_seconds` |
|
|
| `probe` | `report` — fresh ProbeReport (also cached for heartbeat) |
|
|
| `sysinfo` | `snapshot` — full heartbeat payload, collected on demand |
|
|
| `update` | `{ "func": "update", "url": "https://cdn.corrosionmgmt.com/host-agent/.../corrosion-host-agent-<plat>" }` → downloads the binary + `<url>.minisig`, verifies the minisign signature against the agent's EMBEDDED public key, atomically swaps (with `.old` rollback), replies `{ status: success, message: "...relaunching" }`, then relaunches the new binary. Rejects anything not signed by the release key and any URL that isn't `https://cdn.corrosionmgmt.com`. |
|
|
|
|
Unknown funcs return `status: "error"` with a message listing supported funcs.
|
|
|
|
### `corrosion.{license_id}.host.going_offline` (agent → backend, publish)
|
|
|
|
Best-effort beacon (500ms budget) on graceful shutdown so the panel can flip
|
|
the host to offline immediately instead of waiting out heartbeat staleness.
|
|
Payload: `{}`.
|
|
|
|
## Instance-level subjects
|
|
|
|
### `corrosion.{license_id}.{instance_id}.cmd` (backend → agent, request-reply) — LIVE
|
|
|
|
Lifecycle and control for one game instance.
|
|
|
|
The same `start`/`stop`/`restart`/`status` funcs work for **every** game: the
|
|
agent picks a `Supervisor` impl per game — a spawned-process supervisor for
|
|
Rust/Conan/Soulmask, a **docker-compose supervisor for Dune** (`docker compose
|
|
up -d` / `stop` / `restart` against the instance's compose project, configured
|
|
via `[instance.docker_compose]`). The wire contract is identical; only the
|
|
management model behind it differs.
|
|
|
|
Implemented funcs: `start`, `stop` (graceful with 30s budget, then force
|
|
kill — process supervisor; Dune maps stop to `docker compose stop`), `restart`,
|
|
`status` (returns `state` + `uptime_seconds`), and
|
|
`rcon` — `{ "func": "rcon", "command": "<console command>" }` returns
|
|
`{ "status": "success", "output": <server response> }`. Protocol per game:
|
|
WebRCON (WebSocket JSON) for rust, Source RCON (Valve TCP) for
|
|
conan/soulmask; explicit `kind` override available in the instance's
|
|
`[instance.rcon]` config. Always targets 127.0.0.1 (agent is co-located).
|
|
Errors reply `{ "status": "error", "message": ... }` — including start on an
|
|
unmanaged instance, double start, missing rcon config, and unknown funcs.
|
|
|
|
Also implemented: `steam_update` — `{ "func": "steam_update" }` runs
|
|
SteamCMD for the instance's game (app ids: rust 258550, conan 443030,
|
|
soulmask 3017310/3017300; dune rejects — Docker images, no SteamCMD),
|
|
streaming progress lines to `corrosion.{license}.{instance}.steam_status`
|
|
and replying on completion.
|
|
|
|
Planned funcs: `oxide_install` (rust), plus game-adapter-specific
|
|
commands (Dune: RabbitMQ admin-bus commands, Coriolis reset, Postgres admin
|
|
surface). Dune **lifecycle** is already covered by the shared
|
|
start/stop/restart funcs above; container crash-detection and state adoption on
|
|
agent restart land with Phase 3b.
|
|
|
|
### `corrosion.{license_id}.{instance_id}.steam_status` (agent → backend, publish) — LIVE
|
|
|
|
Per-line SteamCMD stdout during a `steam_update`, so the panel can show
|
|
live update progress. Payload: `{ "timestamp", "instance_id", "line" }`.
|
|
|
|
### `corrosion.{license_id}.{instance_id}.files.cmd` (backend → agent, request-reply) — LIVE
|
|
|
|
Jailed file manager, confined to the instance `root` (two-stage check:
|
|
lexical normalize + canonicalize, defeating `../` traversal and symlink
|
|
escape). Request `{ "op": "list|read|write|delete|rename|mkdir|mkfile|move|copy",
|
|
"path": "rel/path", "dest"?, "content"?, "name"? }`; reply
|
|
`{ "status": "success", "data": ... }` or `{ "status": "error", "message": ... }`.
|
|
`read` caps at 5 MiB. Replaces the Go agent's UNJAILED legacy files API,
|
|
which is retired and will not be ported.
|
|
|
|
### `corrosion.{license_id}.{instance_id}.status` (agent → backend, publish) — LIVE
|
|
|
|
State-change events so the panel does not wait for the next heartbeat.
|
|
Payload: `{ "timestamp", "instance_id", "event": { "state": ..., "exit_code"? } }`.
|
|
|
|
Semantics: **keep-latest state sync**, not a lossless transition ledger —
|
|
near-instant transient states (e.g. `starting` when spawn succeeds
|
|
immediately) may coalesce into the following state. Consumers should treat
|
|
each event as "current state is now X".
|
|
|
|
Known Phase 1 limitation: the supervisor does not yet persist/adopt PIDs — if
|
|
the agent itself restarts while a game server is running, the game process
|
|
survives but reports `stopped` until restarted through the panel. PID
|
|
adoption is queued with the service-install work.
|
|
|
|
### `corrosion.{license_id}.{instance_id}.console` (agent → backend, publish)
|
|
|
|
Live console/log lines for the panel console view.
|
|
|
|
### `corrosion.{license_id}.{instance_id}.files.cmd` (backend → agent, request-reply)
|
|
|
|
VueFinder-style file manager ops, jailed to the instance root. Carries over
|
|
the Go agent's jailed filemanager semantics (`fm_list`, `fm_save`, ...); the
|
|
legacy UNJAILED `files.get/put/delete/list` API is retired and will not be
|
|
ported.
|
|
|
|
## Backend mapping notes (Phase 0)
|
|
|
|
- The NestJS NATS bridge subscribes `corrosion.*.host.heartbeat` and
|
|
`corrosion.*.host.going_offline`.
|
|
- Until the license→host→instance schema lands, the backend may map the host
|
|
heartbeat onto the existing single `server_connections` row per license:
|
|
`companion_last_seen` ← heartbeat arrival, `connection_status` ←
|
|
connected/offline, resources ← `host.cpu_percent` / `mem_*` / first disk.
|
|
Instance-level mapping activates with the fleet schema.
|
|
|
|
## Probing — scope honesty
|
|
|
|
The Phase 0 prober measures **outbound** reachability from the host (TCP
|
|
connect + latency). It cannot verify **inbound** port-forwarding (the thing
|
|
players hit). Inbound verification requires a backend-side reverse probe
|
|
service that attempts connections to the customer's public IP/ports on
|
|
request; that is specified as a Phase 1+ feature and will reuse this report
|
|
format with `direction: "inbound"`.
|
|
|
|
## Authentication & tenant isolation
|
|
|
|
The broker enforces per-license auth: an agent connects with `user = license_id`,
|
|
`password = HMAC-SHA256(license_id, NATS_TOKEN_SECRET)` (shown on the panel
|
|
Server page), and is scoped to `corrosion.{license_id}.>` only. The backend uses
|
|
a privileged internal user. This makes cross-tenant access impossible at the
|
|
broker, not just by convention.
|
|
|
|
**Reply-subject rule:** per-license users have NO `_INBOX` permission (granting
|
|
it would let one license read another's request-reply traffic). Therefore any
|
|
backend→agent request-reply MUST use a reply subject inside the license
|
|
namespace — e.g. `corrosion.{license_id}.reply.<id>` — never the client's
|
|
default global `_INBOX`. The agent is unaffected: it responds to whatever
|
|
`msg.reply` it receives. The constraint is on the requester (the internal user
|
|
has full access). The contract/CI tests run against an unauthenticated broker
|
|
and use the default inbox; production request-reply must follow this rule.
|
|
|
|
## Versioning
|
|
|
|
- The agent embeds semver + git hash + build timestamp (`--version`,
|
|
heartbeat `agent` block).
|
|
- Schema changes bump `schema` and are additive where possible.
|