Files
corrosion-admin-panel/corrosion-host-agent/PROTOCOL.md
Vantz Stockwell fde0926d52 feat(host-agent): Phase 1b RCON — WebRCON (rust) + Source RCON (conan/soulmask)
rcon func on the instance command channel: WebSocket JSON WebRCON with
Identifier correlation (skips chat/log noise frames) and full Valve
Source RCON over TCP (auth, exec, multi-packet reassembly via empty
probe, 1MiB cap). Protocol inferred from game, explicit kind override
in [instance.rcon]. Always 127.0.0.1 — agent is co-located.

Hardening from review: WebRCON password never interpolated into error
contexts/logs (redacted URL); probe-tolerant termination — a quiet
period after received data ends the response for servers that don't
echo the probe (Soulmask conformance unverified), so data is never
discarded on probe timeout.

13/13 tests green incl. mock Source-RCON server (auth/multi-packet/
errors) and mock WebRCON server (noise-frame skipping).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 10:53:52 -04:00

167 lines
6.4 KiB
Markdown

# Corrosion Wire Protocol v2
Status: **Phase 0 + Phase 1 process control implemented** (host heartbeat,
host commands, going-offline beacon, per-instance start/stop/restart/status
with push state events). RCON, SteamCMD, file ops, and game adapters are
specified but not yet implemented.
## Design
One **host agent** per machine supervises **N game instances**. Subjects are
scoped license-first, then by addressee:
```
corrosion.{license_id}.host.* host-level (the agent itself)
corrosion.{license_id}.{instance_id}.* instance-level (one game server)
```
`instance_id` is a config-defined slug (`[a-z0-9_-]{1,64}`), validated at
agent start. `host` is a reserved segment and can never be an instance id.
Payloads are JSON. Every heartbeat carries `"schema": 2` so consumers can
distinguish v2 from the legacy Go companion protocol (which used
`corrosion.{license_id}.companion.heartbeat`, no schema field).
## Host-level subjects (Phase 0 — live)
### `corrosion.{license_id}.host.heartbeat` (agent → backend, publish)
Published every `heartbeat_seconds` (default 60, jittered ±20%).
```json
{
"schema": 2,
"timestamp": "2026-06-11T18:00:00Z",
"agent": {
"version": "2.0.0-alpha.1",
"commit": "a8722a7",
"os": "linux",
"arch": "x86_64",
"uptime_seconds": 86400
},
"host": {
"hostname": "asgard-01",
"cpu_percent": 12.5,
"cpu_cores": 80,
"mem_total_mb": 262144,
"mem_used_mb": 81920,
"uptime_seconds": 1209600,
"disks": [
{ "mount": "/", "total_mb": 1907729, "free_mb": 1532211 }
]
},
"instances": [
{
"id": "rust-main",
"game": "rust",
"label": "Main 2x Vanilla",
"state": "configured",
"root_disk_free_mb": 1532211
}
],
"probe": {
"timestamp": "2026-06-11T17:58:00Z",
"results": [
{ "name": "corrosion-cdn", "host": "cdn.corrosionmgmt.com", "port": 443, "ok": true, "latency_ms": 18 }
]
}
}
```
All telemetry is measured, never fabricated. Fields the agent cannot measure
are omitted (`probe` before the first probe completes, `hostname` if
unavailable).
Instance `state` values — process-managed (an `executable` is configured):
`running`, `stopped`, `starting`, `stopping`, `crashed`; unmanaged
(telemetry-only): `configured` (root exists), `missing_root`. Each instance
also reports `uptime_seconds` (0 unless running).
### `corrosion.{license_id}.host.cmd` (backend → agent, request-reply)
Request: `{ "func": "<name>" }`. Reply: `{ "status": "success" | "error", ... }`.
| func | Reply payload |
| --------- | -------------------------------------------------------- |
| `ping` | `version`, `commit`, `uptime_seconds` |
| `probe` | `report` — fresh ProbeReport (also cached for heartbeat) |
| `sysinfo` | `snapshot` — full heartbeat payload, collected on demand |
Unknown funcs return `status: "error"` with a message listing supported funcs.
### `corrosion.{license_id}.host.going_offline` (agent → backend, publish)
Best-effort beacon (500ms budget) on graceful shutdown so the panel can flip
the host to offline immediately instead of waiting out heartbeat staleness.
Payload: `{}`.
## Instance-level subjects
### `corrosion.{license_id}.{instance_id}.cmd` (backend → agent, request-reply) — LIVE
Lifecycle and control for one game instance.
Implemented funcs: `start`, `stop` (graceful with 30s budget, then force
kill), `restart`, `status` (returns `state` + `uptime_seconds`), and
`rcon``{ "func": "rcon", "command": "<console command>" }` returns
`{ "status": "success", "output": <server response> }`. Protocol per game:
WebRCON (WebSocket JSON) for rust, Source RCON (Valve TCP) for
conan/soulmask; explicit `kind` override available in the instance's
`[instance.rcon]` config. Always targets 127.0.0.1 (agent is co-located).
Errors reply `{ "status": "error", "message": ... }` — including start on an
unmanaged instance, double start, missing rcon config, and unknown funcs.
Planned funcs: `steam_update`, `oxide_install` (rust), plus
game-adapter-specific commands (Dune: docker lifecycle, RabbitMQ bus
commands, Coriolis reset).
### `corrosion.{license_id}.{instance_id}.status` (agent → backend, publish) — LIVE
State-change events so the panel does not wait for the next heartbeat.
Payload: `{ "timestamp", "instance_id", "event": { "state": ..., "exit_code"? } }`.
Semantics: **keep-latest state sync**, not a lossless transition ledger —
near-instant transient states (e.g. `starting` when spawn succeeds
immediately) may coalesce into the following state. Consumers should treat
each event as "current state is now X".
Known Phase 1 limitation: the supervisor does not yet persist/adopt PIDs — if
the agent itself restarts while a game server is running, the game process
survives but reports `stopped` until restarted through the panel. PID
adoption is queued with the service-install work.
### `corrosion.{license_id}.{instance_id}.console` (agent → backend, publish)
Live console/log lines for the panel console view.
### `corrosion.{license_id}.{instance_id}.files.cmd` (backend → agent, request-reply)
VueFinder-style file manager ops, jailed to the instance root. Carries over
the Go agent's jailed filemanager semantics (`fm_list`, `fm_save`, ...); the
legacy UNJAILED `files.get/put/delete/list` API is retired and will not be
ported.
## Backend mapping notes (Phase 0)
- The NestJS NATS bridge subscribes `corrosion.*.host.heartbeat` and
`corrosion.*.host.going_offline`.
- Until the license→host→instance schema lands, the backend may map the host
heartbeat onto the existing single `server_connections` row per license:
`companion_last_seen` ← heartbeat arrival, `connection_status`
connected/offline, resources ← `host.cpu_percent` / `mem_*` / first disk.
Instance-level mapping activates with the fleet schema.
## Probing — scope honesty
The Phase 0 prober measures **outbound** reachability from the host (TCP
connect + latency). It cannot verify **inbound** port-forwarding (the thing
players hit). Inbound verification requires a backend-side reverse probe
service that attempts connections to the customer's public IP/ports on
request; that is specified as a Phase 1+ feature and will reuse this report
format with `direction: "inbound"`.
## Versioning
- The agent embeds semver + git hash + build timestamp (`--version`,
heartbeat `agent` block).
- Schema changes bump `schema` and are additive where possible.