Per-instance ProcessSupervisor: tokio child spawn with proper arg list
(fixes Go's naive space-splitting), graceful SIGTERM with 30s budget
then force kill, monitor task classifying ordered-stop vs crash (exit
code captured), watch-channel state observable everywhere. Instance cmd
channel live on corrosion.{license}.{instance}.cmd (start/stop/restart/
status) with state events pushed on {instance}.status (keep-latest
semantics, documented). Heartbeats now carry live process state +
uptime per instance. Crate restructured lib+bin for integration tests.
Verified: 5 integration tests with real OS processes (lifecycle, crash
exit-code, restart recovery, unmanaged rejection, clean spawn failure)
+ live-NATS contract test (request-reply roundtrips, double-start
rejection, push events, heartbeat state) — all green.
Known limitation (documented): no PID adoption yet — agent restart
orphans a running game process to 'stopped' until panel restart.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
6.1 KiB
Corrosion Wire Protocol v2
Status: Phase 0 + Phase 1 process control implemented (host heartbeat, host commands, going-offline beacon, per-instance start/stop/restart/status with push state events). RCON, SteamCMD, file ops, and game adapters are specified but not yet implemented.
Design
One host agent per machine supervises N game instances. Subjects are scoped license-first, then by addressee:
corrosion.{license_id}.host.* host-level (the agent itself)
corrosion.{license_id}.{instance_id}.* instance-level (one game server)
instance_id is a config-defined slug ([a-z0-9_-]{1,64}), validated at
agent start. host is a reserved segment and can never be an instance id.
Payloads are JSON. Every heartbeat carries "schema": 2 so consumers can
distinguish v2 from the legacy Go companion protocol (which used
corrosion.{license_id}.companion.heartbeat, no schema field).
Host-level subjects (Phase 0 — live)
corrosion.{license_id}.host.heartbeat (agent → backend, publish)
Published every heartbeat_seconds (default 60, jittered ±20%).
{
"schema": 2,
"timestamp": "2026-06-11T18:00:00Z",
"agent": {
"version": "2.0.0-alpha.1",
"commit": "a8722a7",
"os": "linux",
"arch": "x86_64",
"uptime_seconds": 86400
},
"host": {
"hostname": "asgard-01",
"cpu_percent": 12.5,
"cpu_cores": 80,
"mem_total_mb": 262144,
"mem_used_mb": 81920,
"uptime_seconds": 1209600,
"disks": [
{ "mount": "/", "total_mb": 1907729, "free_mb": 1532211 }
]
},
"instances": [
{
"id": "rust-main",
"game": "rust",
"label": "Main 2x Vanilla",
"state": "configured",
"root_disk_free_mb": 1532211
}
],
"probe": {
"timestamp": "2026-06-11T17:58:00Z",
"results": [
{ "name": "corrosion-cdn", "host": "cdn.corrosionmgmt.com", "port": 443, "ok": true, "latency_ms": 18 }
]
}
}
All telemetry is measured, never fabricated. Fields the agent cannot measure
are omitted (probe before the first probe completes, hostname if
unavailable).
Instance state values — process-managed (an executable is configured):
running, stopped, starting, stopping, crashed; unmanaged
(telemetry-only): configured (root exists), missing_root. Each instance
also reports uptime_seconds (0 unless running).
corrosion.{license_id}.host.cmd (backend → agent, request-reply)
Request: { "func": "<name>" }. Reply: { "status": "success" | "error", ... }.
| func | Reply payload |
|---|---|
ping |
version, commit, uptime_seconds |
probe |
report — fresh ProbeReport (also cached for heartbeat) |
sysinfo |
snapshot — full heartbeat payload, collected on demand |
Unknown funcs return status: "error" with a message listing supported funcs.
corrosion.{license_id}.host.going_offline (agent → backend, publish)
Best-effort beacon (500ms budget) on graceful shutdown so the panel can flip
the host to offline immediately instead of waiting out heartbeat staleness.
Payload: {}.
Instance-level subjects
corrosion.{license_id}.{instance_id}.cmd (backend → agent, request-reply) — LIVE
Lifecycle and control for one game instance.
Implemented funcs: start, stop (graceful with 30s budget, then force
kill), restart, status (returns state + uptime_seconds). Errors reply
{ "status": "error", "message": ... } — including start on an unmanaged
instance, double start, and unknown funcs.
Planned funcs: rcon (process-class games), steam_update, oxide_install
(rust), plus game-adapter-specific commands (Dune: docker lifecycle, RabbitMQ
bus commands, Coriolis reset).
corrosion.{license_id}.{instance_id}.status (agent → backend, publish) — LIVE
State-change events so the panel does not wait for the next heartbeat.
Payload: { "timestamp", "instance_id", "event": { "state": ..., "exit_code"? } }.
Semantics: keep-latest state sync, not a lossless transition ledger —
near-instant transient states (e.g. starting when spawn succeeds
immediately) may coalesce into the following state. Consumers should treat
each event as "current state is now X".
Known Phase 1 limitation: the supervisor does not yet persist/adopt PIDs — if
the agent itself restarts while a game server is running, the game process
survives but reports stopped until restarted through the panel. PID
adoption is queued with the service-install work.
corrosion.{license_id}.{instance_id}.console (agent → backend, publish)
Live console/log lines for the panel console view.
corrosion.{license_id}.{instance_id}.files.cmd (backend → agent, request-reply)
VueFinder-style file manager ops, jailed to the instance root. Carries over
the Go agent's jailed filemanager semantics (fm_list, fm_save, ...); the
legacy UNJAILED files.get/put/delete/list API is retired and will not be
ported.
Backend mapping notes (Phase 0)
- The NestJS NATS bridge subscribes
corrosion.*.host.heartbeatandcorrosion.*.host.going_offline. - Until the license→host→instance schema lands, the backend may map the host
heartbeat onto the existing single
server_connectionsrow per license:companion_last_seen← heartbeat arrival,connection_status← connected/offline, resources ←host.cpu_percent/mem_*/ first disk. Instance-level mapping activates with the fleet schema.
Probing — scope honesty
The Phase 0 prober measures outbound reachability from the host (TCP
connect + latency). It cannot verify inbound port-forwarding (the thing
players hit). Inbound verification requires a backend-side reverse probe
service that attempts connections to the customer's public IP/ports on
request; that is specified as a Phase 1+ feature and will reuse this report
format with direction: "inbound".
Versioning
- The agent embeds semver + git hash + build timestamp (
--version, heartbeatagentblock). - Schema changes bump
schemaand are additive where possible.