feat(host-agent): Rust rewrite Phase 0 — multi-instance foundation, v2 wire protocol, real telemetry
All checks were successful
Test Asgard Runner / test (push) Successful in 3s
All checks were successful
Test Asgard Runner / test (push) Successful in 3s
New corrosion-host-agent/ crate (Go companion-agent stays as behavior
reference until parity). Wire protocol v2 per COA-B: instance-scoped
subjects corrosion.{license}.{instance}.* + host-level .host.* — spec
in PROTOCOL.md, designed for the license->host->instance fleet model.
- Multi-instance TOML config in the foundation, not retrofitted
- NATS layer on the Vigilance production profile (infinite reconnect,
capped backoff, 30s ping, 8192-msg offline buffer)
- Heartbeat with real sysinfo telemetry — Go agent shipped hardcoded
disk/cpu placeholders; this is the panel's first true Resources data
- Connectivity prober (outbound TCP, periodic + on-demand)
- Host cmd channel (ping/probe/sysinfo), going-offline beacon,
CancellationToken shutdown
- Live-fire verified against production NATS; artifacts: 3.7MB static
linux-musl, 3.8MB windows .exe (static CRT)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
143
corrosion-host-agent/PROTOCOL.md
Normal file
143
corrosion-host-agent/PROTOCOL.md
Normal file
@@ -0,0 +1,143 @@
|
||||
# Corrosion Wire Protocol v2
|
||||
|
||||
Status: **Phase 0 implemented** (host heartbeat, host commands, going-offline
|
||||
beacon). Per-instance command/status subjects are reserved and specified here
|
||||
for Phase 1.
|
||||
|
||||
## Design
|
||||
|
||||
One **host agent** per machine supervises **N game instances**. Subjects are
|
||||
scoped license-first, then by addressee:
|
||||
|
||||
```
|
||||
corrosion.{license_id}.host.* host-level (the agent itself)
|
||||
corrosion.{license_id}.{instance_id}.* instance-level (one game server)
|
||||
```
|
||||
|
||||
`instance_id` is a config-defined slug (`[a-z0-9_-]{1,64}`), validated at
|
||||
agent start. `host` is a reserved segment and can never be an instance id.
|
||||
Payloads are JSON. Every heartbeat carries `"schema": 2` so consumers can
|
||||
distinguish v2 from the legacy Go companion protocol (which used
|
||||
`corrosion.{license_id}.companion.heartbeat`, no schema field).
|
||||
|
||||
## Host-level subjects (Phase 0 — live)
|
||||
|
||||
### `corrosion.{license_id}.host.heartbeat` (agent → backend, publish)
|
||||
|
||||
Published every `heartbeat_seconds` (default 60, jittered ±20%).
|
||||
|
||||
```json
|
||||
{
|
||||
"schema": 2,
|
||||
"timestamp": "2026-06-11T18:00:00Z",
|
||||
"agent": {
|
||||
"version": "2.0.0-alpha.1",
|
||||
"commit": "a8722a7",
|
||||
"os": "linux",
|
||||
"arch": "x86_64",
|
||||
"uptime_seconds": 86400
|
||||
},
|
||||
"host": {
|
||||
"hostname": "asgard-01",
|
||||
"cpu_percent": 12.5,
|
||||
"cpu_cores": 80,
|
||||
"mem_total_mb": 262144,
|
||||
"mem_used_mb": 81920,
|
||||
"uptime_seconds": 1209600,
|
||||
"disks": [
|
||||
{ "mount": "/", "total_mb": 1907729, "free_mb": 1532211 }
|
||||
]
|
||||
},
|
||||
"instances": [
|
||||
{
|
||||
"id": "rust-main",
|
||||
"game": "rust",
|
||||
"label": "Main 2x Vanilla",
|
||||
"state": "configured",
|
||||
"root_disk_free_mb": 1532211
|
||||
}
|
||||
],
|
||||
"probe": {
|
||||
"timestamp": "2026-06-11T17:58:00Z",
|
||||
"results": [
|
||||
{ "name": "corrosion-cdn", "host": "cdn.corrosionmgmt.com", "port": 443, "ok": true, "latency_ms": 18 }
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
All telemetry is measured, never fabricated. Fields the agent cannot measure
|
||||
are omitted (`probe` before the first probe completes, `hostname` if
|
||||
unavailable).
|
||||
|
||||
Phase 0 instance `state` values: `configured` (root path exists),
|
||||
`missing_root`. Phase 1 adds live process states: `running`, `stopped`,
|
||||
`crashed`, `starting`, `updating`.
|
||||
|
||||
### `corrosion.{license_id}.host.cmd` (backend → agent, request-reply)
|
||||
|
||||
Request: `{ "func": "<name>" }`. Reply: `{ "status": "success" | "error", ... }`.
|
||||
|
||||
| func | Reply payload |
|
||||
| --------- | -------------------------------------------------------- |
|
||||
| `ping` | `version`, `commit`, `uptime_seconds` |
|
||||
| `probe` | `report` — fresh ProbeReport (also cached for heartbeat) |
|
||||
| `sysinfo` | `snapshot` — full heartbeat payload, collected on demand |
|
||||
|
||||
Unknown funcs return `status: "error"` with a message listing supported funcs.
|
||||
|
||||
### `corrosion.{license_id}.host.going_offline` (agent → backend, publish)
|
||||
|
||||
Best-effort beacon (500ms budget) on graceful shutdown so the panel can flip
|
||||
the host to offline immediately instead of waiting out heartbeat staleness.
|
||||
Payload: `{}`.
|
||||
|
||||
## Instance-level subjects (Phase 1 — reserved, not yet implemented)
|
||||
|
||||
### `corrosion.{license_id}.{instance_id}.cmd` (backend → agent, request-reply)
|
||||
|
||||
Lifecycle and control for one game instance. Planned funcs: `start`, `stop`,
|
||||
`restart`, `status`, `rcon` (process-class games), `steam_update`,
|
||||
`oxide_install` (rust), plus game-adapter-specific commands (Dune: docker
|
||||
lifecycle, RabbitMQ bus commands, Coriolis reset).
|
||||
|
||||
### `corrosion.{license_id}.{instance_id}.status` (agent → backend, publish)
|
||||
|
||||
State-change events (started/stopped/crashed) so the panel does not wait for
|
||||
the next heartbeat.
|
||||
|
||||
### `corrosion.{license_id}.{instance_id}.console` (agent → backend, publish)
|
||||
|
||||
Live console/log lines for the panel console view.
|
||||
|
||||
### `corrosion.{license_id}.{instance_id}.files.cmd` (backend → agent, request-reply)
|
||||
|
||||
VueFinder-style file manager ops, jailed to the instance root. Carries over
|
||||
the Go agent's jailed filemanager semantics (`fm_list`, `fm_save`, ...); the
|
||||
legacy UNJAILED `files.get/put/delete/list` API is retired and will not be
|
||||
ported.
|
||||
|
||||
## Backend mapping notes (Phase 0)
|
||||
|
||||
- The NestJS NATS bridge subscribes `corrosion.*.host.heartbeat` and
|
||||
`corrosion.*.host.going_offline`.
|
||||
- Until the license→host→instance schema lands, the backend may map the host
|
||||
heartbeat onto the existing single `server_connections` row per license:
|
||||
`companion_last_seen` ← heartbeat arrival, `connection_status` ←
|
||||
connected/offline, resources ← `host.cpu_percent` / `mem_*` / first disk.
|
||||
Instance-level mapping activates with the fleet schema.
|
||||
|
||||
## Probing — scope honesty
|
||||
|
||||
The Phase 0 prober measures **outbound** reachability from the host (TCP
|
||||
connect + latency). It cannot verify **inbound** port-forwarding (the thing
|
||||
players hit). Inbound verification requires a backend-side reverse probe
|
||||
service that attempts connections to the customer's public IP/ports on
|
||||
request; that is specified as a Phase 1+ feature and will reuse this report
|
||||
format with `direction: "inbound"`.
|
||||
|
||||
## Versioning
|
||||
|
||||
- The agent embeds semver + git hash + build timestamp (`--version`,
|
||||
heartbeat `agent` block).
|
||||
- Schema changes bump `schema` and are additive where possible.
|
||||
Reference in New Issue
Block a user