Files

Vantz Stockwell 3e8b29f2ee

Test Asgard Runner / test (push) Successful in 2s

Details

feat: Implement Phase 2 alerting system with anomaly detection

Proactive monitoring infrastructure for server health:

**Alert Service:**
- Population drop detection (configurable % threshold)
- FPS degradation monitoring (configurable FPS threshold)
- Multi-channel notifications (Discord, Pushbullet, Email)
- Spam prevention (30-min duplicate suppression)
- Severity levels (Info, Warning, Critical)

**Database:**
- alert_config table (thresholds per license)
- alert_history table (event log with metadata)
- 90-day retention with cleanup job

**Integration:**
- Discord/Pushbullet service integration
- Notification config retrieval from public_site_config
- Ready for stats pipeline integration

Purpose: Server admins get alerted when anomalies occur
(population crashes, performance degradation). Configurable
thresholds enable proactive server management.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-15 14:28:51 -05:00

15 KiB

Raw Blame History

CHANGELOG — Corrosion Admin Panel

All notable changes to this project will be documented in this file.

[Unreleased]

Added (Phase 2 — Alerting System)

Backend:

Migration 008: Alert configuration and history tables
- alert_config table with threshold settings per license (population drop %, FPS threshold)
- alert_history table logging all triggered alerts with metadata
- Default alert config created for all existing licenses
Alert service (services/alerting.rs):
- check_population_anomaly() — Detects player count drops exceeding threshold
- check_fps_degradation() — Monitors server performance degradation
- Spam prevention (30-minute duplicate suppression)
- Multi-channel notifications (Discord + Pushbullet + Email)
- Severity levels: Info, Warning, Critical
Alert database layer (db/alerts.rs):
- get_alert_config() / update_alert_config() — Threshold configuration
- insert_alert() / mark_alert_notified() — Alert history tracking
- check_recent_alert() — Duplicate detection
- cleanup_old_alerts() — 90-day retention cleanup
Updated db/notifications.rs — Notification config retrieval with webhook/API key support

Alert Types:

Population Drop — Triggers when player count drops >X% in 1 hour
FPS Degradation — Triggers when FPS falls below configurable threshold
Server Crash — Critical alert for auto-recovery failures
Wipe Failed — Alert when wipe execution fails

Purpose: Proactive monitoring for server health issues. Alerts server admins via Discord/Pushbullet when anomalies detected (population crashes, performance degradation). Configurable thresholds per license.

Added (Phase 2 — Wipe Performance Analytics)

Backend:

backend/src/db/wipes.rs — Comprehensive wipe analytics query layer:
- get_wipe_success_rate() — Success vs failure rate over time range
- get_average_wipe_duration() — Average execution time for successful wipes
- get_wipe_to_peak_population() — Hours from wipe completion to peak player count (24h window)
- get_population_curve_by_cycle() — Day 1 vs Day 2 vs Day 3 average player counts post-wipe
- get_optimal_wipe_timing() — Recommends best day of week + hour based on historical peak populations
- get_wipe_analytics_entries() — Detailed per-wipe records for charting (duration, peak pop, success)
- All queries use hourly aggregates (server_stats_hourly) with 90-day retention
backend/src/api/analytics.rs — Wipe performance endpoint:
- GET /api/analytics/wipes/performance?range=90d — Returns full wipe performance metrics
- Supports range params: 6d, 12d, 90d, all (converted to wipe count estimates)
- Response includes: success rate, avg duration, population curve, optimal timing, individual wipe entries

Frontend:

WipeAnalyticsView.vue — Complete wipe performance dashboard:
- ECharts Visualizations:
  - Wipe success timeline (scatter plot: green = success, red = failed)
  - Population curve bar chart (Day 1/Day 2/Day 3 average players post-wipe)
  - Wipe duration trend (line chart showing execution time evolution)
- Insight Cards:
  - Success rate percentage with total wipe count
  - Average wipe duration (formatted as minutes:seconds)
  - Peak population day identifier
  - Optimal wipe timing recommendation (day + hour)
- Actionable Recommendations Banner:
  - Optimal wipe day/hour based on post-wipe player peaks
  - Weekly vs bi-weekly wipe suggestion (if Day 1 >> Day 2 population)
  - Duration optimization alerts (if avg > 10 minutes)
  - Rollback protection warnings (if failures detected)
- Time range selector: Last 6 wipes / Last 12 wipes / All time
- CSV export functionality
Added route /wipes/analytics to router
TypeScript interfaces: WipePerformanceMetrics, WipeAnalyticsEntry, PopulationCurve

Purpose: Answers critical questions: "How long do wipes take? When do players peak post-wipe? What's my success rate? When should I schedule wipes for max population?" Enables data-driven wipe timing optimization and operational insights.

Added (Phase 3 — Public Status Page)

Backend:

Migration 007: Added status_page_description TEXT column to public_site_config
Public API models (models/public.rs):
- PublicServerStatus — Server status with live stats for public display
- PlatformHealth — Platform-wide health metrics (total servers, online count, total players, uptime)
- StatusPageResponse — Complete status page data structure
- PublicSiteConfig — Full public site configuration model
Public database queries (db/public.rs):
- get_public_servers() — Retrieves all opted-in servers with current stats, uptime percentages (24h/7d/30d), wipe schedules
- get_platform_health() — Calculates platform-wide aggregate metrics
- calculate_uptime_percentage() — Uptime calculation from hourly stats
- format_cron_expression() — Human-readable wipe schedule formatting
- get_public_site_config() / create_public_site_config() / update_public_site_config() — Config management
Public API endpoint (api/public.rs):
- GET /api/public/status — Public status page data (no auth required)
Settings API (api/settings.rs):
- GET /api/settings/public-site — Fetch public site config (auth required)
- PUT /api/settings/public-site — Update status page opt-in and description (auth required)

Frontend:

StatusPageView.vue — Complete public status page with:
- Platform health header (total servers, online now, total players, platform uptime)
- Server grid with status indicators (green/yellow/red), player counts, uptime badges (24h/7d/30d)
- Wipe schedule display with countdown timers
- Server search/filter functionality
- Auto-refresh every 10 seconds via polling
- Mobile-responsive grid layout
- "Powered by Corrosion" footer with panel link
Settings dashboard integration (SettingsView.vue):
- New "Public Status" tab with toggle for show_on_status_page
- Text area for status_page_description
- Save endpoint integration

Infrastructure:

nginx already configured for status.corrosionmgmt.com routing
Router already configured with /status route on both panel and marketing domains

Purpose: Public-facing marketing page showcasing all Corrosion servers. Drives platform visibility and attracts new customers ("I want this for my server too").

Added (Phase 2.2 — Player Retention Analytics)

Backend:

Migration 004_player_sessions.sql — Player session tracking table with indexes for retention queries
backend/src/db/player_sessions.rs — Complete player session tracking and retention analysis:
- track_player_join() / track_player_leave() — Record individual player sessions
- calculate_retention_after_wipe() — Calculate 24h/48h/72h return rates per wipe
- get_unique_player_count() / get_avg_session_duration() — Session metrics
- get_new_vs_returning_ratio() — New vs returning player analysis
- get_recent_wipe_retention_metrics() — Multi-wipe retention trends
- cleanup_old_player_sessions() — 90-day retention cleanup
backend/src/api/plugin.rs — Plugin event endpoints:
- POST /api/plugin/player-event — Track player join/leave events
- POST /api/plugin/checkin — Plugin registration on server start
Extended backend/src/api/analytics.rs with retention endpoints:
- GET /api/analytics/retention?wipe_count=6 — Multi-wipe retention metrics
- GET /api/analytics/retention/export — CSV export of retention data

Frontend:

PlayerRetentionView.vue — Complete retention analytics dashboard:
- ECharts retention curve (24h/48h/72h lines across multiple wipes)
- Summary cards: unique players, avg session duration, new vs returning ratio
- Wipe selector (last 3/6/10/20 wipes)
- Detailed wipe table with retention percentages
- CSV export functionality
Added route /retention to router
TypeScript interfaces: WipeRetentionMetric, SessionSummary, RetentionResponse

Plugin:

Updated CorrosionCompanion.cs to track player events via /api/plugin/player-event
Modified OnPlayerConnected / OnPlayerDisconnected hooks with license_key authentication

Purpose: Answers critical question: "What percentage of players return 24h/48h/72h after a wipe?" Enables data-driven wipe timing optimization and player retention analysis.

Added (Phase 2.2 — Map Analytics System)

Backend:

Migration 005: Added map_id FK to server_stats and wipe_history for map effectiveness tracking
Stats consumer now captures current_map_id from server_config when persisting stats
Map analytics database queries (db/maps.rs):
- get_map_analytics() — Returns performance metrics per map (avg/peak players, times used, effectiveness score)
- get_map_population_trends() — Player count trends per map over wipe cycles
- Effectiveness scoring algorithm: (avg_players / peak_players) * 100
Analytics API endpoint (api/analytics.rs):
- GET /api/analytics/maps?range=90d — Map performance summary with rotation effectiveness

Frontend:

MapAnalyticsView.vue — Complete map effectiveness dashboard with:
- Summary cards: Best performing map, rotation effectiveness %, total maps tracked
- ECharts bar chart comparing avg vs peak players per map
- Sortable performance table with effectiveness color coding (green ≥80%, yellow ≥60%, red <60%)
- Actionable insights section recommending rotation improvements
- CSV export functionality
- Time range selector (30d/90d/all)
TypeScript types: MapPerformanceMetrics, MapAnalyticsSummary
Router: Added /maps/analytics route under admin dashboard

Purpose: Answers "Which maps drive the most players? Is my rotation working?" Enables data-driven map selection for wipe day.

Added (Phase 2 — Data Aggregation Pipeline)

Backend:

Stats ingestion consumer service (stats_consumer.rs) subscribing to corrosion.*.stats NATS subject
Complete stats database queries (db/stats.rs) with support for:
- Raw stats insertion and retrieval
- Hourly aggregation queries
- Analytics summary calculations (peak/avg players, uptime)
- Data retention cleanup (7 days raw, 90 days hourly)
Hourly stats aggregation scheduler job (runs at :05 past every hour)
Daily cleanup scheduler job (runs at 03:00 UTC)
Analytics API endpoints (api/analytics.rs):
- GET /api/analytics/summary — Peak/avg players, uptime percentage
- GET /api/analytics/timeseries — Time-series data for charting (hourly/raw granularity)
- GET /api/analytics/export — CSV export of server stats
Background service initialization in main.rs (stats consumer + scheduler)

Frontend:

Analytics TypeScript types (AnalyticsSummary, TimeseriesData, HourlyStats)
Complete AnalyticsView.vue implementation with:
- Real-time data fetching from analytics API
- Apache ECharts integration for Player Count and Server Performance charts
- Time range selector (24h/7d/30d)
- CSV export functionality
- Loading states and responsive layout

Infrastructure:

Made NatsBridge.jetstream public for service consumer access

Added (Sovereign Infrastructure Stack)

Services Deployed:

Gitea (git.corrosionmgmt.com) — Self-hosted Git with Actions support
- Container: corrosion-gitea on port 8090 (HTTP) and 8095 (SSH)
- SQLite database (self-contained, persistent)
- Replaces GitHub dependency for source control
- Gitea Actions enabled for CI/CD
SeaweedFS (cdn.corrosionmgmt.com) — S3-compatible object storage and CDN
- Container: corrosion-cdn with integrated Master/Volume/Filer/S3
- Filer UI at port 8091 (cdn.corrosionmgmt.com)
- Master UI at port 8093 (admin.cdn.corrosionmgmt.com)
- S3 API at port 8092 (internal access)
- Purpose: Map hosting, plugin packages, companion binaries, backups
Gitea Act Runner (asgard build server) — CI/CD execution environment
- Runs on Ryzen 9 7945HX (16C/32T, 64GB DDR5)
- Docker-based job execution
- Go 1.21+ and Rust toolchains available
- Connects to public Gitea instance remotely

CI/CD Workflows:

test-runner.yml — Runner capability validation (hostname, resources, toolchains)
build-companion.yml — Production companion agent build pipeline:
- Triggers on version tags (v*..)
- Cross-compiles for Linux AMD64 and Windows AMD64
- Generates SHA256 checksums
- Creates Gitea release with auto-generated installation instructions
- Uploads binaries and checksums as release assets

Documentation:

infra/docker-compose.yml — Infrastructure stack definition
infra/README.md — Deployment guide and architecture overview
infra/NPM-CONFIG.md — Nginx Proxy Manager configuration
infra/ASGARD-RUNNER.md — Act runner setup guide

Repository Migration:

Migrated from GitHub to self-hosted Gitea
Remote updated to git@git.corrosionmgmt.com:vantzs/corrosion-admin-panel.git
All future development on sovereign infrastructure

Technical Details

Data Flow:

Plugin/Agent publishes stats (60s interval)
  → NATS JetStream (corrosion.*.stats)
  → StatsConsumerService persists to server_stats table
  → Hourly aggregation job rolls up to server_stats_hourly
  → Analytics API queries aggregated data
  → Frontend renders charts via ECharts

Database Schema:

server_stats table (raw stats, 7-day retention)
server_stats_hourly table (aggregated hourly data, 90-day retention)

Scheduler Jobs:

Hourly aggregation: 0 5 * * * * (at :05 past every hour)
Daily cleanup: 0 0 3 * * * (at 03:00 UTC)

Installation Notes

Frontend:

cd frontend && npm install echarts

Backend: No additional dependencies beyond existing Cargo.toml.

Deferred to Phase 2.2

Player retention tracking (new vs returning players, session duration)
Wipe-correlated analytics
Player activity heatmaps (time-of-day patterns)
Anomaly alerting system

[2025-02-15] — Phase 1 Complete

Added (Phase 1 — Foundation)

Backend Services:

Core control plane (Axum + Tokio)
Auto-wiper with rollback (wipe_engine.rs)
Plugin management system
WebSocket/NATS bridge for real-time data
Companion agent adapter (bare metal server management)
Panel adapters (AMP + Pterodactyl)

Frontend:

Vue 3 dashboard with 19 admin sub-views
Wipe management UI with real-time progress
Toast notification system
Plugin management interface
Public server site

Infrastructure:

PostgreSQL schema (migrations 001-003)
NATS JetStream streams (6 streams configured)
Docker Compose deployment (4 services)
JWT auth with refresh tokens, TOTP 2FA

Companion Agent:

Go binary for bare metal server management
NATS-based command execution
Process lifecycle control
File operations support

uMod Plugin:

C# plugin for Rust game server integration
Stats publishing every 60 seconds
Server lifecycle event reporting

Commits

c5d0571 — feat: Complete Phase 1 frontend — WebSocket + Wipe feature end-to-end
590765f — feat: Complete Phase 1 backend services and WebSocket/NATS bridge
8320591 — docs: Update companion agent language choice to Go
3c39345 — docs: Add CLAUDE.md and Claude Code settings
81eeb3b — docs: Add AGENTS.md roster and resource discipline

Format: type: Short description

Types: feat, fix, docs, refactor, test, chore, perf, ci

15 KiB Raw Blame History

CHANGELOG — Corrosion Admin Panel

[Unreleased]

Added (Phase 2 — Alerting System)

Added (Phase 2 — Wipe Performance Analytics)

Added (Phase 3 — Public Status Page)

Added (Phase 2.2 — Player Retention Analytics)

Added (Phase 2.2 — Map Analytics System)

Added (Phase 2 — Data Aggregation Pipeline)

Added (Sovereign Infrastructure Stack)

Technical Details

Installation Notes

Deferred to Phase 2.2

[2025-02-15] — Phase 1 Complete

Added (Phase 1 — Foundation)

Commits

15 KiB

Raw Blame History