Nimbus — Minecraft Cloud System

Standalone agent process for extending Nimbus clusters across multiple machines with WebSocket connectivity, template sync, and crash recovery.

The Nimbus Agent (nimbus-agent) is a standalone process that runs on remote machines to extend a Nimbus cluster across multiple nodes. It connects to the controller via WebSocket, receives service lifecycle commands, and manages local JVM processes.

Overview

The agent handles:

WebSocket connection to the controller with automatic reconnection
Template downloading and caching (ZIP-based, hash-compared)
Local JVM process management (start, stop, command forwarding, stdout capture)
Java version resolution with automatic Adoptium downloads
Service state persistence and crash recovery
System metrics reporting (CPU, memory) via heartbeats

Architecture

Directory Structure

nimbus-agent/src/main/kotlin/dev/nimbuspowered/nimbus/agent/
├── Agent.kt              → Entry point (main function), CLI arg parsing
├── AgentConfig.kt        → TOML config model + loader
├── AgentRuntime.kt       → WebSocket connection loop, message dispatch
├── AgentStateStore.kt    → JSON persistence for crash recovery
├── JavaResolver.kt       → Java installation detection + Adoptium download
├── LocalProcessHandle.kt → Single JVM process wrapper (stdin/stdout/lifecycle)
├── LocalProcessManager.kt→ Service lifecycle, template copy, config patching
├── SetupWizard.kt        → First-run interactive setup
├── StateSyncClient.kt    → State-sync pull/push for floating services
├── TemplateDownloader.kt → HTTP template download + ZIP extraction
└── TlsHelper.kt          → Fingerprint-pinning TLS trust manager

Bootstrap and connection

The main() function in Agent.kt parses CLI arguments (--controller, --token, --name, --max-memory, --max-services), loads agent.toml (or runs the setup wizard), then starts AgentRuntime.

AgentRuntime maintains a persistent WebSocket connection to the controller with automatic 5-second reconnection. On connect, it sends an AuthRequest with the node's token, name, resources, and OS info. After authentication, any recovered services from a previous run are reported, and the agent enters the message dispatch loop.

All message handling uses kotlinx-coroutines with a SupervisorJob scope -- individual service failures do not crash the agent. The agent handles StartService, StopService, SendCommand, HeartbeatRequest, and ShutdownAgent messages (see Protocol for details).

Template sync

TemplateDownloader fetches templates from the controller's REST API (GET /api/templates/{name}/download). Templates are identified by name and content hash.

Directory Structure

Agent requests template "Lobby" with hash "abc123"
  ├── Local cache has matching hash? → Skip download
  └── Hash mismatch or missing?
      ├── Download ZIP from controller (includes global plugins like SDK)
      ├── Extract to templates/{name}/
      └── Store hash for future comparisons

The software parameter is included in the download request so the controller can bundle the correct global plugins (e.g., SDK for Paper but not for Velocity).

Local service management

LocalProcessManager handles service lifecycle: copying templates to work directories (services/temp/{Name}_{uuid} for dynamic, services/static/{Name} for static), patching server.properties and Velocity forwarding configs, building the JVM command line, and launching the process.

LocalProcessHandle wraps a single JVM process with stdout capture (SharedFlow<String>, buffer 4096), ready detection (watches for Done ( regex), graceful shutdown (sends stop via stdin), and process adoption for crash recovery (adopt(pid, serviceName)).

Crash recovery

AgentStateStore persists running service metadata to state/services.json using atomic file writes. On restart:

The agent loads persisted services from the state file
For each service, it attempts to adopt the OS process by PID via LocalProcessHandle.adopt()
Adoption verifies the process is alive and its command line contains nimbus.service.name={serviceName}
Successfully recovered services are reported to the controller as READY
Failed recoveries are removed from the state file

Java version resolution

JavaResolver finds the correct Java version by checking configured paths ([java] section), cached downloads in jdks/, environment variables (JAVA_16_HOME through JAVA_21_HOME), JAVA_HOME, and common directories (/usr/lib/jvm, ~/.sdkman/candidates/java, etc.). If no compatible version is found, it auto-downloads from the Adoptium API. Java 16 is the minimum supported runtime.

Building and running

Terminal

# Build the agent fat JAR
./gradlew :nimbus-agent:shadowJar

# Run (first run triggers setup wizard)
java -jar nimbus-agent/build/libs/nimbus-agent-<version>-all.jar

# Run with CLI args (skip wizard)
java -jar nimbus-agent-<version>-all.jar \
  --controller wss://10.0.0.1:8443/cluster \
  --token your-cluster-token \
  --name worker-1 \
  --max-memory 16G \
  --max-services 20

# Force the setup wizard even when agent.toml exists
java -jar nimbus-agent-<version>-all.jar --setup

Configuration

The agent is configured via agent.toml in the working directory. On first run, the setup wizard calls the controller's /api/cluster/bootstrap endpoint to fetch the TLS fingerprint and fills this file in for you.

agent.toml

[agent]
controller = "wss://127.0.0.1:8443/cluster"
token = "your-cluster-token"
node_name = "worker-1"
max_memory = "8G"
max_services = 10
public_host = ""  # Leave blank to auto-detect from network interfaces

# TLS: pinned controller cert fingerprint (set by the wizard, preferred).
trusted_fingerprint = "AA:BB:CC:DD:EE:FF:..."
# tls_verify = false only for local dev (MITM-vulnerable).
tls_verify = true
# Advanced: supply a JKS/PKCS12 truststore for CA-issued certs.
truststore_path = ""
truststore_password = ""

# Optional: specify paths to Java installations.
# Leave empty for auto-detection / auto-download from Adoptium.
[java]
java_16 = ""
java_17 = ""
java_21 = ""

Key	Default	Description
`controller`	`wss://127.0.0.1:8443/cluster`	Controller WebSocket URL
`token`	(required)	Cluster auth token (matches controller config)
`node_name`	hostname	Display name for this node
`max_memory`	auto-detected	Maximum memory this node can allocate to services
`max_services`	auto-detected	Maximum concurrent services on this node
`public_host`	`""`	Publicly reachable IP or hostname for player routing. Leave blank to auto-detect from network interfaces.
`trusted_fingerprint`	`""`	SHA-256 fingerprint of the controller's TLS cert. Takes precedence over truststore and system CAs. Set by the setup wizard.
`tls_verify`	`true`	If `false`, accept any cert (DEV ONLY). Ignored when `trusted_fingerprint` is set.
`truststore_path`	`""`	Path to a JKS/PKCS12 truststore (advanced, for CA-issued certs).
`truststore_password`	`""`	Password for the truststore.

TLS trust precedence: trusted_fingerprint → truststore_path → system CAs → tls_verify = false (trust all).

Memory and max services are auto-detected from the system if not specified (total RAM minus 2 GB for OS, one service per GB).

See Cluster TLS & Security for the full trust model, cert rotation flow, and troubleshooting.

Next steps

Architecture -- How the agent fits into the cluster
Protocol -- All cluster message types
SDK -- Backend server plugin API

Agent Node