Multi-node Proxmox cluster replacing legacy Hyper-V boxes

Designed and built a Proxmox virtualization cluster with replication and a real failover path — hosting every internal web app the business runs on.

$30K

Legacy hardware retired

Critical apps migrated

2 (plus 1 in provisioning)

Nodes today

Daily + weekly off-site

Backup schedule

The existing virtualization stack was a set of aging Windows Hyper-V hosts with no replication, no HA, and backups that mostly existed on paper. Every internal web app — MOC, QC, asset tracking, the ERP gateway — sat on a single box whose failure would have been a bad week.

I designed and built a Proxmox cluster as the standard internal virtualization platform. The design is explicitly not-Ceph — 2-node ZFS + replication with a clear path to a 3-node quorum plus QDevice — because I'd rather ship a solid Phase 1 than demo a Ceph cluster that I can't keep running.

Stack

Proxmox VE 9.x
ZFS pools with async replication
LVM-thin for local-only workloads
Corosync clustering with QDevice plan
Proxmox Backup Server target (off-host)
Terraform-style provisioning scripts for VM baselines

Why Proxmox, not VMware or Hyper-V

VMware got expensive the week Broadcom changed the licensing model. Hyper-V was already in place but hadn't been updated in years and nobody was excited about renewing it. Proxmox gave us the operational story of a proper hypervisor — snapshots, live migration, replication, PBS backups — without the licensing overhead and with a learning curve that didn't require a week of training.

The cluster is the backbone of the IT function now. Every custom app I've built gets deployed here; every legacy Windows box I can virtualize, I virtualize.

Why not Ceph

Ceph is seductive for a homelab-trained engineer, but it's the wrong tool for two unmatched nodes without dedicated 10GbE storage fabric and 32+ GB of RAM each. ZFS + scheduled replication gives us most of the durability story, far simpler operations, and a clean path to upgrade. Ceph stays on the roadmap for when we have three matched nodes and the network to support it.

Operational choices that matter

All VM storage on ZFS — snapshots are near-free, replication is built in, bit-rot protection is not an afterthought.
Corosync on a dedicated ring path, with a QDevice for quorum — a 2-node cluster without quorum strategy is a future outage.
Backups to a Proxmox Backup Server on separate hardware with off-site replication. Restore drills on a calendar, not on vibes.
VM baselines (OS + patches + agents) scripted so that a new service can stand up in under 30 minutes from IP request to login.

What it unlocks

Every custom internal app I've shipped in the last twelve months — MOC, QC, IT request system, muster, asset tracking, the IQMS MCP chat — lives in this cluster. The cluster is the reason the IT function ships software at all; without it, every app would need its own box and nobody would approve the capex for ten of them.