All case studies

Infrastructure

·

2024–ongoing

Multi-node Proxmox cluster replacing legacy Hyper-V boxes

Designed and built a Proxmox virtualization cluster with replication and a real failover path — hosting every internal web app the business runs on.

$30K

Legacy hardware retired

7+

Critical apps migrated

2 (plus 1 in provisioning)

Nodes today

Daily + weekly off-site

Backup schedule

The existing virtualization stack was a set of aging Windows Hyper-V hosts with no replication, no HA, and backups that mostly existed on paper. Every internal web app — MOC, QC, asset tracking, the ERP gateway — sat on a single box whose failure would have been a bad week.

I designed and built a Proxmox cluster as the standard internal virtualization platform. The design is explicitly not-Ceph — 2-node ZFS + replication with a clear path to a 3-node quorum plus QDevice — because I'd rather ship a solid Phase 1 than demo a Ceph cluster that I can't keep running.

Stack

  • Proxmox VE 9.x
  • ZFS pools with async replication
  • LVM-thin for local-only workloads
  • Corosync clustering with QDevice plan
  • Proxmox Backup Server target (off-host)
  • Terraform-style provisioning scripts for VM baselines

Why Proxmox, not VMware or Hyper-V

VMware got expensive the week Broadcom changed the licensing model. Hyper-V was already in place but hadn't been updated in years and nobody was excited about renewing it. Proxmox gave us the operational story of a proper hypervisor — snapshots, live migration, replication, PBS backups — without the licensing overhead and with a learning curve that didn't require a week of training.

The cluster is the backbone of the IT function now. Every custom app I've built gets deployed here; every legacy Windows box I can virtualize, I virtualize.

Why not Ceph

Ceph is seductive for a homelab-trained engineer, but it's the wrong tool for two unmatched nodes without dedicated 10GbE storage fabric and 32+ GB of RAM each. ZFS + scheduled replication gives us most of the durability story, far simpler operations, and a clean path to upgrade. Ceph stays on the roadmap for when we have three matched nodes and the network to support it.

Operational choices that matter

  • All VM storage on ZFS — snapshots are near-free, replication is built in, bit-rot protection is not an afterthought.
  • Corosync on a dedicated ring path, with a QDevice for quorum — a 2-node cluster without quorum strategy is a future outage.
  • Backups to a Proxmox Backup Server on separate hardware with off-site replication. Restore drills on a calendar, not on vibes.
  • VM baselines (OS + patches + agents) scripted so that a new service can stand up in under 30 minutes from IP request to login.

What it unlocks

Every custom internal app I've shipped in the last twelve months — MOC, QC, IT request system, muster, asset tracking, the IQMS MCP chat — lives in this cluster. The cluster is the reason the IT function ships software at all; without it, every app would need its own box and nobody would approve the capex for ten of them.