Infrastructure
·2024–ongoing
Multi-node Proxmox cluster replacing legacy Hyper-V boxes
Designed and built a Proxmox virtualization cluster with replication and a real failover path — hosting every internal web app the business runs on.
$30K
Legacy hardware retired
7+
Critical apps migrated
2 (plus 1 in provisioning)
Nodes today
Daily + weekly off-site
Backup schedule
The existing virtualization stack was a set of aging Windows Hyper-V hosts with no replication, no HA, and backups that mostly existed on paper. Every internal web app — MOC, QC, asset tracking, the ERP gateway — sat on a single box whose failure would have been a bad week.
I designed and built a Proxmox cluster as the standard internal virtualization platform. The design is explicitly not-Ceph — 2-node ZFS + replication with a clear path to a 3-node quorum plus QDevice — because I'd rather ship a solid Phase 1 than demo a Ceph cluster that I can't keep running.
Stack
- Proxmox VE 9.x
- ZFS pools with async replication
- LVM-thin for local-only workloads
- Corosync clustering with QDevice plan
- Proxmox Backup Server target (off-host)
- Terraform-style provisioning scripts for VM baselines
Why Proxmox, not VMware or Hyper-V
VMware got expensive the week Broadcom changed the licensing model. Hyper-V was already in place but hadn't been updated in years and nobody was excited about renewing it. Proxmox gave us the operational story of a proper hypervisor — snapshots, live migration, replication, PBS backups — without the licensing overhead and with a learning curve that didn't require a week of training.
The cluster is the backbone of the IT function now. Every custom app I've built gets deployed here; every legacy Windows box I can virtualize, I virtualize.
Why not Ceph
Ceph is seductive for a homelab-trained engineer, but it's the wrong tool for two unmatched nodes without dedicated 10GbE storage fabric and 32+ GB of RAM each. ZFS + scheduled replication gives us most of the durability story, far simpler operations, and a clean path to upgrade. Ceph stays on the roadmap for when we have three matched nodes and the network to support it.
Operational choices that matter
- All VM storage on ZFS — snapshots are near-free, replication is built in, bit-rot protection is not an afterthought.
- Corosync on a dedicated ring path, with a QDevice for quorum — a 2-node cluster without quorum strategy is a future outage.
- Backups to a Proxmox Backup Server on separate hardware with off-site replication. Restore drills on a calendar, not on vibes.
- VM baselines (OS + patches + agents) scripted so that a new service can stand up in under 30 minutes from IP request to login.
What it unlocks
Every custom internal app I've shipped in the last twelve months — MOC, QC, IT request system, muster, asset tracking, the IQMS MCP chat — lives in this cluster. The cluster is the reason the IT function ships software at all; without it, every app would need its own box and nobody would approve the capex for ten of them.