Featured image of post High-Availability Docker Swarm on Proxmox

High-Availability Docker Swarm on Proxmox

Build production-grade Docker Swarm on Proxmox with Keepalived VIPs and GPU passthrough.

Why Docker Swarm (Not Kubernetes)

Kubernetes is the standard, but for 5-10 containers, it’s overkill. Docker Swarm gives you:

  • Service discovery with zero configuration
  • Built-in load balancing
  • Rolling updates without custom tooling
  • Single-node control plane if needed

The trade-off: advanced scheduling or custom CNIs need Kubernetes. For Home Assistant, Jellyfin, and WireGuard? Swarm is simpler.

Hardware constraint: My GPU passthrough works on LXC (device nodes), which is simpler than PCI passthrough on VMs. This determines the choice.

Proxmox LXC containers with device nodes bypass the need for PCI passthrough. This simplifies GPU access significantly.

Module Capabilities

The tf-module-proxmox-docker module provisions Docker containers or VMs with Docker Engine installed, optionally forming a Docker Swarm cluster:

  1. Multi-node provisioning — creates LXC or VM nodes across host pool
  2. Docker installation — installs Docker Engine via cloud-init
  3. Keepalived integration — optional VIP for high availability
  4. GPU device passthrough — passes through /dev/apex_0, /dev/dri/* for hardware acceleration
  5. Host pool scheduling — round-robin distribution across Proxmox nodes

Quick Start

module "docker_cluster" {
  source  = "registry.example.com/namespace/tf-module-proxmox-docker/docker"
  version = "1.2.3"

  configuration = {
    cluster = {
      name = "prod-docker"
      type = "lxc"  # or "vm"
      datastore = { id = "nas", node = "alpha" }
    }

    host_pool = [
      { name = "alpha", datastore_id = "local-lvm" },
      { name = "charlie", datastore_id = "local-lvm" },
      { name = "foxtrot", datastore_id = "local-lvm" }
    ]

    worker_nodes = [
      {
        size = "medium"
        networks = { dmz = { address = "192.168.61.21/24", gateway = "192.168.61.1" } }
        vip = { state = "MASTER", priority = 100, interface = "dmz" }
      },
      {
        size = "medium"
        networks = { dmz = { address = "192.168.61.22/24", gateway = "192.168.61.1" } }
        vip = { state = "BACKUP", priority = 90, interface = "dmz" }
      }
    ]

    node_size_configuration = {
      medium = { cpu = 8, memory = 32768, os_disk = 256 }
    }

    vip = { enabled = true, address = "192.168.61.20" }
  }
}

LXC vs VM

The module supports both container and VM backends:

Aspect LXC VM
Resource overhead Minimal Full hypervisor
GPU passthrough Device nodes Full PCI
Nesting support No Yes
Use case Simple containers Full VMs
# LXC-based (type = "lxc")
configuration = {
  cluster = {
    type = "lxc"
  }
}

# VM-based (type = "vm")
configuration = {
  cluster = {
    type = "vm"
  }
}

The VM provisioner downloads a cloud image and imports it:

resource "proxmox_download_file" "vm_image" {
  content_type    = "iso"
  datastore_id  = var.configuration.cluster.datastore.id
  file_name    = "docker-base.iso"
  url          = var.configuration.node_os_configuration[var.configuration.cluster.type].template_image_url
}

Host Pool Scheduling

VMs are distributed across Proxmox nodes via modulo arithmetic:

# In nodes.tf
node_name = var.configuration.host_pool[
  each.key % length(var.configuration.host_pool)
].name

With 3 nodes and 3 node indices:

  • Node 0 → alpha (0 % 3)
  • Node 1 → charlie (1 % 3)
  • Node 2 → foxtrot (2 % 3)

This ensures even distribution across the cluster for resilience.

Keepalived HA

For high availability, Keepalived provides a floating VIP:

configuration = {
  vip = {
    enabled  = true
    address  = "192.168.61.20"
    router_id = 20
  }
}

Each node is configured with its role:

worker_nodes = [
  {
    size = "medium"
    networks = { dmz = { address = "192.168.61.21/24", gateway = "192.168.61.1" } }
    vip = { state = "MASTER", priority = 100, interface = "dmz" }
  },
  {
    size = "medium"
    networks = { dmz = { address = "192.168.61.22/24", gateway = "192.168.61.1" } }
    vip = { state = "BACKUP", priority = 90, interface = "dmz" }
  },
  {
    size = "medium"
    networks = { dmz = { address = "192.168.61.23/24", gateway = "192.168.61.1" } }
    vip = { state = "BACKUP", priority = 80, interface = "dmz" }
  }
]

The module generates Keepalived configuration:

resource "proxmox_virtual_environment_file" "keepalived_config" {
  content = <<-EOF
    vrrp_instance VI_1 {
        state ${node.vip.state}
        interface ${node.vip.interface}
        virtual_router_id ${var.configuration.vip.router_id}
        priority ${node.vip.priority}
        virtual_ipaddress {
            ${var.configuration.vip.address}
        }
    }
  EOF
}

GPU Passthrough

For hardware acceleration (e.g., transcodeing, ML workloads), device passthrough is configured in the host pool:

host_pool = [
  {
    name = "alpha"
    device_map = [
      { device = "/dev/apex_0", mode = "0666" }      # GPU
      { device = "/dev/dri/renderD128", mode = "0666" }  # iGPU
      { device = "/dev/dri/card1", mode = "0666" }
    ]
    datastore_id = "local-lvm"
  }
]

The devices are passed through to containers:

resource "proxmox_virtual_environment_container" "this" {
  # ...
  
 devices_passthrough = [
    for device in var.configuration.host_pool[each.key % length(var.configuration.host_pool)].device_map : {
      path = device.device
      mode = device.mode
    }
  ]
}

Docker Installation

Docker is installed via cloud-init:

resource "proxmox_virtual_environment_file" "cloud_config" {
  content = <<-EOF
#cloud-config
package_update: true
packages:
  - docker.io
  - docker-compose
runcmd:
  - systemctl enable docker
  - systemctl start docker
  - usermod -aG docker root
EOF
}

Or for more complex setups, custom post-install commands:

node_os_configuration = {
  debian = {
    family = "debian"
    template_image_url = "https://..."
    packages = ["docker.io", "docker-compose"]
    package_manager = {
      install_command = "apt-get install -y"
    }
    post_install_commands = [
      "systemctl enable docker",
      "usermod -aG docker root"
    ]
  }
}

Multi-Network Support

The module supports multiple network interfaces per node:

networks = {
  dmz = {
    address = "192.168.61.21/24"
    gateway = "192.168.61.1"
  }
  vmbr1 = {
    address = "192.168.192.121/25"
  }
}

This maps to:

  • dmz — frontend network with gateway (for public access)
  • vmbr1 — backend network (for inter-node communication)

Node Sizing

The node_size_configuration block keeps definitions DRY:

node_size_configuration = {
  small = {
    cpu     = 2
    memory  = 512
    os_disk = 20
  }
  medium = {
    cpu     = 8
    memory  = 32768
    os_disk = 256
  }
  large = {
    cpu     = 16
    memory  = 65536
    os_disk = 512
  }
}

My production cluster uses medium nodes (8 vCPU, 32GB RAM, 256GB disk).

Optional Tools

The module can provision additional tools:

configuration = {
  cluster = {
    options = {
      # Hawser - container management
      hawser = {
        enabled = true
        image   = "harbor.example.com/gh/finsys/hawser:latest"
      }
      
      # Newt - container log viewer  
      newt = {
        enabled = true
        image   = "harbor.example.com/dh/fosrl/newt:latest"
        endpoint = "https://newt.example.com"
      }
      
      # APT cache for faster downloads
      apt_cache = {
        enabled = true
        url     = "https://apt.example.com/"
      }
    }
  }
}

My Production Configuration

Here’s the actual production YAML configuration:

# configurations/docker/prod-docker-lxc.yaml
name: prod-docker-lxc
enabled: true

cluster:
  name: prod-docker-lxc
  type: lxc
  datastore:
    id: nas
    node: alpha

host_pool:
  - name: alpha
    device_map:
      - device: /dev/apex_0
        mode: "0666"
      - device: /dev/dri/renderD128
        mode: "0666"
      - device: /dev/dri/card1
        mode: "0666"
    datastore_id: local-lvm
  - name: charlie
    device_map:
      - device: /dev/dri/renderD128
        mode: "0666"
      - device: /dev/dri/card0
        mode: "0666"
    datastore_id: local-lvm
  - name: foxtrot
    device_map:
      - device: /dev/apex_0
        mode: "0666"
      - device: /dev/dri/renderD128
        mode: "0666"
      - device: /dev/dri/card0
        mode: "0666"
    datastore_id: local-lvm

worker_nodes:
  - size: medium
    networks:
      dmz:
        address: 192.168.61.21/24
        gateway: 192.168.61.1
      vmbr1:
        address: 192.168.192.121/24
    vip:
      state: MASTER
      priority: 100
      interface: dmz
  - size: medium
    networks:
      dmz:
        address: 192.168.61.22/24
        gateway: 192.168.61.1
      vmbr1:
        address: 192.168.192.122/24
    vip:
      state: BACKUP
      priority: 90
      interface: dmz
  - size: medium
    networks:
      dmz:
        address: 192.168.61.23/24
        gateway: 192.168.61.1
      vmbr1:
        address: 192.168.192.123/24
    vip:
      state: BACKUP
      priority: 80
      interface: dmz

node_size_configuration:
  medium:
    cpu: 8
    memory: 32768
    os_disk: 256

vip:
  enabled: true
  address: 192.168.61.20
  router_id: 20

Outputs

The module returns node credentials for access:

output "nodes_credentials" {
  value = {
    password = random_password.node_root_password.result
    ssh_key = tls_private_key.node_root_ssh_key.private_key_pem
    hawser_token = random_uuid.hawser_token.id
  }
}

output "nodes_configurations" {
  value = {
    for idx, node in proxmox_virtual_environment_container.this : idx => {
      id      = node.id
      name    = node.name
      node    = node.node
      ip      = node.ip_addresses[0]
    }
  }
}

Credentials are automatically stored in Bitwarden:

resource "bitwarden-secrets_secret" "docker_nodes_password" {
  key   = "${local.cluster_name}-nodes_password"
  value = module.docker[0].nodes_credentials.password.result
}

resource "bitwarden-secrets_secret" "docker_nodes_ssh_key" {
  key   = "${local.cluster_name}-nodes_ssh_key"
  value = module.docker[0].nodes_credentials.ssh_key.private_key_pem
}

Use Cases

This cluster handles workloads like:

  • Home Assistant — Docker Compose-based home automation
  • Media services — Plex, Jellyfin with GPU transcoding
  • VPN services — WireGuard, OpenVPN
  • CI runners — GitHub Actions self-hosted runners

The hardware acceleration via GPU passthrough is critical for media workloads.

What Most People Get Wrong

  1. “Docker Swarm is dead” — It’s not Kubernetes, but for 10-container workloads, it’s simpler. No RBAC complexity, no CNI headaches.

  2. GPU passthrough works on LXC — Most guides assume PCI passthrough (VMs). With device nodes (/dev/apex_0, /dev/dri/*), LXC containers access GPUs directly.

  3. Keepalived needs 3 nodes for quorum — Two nodes work fine with nopreempt on the master. The backup only takes over if master fails.

When to Use / When NOT to Use

Use Docker Swarm Use Kubernetes
3-15 containers 50+ containers
Simple networking Custom CNI required
Single admin Team with RBAC needs
GPU passthrough via LXC GPU operators

What’s Next

Current areas of exploration:

  1. GPU scheduling — Kubernetes-style GPU scheduling for Docker
  2. Portainer integration — management UI for Docker
  3. Observability — centralized logging with Loki