Air-Gapped Deployment#

This guide covers deploying Scout in air-gapped environments where production nodes do not have internet access.

Overview#

Scout supports air-gapped deployments through a staging node architecture. Ansible automatically deploys a K3s cluster and Harbor registry on the staging node, which acts as a proxy between the internet and your production cluster.

Architecture#

┌──────────────┐         ┌──────────────────────────────┐
│   Internet   │────────▶│  Staging Node                │
└──────────────┘         │  - K3s cluster (standalone)  │
                         │  - Harbor registry proxy     │
                         │  - Squid forward proxy       │
                         └──────────────────────────────┘
┌───────────────────────┐                 │ HTTPS (registry)
│  Ansible Control Node │                 │
│  ("Jump Node")        │                 │
│  - Helm CLI           │                 │
│  - Scout repo         │                 │
│  - Kubeconfig access  │                 │
└───────────┬───────────┘                 │
            │ K8s API                     │
            │                             │
            ▼                             ▼
┌─────────────────────────────────────────────────────────┐
│  Production K3s Cluster                                 │
│  - No internet access                                   │
│  - Pulls images via Harbor                              │
│  - Managed by Ansible from jump node                    │
└─────────────────────────────────────────────────────────┘

How it works:

  • Harbor pull-through proxy automatically caches container images from the internet

  • Production nodes pull images through Harbor without needing internet access

  • K3s artifacts are downloaded to the Ansible control node and distributed to air-gapped production nodes

  • SELinux packages are downloaded via a Kubernetes Job on the staging cluster

Requirements#

Operating System#

Critical: Production K3s nodes must run Rocky Linux 9 (or compatible RHEL 9-based distribution).

This requirement exists because air-gapped installations download SELinux packages (k3s-selinux and container-selinux) from Rancher’s repository using a Kubernetes Job that runs Rocky Linux 9 containers. The downloaded packages must match the production node OS to ensure compatibility.

Staging Node#

  • Internet access for downloading artifacts and container images

  • Separate physical or virtual machine from production cluster

  • Sufficient storage for Harbor registry cache (recommend 100Gi+)

  • SSH access from Ansible control node

  • Rocky Linux 9 (recommended for consistency, but not strictly required)

Production Nodes#

  • Rocky Linux 9 (required)

  • No internet access needed

  • Network connectivity to staging node Harbor registry

  • SSH access from Ansible control node

Ansible Control Node#

  • Network access to both staging and production K8s API servers (port 6443)

  • kubectl command-line tool installed

  • Ansible kubernetes.core collection installed

  • Internet access (for downloading Helm charts and k3s artifacts to control node)

Network Connectivity#

  • Production nodes → Staging Harbor (HTTPS, typically port 443 but configurable in the inventory)

  • Production nodes → Staging Squid proxy (TCP port 3128, for IdP authentication)

  • Ansible control → Staging K8s API (port 6443)

  • Ansible control → Production K8s API (port 6443)

  • Ansible control → All nodes (SSH, port 22)

Staging Host Configuration#

Adding Staging to Inventory#

Add a staging group to your inventory.yaml:

staging:
  hosts:
    staging.example.edu:
      ansible_host: staging
      ansible_python_interpreter: /usr/bin/python3
  vars:
    # K3s cluster join token for staging cluster
    staging_k3s_token: !vault |
          $ANSIBLE_VAULT;1.1;AES256
          ...encrypted token...

    # Harbor admin password
    harbor_admin_password: !vault |
          $ANSIBLE_VAULT;1.1;AES256
          ...encrypted password...

    # Storage size for Harbor registry cache
    harbor_storage_size: 100Gi

Required Variables#

Variable

Description

staging_k3s_token

Cluster join token for the staging K3s cluster (separate from production k3s_token)

harbor_admin_password

Admin password for Harbor web UI and API

harbor_storage_size

Persistent volume size for cached images (recommend 100Gi minimum)

Generating Staging Credentials#

Generate and encrypt credentials using Ansible Vault:

# Generate staging K3s token
openssl rand -hex 32 | ansible-vault encrypt_string --vault-password-file vault/pwd.sh --name 'staging_k3s_token'

# Generate Harbor admin password
openssl rand -hex 32 | ansible-vault encrypt_string --vault-password-file vault/pwd.sh --name 'harbor_admin_password'

See Configuring Secrets for more details on Ansible Vault.

Optional TLS Configuration#

By default, the staging playbook generates a self-signed TLS certificate for Harbor ingress mode. To use your own certificate, define in staging vars:

staging:
  vars:
    tls_cert_path: /path/to/cert.pem
    tls_key_path: /path/to/key.pem

When using self-signed certs, k3s nodes skip TLS verification. When providing valid certs, TLS verification is enabled.

To use bare hostname (no subdomain):

staging:
  vars:
    harbor_subdomain: ''  # Uses staging hostname directly instead of harbor.<hostname>

Air-Gapped Configuration Variables#

Enabling Air-Gapped Mode#

Set the global air_gapped flag in your inventory:

all:
  vars:
    air_gapped: true

K3s Air-Gapped Variables#

Configure these variables in the k3s_cluster vars section or globally in all vars:

k3s_cluster:
  vars:
    # Enable air-gapped installation mode (default: false)
    air_gapped: true

    # Timeout for downloading k3s artifacts to control node in seconds (default: 300)
    k3s_artifact_download_timeout: 300

    # SELinux package installation (default: auto-detect based on target node SELinux status)
    # Set to true/false to override auto-detection
    # k3s_selinux_enabled: true

    # Rancher repository channel for SELinux packages (default: stable)
    # Options: stable, testing, latest
    k3s_selinux_channel: stable

Variable Reference#

Variable

Default

Description

air_gapped

false

Enable air-gapped installation mode

k3s_artifact_download_timeout

300

Timeout in seconds for downloading k3s binary and install script to Ansible control node

k3s_selinux_enabled

auto-detect

Install SELinux packages (k3s-selinux, container-selinux). Set to true/false to override auto-detection based on target host SELinux status

k3s_selinux_channel

stable

Rancher repository channel for SELinux packages. Options: stable, testing, latest

k3s_selinux_rpm_site

rpm.rancher.io

Rancher RPM repository site (rarely needs changing)

Deployment Steps#

Follow these steps to deploy Scout in air-gapped mode:

1. Enable Air-Gapped Mode#

Set air_gapped: true in your inventory file:

all:
  vars:
    air_gapped: true

2. Configure Staging Host#

Add the staging host and credentials to your inventory (see Staging Host Configuration above).

3. Configure Production Cluster#

Define your production K3s cluster nodes as usual in the server, workers, and optionally gpu_workers groups. Ensure all nodes run Rocky Linux 9.

server:
  hosts:
    prod-server.example.edu:
      ansible_host: prod-server

workers:
  hosts:
    prod-worker-1.example.edu:
      ansible_host: worker-1
    prod-worker-2.example.edu:
      ansible_host: worker-2

4. Deploy#

Deploy Scout components normally:

ansible-playbook -i inventory.yaml playbooks/main.yaml

# Or use the Makefile
make all

What happens:

  • staging play installs a single-node K3s cluster on the staging host (online mode), deploys Traefik, Harbor (container image proxy), and Nexus (package proxy for conda/PyPI/Maven) via Helm, and installs Squid forward proxy for outbound IdP access

  • k3s play

    • Downloads K3s artifacts (binary, install script) to Ansible control node

    • Downloads SELinux packages via Kubernetes Job on staging cluster

    • Distributes artifacts to production nodes that lack internet access

    • Installs K3s with Harbor registry mirrors configured so production nodes can pull container images through Harbor

  • Other Scout plays

    • Helm charts are deployed from Ansible control node (charts are bundled in the Scout repository)

    • Container images are pulled by production nodes through Harbor

    • Harbor automatically caches images from upstream registries on first pull

How It Works#

Harbor Pull-Through Proxy#

Harbor acts as a transparent caching proxy for container registries:

  1. Production pod requests an image (e.g., docker.io/postgres:15)

  2. K3s containerd is configured to rewrite requests to Harbor (e.g., staging.example.edu/dockerhub-proxy/postgres:15)

  3. Harbor checks its cache:

    • Cache hit: Returns cached image immediately

    • Cache miss: Downloads from internet, caches, returns to requester

  4. Subsequent requests for the same image are served from Harbor cache

Supported registries:

  • Docker Hub (docker.io)

  • GitHub Container Registry (ghcr.io)

  • Quay.io (quay.io)

  • K8ssandra Container Registry (cr.k8ssandra.io)

  • Kubernetes Registry (registry.k8s.io)

  • Elastic Docker Registry (docker.elastic.co)

  • NVIDIA GPU Cloud (nvcr.io)

  • Apache Superset (apachesuperset.docker.scarf.sh)

Squid Forward Proxy#

Squid is installed as a system service on the staging node to provide outbound HTTPS access for services on the air-gapped production cluster that need to reach external APIs — specifically, Keycloak’s server-to-server calls to external identity provider (IdP) OAuth endpoints.

How it works:

  1. Squid listens on port 3128 on the staging node with a strict domain allowlist

  2. Keycloak on the production cluster is configured to route IdP traffic through the proxy

  3. Squid permits HTTPS CONNECT requests only to allowed domains, denying all other traffic

Allowed domains are computed automatically from your IdP configuration:

IdP configured

Domains allowed

keycloak_microsoft_client_id

login.microsoftonline.com, graph.microsoft.com

keycloak_gh_client_id

github.com, api.github.com

Keycloak integration:

When air_gapped: true and an external IdP is configured, the Keycloak Ansible role automatically configures Keycloak’s spi-connections-http-client-default-proxy-mappings SPI to route IdP traffic through Squid. No manual configuration is needed.

Adding extra domains:

To allow additional outbound access through the proxy (e.g., for Azure OpenAI API), set squid_extra_allowed_domains in your staging inventory vars:

staging:
  vars:
    squid_extra_allowed_domains:
      - api.openai.com

Network requirements:

Production cluster nodes must be able to reach the staging node on port 3128 (TCP). Add this to the network connectivity requirements alongside the existing Harbor registry access.

K3s Artifact Distribution#

In air-gapped mode, k3s installation artifacts are handled differently:

  1. Download phase (on Ansible control node):

    • K3s binary downloaded from GitHub releases

    • Install script downloaded from get.k3s.io

    • SELinux RPMs downloaded via Kubernetes Job on staging cluster

  2. Distribution phase:

    • Artifacts copied from control node to production nodes via SSH

    • Install script run with INSTALL_K3S_SKIP_DOWNLOAD=true

    • SELinux packages installed via dnf

SELinux Package Download#

SELinux packages are downloaded using a unique Kubernetes Job approach:

  1. Ansible creates a Job on the staging cluster

  2. Job runs a Rocky Linux 9 init container that:

    • Configures Rancher k3s yum repository

    • Downloads k3s-selinux and container-selinux with all dependencies

    • Saves RPMs to a shared volume

  3. Main container keeps the pod running

  4. Ansible extracts RPMs using kubectl cp from the pod

  5. RPMs are fetched to control node and distributed to production nodes

This approach ensures correct package versions for Rocky Linux 9 without requiring yum repositories on air-gapped nodes.

Limitations#

Version Upgrades#

When upgrading k3s or other components:

  1. Test the upgrade in your staging environment first

  2. Use -e flag to override versions temporarily (see Testing Upgrades)

  3. Update group_vars/all/versions.yaml after validating the upgrade

  4. Deploy to production

Network Isolation#

Air-gapped mode prevents production nodes from accessing the internet, but:

  • Production nodes still need access to staging Harbor

  • Ansible control node needs access to K8s APIs

  • This is not a completely isolated environment (no external network access)

Operating System Support#

Air-gapped installations only support Rocky Linux 9 for production nodes due to SELinux package requirements. The staging node can run other distributions, but Rocky Linux 9 is recommended for consistency.

Additional Information#