Cases

Real problems, engineered solutions.

Platform engineering, zero-trust security, observability, and self-service infrastructure — from architecture to production.

Enterprise Zero-Trust K8s Lab

Production-grade, self-hosted Kubernetes platform with GitOps, SSO, secrets management, and observability.

KubernetesArgoCDCiliumVaultAuthentik

SSH Management with Keycloak

Centrally managed SSH access using Keycloak group membership — no LDAP, no FreeIPA, works air-gapped.

KeycloakSSHREST APILinux

Azure Storage Policy Automation

K8s-native Python service reconciling Azure Blob access policies and rotating SAS tokens into Key Vault.

PythonAzureKey VaultK8s CronJob

Crossplane Self-Service Portal

Self-service portal abstracting Kubernetes and Crossplane behind a role-based wizard UI.

CrossplanePythonKubernetesAWSAzure

AI Observability Agents

Non-invasive AI overlay on OTel → ClickHouse → Grafana — anomaly detection, correlation, and LLM-powered incident reasoning.

OpenTelemetryClickHouseLLMNATS

Enterprise Zero-Trust Kubernetes Lab

A production-grade, self-hosted Kubernetes platform with GitOps, SSO, secrets management, observability, and cloud resource provisioning.

Challenge

Design and operate a realistic, enterprise-grade Kubernetes environment on bare-metal that demonstrates the full DevOps/GitOps lifecycle — from cluster bootstrapping to zero-trust access control.

Key Decisions

  • Replaced MetalLB with Cilium LB IPAM — eBPF-native load balancing, fewer components.
  • Separated Helm chart repos from values repos — one ApplicationSet deploys any chart to any cluster.
  • K8s API server integrated with Authentik OIDC — RBAC driven by SSO group membership.
  • OAuth2-Proxy as Traefik middleware — adds auth without modifying apps.
  • ClickHouse over Loki/Mimir — SQL-native queries over high-cardinality telemetry.

Outcomes

  • Fully automated GitOps: single git push deploys across all clusters.
  • Zero standing credentials via Vault + External Secrets Operator.
  • Complete cluster rebuild in under 30 minutes.
  • Published as open-source training (CC BY-NC).

Technology Stack

Kubernetes/k3sArgoCDCiliumTraefikHelmCrossplaneVaultAuthentikOpenTelemetryClickHouseGrafana

SSH User Management with Keycloak

A lightweight, LDAP-free solution for centrally managing SSH access using Keycloak group membership — designed for air-gapped environments.

Problem

Traditional SSH access relies on FreeIPA/AD/LDAP. In air-gapped environments, Keycloak cannot serve as an LDAP directory. Challenge: centrally manage SSH keys and per-server access without a user directory.

Solution

Store SSH public keys as Keycloak user attributes. Use group membership for per-server access. A custom AuthorizedKeysCommand queries the Keycloak REST API at login time.

Key Decisions

  • Event-driven provisioning via Keycloak webhooks; scheduled sync as fallback.
  • ed25519 enforced via attribute validation regex.
  • Each server reads its allowed group from a local config — script is generic.

Outcomes

  • Centralized SSH management without FreeIPA/AD/LDAP.
  • Revoking access is instant: remove user from group.
  • Single script works across all managed servers.

Technology Stack

KeycloakSSH/sshdREST APIPAMBashLinux

Azure Storage Policy Automation

A Kubernetes-native Python service that reconciles Azure Blob container access policies and rotates SAS tokens into Key Vault.

Problem

Dozens of storage containers required manually managed access policies and SAS tokens. Tokens expired silently, policies drifted, no audit trail.

SAS Manager Architecture

Key Decisions

  • AKS Workload Identity — no secrets in the cluster.
  • Metadata-driven: policy_* container tags define desired state.
  • K8s CronJob handles scheduling, retries, observability.

Outcomes

  • Eliminated manual SAS token management subscription-wide.
  • Policy drift auto-corrected on every run.
  • All SAS tokens stored in Key Vault and consumed by workloads via ExternalSecrets.

Technology Stack

PythonAzure SDKBlob StorageKey VaultWorkload IdentityK8s CronJobExternalSecretsArgoCDHelm

Crossplane Self-Service Infrastructure Portal

A Python-based self-service portal abstracting Kubernetes and Crossplane behind a role-based wizard UI.

Problem

Developers must understand XRDs, compositions, and K8s manifests to provision cloud resources. Every request flows through the platform team.

Solution

UI pushes Crossplane CRDs to a central job queue. Per-project Applier Agents pick from the queue and apply at controlled rate. State Poller reads XR status into cache.

Infrastructure Planning UI Architecture

Key Decisions

  • Queue-based decoupling protects K8s API from bursts.
  • Per-project Applier Agents: control plane by default, external cluster when compliance requires.
  • Cache-first reads with event-triggered refresh for near-real-time status.
  • Provider credentials encrypted at rest, injected server-side — never exposed in UI.

Outcomes

  • Developers provision infra without K8s/Crossplane knowledge.
  • Platform team bottleneck eliminated.
  • Flexible: regulated teams run their own agent connecting to central queue.

Technology Stack

PythonNiceGUICrossplaneKubernetesAWSAzure

AI Observability Agents

A non-invasive AI overlay on OTel → ClickHouse → Grafana — anomaly detection, correlation, and LLM-powered incident reasoning with zero pipeline changes.

Problem

Observability stacks generate enormous signal volumes but require manual correlation. Alert fatigue is high, root cause analysis is slow, no business-context layer.

AI Observability Architecture

Key Decisions

  • Read-only integration: zero changes to OTel, ClickHouse, or Grafana.
  • LLM only for reasoning over structured events — never raw data or anomaly math.
  • Each component is an independent K8s Deployment via event bus (NATS/Kafka).
  • Pluggable dispatcher — alerting backends swapped without touching detection.

Outcomes

  • Incidents enriched with root cause and business impact before paging.
  • Cross-service correlation reduces alert noise.
  • SLO/SLA-aware severity classification.
  • Feedback loop for continuous accuracy improvement.

Technology Stack

OpenTelemetryClickHouseGrafanaLLMNATS/KafkaKubernetesPython