Case Study

Deploying a Backup Agent on Application-Control-Locked Infrastructure — With an AI Copilot That Remembers Its Mistakes

A 4-node Azure Local cluster hardened with WDAC in enforced mode. The only remote access path was Azure Arc SSH. The AI assistant had persistent memory, safety rules, and multi-session context. Here's what that actually means in practice.

Scope: 4-node HCI cluster, WDAC-enforced, Azure Arc-managed Duration: Multi-session engagement Deliverable: Working agent + validated WDAC policy + repeatable runbook

The Situation

4-Node
Cluster
WDAC
Enforced
Zero
Physical Access
8
Safety Rules

An enterprise ran a 4-node Azure Local (HCI) cluster — hardened with Windows Defender Application Control (WDAC) in enforced mode. Every unsigned binary that tries to run gets blocked. The backup vendor's agent wasn't signed with a publisher certificate that WDAC recognized, so the standard "push install from the console" workflow was dead on arrival. WinRM was firewalled. The only remote access path was Azure Arc SSH tunneled through the Azure control plane.

The customer needed the agent installed, a supplemental WDAC policy deployed to allow it to run, and the cluster registered with the backup console — all without breaking the security posture of a production environment. And they wanted it done with an AI-assisted workflow, not a manual runbook someone runs once and throws away.

What Made This Hard

1. WDAC Blocks Everything, Including Installers

The backup agent's MSI spawns helper DLLs from temp paths during installation. Even if you whitelist the final binary paths, WDAC blocks the installer process itself in enforced mode. You have to temporarily switch the entire cluster to audit mode, install, deploy the supplemental policy, then re-enforce. That's a cluster-wide security posture change — not something you wing.

2. The Platform Fights You

Azure Local runs an HCI Orchestrator that performs WDAC drift control every 90 minutes. If you manually drop a policy file into the right directory using the standard Windows CiTool.exe, the Orchestrator silently deletes it on its next sweep. You must use the Azure Local-specific cmdlets (Add-ASWDACSupplementalPolicy) — which aren't documented in the backup vendor's own instructions. The vendor's workaround doc says "use CiTool." That works on standalone Windows Server. It silently fails on Azure Local within 90 minutes.

3. The Only Access Path Has Zero Margin for Error

Azure Arc SSH tunnels through Azure's control plane via IPv6 loopback ([::1]:22) on the node. If anything restricts sshd to IPv4-only, you lose all remote access instantly. There's no out-of-band recovery path without a support ticket. We know this because it happened — an earlier AI session modified sshd_config and locked us out for hours.

How We Did It — And Why the AI Mattered

This wasn't "ask ChatGPT to write a script." The AI system we used has persistent memory, safety rules, and multi-session context. Here's what that actually means in practice:

The AI Remembered a Production Incident

During an earlier session, the AI modified sshd_config on a node — restricting it to IPv4 127.0.0.1 only. Arc SSH uses IPv6. The node became unreachable. Recovery took hours.

After that incident, we wrote a formal root cause analysis and encoded 8 safety rules directly into the AI's configuration:

AI Safety Rules (Loaded Before Every Session)

  • Never modify sshd_config on HCI nodes. Period.
  • Pilot-first. Never apply changes to more than one node simultaneously.
  • Verify-before-proceed. After every change, test that the system still works before doing anything else.
  • Rollback-ready. Before making any change, state the rollback command. If you can't articulate the undo — don't make the change.
  • Stop flailing. If 3 attempts at the same approach fail, stop. Present alternatives.

These rules persisted across every subsequent session. The AI couldn't "forget" them — they're loaded before every conversation starts. When we started the WDAC deployment work weeks later, the AI already knew that Arc SSH uses IPv6, that sshd_config is off-limits, and that every command needs a stated rollback before execution.

The AI Built Its Own Knowledge Base

The backup vendor's WDAC workaround document said to use CiTool.exe to deploy the supplemental policy. Standard Windows guidance. But Azure Local isn't standard Windows — the HCI Orchestrator overrides CiTool changes within 90 minutes.

The AI:

  1. Read the vendor's workaround document and extracted the installation sequence
  2. Researched Microsoft's WDAC documentation for Azure Local and found the platform-specific cmdlets
  3. Identified the conflict — vendor says CiTool, Microsoft says Add-ASWDACSupplementalPolicy
  4. Resolved it correctly — use the Azure Local cmdlets, not the vendor's instructions
  5. Saved the finding to persistent memory so it would never repeat the CiTool mistake in future sessions

Key finding: The memory entry — "HCI Orchestrator drift control runs every 90 min and reverts any manual .cip file drops" — is now available to every future conversation about this cluster.

The AI Discovered the API Schema by Experimentation

The backup console uses a GraphQL API with polymorphic union types and undocumented schema quirks. To build an inventory of what was already registered vs. what needed to be added, the AI:

  1. Ran a schema introspection query to discover the actual field names (the documented ones were wrong)
  2. Tried multiple enum values for failover cluster queries, discovering that WINDOWS_HOST_ROOT was the correct filter through trial and error
  3. Built working inventory scripts that correctly query Hyper-V servers, failover clusters, and host connections
  4. Identified that the target nodes existed in the host inventory but weren't registered as a Hyper-V cluster object

The AI Wrote the Deployment Plan — With Rollback at Every Phase

The final plan had 7 phases, each with exact commands, expected output, rollback procedures, and warnings about cluster-wide impact. Critical callouts the AI surfaced that a human might miss:

The File Transfer Problem

The agent MSI (31 MB) needed to get from a management machine to the nodes. The access path is Azure Arc SSH — no direct network connectivity, no SCP, no shared file system.

What didn't work:

What worked:

The AI tried 3 approaches, hit the "stop flailing" rule, and switched strategies. That's the system working as designed — not banging on a broken approach indefinitely.

What They Got

On the Pilot Node

For the Remaining 3 Nodes

For Future Sessions

The Methodology

This is what "Work Through AI" means in practice. The AI isn't writing code in isolation and hoping it works. It's operating inside a system with:

The human role: architecture decisions, risk approval, and the judgment call on when to proceed. The AI role: research, planning, command generation, verification, and remembering everything that happened so the next session starts where the last one left off.

Tech Stack

Claude Code Azure Local Azure Arc SSH WDAC HCI Orchestrator

This engagement was conducted using a custom-configured AI development environment with persistent memory, safety constraints, and domain-specific tooling for Azure infrastructure operations.

Interested in working together?

Let's discuss how this approach could solve your infrastructure challenges.

Schedule a 30-Minute Call