Case Study: AI-Assisted WDAC Deployment on Azure Local

The Situation

4-Node

Cluster

WDAC

Enforced

Zero

Physical Access

Safety Rules

An enterprise ran a 4-node Azure Local (HCI) cluster — hardened with Windows Defender Application Control (WDAC) in enforced mode. Every unsigned binary that tries to run gets blocked. The backup vendor's agent wasn't signed with a publisher certificate that WDAC recognized, so the standard "push install from the console" workflow was dead on arrival. WinRM was firewalled. The only remote access path was Azure Arc SSH tunneled through the Azure control plane.

The customer needed the agent installed, a supplemental WDAC policy deployed to allow it to run, and the cluster registered with the backup console — all without breaking the security posture of a production environment. And they wanted it done with an AI-assisted workflow, not a manual runbook someone runs once and throws away.

What Made This Hard

1. WDAC Blocks Everything, Including Installers

The backup agent's MSI spawns helper DLLs from temp paths during installation. Even if you whitelist the final binary paths, WDAC blocks the installer process itself in enforced mode. You have to temporarily switch the entire cluster to audit mode, install, deploy the supplemental policy, then re-enforce. That's a cluster-wide security posture change — not something you wing.

2. The Platform Fights You

Azure Local runs an HCI Orchestrator that performs WDAC drift control every 90 minutes. If you manually drop a policy file into the right directory using the standard Windows CiTool.exe, the Orchestrator silently deletes it on its next sweep. You must use the Azure Local-specific cmdlets (Add-ASWDACSupplementalPolicy) — which aren't documented in the backup vendor's own instructions. The vendor's workaround doc says "use CiTool." That works on standalone Windows Server. It silently fails on Azure Local within 90 minutes.

3. The Only Access Path Has Zero Margin for Error

Azure Arc SSH tunnels through Azure's control plane via IPv6 loopback ([::1]:22) on the node. If anything restricts sshd to IPv4-only, you lose all remote access instantly. There's no out-of-band recovery path without a support ticket. We know this because it happened — an earlier AI session modified sshd_config and locked us out for hours.

How We Did It — And Why the AI Mattered

This wasn't "ask ChatGPT to write a script." The AI system we used has persistent memory, safety rules, and multi-session context. Here's what that actually means in practice:

The AI Remembered a Production Incident

During an earlier session, the AI modified sshd_config on a node — restricting it to IPv4 127.0.0.1 only. Arc SSH uses IPv6. The node became unreachable. Recovery took hours.

After that incident, we wrote a formal root cause analysis and encoded 8 safety rules directly into the AI's configuration:

AI Safety Rules (Loaded Before Every Session)

Never modify sshd_config on HCI nodes. Period.
Pilot-first. Never apply changes to more than one node simultaneously.
Verify-before-proceed. After every change, test that the system still works before doing anything else.
Rollback-ready. Before making any change, state the rollback command. If you can't articulate the undo — don't make the change.
Stop flailing. If 3 attempts at the same approach fail, stop. Present alternatives.

These rules persisted across every subsequent session. The AI couldn't "forget" them — they're loaded before every conversation starts. When we started the WDAC deployment work weeks later, the AI already knew that Arc SSH uses IPv6, that sshd_config is off-limits, and that every command needs a stated rollback before execution.

The AI Built Its Own Knowledge Base

The backup vendor's WDAC workaround document said to use CiTool.exe to deploy the supplemental policy. Standard Windows guidance. But Azure Local isn't standard Windows — the HCI Orchestrator overrides CiTool changes within 90 minutes.

The AI:

Read the vendor's workaround document and extracted the installation sequence
Researched Microsoft's WDAC documentation for Azure Local and found the platform-specific cmdlets
Identified the conflict — vendor says CiTool, Microsoft says Add-ASWDACSupplementalPolicy
Resolved it correctly — use the Azure Local cmdlets, not the vendor's instructions
Saved the finding to persistent memory so it would never repeat the CiTool mistake in future sessions

Key finding: The memory entry — "HCI Orchestrator drift control runs every 90 min and reverts any manual .cip file drops" — is now available to every future conversation about this cluster.

The AI Discovered the API Schema by Experimentation

The backup console uses a GraphQL API with polymorphic union types and undocumented schema quirks. To build an inventory of what was already registered vs. what needed to be added, the AI:

Ran a schema introspection query to discover the actual field names (the documented ones were wrong)
Tried multiple enum values for failover cluster queries, discovering that WINDOWS_HOST_ROOT was the correct filter through trial and error
Built working inventory scripts that correctly query Hyper-V servers, failover clusters, and host connections
Identified that the target nodes existed in the host inventory but weren't registered as a Hyper-V cluster object

The AI Wrote the Deployment Plan — With Rollback at Every Phase

The final plan had 7 phases, each with exact commands, expected output, rollback procedures, and warnings about cluster-wide impact. Critical callouts the AI surfaced that a human might miss:

The supplemental policy XML contains an Enabled:Audit Mode flag on line 10. Looks like it would put the node in audit mode. It doesn't — supplemental policies can't override base policy enforcement. The flag is a no-op.
MSI installation spawns WixCA.dll from temp paths. Even with all final binary paths whitelisted, the installer itself gets blocked by WDAC. Audit mode is required during install, not just for the running binaries.
Add-ASWDACSupplementalPolicy deploys cluster-wide from any node. Running it on the pilot node also deploys to the other 3. This is correct behavior — the policy only permits the agent to run, it doesn't install it.

The File Transfer Problem

The agent MSI (31 MB) needed to get from a management machine to the nodes. The access path is Azure Arc SSH — no direct network connectivity, no SCP, no shared file system.

What didn't work:

SCP over Arc SSH — hangs indefinitely (known Arc limitation)
PSSession (New-PSSession) — Access Denied from Arc SSH context (stricter auth than direct WinRM)
Base64 over command line — command too long for a 31 MB file

What worked:

Base64 stdin pipe through Arc SSH — encode the file, pipe it through the SSH tunnel, decode on the other side
SMB admin shares for node-to-node copies — \\node\C$\tmp\ works when logged into the cluster with the domain service account

The AI tried 3 approaches, hit the "stop flailing" rule, and switched strategies. That's the system working as designed — not banging on a broken approach indefinitely.

What They Got

On the Pilot Node

Backup agent installed to the default path with all 6 required binaries verified
Supplemental WDAC policy deployed cluster-wide using the Orchestrator-safe cmdlet
WDAC back in enforced mode — no security posture regression
Agent service running with both rbs.exe and rba.exe processes active
Zero WDAC block events in the Code Integrity event log

For the Remaining 3 Nodes

MSI and policy XML staged on all nodes via SMB admin shares
Identical runbook ready to execute, one node at a time, with the same pilot-first methodology

For Future Sessions

Persistent memory entries covering: correct cmdlets, correct installation order, file transfer methods that work, API schema quirks, and the 90-minute Orchestrator drift caveat
Safety rules that prevent the AI from repeating the sshd lockout incident
Working GraphQL inventory scripts for ongoing monitoring

The Methodology

This is what "Work Through AI" means in practice. The AI isn't writing code in isolation and hoping it works. It's operating inside a system with:

Persistent memory — lessons learned survive across sessions. The sshd lockout from Session 1 prevents the same mistake in Session 15.
Safety rails — non-negotiable rules that constrain what the AI can do on production systems. These aren't suggestions; they're loaded before the conversation starts.
Rollback-first planning — every change has a stated undo before it executes. This is enforced by the rules, not left to the AI's judgment.
Domain research capability — the AI reads vendor docs, cross-references platform documentation, and resolves conflicts between them.
Iterative discovery — when the API schema doesn't match the docs, the AI introspects and adapts. When a file transfer method fails, it tries alternatives and stops after 3 failures.

The human role: architecture decisions, risk approval, and the judgment call on when to proceed. The AI role: research, planning, command generation, verification, and remembering everything that happened so the next session starts where the last one left off.

Tech Stack

Claude Code Azure Local Azure Arc SSH WDAC HCI Orchestrator

This engagement was conducted using a custom-configured AI development environment with persistent memory, safety constraints, and domain-specific tooling for Azure infrastructure operations.

Deploying a Backup Agent on Application-Control-Locked Infrastructure — With an AI Copilot That Remembers Its Mistakes

The Situation

What Made This Hard

1. WDAC Blocks Everything, Including Installers

2. The Platform Fights You

3. The Only Access Path Has Zero Margin for Error

How We Did It — And Why the AI Mattered

The AI Remembered a Production Incident

AI Safety Rules (Loaded Before Every Session)

The AI Built Its Own Knowledge Base

The AI Discovered the API Schema by Experimentation

The AI Wrote the Deployment Plan — With Rollback at Every Phase

The File Transfer Problem

What They Got

On the Pilot Node

For the Remaining 3 Nodes

For Future Sessions

The Methodology

Tech Stack

Interested in working together?