The Situation
An enterprise ran a 4-node Azure Local (HCI) cluster — hardened with Windows Defender Application Control (WDAC) in enforced mode. Every unsigned binary that tries to run gets blocked. The backup vendor's agent wasn't signed with a publisher certificate that WDAC recognized, so the standard "push install from the console" workflow was dead on arrival. WinRM was firewalled. The only remote access path was Azure Arc SSH tunneled through the Azure control plane.
The customer needed the agent installed, a supplemental WDAC policy deployed to allow it to run, and the cluster registered with the backup console — all without breaking the security posture of a production environment. And they wanted it done with an AI-assisted workflow, not a manual runbook someone runs once and throws away.
What Made This Hard
1. WDAC Blocks Everything, Including Installers
The backup agent's MSI spawns helper DLLs from temp paths during installation. Even if you whitelist the final binary paths, WDAC blocks the installer process itself in enforced mode. You have to temporarily switch the entire cluster to audit mode, install, deploy the supplemental policy, then re-enforce. That's a cluster-wide security posture change — not something you wing.
2. The Platform Fights You
Azure Local runs an HCI Orchestrator that performs WDAC drift control every 90 minutes. If you manually drop a policy file into the right directory using the standard Windows CiTool.exe, the Orchestrator silently deletes it on its next sweep. You must use the Azure Local-specific cmdlets (Add-ASWDACSupplementalPolicy) — which aren't documented in the backup vendor's own instructions. The vendor's workaround doc says "use CiTool." That works on standalone Windows Server. It silently fails on Azure Local within 90 minutes.
3. The Only Access Path Has Zero Margin for Error
Azure Arc SSH tunnels through Azure's control plane via IPv6 loopback ([::1]:22) on the node. If anything restricts sshd to IPv4-only, you lose all remote access instantly. There's no out-of-band recovery path without a support ticket. We know this because it happened — an earlier AI session modified sshd_config and locked us out for hours.
How We Did It — And Why the AI Mattered
This wasn't "ask ChatGPT to write a script." The AI system we used has persistent memory, safety rules, and multi-session context. Here's what that actually means in practice:
The AI Remembered a Production Incident
During an earlier session, the AI modified sshd_config on a node — restricting it to IPv4 127.0.0.1 only. Arc SSH uses IPv6. The node became unreachable. Recovery took hours.
After that incident, we wrote a formal root cause analysis and encoded 8 safety rules directly into the AI's configuration:
AI Safety Rules (Loaded Before Every Session)
- Never modify
sshd_configon HCI nodes. Period. - Pilot-first. Never apply changes to more than one node simultaneously.
- Verify-before-proceed. After every change, test that the system still works before doing anything else.
- Rollback-ready. Before making any change, state the rollback command. If you can't articulate the undo — don't make the change.
- Stop flailing. If 3 attempts at the same approach fail, stop. Present alternatives.
These rules persisted across every subsequent session. The AI couldn't "forget" them — they're loaded before every conversation starts. When we started the WDAC deployment work weeks later, the AI already knew that Arc SSH uses IPv6, that sshd_config is off-limits, and that every command needs a stated rollback before execution.
The AI Built Its Own Knowledge Base
The backup vendor's WDAC workaround document said to use CiTool.exe to deploy the supplemental policy. Standard Windows guidance. But Azure Local isn't standard Windows — the HCI Orchestrator overrides CiTool changes within 90 minutes.
The AI:
- Read the vendor's workaround document and extracted the installation sequence
- Researched Microsoft's WDAC documentation for Azure Local and found the platform-specific cmdlets
- Identified the conflict — vendor says CiTool, Microsoft says
Add-ASWDACSupplementalPolicy - Resolved it correctly — use the Azure Local cmdlets, not the vendor's instructions
- Saved the finding to persistent memory so it would never repeat the CiTool mistake in future sessions
Key finding: The memory entry — "HCI Orchestrator drift control runs every 90 min and reverts any manual .cip file drops" — is now available to every future conversation about this cluster.
The AI Discovered the API Schema by Experimentation
The backup console uses a GraphQL API with polymorphic union types and undocumented schema quirks. To build an inventory of what was already registered vs. what needed to be added, the AI:
- Ran a schema introspection query to discover the actual field names (the documented ones were wrong)
- Tried multiple enum values for failover cluster queries, discovering that
WINDOWS_HOST_ROOTwas the correct filter through trial and error - Built working inventory scripts that correctly query Hyper-V servers, failover clusters, and host connections
- Identified that the target nodes existed in the host inventory but weren't registered as a Hyper-V cluster object
The AI Wrote the Deployment Plan — With Rollback at Every Phase
The final plan had 7 phases, each with exact commands, expected output, rollback procedures, and warnings about cluster-wide impact. Critical callouts the AI surfaced that a human might miss:
- The supplemental policy XML contains an
Enabled:Audit Modeflag on line 10. Looks like it would put the node in audit mode. It doesn't — supplemental policies can't override base policy enforcement. The flag is a no-op. - MSI installation spawns
WixCA.dllfrom temp paths. Even with all final binary paths whitelisted, the installer itself gets blocked by WDAC. Audit mode is required during install, not just for the running binaries. Add-ASWDACSupplementalPolicydeploys cluster-wide from any node. Running it on the pilot node also deploys to the other 3. This is correct behavior — the policy only permits the agent to run, it doesn't install it.
The File Transfer Problem
The agent MSI (31 MB) needed to get from a management machine to the nodes. The access path is Azure Arc SSH — no direct network connectivity, no SCP, no shared file system.
What didn't work:
- SCP over Arc SSH — hangs indefinitely (known Arc limitation)
- PSSession (New-PSSession) — Access Denied from Arc SSH context (stricter auth than direct WinRM)
- Base64 over command line — command too long for a 31 MB file
What worked:
- Base64 stdin pipe through Arc SSH — encode the file, pipe it through the SSH tunnel, decode on the other side
- SMB admin shares for node-to-node copies —
\\node\C$\tmp\works when logged into the cluster with the domain service account
The AI tried 3 approaches, hit the "stop flailing" rule, and switched strategies. That's the system working as designed — not banging on a broken approach indefinitely.
What They Got
On the Pilot Node
- Backup agent installed to the default path with all 6 required binaries verified
- Supplemental WDAC policy deployed cluster-wide using the Orchestrator-safe cmdlet
- WDAC back in enforced mode — no security posture regression
- Agent service running with both
rbs.exeandrba.exeprocesses active - Zero WDAC block events in the Code Integrity event log
For the Remaining 3 Nodes
- MSI and policy XML staged on all nodes via SMB admin shares
- Identical runbook ready to execute, one node at a time, with the same pilot-first methodology
For Future Sessions
- Persistent memory entries covering: correct cmdlets, correct installation order, file transfer methods that work, API schema quirks, and the 90-minute Orchestrator drift caveat
- Safety rules that prevent the AI from repeating the sshd lockout incident
- Working GraphQL inventory scripts for ongoing monitoring
The Methodology
This is what "Work Through AI" means in practice. The AI isn't writing code in isolation and hoping it works. It's operating inside a system with:
- Persistent memory — lessons learned survive across sessions. The sshd lockout from Session 1 prevents the same mistake in Session 15.
- Safety rails — non-negotiable rules that constrain what the AI can do on production systems. These aren't suggestions; they're loaded before the conversation starts.
- Rollback-first planning — every change has a stated undo before it executes. This is enforced by the rules, not left to the AI's judgment.
- Domain research capability — the AI reads vendor docs, cross-references platform documentation, and resolves conflicts between them.
- Iterative discovery — when the API schema doesn't match the docs, the AI introspects and adapts. When a file transfer method fails, it tries alternatives and stops after 3 failures.
The human role: architecture decisions, risk approval, and the judgment call on when to proceed. The AI role: research, planning, command generation, verification, and remembering everything that happened so the next session starts where the last one left off.
Tech Stack
This engagement was conducted using a custom-configured AI development environment with persistent memory, safety constraints, and domain-specific tooling for Azure infrastructure operations.