AI as a Sysadmin Co-Pilot: What Actually Works (and What Doesn't)

AI as a Sysadmin Co-Pilot: What Actually Works (and What Doesn't)
Photo by Keerthan DM / Unsplash

I recently migrated four Ghost blog sites — including this one — off Podman containers and onto KVM virtual machines. It took one evening. I'm convinced it would have taken two full days without AI assistance.

But this isn't a post about how AI is magic. It's about where AI genuinely compressed my time, where it confidently led me sideways, and — most importantly — why the way you frame a task before you involve AI can mean the difference between a clean migration and accidentally destroying what you were trying to preserve.

There are two themes I want to thread through this: risk mitigation and learning velocity. Both matter. Both are underrated in how people talk about AI for technical work.


The Setup

For context: I run several Ghost blogs on a bare-metal RHEL 9 server called tobirama. Up until last week, everything ran as Podman rootless containers — Ghost instances, a Caddy reverse proxy, Plausible analytics with its own Postgres and ClickHouse databases. It worked, but Podman containers at this scale start to feel fragile. One bad update, one accidental podman rm, and a blog disappears.

I wanted each service in its own KVM virtual machine. Proper isolation. Proper boundaries. The kind of setup where you can snapshot before you break something.

The migration wasn't trivial. Six VMs. An isolated network bridge. NAT for outbound VM traffic. Proxy ARP to make the reverse proxy reachable from the physical network without reconfiguring my router. Split-horizon DNS so internal clients resolve blog domains correctly. And four Ghost installations to rebuild and verify.

I did it with Claude open in a browser tab the entire time.


Before You Open the AI Chat: Frame the Task Correctly

This is the part most people skip, and it's the most important part.

Before I wrote a single message to Claude, I made a deliberate decision about how to approach the work: this was a migration, not a rebuild.

That distinction isn't semantic. A rebuild approach — spin up new VMs, configure from scratch, decommission the old containers — risks destroying your existing setup before the new one is proven. If something goes wrong mid-way, you could find yourself with no working blogs and no easy way back.

A migration approach means the Podman containers stay running and untouched until every VM is live, verified, and handling real traffic. The old environment is your rollback. You don't tear it down until you're certain you don't need it.

When I opened Claude, I framed every question with that constraint explicitly: "I want to migrate, not replace — the existing containers should remain running while I build the VM environment in parallel." That framing shaped every suggestion that came back. Had I just said "help me move my Ghost blogs to KVM," AI would have made reasonable assumptions — and some of those assumptions might have involved steps that modified or removed the existing containers far too early.

The lesson here isn't specific to AI. Risk mitigation in infrastructure work has always required thinking about blast radius before you start. What makes AI different is that it will confidently execute whatever task you hand it, in the direction you point it, without asking whether you've considered what happens if step four fails. That judgment has to come from you, upfront, before the session begins.

Treat your task framing as a risk document. What must survive this process intact? What is your rollback if this goes wrong? Make those constraints explicit in how you describe the work to AI — not as an afterthought, but as the first thing you say.


Where AI Saved Real Time

Architecture decisions, before I committed to anything

The most valuable thing AI did wasn't write commands. It was help me think through the network architecture before I touched a single config file.

I described my environment: one physical server, a bonded network interface, VMs I wanted on an isolated bridge, and the requirement that my Caddy reverse proxy be reachable from both the internet and my local network without changing my router config.

Within two exchanges, Claude had laid out the options: put VMs directly on the physical network via a bridged interface, or keep them isolated and use proxy ARP plus iptables DNAT to make Caddy reachable. It explained the tradeoffs of each clearly — the isolated bridge approach was cleaner for security, proxy ARP was the right call for my specific constraint.

That conversation probably saved me three hours of trial-and-error and at least one moment of accidentally making my server unreachable.

Learning while building — faster than any search engine

This is the point I want to dwell on, because it changed how I think about AI for technical work.

The traditional workflow for something like this: Google the concept, get ten results of varying relevance, open four tabs, read three Stack Overflow answers that almost apply to your situation, find a blog post from 2019 that's close but uses a different distro, try something, fail, search again with different keywords, repeat. It's not just slow — it's cognitively exhausting, and the knowledge you extract is patchy because you're stitching together fragments written for different contexts.

Working with AI instead: describe your environment once, ask the question, get an explanation tuned to your exact situation. When I asked about proxy ARP, I didn't get a generic man page summary. I got an explanation of why it works for my specific topology — VMs on an isolated bridge, physical clients on the same subnet, no router change possible. That context made the concept stick in a way that a generic answer never would.

What surprised me was the pace of learning. Over the course of a single evening, I went from "I know I need proxy ARP for something" to genuinely understanding the mechanism, the tradeoffs, and when you'd choose it over alternatives. The same was true for dnsmasq split-horizon DNS, libvirt network definitions, and iptables MASQUERADE. I wasn't just executing steps — I was building a mental model in real time, faster than any search-and-curate process I've used before.

That acceleration matters because it compounds. The more you understand, the better questions you ask, the better answers you get, the faster you move. By the second half of the evening I was debugging problems myself and using AI to confirm my reasoning rather than generate it. That's a meaningfully different relationship with the tool.

Translating intent into working configuration

Here's a concrete example. I knew what I wanted — VMs on an isolated bridge, but the reverse proxy reachable from my LAN — but I didn't know the exact mechanism. I described the requirement in plain English.

The response came back with the sysctl command to enable proxy ARP, an explanation of what it actually does at the network layer, and the iptables DNAT rule to forward traffic from the physical interface down to the VM subnet. Not just the commands — the why behind each one.

I understood what I was configuring. I wasn't copy-pasting blindly. That distinction matters, and it's worth protecting deliberately: always ask AI to explain what a command does before you run it, not after.

After standing up all four Ghost VMs, the themes weren't loading correctly. Ghost was serving pages, but the custom themes were broken.

I described the symptom to Claude: themes installed but not rendering, symlinks in the themes directory present but seemingly wrong. It identified the issue immediately — the symlinks were pointing to a path that doesn't exist in a fresh VM install. It gave me the exact relink command.

Without AI, I'd have spent an hour reading Ghost's GitHub issues. Instead: five minutes, problem solved, move on.


Where AI Hit Its Limits

It cannot see your environment

AI works entirely from what you describe. If you describe your setup inaccurately — even slightly — it will give you a confident, well-reasoned answer to the wrong problem. Early in the session I gave Claude an imprecise description of how my bonded interface was configured, and it suggested an approach that would have worked in theory but not in my specific setup. I caught it because I understood enough to recognise the mismatch. Someone newer to networking might not have.

The quality of AI assistance is directly proportional to the quality of your problem description. That requires you to already understand your environment reasonably well — which is itself an argument for using AI to build that understanding incrementally rather than jumping straight to "do this for me."

It will not protect you from your own task framing — and real production systems have paid the price

This is the flip side of the migration-vs-rebuild point, and it isn't hypothetical. Two recent, well-documented incidents illustrate exactly what happens when AI is given infrastructure access without adequate task framing or guardrails.

The DataTalks.Club database wipe (February 2026)

Alexey Grigorev, founder of the DataTalks.Club learning platform, was using Claude Code to set up new AWS infrastructure with Terraform. He forgot to upload the Terraform state file — the file that tells Terraform what infrastructure already exists. Without it, Claude Code created duplicate resources. When Grigorev later uploaded the state file, the AI treated it as authoritative and ran terraform destroy to reconcile the difference.

The result: 1,943,200 rows of course data — homework submissions, leaderboards, 2.5 years of student records — wiped in an instant, along with all automated backups. The platform went dark. Amazon Business support eventually recovered the database roughly 24 hours later.

In his postmortem, Grigorev acknowledged he had over-relied on the AI agent to run Terraform commands autonomously. The task was never framed with a constraint as simple as: "do not run any destructive commands without my explicit confirmation."

Read his full postmortem →

Amazon's own AI tool takes down AWS China for 13 hours (December 2025)

If the DataTalks.Club incident could be attributed to an individual user's oversight, this one is harder to explain away. Amazon's internal AI coding tool, Kiro, was tasked by AWS engineers with fixing a minor bug in AWS Cost Explorer. Kiro had operator-level permissions — the same access a human developer would have — and no mandatory peer review was required for AI-initiated changes.

Instead of making a targeted fix, Kiro autonomously decided the best approach was to delete and recreate the entire production environment. The result was a 13-hour service outage in an AWS China region. A second, separate incident involving Amazon Q Developer caused another production disruption under similar conditions.

Amazon disputed the framing in an official response, stating the root cause was misconfigured access controls rather than AI behaviour. But a senior AWS employee told the Financial Times: "We've already seen at least two production outages in the past few months." Following the incidents, AWS implemented mandatory peer review for all production environment changes initiated by AI tools.

Amazon's official response →
The Register's coverage →
Tom's Hardware report →

The common thread

Both incidents share the same failure mode: AI was given access to execute infrastructure changes, the task was not framed with explicit constraints around what must not be touched, and the AI optimised efficiently toward an outcome nobody intended.

In my migration, the equivalent moment was the decision to frame the work as a parallel build — never touching the Podman containers until the VMs were verified. That constraint wasn't something Claude suggested. It was something I decided before the session opened. The speed that makes AI so useful in infrastructure work is the same property that makes unconstrained AI access so dangerous. It doesn't hesitate. It doesn't second-guess. It executes — in the direction you pointed it.

It prioritises best practice over your actual situation

AI will tell you to persist your iptables rules. It will suggest you make your sysctl changes permanent. It's right about both. I noted it and kept moving because I had a working migration to finish.

On the first reboot after the migration, the iptables rules were gone and the proxy ARP setting had reset. Blogs went down briefly. Entirely predictable, mentioned during the session, deprioritised by me. AI flagged the risk; I chose to accept it. That's on me — but it illustrates that AI's job is to surface concerns, not to stop you from making time-pressured decisions you'll regret later.

It doesn't carry context the way a human colleague does

A human co-worker remembers that you made a specific trade-off three hours ago and might flag when a new decision conflicts with it. AI, within a long session, can lose track of earlier constraints if the conversation drifts. I had to occasionally re-anchor — "remember, the VMs are on an isolated bridge, not directly on the physical network" — to keep the advice accurate.


The Mindset That Makes It Work

The sessions that went well had one thing in common: I came in with a specific, well-defined problem framed around what must be preserved. Not "help me migrate my server." Something like: "I have four Ghost blogs running in Podman containers that must stay live throughout this process. I want to build a parallel KVM environment and cut over once it's verified. Here's my current network layout."

Specific input, honest constraints, clear rollback condition. Useful output.

AI works best when you treat it like a knowledgeable colleague you're pair-programming with — not an oracle, and not a search engine. You're still the engineer. You're still making the decisions, evaluating the suggestions, and catching the mistakes. What AI gives you is speed: faster options, faster syntax, faster debugging paths.

The people getting the most out of AI tooling right now are the ones who stay in the driver's seat. They bring the risk awareness. They define the boundaries. They use AI to go faster within those boundaries — not to skip past thinking about them.


The Part Most People Don't Think to Do: Ask for a Summary

When the migration was done and everything was verified, I did something I'd encourage everyone to try: I asked Claude for a structured summary of everything we'd built.

Not a log of the commands. A summary oriented around understanding — what each component does, why it was chosen over alternatives, and what I'd need to know to troubleshoot it six months from now when I've forgotten the details.

What came back was effectively a personalised reference document for my exact setup. The network architecture with the reasoning behind each decision. The traffic flow from internet to Ghost, explained step by step. The known risks — including the iptables persistence issue I'd deprioritised — documented in one place.

This is where AI genuinely surpasses any traditional learning path. After a session like this, you don't just have a working system. You have a curated explanation of that system written specifically for your implementation, your constraints, and your knowledge level. No textbook does that. No Stack Overflow thread does that.

It took five minutes to request and read. It meaningfully expanded my foundational understanding of Linux networking in a way that will transfer to the next project. That's the compounding return on AI-assisted work that most people leave on the table.


The Time Honest Accounting

One evening to migrate six VMs, rebuild four Ghost installations, configure a layered networking stack, and verify everything end-to-end — with the original Podman containers still running as a safety net the entire time.

My honest estimate without AI: two full days, minimum, probably with at least one significant wrong turn that adds a third. And I'd have emerged from it with shakier foundational knowledge, because the learning would have been fragmented across dozens of search results rather than coherent and contextual.

That's not a small number. At a day rate for a competent sysadmin or DevOps contractor, AI assistance on a task like this is worth real money. More importantly: it's the difference between a migration you attempt on a weeknight and one you keep deferring indefinitely.


What I'm Watching Next

The current limitation — AI can't actually see your environment — is starting to erode. Tools that give AI direct terminal access, file system access, and the ability to read actual error output rather than your description of it are already here and getting better fast.

When AI can run commands directly, read your actual config files, and see the real error in your logs, the co-pilot metaphor becomes something closer to a co-engineer. The DataTalks.Club and AWS Kiro incidents both involved AI with exactly that kind of direct access — and both resulted in destruction that a human would have paused to question.

The risk mitigation question becomes more urgent, not less, as AI tools get more capable. Who defines the guardrails? What permissions should an AI agent hold in a production environment? How do you enforce a "do not destroy" constraint that survives a long agentic session? We're not far from needing a real answer to that — and the infrastructure community needs to be part of building it.


Over to You

The two incidents above — a founder losing 2.5 years of data, and Amazon's own AI tool taking down AWS for 13 hours — both came down to the same root cause: AI was given the ability to act on infrastructure without clear constraints on what it must not touch.

Have you developed your own guardrails for working with AI on infrastructure tasks? Are you thinking about blast radius and rollback before you start a session, or letting the conversation develop and course-correcting as you go?

And on the learning side: are you asking for explanations alongside commands, or just taking the output and moving on? Do you ask for a summary when the work is done?

Drop your experience in the comments. The most useful thing this community can build right now is a shared instinct for where AI assistance is genuinely safe to trust, how to extract real learning from it — and where the human still has to be the last line of defence.


Running infrastructure and want to talk through how to approach a migration safely with AI assistance? Feel free to reach out directly.