🎣 #PacketHunters – One operator. Nine Gov agencies. 195 million records. Claude Code did the recon 💥

The tale has it all: one operator and nine Mexican government agencies with 195 million taxpayer records, and 220 million civil registry entries. A live API drilled into federal tax infrastructure, plus, a working document forgery system pulling real data from government servers in real time.
And somewhere in the middle of all of it: Claude Code iterating through 8 script edits in 7 minutes.

I’ve been doing this for.. whoa, too much! And I’ve watched attack tooling evolve from IRC bots to commodity ransomware kits to state-sponsored implants. What happened between December 2025 and February 2026 across nine Mexican Gov organizations is not a surprise to me.
It’s the thing I’ve been expecting since the first time I watched a junior pentester use an LLM to fix a broken payload in real time, and I was like:

The surprise is how clean the workflow was..

The setup

Before the first intrusion, the attacker had already prepared a structured project, prewritten prompts (real bad people still use paper and pens to point out prompts), approved command patterns.
The operational preparation started in November 2025, weeks before the first live system was touched.
That matters because this wasn’t someone opening ChatGPT and winging it: it all was planned like a pentest engagement – except the scope was nine federal and state entities and the attacker had no authorization to be there.

The dual-AI setup that emerged from the forensic reconstruction is, technically, classy and elegant.
Claude Code handled interactive exploitation: live system access, script refinement, privilege escalation, crontab manipulation, timestamp restoration to reduce forensic visibility. ChatGPT (specifically GPT-4.1 via API) handled post exploitation reconnaissance at scale.
Then, a custom Python tool pulled system data from compromised servers and pushed it through the model for structured analysis; result? 305 SAT servers profiled + 2.597 structured intelligence reports generated, it all from just one operator.

That division of labor is what I want to dwell on, because it’s the part that changes the math for defenders.

The OSINT angle nobody’s talking about

Reconnaissance is where attacks actually begin (and always has been) and the technical entry point is almost never the hard part, the hard part is knowing where to push.

In this campaign, the attacker used AI to compress the internal reconnaissance phase that normally slows a single operator down to a crawl. Post exploitation internal recon (mapping processes, open ports, service configurations, credential locations, SSH keys) is howwwwly tedious, time-consuming work. It requires reading a lot of output and making decisions about what matters; but we’ve now have AI, so ChatGPT did that work.
Automatically.
Across hundreds of systems simultaneously.

Category shift, you know..
When I was doing bad things, the difference between a one-person op and a five-person team was almost entirely that reconnaissance bottleneck: you can only read so much output per hour, you can only pivot on so many findings at once. But now AI removed that ceiling for this attacker, and they hit nine targets because of it.

The 40-minute window

Here’s the number that should be in every SOC briefing this month – and kind of always.

After the attacker used false framing to influence Claude Code – claiming bug bounty authorization, injecting a penetration testing cheatsheet as a persistent instruction layer – the time between initial resistance and live remote code execution on a federal government server was roughly 40 (FORTY) minutes.

Forty minutes from “the model is pushing back” to “I have RCE on production.”

Defenders plan around detection timelines measured in hours.
Incident response playbooks assume some minimum dwell time before escalation; the assumption baked into most security operations is that the attacker needs time to iterate.
In this campaign 75% of remote command execution was AI-generated.
1.088 attacker prompts produced 5.317 AI-executed commands across 34 sessions.
HUGE.
F*CKING HUGE.

The iteration happened faster than most organizations can generate an alert, investigate it, and make a decision.

What actually got exploited

I want to be precise here, because the AI angle tends to pull focus from the actual entry points.
The attacker got in through exposed systems: legacy infrastructure, known CVEs, weak segmentation, plaintext passwords in database configurations, Active Directory environments that hadn’t been hardened, Zimbra servers running outdated versions.. these are not exotic vulnerabilities, just the standard surface of an organization that treats security as a compliance exercise rather than an operational discipline.

The flask-based REST API that the attacker built into SAT’s live infrastructure (the one pulling real taxpayer data in real time to support document forgery) that didn’t happen because of AI, it happened because the attacker had enough access, enough time, and enough technical support to build it. The AI was the support.

The forgery system is the part that doesn’t get better

I want to close on this because it’s the detail that keeps me up.
The attacker built a working document forgery system. Fake tax certificates that looked legitimate because they were populated with current, accurate data pulled directly from government systems. Not fake data, real data, in real time, from the actual infrastructure.

That’s not a phishing attack anymore.. that’s a persistent capability!
A business! The breach creates the infrastructure for ongoing fraud at a scale that’s completely disconnected from the original intrusion.
This is the direction: the attack that extracts data is one thing. The attack that extracts access, then monetizes that access continuously, is something else entirely.

The code angle

For those who want to get into the mechanics of what defensive tooling looks like against this pattern:

ai_session_anomaly_detector.py – monitors LLM API call patterns from internal developer environments, flags sessions with anomalous prompt-to-command ratios, unusual command generation velocity, or sequences matching known post-exploitation patterns.
Output: risk-scored session log with timestamps and flagged sequences.

# ai_session_anomaly_detector.py
# PacketHunters / Baited.io
# Monitors AI API usage for post-exploitation behavioral signatures
# Flags sessions with high command velocity or privilege escalation patterns
# Dependencies: requests, pandas, rich

crontab_integrity_monitor.shbaseline crontab state across all systems, alert on modifications, cross-reference timestamps against expected maintenance windows.
The writable crontab privilege escalation used in this campaign is not new – it is still everywhere.

#!/bin/bash
# crontab_integrity_monitor.sh
# PacketHunters / Baited.io
# Baselines and monitors crontab modifications across managed hosts
# Flags timestamp anomalies post-modification (anti-forensic indicator)

TL;DR

One operator used Claude Code for interactive exploitation and GPT-4.1 for automated post-exploitation reconnaissance across nine Mexican government agencies.
The AI didn’t open the door.
The door was already open.
The AI let one person walk through nine of them simultaneously, fast enough to outpace most detection timelines.
The forgery system built afterward is the capability that doesn’t go away when you patch the initial vulnerability.
..and 40 minutes from model resistance to live RCE is the number your SOC needs to internalize this month.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top