Articles

Why we benched AI LLM pentesting agents

This article is part two of our previous post, “Our Journey into Building AI LLM Pentesting Agents.” If you have not read it yet, I recommend starting there.

Before I get into it, one thing to clear, I changed my project’s name from “Predator” which was widely mentioned in my last article, to “SecRaptor”, Security Raptor.

I will continue where we left off last time and answer the question which was asked in the end: “Predator has two versions, one full LLM orchestration and the other one with deterministic analysis and statistical analysis. Guess which one is faster and more accurate?”

For a quick tl;dr version:

“Well my analysis shows that the latter one is the way to go.”

Now as we have answered this question, lets continue to why we benched the LLM AI orchestration and where my journey led to. Another tl;dr:

Hybrid Evolutionary Pentesting Intelligence System

Three years back I wanted to make an AI penetration testing agent, and I built it. However looking at the downsides of adding a transformer based engine for full orchestration is very heavy on the system and requires a lot of resources and also the cost of tokens and the speed of assessment cost is also too high. I started thinking how I can NOT use an LLM and still make a system intelligent so that it keeps on learning. I was already enhancing SecRaptor project one version after another. I trained and added Neural Networks for response analysis. As this project was an automated web application and web API pentesting software, and at the time I was also performing AI LLM assessments, so I thought to add another Neural Network for AI LLM assessment response analysis.

But neural networks alone were not enough. Models can misclassify predictable cases, so I added deterministic checks alongside them. These rule-based checks handle signals such as missing security headers, cookie flag issues, stack traces, database errors, payload reflection, timing differences, WAF blocks, and rate limits. This helped reduce false positives and made the system easier to explain.

The architecture then became more of an ensemble than a single AI brain. Deterministic orchestration controls the scan flow. Statistical response analysis compares baseline and mutated responses. Neural networks classify findings and LLM responses. Rule-based logic corrects common model mistakes. And when the neural model disagrees with strong technical evidence, the system can preserve the finding for review instead of blindly discarding it.

The system was performing very well. By then I had built a complete multi-tenant platform with technical dashboards, leadership dashboards, and almost everything that a modern assessment platform needs. I had also moved from a standalone application to a SaaS solution that could actually help organizations. From the outside, SecRaptor looked close to complete. But in my mind, one important part was still missing, it was not yet the self-healing, self-learning system I originally wanted to build.

In my mind SecRaptor wasn’t learning, and I thought learning was retraining or fine-tuning NN (Neural Networks) and/or transformer based models. The retraining required quite a lot compute and doing it at run-time is not feasible or scalable.

Then I came across a concept which changed everything. It was evolutionary learning and Evolutionary algorithms. I will not go into technical depth but this was the start of what I always wanted to do. In simple words, the system started learning from what worked and what did not, without needing to retrain a large model every time.

Payloads and request strategies had a lifecycle. They could be selected, mutated, crossed over, promoted, demoted, disabled, or remembered based on how useful they were in previous scans. The system rewarded payloads that create meaningful response changes, confirmed findings, higher confidence, novelty, or reproducibility. It penalized payloads that trigger WAF blocks, rate limits, timeouts, duplicates, or noise.

In the pursuit of what I thought was the one silver bullet for learning, I found out that there are many other ways of achieving the same goal. Intelligence did not have to mean putting an LLM in control of everything.

This brings me back to the “benched AI LLM”. I did not remove it completely. I changed its role. Instead of using it for orchestration and core analysis, I now use it where it actually adds value, explaining findings, generating executive summaries, and helping translate technical results into something leadership and engineering teams can act on.

Fast forward to today, I have launched the closed Beta (open to request) of SecRaptor. It has taken 3 years in the making but I am satisfied that in the age where AI Agents is the next best thing and everyone is hyper focused on bring the token usage down, I managed to create something which is not a transformer based system but rather a “Hybrid evolutionary pentesting intelligence system”.

The system became deterministic where speed and repeatability mattered, statistical where evidence had to be measured, neural where patterns needed classification, evolutionary where payloads needed to improve, and LLM-assisted only where explanation and summarization added real value.

The project is live: www.secraptor.com

Future to follow.

Peace!

Articles

Why we benched AI LLM pentesting agents

More Articles

Our Journey into building AI LLM pentesting agents

The Hidden Risks Behind Large Language Models

Out of the Shadows – Shadow IT