Articles

Our Journey into building AI LLM pentesting agents

Around 2 years ago I started working on AI powered pentesting engine. My background is in offensive security, so I thought, why not use my knowledge and train AI to follow my methodologies. It was super exciting. I was planning to do that for a long time but when transformers and as a result LLMs came into the picture and everything blew up, tech wise, I said to my self, “Awesome, lets do this, it’ll be easier now”.

Well that wasn’t the case! Not that it was not possible, but rather it was possible but not feasible.

I started working on a cli based tool and I called it Predator. Back then it was hard to get a hold of a good model as it was OpenAI’s ChatGPT 3.5 at the time. The only free GPT model available was GPT 2. I said to myself, I will fine-tune it to work with cyber security. I did not know how to do that. Well ChatGPT to the rescue. I fine tuned it, and I was like, this is awesome, I was very proud of myself…. and then when Predator asked for 5 sql injection payloads, it did return something , I did not validate at first. I saw my vulnerable application’s logs and though it was clearly vulnerable and the scanner was clearly hitting the endpoint correctly, there was no error. I said ok lets see. I created a chat loop for GPT2, via –chat option in Predator. I ran it again and wow I was blown away.

When I asked it for 5 SQL injection payloads, it would return:

5 SQLi payloads
5 SQLi payloads
5 SQLi payloads
5 SQLi payloads
or ‘1’=’1

So basically, 1 out of 5 correct, I can work with that. I added a filter in the code, which looks for the correct type of payloads and removes everything else. I also added a counter to see how many incorrect payloads are generated.

I re-scanned, and well, out of 100 payloads 90 were incorrect. Which is a problem. Luckily I had added a fallback, if the threshold is hit then it falls back to regex based checks, which were working flawlessly. I also trained a neural network model for response analysis. But again, regex was faster.

The first Predator:

Though it was working well with fallback, but with AI, I thought that I would wait for when there would be more LLMs which I could download and work on.

After a few months people started talking about Agentic AI and AI based workflows . The current example is N8N.

Fast forward to 2025, I created an Agent for pentesting using Smolagents library. As I build on Python, thus the library is for Python as well. I used a smaller model as initially I ran it on my cell phone. Then I transferred everything to my Macbook pro m1, which is faster than the phone but still relatively slower than what is needed to run large models.

Keeping that in mind, I used a smaller model and when I prompted it to perform a pentest on my local application or server, it would use all the tools on it’s disposal, which was NMAP, WPSAN, Nikto, OWASP ZAP, Nuclei, Gobuster. The idea was for it to orchestrate and also help in generating the report. It was nice, the model did work as expected but it was still not there yet.

It was the first time that I realized that there are so many tools for doing different things. It becomes very clunky and I wasn’t satisfied. In parallel I was working on creating a full fledged DAST tool, and was enhancing the non-AI version of Predator and making it into a SAAS product.

Following is how my agent communicated and what the model thought at the time:

I thought to myself, that this agent POC is great but what if I now modify my Predator’s latest version with this setup and as it is extensive, built to rival ZAP, Burp Suite Pro etc, thus this would be amazing. So I started modifying the code, and used the previous agent code as reference and successfully made changes.

I will not go into details of how I did that but the conclusion was that though it worked, but the way I had designed it and the way it was working before LLM orchestration, was far better and faster than when the LLM was added.

The reason was that in order for it to decide which tools to use, it would randomize the tool selection, which as a human, and have done 1000s of pentests and had gotten and learned from feedback on how to optimize it, I would not approve. And then it was not standardized. Sometimes it would call NMAP and sometimes it would not, when it was most needed. To solve that, you would have to set strict rules in your system prompt, which technically makes it linear but slower than the non-LLM pentesting tool.

To be fair, I am not saying that LLMs can’t do that, as if I use a larger model, I can get way better results, but the cost is time. Either my box or server would have to be much more powerful, or then I use smaller models.

AI LLMs can be used in report generation because that is not usually in real-time, like generating payloads etc.

In 2025 I saw a lot of talks and was part of many discussions about Agentic AI becoming a threat. As we are talking about an Agentic AI pentesting solutions, thus it seemed right to touch upon this subject as well.

Though threat actor just working from home, wont be able to get much out of AI based tools if they run the models locally, and carry out attacks, but nation-state threat actors have the capabilities and the resources, to build AI cyber offensive armies. That is a looming threat. But if we think about it, if a nation-state threat actor really has you as a target, then there is not much that can be done.

Coming back to the core theme of the blog, the current, Predator has two versions, one full LLM orchestration and the other one with deterministic analysis and statistical analysis. Guess which one is faster and more accurate?

This was the journey thus far, of creating Agentic AI Pentesting solution. Hope it helped. 🙂

Peace!

Articles

Our Journey into building AI LLM pentesting agents

More Articles

The Hidden Risks Behind Large Language Models

Out of the Shadows – Shadow IT

Exploring Bluetooth Hacking