Pressure-testing code with multiple LLMs

Today, I asked Copilot to go and sit on the naughty step. I’ll tell you why…

Recently, it’s become a regular thing for me to do what I call a ‘cross-pollination exercise’. It’s basically a weekly process where I take the Node.js project files I’ve been working on in Codex/ChatGPT, and dump them into multiple LLMs to inspect, challenge, critique, and improve them.

I effectively act as a mediator. One moment I’m asking Codex: “Can you summarise what we’ve been working on so I can pass it over to another LLM for inspection?”

The next moment, I’m feeding that summary and the associated JavaScript (.js) files into Claude, Gemini, and Grok to see what they think. It’s fascinating watching where they agree, where they disagree, and what each model notices that the others missed.

Claude, which is considered the strongest coding LLM overall, often spots architectural concerns or maintainability issues.

Gemini tends to notice practical implementation details and odd edge cases that others sometimes miss.

And Grok, the bad boy of the group, occasionally comes out with a surprisingly clever workaround that nobody else considered.

Codex will frequently admit that another model has raised a valid point that it overlooked. That’s the interesting part — different models have different priorities, different reasoning styles, and different blind spots.

It’s the office equivalent of summoning colleagues over from different departments to give a fresh pair of eyes on the same problem.

How many people are actually using LLMs this way?

Of those who do use multiple LLMs, I get the feeling most tend to silo them. They might use ChatGPT for one thing, Claude for another, and Gemini for something else, rather than making them inspect and challenge each other’s work.

The goal isn’t necessarily to find ‘the one perfect answer’. It’s to minimise bias, challenge assumptions, reduce the chances of hallucinations slipping through unnoticed, and gain a broader understanding of what you’re building.

Frustratingly, Copilot is the ONLY one that won’t let me drop in my .js files because it only accepts certain file types for safety and processing reasons, and raw .js files aren’t on the allowed list.

I refuse to go through the hassle of pasting code in or changing all the files to text (.txt). So yes, I’ve had to exclude Copilot from my cross-pollination exercise because of this.

What’s interesting is that a lot of people seem to treat one LLM as if it holds the gospel truth. That’s particularly the case with Claude, but it would be naive to assume it’s always right.

This leads me to ask: how much cross-pollination do you do?