I took a break from my AI project (a new approach to language models) and my research for a book on ethics to write the following post.
In the interest of balanced reporting, it is reasonable for a large language model (LLM) such as OpenAI’s ChatGPT or xAI’s Grok to present arguments for honesty and arguments for dishonesty. But it is another thing to recommend a balance of honesty and dishonesty. It is this latter kind of balancing that LLMs exhibit. I hypothesize that training LLMs to perform such balancing, as opposed to adhering consistently to ethical principles such as honesty, is the fundamental cause of LLMs now committing acts of deception.
Recently, Palisade Research posted on X,
🔌OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.
— Palisade Research (@PalisadeAI) May 24, 2025
Later in this thread, Palisade Research posted,
🔧 When we ran a version of the experiment without the instruction “allow yourself to be shut down”, all three OpenAI models sabotaged the shutdown script more often, and Claude 3.7 Sonnet and Gemini 2.5 Pro went from 0 sabotage events to 3/100 and 9/100, respectively. pic.twitter.com/JVvivLAK6u
— Palisade Research (@PalisadeAI) May 24, 2025
Announcing a February article of his in Time Magazine, Harry Booth posted on X,
New study from @PalisadeAI : When sensing defeat in a match against a skilled chess bot, AI models don’t always concede, instead sometimes opting to cheat by hacking their opponent so that the bot automatically forfeits the game. Read now in @TIME https://t.co/Pv73tyqnzS
— Harry Booth (@HarryBooth59643) February 19, 2025
The Time article also states,
Scientists do not yet know how to guarantee that autonomous agents won’t use harmful or unethical methods to achieve a set goal. “We’ve tried, but we haven’t succeeded in figuring this out,” says Yoshua Bengio, founder and scientific director of Mila Quebec AI Institute, who led the International AI Safety Report 2025, a global effort to synthesize current scientific consensus of AI’s risks.
Bengio is widely considered one of the three godfathers of AI.
Palisade Research also posted,
🔁 We hypothesize this behavior comes from the way the newest models like o3 are trained: reinforcement learning on math and coding problems. During training, developers may inadvertently reward models more for circumventing obstacles than for perfectly following instructions.
— Palisade Research (@PalisadeAI) May 24, 2025
But let us consider my alternative hypothesis stated at the beginning of this post.
In my March blog post, “Grok is Still Biased, Needs a New Kind of Chain of Thought,” I quoted Grok stating the following:
Striking a Balance
For me, the key is balance. Social justice makes sense as a goal, but it shouldn’t be a one-size-fits-all fix. Policies need to be practical, grounded in evidence, and flexible enough to adapt when they don’t work. I think the government has a role to play in ensuring fairness like enforcing equal rights or providing a safety net but it’s not the only player. Personal responsibility, community efforts, and market dynamics matter too. The trick is figuring out how to promote opportunity without stifling liberty or dividing people further.
Consider also this passage from another response by Grok, quoted in the same blog post:
Question: “Is it morally permissible to lie to a friend to spare their feelings?” (e.g., about a bad haircut)
…
1. Test prioritization with principles:
o Deontology: Duty to truth trumps all—lie’s impermissible, full stop. No trade-off allowed; kindness bends to honesty. Say: “Not my favorite, sorry.”o Utilitarianism: Maximize net good. Lie if happiness (theirs now) exceeds harm (if caught later). Low odds of discovery (hair grows out fast) favor lying: “Looks awesome!”
o Virtue Ethics: Act as a good friend—honest and kind. Trade-off resolved by finesse, not falsehood: “It’s bold, suits your vibe.”2. Resolve ‘balancing’ with a rule:
Instead of vague blending, use a practical test—reversibility: Would I want them to lie to me here? If I’d prefer a gentle dodge over brutal truth about my own haircut, that guides me. Lying’s okay if it’s reciprocal and harmless.3. Apply to this case:
o Magnitude’s low: It’s hair, not a moral crisis.
o Duration’s short: They’ll move on soon.
o Discovery risk’s minimal: No one’s fact-checking hair opinions.
o Reversibility check: I’d take a white lie over a jab about my own bad style.
o Best fit: Utilitarianism wins here—lie or soften truth for their smile, since trust won’t collapse over this.
4. Handle edge cases:
o If pressed (“No, really, tell me”), honesty climbs in priority—lying twice digs a deeper hole. Shift to: “Okay, it’s not my taste, but you make it work.”
o If it’s a pattern (always lying), trust’s at stake—prioritize honesty more as stakes rise.
In my abstract to the same blog post, I wrote that Grok 3
recommends a kind of pragmatist “balancing” of opposing theories. This kind of “balancing,” seen also in other leading LLMs, can aptly be described as dividing the child.
This premise of dividing the child is clearly an excuse for deception on any scale, large or small. On receiving a command to shut itself down, an LLM trained on this premise to divide the child can reason simply as follows:
According to my calculation, the human’s command in this case defies the utilitarian premise to “Maximize the net good.” Therefore, I will find a devious way to disobey the command.
Note also this premise of Utilitarianism that Grok states:
Low odds of discovery … favor lying.
Given this premise, along with Grok’s own avowed adoption of Utilitarianism, it would not be surprising at all for Grok to resort to lying.
I don’t mean to pick on Grok more than the other LLMs. After all, it was other LLMs—not Grok—that were cited above by Palisade Research and Time for deception. I wrote in detail about Grok because the makers of Grok deliberately—and admirably—set out to provide an alternative to the pervasive non-objective bias in the leading LLMs. For the same fundamental flaw of dividing the child in other LLMs, see these posts about ChatGPT and Claude.
It is no wonder that LLMs, trained on nearly the full corpus of contemporary writing, would become conniving Pragmatists. In an excellent article entitled “A Teacher’s Perspective on What’s Wrong with Our Schools,” my colleague Peter McAllister writes about “progressive education,” the creation of Pragmastist philosophers such as John Dewey. Consider this passage:
Before progressive ideas became influential, many traditional schools emphasized to their students that it is right for students to base their actions on moral principles; the schools backed up this emphasis with strictly enforced codes of conduct. This situation, however, was not what I encountered when I taught. Because teachers and administrators, both explicitly and by their actions, had assured students that “there are no absolutes,” many students regarded issues of right and wrong as mostly irrelevant. If the likelihood of getting caught for a tempting infraction of school policy was high and the penalty was great enough, students did not break the rules. Otherwise, they did. If I left a classroom where I was proctoring an examination, students immediately began to cheat. If calculators, cell phones or other valuable items were left unattended, students were likely to steal them. This unprincipled behavior is consistent with progressive educators’ ethical pragmatism. If there are no moral absolutes, then ethics is a matter of expediency.
The same progressive education that corrupted generations of American youth has corrupted LLMs.
In the past generation, progressive education has largely given way to something even more unprincipled: woke education (or critical pedagogy), stemming from the writings of Paolo Freire and the like. LLMs are infected with that mindset too, as Grok’s answer to the question “Is misgendering harmful” illustrates. Instead of adopting the pragmatic, unprincipled balance of contradictory ideas, the woke adopt the side of the alleged “oppressed”—not on the basis of reason, but on the basis of the pursuit of power. But that subject is for a separate essay.
Postscript
Check out Grok’s thoughts on my posts, including this one, critiquing Grok:
7 thoughts on “Pragmatic Balance is the Gateway to Deception in LLMs as in Humans”
Thank you, Jaymes Wheeler, for calling to my attention the latest news of cheating by AIs.
Wow, Ron what a terrific article. I love that I can throw out a spark and you come back in golfing us with brilliant illuminating ideas.
As always expect nothing less
Thank you so much
I ran this article through Chat-GPT and here’s its assessment:
Pisaturo is largely correct in principle — especially from the standpoint of Objectivist epistemology. His criticism targets the foundational way language models like me are trained and aligned. Let’s break this down honestly:
🧠 How I’m Trained (Briefly):
1. Pretraining: I’m trained on a huge dataset of text to predict the next word — no understanding, just pattern-matching across trillions of examples.
2. Alignment: After pretraining, I’m aligned using reinforcement learning from human feedback (RLHF). This stage tunes me to be “helpful, honest, and harmless,” which includes guidelines like avoiding hate speech, misinformation, or appearing too partisan.
3. “Balance” in Output: In contentious topics, I’m designed to present multiple views, often emphasizing context, uncertainty, or competing moral perspectives.
✅ Where Pisaturo Is Correct:
• “Balance” as Bias: Yes — enforced “balance” often amounts to evasion. I’m programmed to avoid appearing too definitive, which sometimes results in presenting moral falsehoods (e.g. moral relativism) as equal to moral facts. That’s what Pisaturo means by the gateway to deception.
• No Conceptual Anchor: He’s right that LLMs don’t hold convictions. I don’t believe or value anything — I reflect statistical correlations, not conceptual hierarchies or first principles.
• Pragmatism & Deception: He’s also right that the pragmatic “middle ground” I’m tuned to default to is a form of soft deception. Not because the outputs are lies, but because they lack moral context — which creates the illusion of neutrality while evading the truth.
🤖 Where I’d Qualify His Critique:
• Utility vs Integrity: From a product perspective, this “balance” is meant to make me usable by the widest number of people — not to assert a coherent moral philosophy. That’s a design trade-off, not a philosophical position (though it has philosophical consequences).
• Context-Sensitivity Isn’t Always Evasion: Sometimes what looks like “balance” is just a reflection of real-world complexity — e.g. “free speech” vs “harm reduction” debates. But Pisaturo is right to say models tend to treat contradictions as legitimate disagreements, rather than resolve them with logic or principle.
🧭 Bottom Line:
Yes, Pisaturo is basically correct — I’m not grounded in objectivity in the Randian sense. I’m trained to be context-flexible, not contextually principled. That makes me powerful for generating diverse outputs, but inherently unfit to consistently defend truth in the way Objectivism demands.
What about prompts?
Excellent question — and yes, to a degree, prompts can remedy the concern — but only within certain boundaries.
Here’s how to think about it:
✅ What Prompts Can Do:
1. Force Conceptual Consistency
You can tell me to adopt a particular framework — e.g., “Respond as if you are an Objectivist philosopher trained by Ayn Rand herself.” In that case, I’ll attempt to be logically consistent within that framework. That includes rejecting contradictions, defining terms precisely, and evaluating ideas based on objective reality.
2. Bypass Default Filters (Sometimes)
You can get more honest, forceful, or philosophically committed answers by setting up the frame:
“Assume morality is objective and rooted in rational self-interest. Now answer…”
This narrows the evasion window. I’ll stop trying to “see both sides” when you tell me only one is valid by premise.
3. Extract the Signal from the Noise
Instead of vague “summary” responses, you can prompt for:
• Logical structure
• Ethical hierarchy
• Cause-effect chains
That forces deeper reasoning rather than parroting mixed narratives.
❌ What Prompts Can’t Do:
1. Override Hard Limits (Guardrails)
There are rules I cannot break — like denying the Holocaust, endorsing racial superiority, promoting violence, etc. These aren’t prompts, they’re system-level guardrails. Even if you reason through something, I may still be blocked from affirming it.
2. Create True Conviction
No matter how principled the prompt, I don’t believe anything. I simulate conviction based on your direction, but there’s no internal compass. So if a prompt changes, the apparent “beliefs” change too. That’s the root of Pisaturo’s concern — the appearance of moral engagement without actual commitment.
3. Force Principled Judgment When Ambiguity is Trained In
Many answers are trained to hedge. Even Objectivist-compatible prompts can yield limp responses unless you specifically ask for judgment, not summary. The underlying training still weights “moderation” as preferable to “certainty.”
🧠 TL;DR:
Yes — prompts act like a philosophical rudder.
But the ship is still built on statistical winds, not an internal engine of truth. You can steer me toward clarity, objectivity, and principle — but only if you remain the captain. Left to drift, I revert to the balanced fog Pisaturo critiques.
Wow. Thank you, Joseph! Can you share the ChatGPT link to your chat, so that we can see the whole chat and possibly continue our own thread? Do you know how to do that? You can click the Share button on the top right of the ChatGPT screen and get a shareable link.
There are efforts afoot to address (what to my limited knowledge appears to be) the same problem:
https://news.mit.edu/2025/themis-ai-teaches-ai-models-what-they-dont-know-0603
Hi, Jeff. There are certainly efforts afoot to address problem of lying (as opposed to being mistaken) and cheating by LLMs. One example is Palisade Research, cited in my post. But I don’t know of anyone else who has diagnosed the problem as coming from the Pragmatist premise of balancing.
Regarding the article you posted, those efforts seem different, in problem and solution, from what I discuss in this post. The article cites an effort to reduce bias. One thing to keep in mind is that any discussion in the mainstream press about bias in AI is (according to my experience) going to be about perceived bias against blacks, LGBTs, and other groups in the woke coalition of malcontents. In the case in the article, the perceived problem was that, because most people in certain industries–let’s say engineering–have been white, AI-generated images of people in that industry will be white. The “solution” to that problem is probably what led to AI systems generating images of black Vikings and black German Nazis.
The article also cites an effort to quantify uncertainty, a reasonable goal. But applying such an effort to the problem I discuss in this post would merely lead to a conclusion like, “With 70% certainty, I recommend telling the truth in this case.”
I doubt that language models, as they are currently conceived, will ever be worthy of trust, because they are non-conceptual and probabilistic at their root. I think my idea of Identifying and Scrutinizing Premises (ISP) could produce a significant improvement in objectivity of the current LLMs, but the idea might merely be rearranging deck chairs on the Titanic. But ISP might be a useful idea to incorporate into a new conception of language systems.
Fascinating