LLMs have advanced at a staggering pace across fields. They debug complex code, solve graduate-level math problems, assist with physics research, and generate functional software from natural language descriptions. The progress seems relentless—new benchmarks fall every few months, capabilities that seemed impossible last year are routine today.
Yet one thing still frustrates anyone who uses these tools regularly is how they handle text literature: ask an LLM to build a complex web scraper and it will deliver flawless Python. Ask it to write a heartfelt condolence email and it sounds like a corporate HR manual.
It is like Oh man! you can do coding so beautifully but can’t capture the right tone for a simple thank-you note. It feels random, like the technology is brilliant in some areas and inexplicably mediocre in others.
But if you look deeper, it’s not random at all. The gap between AI’s coding genius and its writing mediocrity isn’t a quirk or a temporary limitation. It’s a fundamental problem rooted in how AI learns what “good” actually means. And understanding why reveals something crucial about what AI can and can’t do.
WHY AI Sucks at Literature?
Code and math exist in a beautifully structured world with absolute rules. Did the code compile? Did it pass unit tests? Is the math solution logically sound? These questions have binary answers. Take a simple equation: 3x + 5 = 10. Pure logic dictates 3x = 5, so x = 5/3. If the AI outputs exactly that, it gets rewarded. If it outputs anything else, it gets penalized. No points for creative attempts or friendly explanations. Absolutely right or absolutely wrong.
It is what we call as verifiability. Code either works or it doesn’t. Math problems have objectively correct solutions. But writing? Writing is subjective, contextual, and impossible to grade with absolute certainty. That difference determines how fast AI can improve at each skill.
Here’s how it works. Initially, large language models are just trained to predict the next word in a sentence. To turn them into useful assistants, they go through post-training called Reinforcement Learning from Human Feedback (RLHF). This phase rewards the AI for good responses and penalizes bad ones, teaching it to behave helpfully rather than just complete text plausibly.
For English writing, the grading problem is messy. How do you objectively score whether an email is “warm but professional” or an essay is “compelling”? You can’t have humans rate millions of AI responses, so companies train a Reward Model—literally another AI trained on pairs of human preferences—to act as the judge. The result: an AI trying its best to guess human subjectivity. That’s inherently noisy, slow, and prone to drift as the Reward Model itself has biases baked in from whatever human feedback it saw.

This verifiability unlocks a superpower: infinite self-training. Because the AI doesn’t need a slow, fuzzy Reward Model to grade its work, it can generate millions of math problems or Python scripts, automatically verify answers against a compiler or calculator, and update its own neural network in a massive, instantaneous loop. When feedback is objective, perfectly accurate, and fully automated, AI improves at lightning speed.
The process works like this: The model generates a coding problem and attempts a solution. A compiler or test suite verifies whether the solution works—instantly, objectively, at zero marginal cost. If correct, that example reinforces the model’s understanding. If wrong, the exact error message provides precise feedback about what failed. The model updates its weights and tries again. Repeat millions of times.
This is why coding benchmarks keep getting demolished. Models can practice coding against compilers 24/7 with perfect feedback. They’re essentially doing what human programmers do—write code, run it, debug errors—except at scale that’s physically impossible for humans. A model can attempt ten thousand coding problems overnight and get instant, definitive feedback on every single one.
English writing can’t do this. There’s no compiler for emails. No unit test for essays. You can’t automatically verify whether a product description is “engaging but not overly salesy.” The best you can do is have that Reward Model—trained on human preferences—give its opinion. And that opinion is noisy because human preferences themselves are inconsistent, context-dependent, and culturally variable.
The gap compounds over time. As coding models self-train on billions of verifiable examples, they discover edge cases, optimize solutions, and internalize patterns that generalize across programming languages. Writing models, meanwhile, are stuck waiting for expensive human feedback or relying on Reward Models that themselves are imperfect approximations of human judgment.
This explains why AI-generated code often feels impressively competent while AI-generated creative writing feels generic or off-tone. The code was trained with billions of examples where success was unambiguous. The writing was trained on fuzzy approximations of what humans might prefer, filtered through another AI’s best guess.
It also explains why certain writing tasks work better than others. Technical documentation? Pretty good—there are objective standards for clarity and completeness. Marketing copy or fiction? Much weaker—success depends heavily on subjective audience response that’s impossible to verify automatically.
The future might narrow this gap somewhat. Researchers are exploring ways to make writing feedback more objective—measuring reader engagement, testing different phrasings with actual users, training Reward Models on more diverse human preferences. But the fundamental constraint remains: you can’t compile an email or run unit tests on poetry.
Code and math will continue advancing faster than creative writing because they offer something English never can—immediate, perfect, infinite feedback. The AI doesn’t need to guess whether it succeeded. It knows, instantly, every single time. That’s not a temporary advantage that better training will overcome. It’s a structural difference in what these domains allow AI to learn from.
So when you get flawless Python from ChatGPT but mediocre prose, remember: one of those tasks has a compiler. The other has a Reward Model making educated guesses. And in AI training, that difference is everything.
For discussions on AI capabilities, training methodologies, and language model development, join our WhatsApp community where ML engineers discuss model training.
Discover more from WireUnwired Research
Subscribe to get the latest posts sent to your email.




