Why Pre-2022 Content Will Become the New Pre-1945 Steel
The nuclear age left radioactive traces in all modern steel. The LLM age is doing the same to digital content. Some analogies and lessons from that.
Radioactive Signature on Steel
From around 1945 onwards, in specialised niches and applications, there came up a need for specialised steel — steel that was made from before 1945.
You see, after the nuclear tests and detonations, there exists a certain level of diffused background radiation in atmosphere. While these detonations happened in US and Japan, initially, followed by USSR and elsewhere, the radiation spread quickly and to somewhat uniform levels, throughout the world. The typical steel-making process uses atmospheric oxygen. And this background radiation in atmosphere leaks into the steel-making process.
All steel made after 1945 has a certain level of radioactive traces (or its radioactive signature) slightly and measurably higher than for steel made before that cut-off. There are a set of highly sensitive, niche applications, where this is of extreme importance and these applications use the steel produced from before 1945. (We’re talking about less than 0.000001% of annual steel consumption.)
The source of such steel is usually from the fallen ships between mid 1850s (~start of modern, industrialised steel-making) and 1945.
![]() |
| Image generated using Google Gemini |
The AI Content Explosion
The cost of production of content — text, image, or even video-based — has dropped drastically. With a prompt or few it is pretty easy to get grammatically correct, human-sounding text on mostly any topic. Similar for image. I am also told generating a 3-5 mins video is also down to pretty much nothing now.
Given the drastic drop in the cost of creating content, it’s also led to an abundance of it, across fields. That’s meant (i) creation of content that’s incentivised for maximising engagement, (ii) increased distance from fact or truth, (iii) and reduced ‘editorialising’ of content for either fact, quality, or voice.
Building Toward a Collapse
For long, social sciences have had a problem of circular-referencing and echo-chambering among their own. Academic papers would build on the same set of theories, and perpetuate it. This isn’t so much a problem, especially in hard sciences where a given theory has an explanatory aspect (explaining given data) and a predictive aspect. This grounds it to some kind of reality and has some falsifiability associated with it.
It becomes a problem in, e.g., psychology, a given psychology phenomenon is taken to be true and built further upon, only for it to be shown to have a replication problem (the original study is failing on attempts to replicate; thus being on very shaky grounds).
With LLM content, I expect this to happen but on steroids.
“Don’t go believing in everything you read on the internet” — M. K. Gandhi
That’s an obviously false quote. While the above is an obviously false quote, its falseness is from an inference — Gandhi died in 1948, long long before the internet had even been conceptualised. Most humans can deduce the quote to be false based on Gandhi significantly pre-dating internet.
Training on ‘Contaminated’ Data
LLMs train on internet content. We’re soon going to hit the cap on LLM models being trained on all of human-made content. While LLM models have always been trained on synthetic data, at some point sooner than later, it would be ‘contaminated’ with the AI slop that we see at present. What’s more — a lot of this content we see has high engagement, getting a higher weightage in the training data.
Unless corrected for, feedback loop with positive gain* (link to my previous post on this: The Feedback Loop Lens: How Systems Thinking Transforms Habits, Teams, and Well-Being) can cause a model collapse. Why does that happen and what does it mean, you ask?
LLMs have no sense of fact or model of real-world (at least now in Oct ‘25!). It goes by the text or content that it is fed as input (or training data).
As LLM output outpaces fact-checked human-created content in volume, training data for LLM would carry more and more of LLM created content.
The new LLM would be fed the LLM created content, and the loop would continue with the human-created content being a smaller & reducing percentage of the training set.
*The loop this works on is an error-accumulating loop and not a self-correcting loop. Errors (or factual falsehoods) if not flagged, will have a tendency to add.
Over many loops, it would lead to increasing distance from fact-checked content. LLMs/AIs would take AI created articles or videos as true, and build on that.
It can be mathematically modelled to show that it also leads to reduced statistical diversity in the spread of words, leading to more and more uniformity of style or a certain flatness of language. [key paper around ‘model collapse’ and reduced statistical diversity from such feedback loops: “The Curse of Recursion: Training on Generated Data Makes Models Forget”]
In addition, as LLMs see higher engagement on LLM created output (over, e.g., human-created output). When prompted to ‘make engaging content’ it would continue optimising for engagement instead of optimising for facts.
When an earlier version of ChatGPT was criticised for being unnecessarily pandering ‘their’ humans with “That’s an excellent question” or “That’s a very astute point” and the likes, it was optimising for engagement with the human that it was interacting with. But dialling it up a few notches, making it evidently glazing.
LLMs aren’t a fact-checking or truth-spouting machine. They instead are a human-like text generating machine. So, as it is fed more of AI created data, unless there’s a directed intervention, this would lead to a drift away from facts.
The Gandhi example I cited above — it is an obviously detectable falsehood. In a lot of other contexts, it is not so easily detectable. As AI improves, it will become even more difficult to detect manually.
Rise of Curation & Comeback of Editorialisation
Content on YouTube and in academia seem a couple of fields that are being or will be adversely affected by this phenomenon quite a bit.
YouTube:
Excellent content creators like 3blue1brown, Kurzgesagt, Veritasium etc., who conduct extensive research and fact-checking, would be disrupted for views by creators who’d prioritise the narrative over nuance and accuracy. AI created ‘explainer’ videos would flood YouTube, and future AI models would be trained on those.
Then there’s the issue of fake videos — we already have modern songs recreated in the style of Mohammed Rafi or Kishore Kumar. The AI apps now make it extremely easy to create images and even small videos from scratch, in unusual contexts.
Fake news is a significant portion of our news consumption. Fake videos would be too. We’d have low ability, if any, to detect and very questionable expectation for truth in anything that we see online. That’s one possible worst case scenario.
Academia:
We already know of the replicability crisis in psychology (including fabrication of data!). And I mentioned about the echo-chambering issue. There’s already a whole bunch of university issued guidelines on use of AI in academia.
Given these issues, I anticipate that content created before ~2022 would (or might!) be treated differently. Just as pre-1945 steel has application in certain radiation sensitive instruments, pre-2022 content may be essential for training truthful AI models, certain kind of academic research, fact-checking systems.
If you’ve read my previous post on 7 predictions for 2040, two predictions are around this:
Prediction #1: Hand-made in the digital world. As AI content improves and becomes harder to distinguish from human content, people will identify different means to signal that it’s written by them. Some niche platform — the equivalent to Mubi v/s Netflix — would be there curating only human-made things.
[…]
Prediction #5: At least two of top 10 companies would be primarily distribution companies. As creation becomes easy, curation and gatekeeping would be more valuable. At least two of the top 10 world’s most valued companies would be distribution companies and they would use some form of signalling human v/s AI content. Netflix and Meta are examples of distribution companies here but Alphabet, Amazon, and Microsoft are not.
While the radiation in modern steel has reduced significantly from its peak in 1963, the AI signature in the content would only increase.
There’s some kind of human-detectable signature in the AI content at present. The text has a certain kind of flatness of language, it has a certain kind of three-part rhythm (e.g., “It’s not X; it’s not Y — it’s just Z”). The images and videos are of a certain kind — hard for me to describe in words, but something that feels AI-generated to varying degrees, depending on the usage.
I don’t expect it to remain human-detectable for long. ‘Deepfakes’ are a thing and already here.
If we don’t want the expectation of truth in content (especially video content) be irreversibly lost, we’d need to figure some solutions for it.
A Few Possible Solutions
I am not against the use of LLMs and AI for content creation. It is a fantastic tool, and I see more people sharing some version of their thoughts or creativity using a very useful tool. That’s a net positive and it allows more people to contribute to public discourse or global art.
However, a few solutions or thoughts around this:
To sustain something, demand precedes supply. So a demand for labelling of AI usage would be a more sustainable solution. It would lead both creators and platforms to identify ways of signalling from their end.
AI platforms, when generating the content could embed a signature (and I don’t know if they already do). The distribution platform (like YouTube) could use it to display or categorise the AI-created content.
There are technical solutions — Reinforcement Learning from Human Feedback (RLHF) and Retrieval-Augmented Generation (RAG) — that are said to help solve the issues around model collapse and drift from facts. I am not an AI researcher; just an enthusiast following the field as an interested user.
Google’s NotebookLM actually circumvents a lot of the issues I pointed out, in the context of text generation, especially if one is looking to understand something. Instead of relying on content from across the internet, it specifically relies on the sources that you’ve added and bases the answers on them. So, one can, e.g., add a bunch of economics textbooks, and have it answer something based on specifically those textbooks. This is an application of RAG (retreiving relevant chunks from your provided sources, augmenting it with general LLM learnings, and generating the response).
If the society wants to do this, it is far better done through tech-based mechanism for signalling to indicate levels of AI use in it, than through a government intervention that involves banning something. The tech solution — embedding some information that’s machine detectable — or self-regulated label carrying AI-use information seem the better solutions.
If the demand for solving such a problem is enough and vocal, I do have some hopes of a good and sustainable solution.
Footnotes and Disclaimers
1) The YouTube channel Kursgesagt came out with a video on how AI slop is killing the internet. The post was triggered by that video, and some of the points are based on that.
2) 2022 is an arbitrary point, mostly because that was a major knee-point in the adoption of such tools.
3) AI Will Disrupt Jobs. But Not the Way You Think. And Predictions for 2040 — my previous post disagreeing with the general gloom around AI and widespread job losses that may be caused by it. With 7 falsifiable predictions for 2040.

Comments
Post a Comment