News outlets are blocking Wayback Machine from archiving their pages — 23 outlets concerned AI companies might abuse fair use and use it to train their models

(Image credit: Internet Archive)

Many news outlets are reportedly blocking Wayback Machine from archiving their pages, apparently because they fear that AI companies will abuse fair use policies and train their models on the snapshots of old articles. This risks reducing society’s collective access to historical news stories, as well as other critical information, especially in an age where misinformation is in abundance, and AI large language models (LLMs) hallucinate convincing answers. Wired reports that 23 major publications currently block ia-archiverbot, Internet Archive’s commonly used crawler, including USA Today and The New York Times. Ironically, the publication pointed out that some of these outlets use Wayback Machine in their reporting.

Many libraries and newspaper offices used to have a rich repository of archived volumes, with people accessing them to gain insights into historical records. But as the world abandoned print journalism and preferred the convenience of online newspapers, these archives are no longer updated; we must rely on online archiving services like Wayback Machine to serve as the modern historical record.

There has been some pushback from publications regarding archiving, but the legal system has established that what the Internet Archive is doing is legal and part of fair use. “Courts have long recognized it’s often impossible to build a searchable index without making copies of the underlying material,” the Electronic Frontier Foundation said. It also added, “The copying served a transformative purpose: enabling discovery, research, and new insights about creative works.”

Latest Videos From

Watch full video here:

A Wayback Machine snapshot of the Tom's Hardware homepage from 1997 (Image credit: Tom's Hardware/Wayback Machine)

It could be argued that the newspapers and publications themselves could handle their own archiving, but it’s in the public interest that a neutral third party handle record-keeping. After all, it’s easy to change online articles to change the record, and while many outlets are trustworthy, some are also owned by big corporations that could potentially benefit from the control of the historical narrative. Besides, it’s commonly known that outlets sometimes update articles, whether openly or in secret, so an archive like the Wayback Machine is also useful for tracking changes like these. Archive services can also be used for keeping records of publications that have since gone defunct and whose content would have been otherwise lost to history.

Companies abusing fair use policies to train AI LLMs is indeed a valid concern for both media companies and other platforms that host massive amounts of data. But preventing archiving services, such as the Wayback Machine, will do society more harm than good. Hopefully, not all is lost with archiving — Wayback Machine director Mark Graham is reportedly in talks with several outlets so that the archiver’s bot could gain access to these websites once more, while a coalition of journalists and other stakeholders have signed a letter in support of the Internet Archive and its mission of providing universal access to all knowledge.

Follow Tom's Hardware on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.

Jowi Morales is a tech enthusiast with years of experience working in the industry. He’s been writing with several tech publications since 2021, where he’s been interested in tech hardware and consumer electronics.

17 Comments Comment from the forums

bit_user

I think readers should boycott news sites that do this.

The solution to AI companies scraping their data is to publish with a license that forbids it, and then sue any AI providers who violate those terms. Yes, neither that surveillance nor litigation are cheap, but publishers should be able to pool their resources to support such activities.
Reply
ezst036

This isn't exactly new.

These websites were blocking the Wayback Machine before the rise of AI due to internet sleuths who would do diff compares over months to see how they changed their headlines and web content changes unattributed(key words changed, headlines altered, whole paragraphs gone without a public notation) while the sleuths noted that the changes that were happening were all on a strikingly one-sided note, politically.

AI just gives them a much cleaner excuse for why they cover for their stealth edits instead of "I didn't want to get caught working for political gain".
Reply
Arkitekt78

Haha... "news" sites.

The only reason to block is they know they regularly publish false information and they are tired of being called out on it with evidence. Journalism is dead.
Reply
bit_user

ezst036 said:
This isn't exactly new.
Depends on which news sites you're talking about.

Arkitekt78 said:
Journalism is dead.
Depends on which news sites you're talking about.

Some people have a very strong motive to undermine public trust in journalists and journalism. It's not for no reason that a free press is in the first Amendment of the US Constitution.

In the spirit of the quote: "The greatest trick the Devil ever pulled was convincing the world he didn't exist."

...I'd say beware of people who tell you the media cannot be trusted. They just might be seeking to undermine the one thing that can hold them accountable.
Reply
JamesDax65

I don't understand this nonsense. How are supposed to train AIs if they can't access news sites and libraries? This has less to do with privacy and more to do with selfishness and greed.
Reply
bit_user

JamesDax65 said:
I don't understand this nonsense. How are supposed to train AIs if they can't access news sites and libraries?
Training your AI model on others' content is not a right.

The proper way to do it is to arrange with the rights owners of the media you want to use for training. Some AI companies have already reached such agreements with some large publishers.

JamesDax65 said:
This has less to do with privacy and more to do with selfishness and greed.
Are you implying the AI companies are the altruists, here?
🤣
Reply
ezst036

bit_user said:
Depends on which news sites you're talking about.

Some people have a very strong motive to undermine public trust in journalists and journalism. It's not for no reason that a free press is in the first Amendment of the US Constitution.

In the spirit of the quote: "The greatest trick the Devil ever pulled was convincing the world he didn't exist."

...I'd say beware of people who tell you the media cannot be trusted. They just might be seeking to undermine the one thing that can hold them accountable.
I agree with every word here.

There's the big problem though, nobody has access to stealth edit (__________ news website) except for the news website themselves. To that extent, that makes the journalists themselves the saboteurs of that very journalism. The buck stop right there at their desk.

(One of the sources listed in the Tom's Article is USA Today, so I'm just going to go with that one as an example, this is not an accusation)
In other words, I can't see an argument to be made that it's the bloggers fault for noticing that USA Today did a stealth edit. That's USA Today's fault for stealth editing. They simply own it, they do.

Its entirely controllable. If they don't want people to notice stealth edits, then they can cease and desist all stealth edits in the future, forever. That would go a long way toward restoring the lost trust that has taken place in journalism.

I can tell you right now, even if I had the password and could do edits to the USA Today website, I would not do it. But I'm just saying. The organization itself owns this and they own their own failures, and off of the top of my head I cannot ever remember seeing a news report that "person name" was fired after being discovered stealth editing news articles.

Now you know, you know they have logging. So they know who in their organizations are doing this editing. But if they can't have accountability, then trust is going to fall.
Reply
thestryker

bit_user said:
I think readers should boycott news sites that do this.

The solution to AI companies scraping their data is to publish with a license that forbids it, and then sue any AI providers who violate those terms. Yes, neither that surveillance nor litigation are cheap, but publishers should be able to pool their resources to support such activities.
What the ai companies are doing likely is against the law already. So the question becomes as a publisher do you cut off the problem or do you spend tons of money and time in court to get compensation (and even if they do it still won't stop what's happening). It's all well and good to want the system to work, but we have way too much evidence showing it simply doesn't when you don't have a government willing to step in and do something. Pretty much every major tech driven service the last couple of decades has relied on exactly this. Legality simply doesn't matter if you're allowed to continue operating illegally and periodically write checks to make a nuisance go away.
Reply
USAFRet

thestryker said:
What the ai companies are doing likely is against the law already.
1. Too big to fail.

2. Be good friends with the people that have the authority to allow you to bypass that 'law'.
Reply
dedcap

Hhhmmm, how does that fix anything, whats to keep someone from just accessing the pages or building their own scraper that avoids their protection? If someones doing it the bad way what do they care. Their paywalls are a joke. Just disable the js that they all seem to have copy/pasted and block access to specific servers if it doesnt break the readability. That gets rid of paywall and annoying ad popups. I am not doing this btw, TH your ads are still there for me :). Edit: just pointing out how simple it is on most of their sites to break their "protections"
Reply

Show more comments