Giveaway: Enter To Win a BLUETTI Elite 100 V2 Portable Power Station ($799 value)
Enter Here

Connect with us

AI

Reddit will block AI from scraping its Internet Archive

AI companies have been using the Wayback Machine to grab huge amounts of Reddit data without permission.

Reddit app on phone
Image: Unsplash

Just a heads up, if you buy something through our links, we may get a small share of the sale. It’s one of the ways we keep the lights on here. Click here for more.

Reddit is making big changes to how its content can be saved by the Internet Archive’s Wayback Machine, a popular online tool that lets people see old versions of websites. 

From now on, the Wayback Machine will mostly be blocked from copying Reddit pages. 

It will still be able to archive the Reddit homepage (showing the day’s most popular posts and headlines) but not the actual post details, comments, or user profiles.

Why the crackdown? According to Reddit, some AI companies have been using the Wayback Machine to grab huge amounts of Reddit data without permission. 

Tim Rathschmidt, a Reddit spokesperson, said the site wants to protect user privacy and follow its own rules about things like removing deleted content. (Via: The Verge)

Until the Internet Archive can better prevent scraping and follow these policies, Reddit is cutting its access.

The change starts immediately, and Reddit says it gave the Internet Archive a heads-up beforehand. 

This isn’t the first time Reddit has fought against large-scale scraping, especially now that AI companies are hungry for data to train their models. 

Reddit is willing to share data if companies pay for it, but it’s been blocking free access through many tools.

In 2023, Reddit made major API changes that caused outrage and protests because they shut down some third-party apps. 

Reddit said these moves were partly to stop AI companies from using its data for free. Last year, it struck deals to sell data to Google (for search and AI training) and to OpenAI. 

But in June 2024, Reddit sued another AI company, Anthropic, accusing it of scraping Reddit content even after promising to stop.

The Internet Archive’s mission is to preserve the history of the Internet, including cultural and historical content. 

Its Wayback Machine lets people revisit old websites, but this dispute shows how data ownership is becoming more of a battleground in the AI era. 

Mark Graham, who runs the Wayback Machine, says the Archive has a long-standing relationship with Reddit and is still talking with them about the issue.

Is Reddit right to protect its users’ data from AI scraping, even if it means limiting internet archiving for historical purposes? Should AI companies be required to pay for all training data, or does this create barriers that could stifle innovation? Tell us below in the comments, or reach us via our Twitter or Facebook.

Follow us on Flipboard, Google News, or Apple News

Ronil is a Computer Engineer by education and a consumer technology writer by choice. Over the course of his professional career, his work has appeared in reputable publications like MakeUseOf, TechJunkie, GreenBot, and many more. When not working, you’ll find him at the gym breaking a new PR.

Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

More in AI