

Web-Scraping AI Bots Cause Disruption For Scientific Databases and Journals (nature.com) 36
Automated web-scraping bots seeking training data for AI models are flooding scientific databases and academic journals with traffic volumes that render many sites unusable. The online image repository DiscoverLife, which contains nearly 3 million species photographs, started receiving millions of daily hits in February this year that slowed the site to the point that it no longer loaded, Nature reported Monday.
The surge has intensified since the release of DeepSeek, a Chinese large language model that demonstrated effective AI could be built with fewer computational resources than previously thought. This revelation triggered what industry observers describe as an "explosion of bots seeking to scrape the data needed to train this type of model." The Confederation of Open Access Repositories reported that more than 90% of 66 surveyed members experienced AI bot scraping, with roughly two-thirds suffering service disruptions. Medical journal publisher BMJ has seen bot traffic surpass legitimate user activity, overloading servers and interrupting customer services.
The surge has intensified since the release of DeepSeek, a Chinese large language model that demonstrated effective AI could be built with fewer computational resources than previously thought. This revelation triggered what industry observers describe as an "explosion of bots seeking to scrape the data needed to train this type of model." The Confederation of Open Access Repositories reported that more than 90% of 66 surveyed members experienced AI bot scraping, with roughly two-thirds suffering service disruptions. Medical journal publisher BMJ has seen bot traffic surpass legitimate user activity, overloading servers and interrupting customer services.
no way (Score:1)
AI is the future and makes all our lives better. Why so much hate?
Re: (Score:2)
Maybe via the Broken Window Theory of economics. The anti-scrape-bots will need to use AI to get around the scraper source spoofing tricks, creating a never-ending cat-and-mouse escalation pattern where AI experts on both sides make buck.
It's like the military-industrial-complex, they get rich by encouraging our leaders to moon dictators, and their counterparts on the other side are doing that same.
Re: (Score:2)
AI is software. The company officers and scientists who own and deploy the software to scrape scientific and university databases and cause denial of service attacks should be in jail, awaiting trial. If they elect to kill themselves while in jail, so be it.
Re: no way (Score:2)
Classism is a nasty social construct that has worked its way into almost everything. Differences in social standing and political power almost guarantees it comes about. In America, it is mainly determined by wealth, occupation, and race. But there are other factors too. The most important thing to remember about classism when visiting America: almost everyone is in denial that it exists, and nobody wants to be shown the truth (because that's too "woke", so go back to sleep America)
Really? (Score:2)
So, like Googlebot, Bingbot, YandexBot, Baiduspider, DuckDuckBot, Sogou Spider, Exabot, MojeekBot, Qwantify, AhrefsBot, SemrushBot, DotBot, Censysbot, PetalBot, Gigabot, MJ12bot, Bytespider (by ByteDance), Applebot (for Siri and Spotlight), NeevaBot (defunct but crawled while active), SeznamBot...
Re:Really? (Score:5, Informative)
No, this is actually different.
They siphon up everything, evading any attempt to restrict them. Search engines have to be wary of indexing useless stuff, these don't.
It's a real problem for internet infrastructure.
"If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic."
https://drewdevault.com/2025/0... [drewdevault.com]
Re: (Score:2)
I've got quite a few useragents from pre LLM times that get only a 403. Most of them are for "SEO insights" and similar things, which seem to require to compare robots.txt excluded pages with the SEO score of the competition. Of course they also have their website as spam referer.
Re: (Score:3)
Generally those traditional crawlers are well-behaved, and will follow the instructions given in robots.txt, though not all follow suggestions like crawl-delay. And if not, they tend to originate from fixed source IP addresses which can be blocked or throttled by the site operator or their CDN.
Back in 2020 IETF released a draft document "RateLimit Header Fields for HTTP [github.com]" providing rate-limit headers which well-behaved clients should respect.
Re: (Score:3)
Classical search engines fetch html. These new bots attempt to download the 3 million images in their maximum resolution.
AI and Nostalgia! (Score:2)
Look! AI has reproduced the Slashdot effect! Something that's been mostly unheard of for at least a decade!
Awe... all the good feels of days gone by. What old is new again.
Re: (Score:2)
So true! There's even a CmdrInChiefTaco again now!
LLMs are the worst (Score:2)
Re:LLMs are the worst (Score:5, Informative)
Most FLOSS projects have set up Anubis: https://anubis.techaro.lol/ [techaro.lol]
Re: (Score:1)
Mass holes (Score:1)
Why is it so easy in internet- and phone-land to spoof the source? Our infrastructure if focked up; it should just not be that easy. Send cruise missiles up cheaters' asses, send a message. And/or make a better standard.
Re: (Score:2)
I happen to find that an interesting discussion, but it's not relevant here.
They aren't spoofing, they are just not identifying themselves. That's not something you can solve by infrastructure, a large IP pool owner can spam you with requests from a million IPs without any spoofing. Only legal obligations to identify themselves could help, but that would be hard to implement and not without side effects.
Re: (Score:1)
> a large IP pool owner can spam you with requests from a million IPs without any spoofing.
Aren't owners required to publicly register their IP blocks?
If there is lots of traffic from a single owner, it can be throttled.
Re: (Score:3)
It can be throttled, but there's a lot of ways to get IPs. IP leasing, cloud/hosting providers, plain old botnets.
Until there is some major scorched earth "we don't care about innocent bystanders" blacklist for AI scrapers, this won't get solved. As an individual webmaster, you will always be one step behind.
Re: (Score:2)
It is not easy. You can spoof the source, but you won't get the response. What you see here is people/companies/who-know-who renting cloud servers. And you can't see who's using an AWS instance when your site is hit by an AWS IP.
Aaron Swartz must be rolling in his tomb... (Score:2)
Re: Aaron Swartz must be rolling in his tomb... (Score:2)
Not quite. He was trying to spread knowledge. AI takes that knowledge and uses it as a template to make its predictive text appear true, whether it is or not.
Our current state of affairs demonstrates the problem of making knowledge proprietary and lies free.
Re: (Score:1)
Dead Internet Theory Again (Score:2)
You can't index it, because it's growing like cancer.... therefore, you can't search it with any authority.... but since most stuff now is generated to get you to land on it for google small text ad farms... i.e. it's all garbage anyways. The source material is garbage, the generated material is garbage.
Most of the traffic is bots. The I-net is finished.
Pack it up and move on folks. Nothing to see here.
So idiotic (Score:3)
Re: (Score:2)
"The online image repository DiscoverLife, which contains nearly 3 million species photographs"
How many queries do you think it takes to download all of those?
Re: (Score:1)
"The online image repository DiscoverLife, which contains nearly 3 million species photographs"
How many queries do you think it takes to download all of those?
At least one query to start with: An email or other message with the person/group that runs DiscoverLife asking them the preferred way to mirror the website.
DisoverLife may reply with "just suck everything down, but please do it slowly as a courtesy to other users" or "send us $10 and we'll ship it to you on a USB stick" or even better, DiscoverLife would enter into some kind of arrangement for the requesting party to become an official mirror site.
Re: (Score:1)
Did they pay so much for the Nvidia racks there's no money for paying the IP holders? Maybe it'll bring light to these publishing companies locking publicly funded research behind middlemen that add no benefit, but somehow weaseled their way into the revenue stream.
Sceapers shouks be classed as malware (Score:2)
Registration for "Expensive" content (Score:2)
Stop allowing unregistered users to access even slightly [computationally] expensive content... Anything uncached, really.
Institute a delay and possibly an additional verification requirement before users can view the most expensive content.
Anything everyone can see should be aggressively cached.
Pestilence (Score:4, Interesting)
These badly-behaved bots, mostly from China, are a scourge on the Internet. I have a self-hosted gitea instance and I had to password-protect it to stop the bots from eating all my bandwidth, even after I banned huge swaths of the IPv4 space.
Re: Pestilence (Score:2)