Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
AI Science

Web-Scraping AI Bots Cause Disruption For Scientific Databases and Journals (nature.com) 36

Automated web-scraping bots seeking training data for AI models are flooding scientific databases and academic journals with traffic volumes that render many sites unusable. The online image repository DiscoverLife, which contains nearly 3 million species photographs, started receiving millions of daily hits in February this year that slowed the site to the point that it no longer loaded, Nature reported Monday.

The surge has intensified since the release of DeepSeek, a Chinese large language model that demonstrated effective AI could be built with fewer computational resources than previously thought. This revelation triggered what industry observers describe as an "explosion of bots seeking to scrape the data needed to train this type of model." The Confederation of Open Access Repositories reported that more than 90% of 66 surveyed members experienced AI bot scraping, with roughly two-thirds suffering service disruptions. Medical journal publisher BMJ has seen bot traffic surpass legitimate user activity, overloading servers and interrupting customer services.

Web-Scraping AI Bots Cause Disruption For Scientific Databases and Journals

Comments Filter:
  • AI is the future and makes all our lives better. Why so much hate?

    • by Tablizer ( 95088 )

      Maybe via the Broken Window Theory of economics. The anti-scrape-bots will need to use AI to get around the scraper source spoofing tricks, creating a never-ending cat-and-mouse escalation pattern where AI experts on both sides make buck.

      It's like the military-industrial-complex, they get rich by encouraging our leaders to moon dictators, and their counterparts on the other side are doing that same.

    • No hate. Just consistency.

      AI is software. The company officers and scientists who own and deploy the software to scrape scientific and university databases and cause denial of service attacks should be in jail, awaiting trial. If they elect to kill themselves while in jail, so be it.

  • So, like Googlebot, Bingbot, YandexBot, Baiduspider, DuckDuckBot, Sogou Spider, Exabot, MojeekBot, Qwantify, AhrefsBot, SemrushBot, DotBot, Censysbot, PetalBot, Gigabot, MJ12bot, Bytespider (by ByteDance), Applebot (for Siri and Spotlight), NeevaBot (defunct but crawled while active), SeznamBot...

    • Re:Really? (Score:5, Informative)

      by serafean ( 4896143 ) on Monday June 02, 2025 @02:05PM (#65422773)

      No, this is actually different.
      They siphon up everything, evading any attempt to restrict them. Search engines have to be wary of indexing useless stuff, these don't.
      It's a real problem for internet infrastructure.

      "If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic."
      https://drewdevault.com/2025/0... [drewdevault.com]

      • by allo ( 1728082 )

        I've got quite a few useragents from pre LLM times that get only a 403. Most of them are for "SEO insights" and similar things, which seem to require to compare robots.txt excluded pages with the SEO score of the competition. Of course they also have their website as spam referer.

    • by Nonesuch ( 90847 )

      Generally those traditional crawlers are well-behaved, and will follow the instructions given in robots.txt, though not all follow suggestions like crawl-delay. And if not, they tend to originate from fixed source IP addresses which can be blocked or throttled by the site operator or their CDN.

      Back in 2020 IETF released a draft document "RateLimit Header Fields for HTTP [github.com]" providing rate-limit headers which well-behaved clients should respect.

    • Classical search engines fetch html. These new bots attempt to download the 3 million images in their maximum resolution.

  • Look! AI has reproduced the Slashdot effect! Something that's been mostly unheard of for at least a decade!

    Awe... all the good feels of days gone by. What old is new again.

    • Look! AI has reproduced the Slashdot effect! Something that's been mostly unheard of for at least a decade! Awe... all the good feels of days gone by. What old is new again.

      So true! There's even a CmdrInChiefTaco again now!

  • My lemmy instance recently started getting hit with LLM bots trawling for data from legitimate users (most of whom are anti-llm, so provides good quality training data) and it sucks up so much bandwidth. They don't respect robots.txt either, so outside of IP blocks, theres not much we can do. Whatever small productivity boost the programmers get out of using llms, perhaps we shouldn't destroy all of civilized society to obtain it. Ban all llms please
  • Why is it so easy in internet- and phone-land to spoof the source? Our infrastructure if focked up; it should just not be that easy. Send cruise missiles up cheaters' asses, send a message. And/or make a better standard.

    • I happen to find that an interesting discussion, but it's not relevant here.

      They aren't spoofing, they are just not identifying themselves. That's not something you can solve by infrastructure, a large IP pool owner can spam you with requests from a million IPs without any spoofing. Only legal obligations to identify themselves could help, but that would be hard to implement and not without side effects.

      • by Tablizer ( 95088 )

        > a large IP pool owner can spam you with requests from a million IPs without any spoofing.

        Aren't owners required to publicly register their IP blocks?

        If there is lots of traffic from a single owner, it can be throttled.

        • It can be throttled, but there's a lot of ways to get IPs. IP leasing, cloud/hosting providers, plain old botnets.

          Until there is some major scorched earth "we don't care about innocent bystanders" blacklist for AI scrapers, this won't get solved. As an individual webmaster, you will always be one step behind.

    • by allo ( 1728082 )

      It is not easy. You can spoof the source, but you won't get the response. What you see here is people/companies/who-know-who renting cloud servers. And you can't see who's using an AWS instance when your site is hit by an AWS IP.

  • AI is doing exactly what he attempted doing, except that he was hounded [wikipedia.org] until he took his own life. And now it's somehow all okay.
    • Not quite. He was trying to spread knowledge. AI takes that knowledge and uses it as a template to make its predictive text appear true, whether it is or not.

      Our current state of affairs demonstrates the problem of making knowledge proprietary and lies free.

    • The legal system supports the establishment and big business, which he took head on. Lobbyists and think tank groups literally write legislation that becomes law. Set up to benefit the people who wrote it. Here you have big business and the establishment doing something (scraping the web with LLMs) that upsets another big business and other establishment, so get what we have here. Its like when the police investigate themselves and find no wrongdoing. There probably was wrongdoing but they're not going to l
  • More bots than peeps.

    You can't index it, because it's growing like cancer.... therefore, you can't search it with any authority.... but since most stuff now is generated to get you to land on it for google small text ad farms... i.e. it's all garbage anyways. The source material is garbage, the generated material is garbage.

    Most of the traffic is bots. The I-net is finished.
    Pack it up and move on folks. Nothing to see here.
  • by bradley13 ( 1118935 ) on Monday June 02, 2025 @02:50PM (#65422903) Homepage
    There's no reason for millions of queries. How many models are being trained? These are just badly behaved bots, ruining a good thing (open access) for everyone else. Tragedy of the commons.
    • "The online image repository DiscoverLife, which contains nearly 3 million species photographs"

      How many queries do you think it takes to download all of those?

      • by davidwr ( 791652 )

        "The online image repository DiscoverLife, which contains nearly 3 million species photographs"

        How many queries do you think it takes to download all of those?

        At least one query to start with: An email or other message with the person/group that runs DiscoverLife asking them the preferred way to mirror the website.

        DisoverLife may reply with "just suck everything down, but please do it slowly as a courtesy to other users" or "send us $10 and we'll ship it to you on a USB stick" or even better, DiscoverLife would enter into some kind of arrangement for the requesting party to become an official mirror site.

    • Why not just work out a deal with the publishing companies to get these LLMs local copies to train on? It legitimize the process. The publishers would get paid and be happy. The LLM makers would have faster access to high quality data.

      Did they pay so much for the Nvidia racks there's no money for paying the IP holders? Maybe it'll bring light to these publishing companies locking publicly funded research behind middlemen that add no benefit, but somehow weaseled their way into the revenue stream.

  • Residential ISPs should not allow scrapers to launder through their networks. Microsoft, Apple, Google, Antivirus and Router makers should all crack down on scraping botnets the same way they crack down own other malicious traffic.
  • Stop allowing unregistered users to access even slightly [computationally] expensive content... Anything uncached, really.

    Institute a delay and possibly an additional verification requirement before users can view the most expensive content.

    Anything everyone can see should be aggressively cached.

  • Pestilence (Score:4, Interesting)

    by dskoll ( 99328 ) on Monday June 02, 2025 @04:26PM (#65423119) Homepage

    These badly-behaved bots, mostly from China, are a scourge on the Internet. I have a self-hosted gitea instance and I had to password-protect it to stop the bots from eating all my bandwidth, even after I banned huge swaths of the IPv4 space.

    • Another possibility is to setup a rate limiter per IP address. Figure out what would be reasonable for a normal human surfing the site would be and add some padding... 50 queries a minute? (Accounting for all of the CSS, JS, HTML, fonts, etc)

"Today's robots are very primitive, capable of understanding only a few simple instructions such as 'go left', 'go right', and 'build car'." --John Sladek

Working...