The “client-side problems” Siebenmann is talking about are the various anti-bot measures (CAPTCHAs, rate limiters, etc.) that operators put in place that make the end user experience worse. Operators feel that they have no choice but to keep their servers available, thus they “don’t care”.
He makes a statement in an earlier article that I think sums things up nicely:
> One thing I've wound up feeling from all this is that the current web is surprisingly fragile. A significant amount of the web seems to have been held up by implicit understandings and bargains, not by technology. When LLM crawlers showed up and decided to ignore the social things that had kept those parts of the web going, things started coming down all over the place.
This social contract is, to me, built around the idea that a human will direct the operation of a computer in real time (largely by using a web browser and clicking links) but I think that this approach is extremely inefficient of both the computer’s and the human’s resources (cpu and time, respectively). The promise of technology should not be to put people behind desks staring at a screen all day, so this evolution toward automation must continue.
I do wonder what the new social contract will be: Perhaps access to the majority of servers will be gated by micropayments, but what will the “deal” be for those who don’t want to collect payments? How will they prevent abuse while keeping access free?
> The “client-side problems” Siebenmann is talking about are the various anti-bot measures (CAPTCHAs, rate limiters, etc.)
Directly from the article:
> it's not new, and it goes well beyond anti-crawler and anti-robot defenses. As covered by people like Alex Russell, it's routine for websites to ignore most real world client side concerns (also, and including on desktops). Just recently (as of August 2025), Github put out a major update that many people are finding immensely slow even on developer desktops.
The things he links to are about things that are unrelated to anti-bot measures.
The fact is, the web is an increasingly unpleasant place to visit. Users are subject to terrible UX – dark patterns, tracking, consent popups, ads everywhere, etc.
Then along come chatbots and when somebody asks about something, they are given the response on the spot without having to battle their way through all that crap to get what they want.
Of course users are going to flock to chatbots. If a site owner is worried they are losing traffic to chatbots, perhaps they should take a long, hard look at what kind of user experience they are serving up to people.
This is like streaming media all over again. Would you rather buy a legit DVD and wait for it to arrive in the post, then wait through an unskippable lecture about piracy, then wait through unstoppable trailers, then find your way through a weird, horrible DVD menu… or would you rather download it and avoid all that? The thing that alleviated piracy was not locking things down even more, it was making the legitimate route more convenient.
We need to make websites pleasant experiences again, and we can’t do that when we care about everything else more than the user experience.
The chat bot operator slurps all websites and gives answers to all questions free of charge.
No other website can compete with that.
The whole story with streaming media is not just that pay streaming became more convenient. It’s also that content creators used legal and business mechanisms to make piracy inconvenient. They shut down Napster. They send DMCA notices. They got the DMCA enacted. They got YouTube working for them by serving ads with their content and thus monetizing it.
Chat bots are just like Napster. They’re free-riding off the content others worked to create. Just like with Napster, making websites more convenient will be only part of the answer.
> content creators used legal and business mechanisms to make piracy inconvenient
Copyright holders, not content creators. Though typically content creators are also copyright holders, copyright holders are not always content creators, esp in this context. To a big degree these practices are not on the behalf of content creators nor are they helping them.
The solution may be elsewhere: starting from creating content that people may actually care about.
Another implicit social contract is that you can tell whether a request is coming from a commercial or non-commercial source based on the originating ISP. This was always a heuristic but it was more reliable in the past.
If 1000 AWS boxes start hammering your API you might raise an eyebrow, but 1000 requests coming from residential ISPs around the world could be an organic surge in demand for your service.
Residential proxy services break this - which has been happening on some level for a long time, but the AI-training-set arms race has driven up demand and thus also supply.
It's quite easy to block all of AWS, for example, but it's less easy to figure out which residential IPs are part of a commercially-operated botnet.
IPs that you've never seen before are hitting a single random page deep within your site are bots, or first-time followers of a search engine link. Grey list them and respond slowly. If they are seen again at normal human rates, unthrottle them.
The only possible contract moving forward will be something along the lines of agreeing to paying for access. I can't imagine how this will work, but with the collapse of the ad-supported Internet, that's the only way forward. It's not a straightforward problem to solve because people are not willing to pay for something before they see it, but after seeing the page, the incentive to pay is gone.
The Internet and Web were both designed with the assumption of cooperation. I wonder what they would have built if they'd taken hostility into account from day one.
As we've seen security is really hard to build in after the fact. It has to be part of your design concept from the very first, and pervades every other decision you make. If you try to layer security on top you will lose.
Of course you may discover that a genuinely secure system is also unusably inconvenient and you lose to someone willing to take risks, and it's all moot.
> > One thing I've wound up feeling from all this is that the current web is surprisingly fragile. A significant amount of the web seems to have been held up by implicit understandings and bargains, not by technology.
This is something I've been pondering, and honestly I feel like the author doesn't go far enough. I would go as far as to say a lot of our modern society has been held up by these implicit social contracts. But nowadays we see things like gerrymandering in the US, or overusing the 49-3 in France to pass laws despite the parliament voting against them. Just an overall trend of only feeling constrained by the exact letter of the law and ignoring the spirit of it.
Except it turns out these implicit understandings that you shouldn't do that existed because breaking them makes life shittier for everyone, and that's what we're experiencing now.
The author is suggesting that websites care more about server side issues than client side issues. To the point that they don't realize that users stop using them.
I think that statement is way too strong and obviously not true of businesses. It might be true if hobbyist websites where the creator is personally more interested on the server side but it's definitely not true of professional websites.
Professional websites that have enough of a budget to care about the server side will absolutely care about the client side and will track usage. If 10% fewer people used the website, the analytics would show that and there would be a fire drill.
What I can agree with on the author is more of a nuanced point. Client side problems are a lot harder and have a very long tail due to unique client configurations (OS, browser, extensions, physical hardware). So with thousands of combinations, you end up with some wild and rare issues. It becomes hard to chase all of them down and some you just have to ignore.
This can lead to it feeling like websites don't care about client side but it just shows client side is hard.
> I think that statement is way too strong and obviously not true of businesses
Amazon.com Inc is currently worth 2.4 billion dollars and the only reason is that most businesses insist on giving their customers the worst online experience possible. I wish that I could one day understand the logic, which goes like this:
1. Notice that people are on their phones all the time.
2. And notice that when people are looking to buy something they first go on the computer or on the smart phone.
3. Therefore let's make the most godawful experience on our website possible, to make sure that our potential customers hate us and don't make a purchase.
4. Customers make their purchase on Amazon instead.
There are many one man online businesses with very smooth and user friendly customer experience. Does every business have to be as smooth as Amazon? No. But that's not an argument for giving up completely.
> Amazon.com Inc is currently worth 2.4 billion dollars and the only reason is that most businesses insist on giving their customers the worst online experience possible
Not the gp, but from my own experience: some business use out-of-the-box online shop software that is not very good. I wouldn't say "most" but, if you're buying some particular niche products, it becomes true. Slow pages, abysmal usability... one pet peeve is that they offer a brand filter in the left column with checkboxes. I want to select three brands and, every time I tick the checkbox, the page is reloaded. Reloading is painfully slow, so I need one minute to get to the search. If I want to do several searches, it's too much time.
Also, at least in Spain, some delivery companies are awful. I have a package delivered to a convenience store right now. They refuse to give it to me because I have no delivery key. The courier didn't send it to me. I try to get assistance in their web... and they ask me the key that I want them to give me. Nice, huh?
I asked for a refund to the shop. They have ghosted me in the chat, their return form doesn't work. Their email addresses are no-reply. The contact form doesn't work either. Now I need to wait until Monday to phone them.
I know the shop is legit. They're just woefully incompetent and don't know they are or think that's the way things work.
For cheap and not too expensive products, Amazon just works. No "but I went to your house and there was nobody there" bullshit. No questions return policy.
> Amazon.com Inc is currently worth 2.4 billion dollars and the only reason is that most businesses insist on giving their customers the worst online experience possible.
This is an incredibly reductive view of how Amazon came to dominate online retail. If you genuinely believe this, I would strongly urge you to research their history and understand how they became the monopoly they are today.
I assure you, it's not primarily because they care more about the end user's experience.
It's just an example, and it holds true even if it's reductive. If businesses made just 5% of the effort with their online experience as they do with their physical stores or social media campaigning, then they would see massive returns on effort.
Respectfully, this argument reads like it is completely ignorant of the e-commerce landscape over the past 30 years and how much Amazon has shaped and innovated in the space. Not to mention that today they have several verticals beyond e-commerce that make up their valuation.
Okay go on and count only half for the sake of argument. That's still a trillion. Any business can do what Amazon does for their products and their customers. But they don't and they won't. Those who do experience great advantages.
I...don't have this experience. It doesn't hold true for me, and I suspect I am not alone. There are certainly some online stores that are not very great, but by and large, I just don't have problems with them. I prefer the seller's website over Amazon.
Amazon, on the other hand, is plagued with fake or bad products from copycat sellers. I have no idea what I am going to get when I place an order. Frankly, I'm surprised when I get the actual thing I ordered.
A couple years back I tried to buy some parts on digikey and literally could not get the checkout to work without completely disabling noscript (assuming that would've helped). They had like a dozen 3rd party tracking scripts. Eventually I gave up and used Amazon.
it's still the case today, in 2025, that when I bought a Focusrite 18i20 mixer from Sweetwater that turned out to be defective, I had to spend a week with a lengthy and super-long-delayed conversation with their support department convincing them that the unit was in fact defective, that I was using it correctly, and finally getting the prized RMA to return it. Whereas if I had bought it from Amazon, I would have received the original package more quickly, and when defective, I could have had it in a box and shipped off from any local shipper that same day with no emails/phone calls required with a new one to arrive the next day. Amazon even as the leader in "enshittification" still offers a dramatically better experience for a wide range of products (though certainly not all of them).
How real is this "crawler plague" that the author refers to? I haven't seen it. But that's just as likely to because I don't care, and therefore am not looking, as it is to be because it's not there. Loading static pages from CDN to scrape training data takes such minimal amounts of resources that it's never going to be a significant part of my costs. Are there cases where this isn't true?
The following is the best I could collect quickly to provide backup to the statement. Unfortunally it's not the high quality first instance of raw statistics I would have liked.
But from what I have read time to time the crawler acted magnitudes outside of what could have been accepted as just badly configured.
My forum traffic went up 10x due to bots a few months ago. Never seen anything like it.
> Loading static pages from CDN to scrape training data takes such minimal amounts of resources that it's never going to be a significant part of my costs. Are there cases where this isn't true?
Why did you bring up static pages served by a CDN, the absolute best case scenario, as your reference for how crawler spam might affect server performance?
Not OP, but many technologies nowadays push users to use a server-side component when not needed.
An example is NextJS where you're strongly encouraged[0] to run a server (or use a platform like Vercel), even if what you're doing is a fairly simple static site.
Combine inconsiderate crawler (AI or otherwise) with a server-side logic that doesn't really need to be there and you have a recipe for a crash, a big hosting bill, or both.
Yes, it’s true. Most sites don’t have a forever cache TTL so a crawler that hits every page on a database-backed site is going to hit mostly uncached pages (and therefore the DB).
I also have a faceted search that some stupid crawler has spent the last month iterating through. Also mostly uncached URLs.
I just had to purchase a cloudflare account to protect two of my sites used for CI that run Jenkins and Gerrit servers. These are resource-hungry java VMs which I have running on a minimally powered server as they are intended to be accessed only by a few people, yet crawlers located in eastern Europe and Asia eventually found it and would regularly drive my CPU up to 500% and make the server unavailable (it should go without saying I have always had a robots.txt on these sites that prohibit all crawling. Such files are a quaint relic of a simpler time). For a couple of years I'd block the various offending IPs, but this past month the crawling resumed again this time intentionally swarmed across hundreds of IP numbers so that I could not easily block them. Cloudflare was able to show me within minutes the entirety of the IP numbers came from a single ASN owned by a very large and well known Chinese company and I blocked the entire ASN. While I could figure out these ASNs manually and get blocklists to add to apache config, Cloudflare makes it super easy showing you the whole thing happening in realtime. You can even tailor the 403 response to send them a custom message, in my case, "ALL of the data you are crawling is on github! Get off these servers and go get it there!" (again sure I could write out httpd config for all of that but who wants to bother). They are definitely providing a really critical service.
open to people who contribute PRs so they can see why their tests failed, also htdigest / htpasswd access is complicated / impossible (depending on use case) to configure with the way jenkins / gerrit authentication itself works, particularly with internal scripts and hooks that need to communicate with them.
How many users do you think are on the poster’s Jenkins/CI system? Sounded like a personal thing or maybe a small team, I didn’t get the impression it was supposed to be public.
It's funny because I can no longer view his website's web pages because he implemented a broken by default "anti-bot" system that blocks anything that's not a corporate browser from $latestyear.
> You're using a suspiciously old browser
>You're probably reading this page because you've attempted to access some part of my blog (Wandering Thoughts) or CSpace, the wiki thing it's part of. Unfortunately you're using a browser version that my anti-crawler precautions consider suspicious, most often because it's too old (most often this applies to versions of Chrome). Unfortunately, as of early 2025 there's a plague of high volume crawlers (apparently in part to gather data for LLM training) that use a variety of old browser user agents, especially Chrome user agents. To reduce the load on Wandering Thoughts I'm experimenting with (attempting to) block all of them, and you've run into this.
>If this is in error and you're using a current version of your browser of choice, you can contact me at my current place at the university (you should be able to work out the email address from that). If possible, please let me know what browser you're using and so on, ideally with its exactl User-Agent string.
He then makes a bunch of hints about figuring out his email address without being able to actually load any pages on his site. I was unable to solve this riddle to email him my useragent string.
I don't really get what this article is talking about nor the distinctions that it's trying to draw between server and client. It brings up multiple different things from captcha to actual client performance so it's not clear what "problems" means in the title nor TFA.
The author needs to open with a paragraph that establishes better context. They open with a link to another post where they talk about anti-LLM defenses but it doesn't clarify what they are talking about when they compare server problems with client-side problems.
It's not that website owners don't care that they're frustrating users, losing visitors and customers, or creating a poor experience. It's an intractable problem for most website owners to combat the endless ways that their sites are being botted and bogged down, and having to pay for resources to handle the 98% of traffic their sites are getting that isn't coming from real users and customers. By all means, solve it and everyone will be happy.
Heh. Who asked those website owners to have laggy scrolling, non existent contrast, hijack my back button, generally run so much javascript that a cluster is needed client side just to display a 3 line LLM generated blog post?
To be fair, the sites resorting to extreme anti-bot measures are also often not the ones that are a clientside JavaScript hell.
Thinking of the most extreme option (throwing proof of work checks at browsers), the main stuff that jumps to mind is sites like sourcehut, Linux Kernel Archives and so on and the admins for all of those sites have noted that the traffic they get is far outside of expectations[0]. Not whatever blogspam ended up on the top of Google search that day.
The badly designed sites are often the ones that don't care about their bandwidth anyways.
This. It seems every website these days needs Javascript enabled just to load static content that could have been loaded between the time I hovered over a link and clicked it.
The “client-side problems” Siebenmann is talking about are the various anti-bot measures (CAPTCHAs, rate limiters, etc.) that operators put in place that make the end user experience worse. Operators feel that they have no choice but to keep their servers available, thus they “don’t care”.
He makes a statement in an earlier article that I think sums things up nicely:
> One thing I've wound up feeling from all this is that the current web is surprisingly fragile. A significant amount of the web seems to have been held up by implicit understandings and bargains, not by technology. When LLM crawlers showed up and decided to ignore the social things that had kept those parts of the web going, things started coming down all over the place.
This social contract is, to me, built around the idea that a human will direct the operation of a computer in real time (largely by using a web browser and clicking links) but I think that this approach is extremely inefficient of both the computer’s and the human’s resources (cpu and time, respectively). The promise of technology should not be to put people behind desks staring at a screen all day, so this evolution toward automation must continue.
I do wonder what the new social contract will be: Perhaps access to the majority of servers will be gated by micropayments, but what will the “deal” be for those who don’t want to collect payments? How will they prevent abuse while keeping access free?
[1] “The current (2025) crawler plague and the fragility of the web”https://utcc.utoronto.ca/~cks/space/blog/web/WebIsKindOfFrag...
> The “client-side problems” Siebenmann is talking about are the various anti-bot measures (CAPTCHAs, rate limiters, etc.)
Directly from the article:
> it's not new, and it goes well beyond anti-crawler and anti-robot defenses. As covered by people like Alex Russell, it's routine for websites to ignore most real world client side concerns (also, and including on desktops). Just recently (as of August 2025), Github put out a major update that many people are finding immensely slow even on developer desktops.
The things he links to are about things that are unrelated to anti-bot measures.
The fact is, the web is an increasingly unpleasant place to visit. Users are subject to terrible UX – dark patterns, tracking, consent popups, ads everywhere, etc.
Then along come chatbots and when somebody asks about something, they are given the response on the spot without having to battle their way through all that crap to get what they want.
Of course users are going to flock to chatbots. If a site owner is worried they are losing traffic to chatbots, perhaps they should take a long, hard look at what kind of user experience they are serving up to people.
This is like streaming media all over again. Would you rather buy a legit DVD and wait for it to arrive in the post, then wait through an unskippable lecture about piracy, then wait through unstoppable trailers, then find your way through a weird, horrible DVD menu… or would you rather download it and avoid all that? The thing that alleviated piracy was not locking things down even more, it was making the legitimate route more convenient.
We need to make websites pleasant experiences again, and we can’t do that when we care about everything else more than the user experience.
The chat bot operator slurps all websites and gives answers to all questions free of charge.
No other website can compete with that.
The whole story with streaming media is not just that pay streaming became more convenient. It’s also that content creators used legal and business mechanisms to make piracy inconvenient. They shut down Napster. They send DMCA notices. They got the DMCA enacted. They got YouTube working for them by serving ads with their content and thus monetizing it.
Chat bots are just like Napster. They’re free-riding off the content others worked to create. Just like with Napster, making websites more convenient will be only part of the answer.
> content creators used legal and business mechanisms to make piracy inconvenient
Copyright holders, not content creators. Though typically content creators are also copyright holders, copyright holders are not always content creators, esp in this context. To a big degree these practices are not on the behalf of content creators nor are they helping them.
The solution may be elsewhere: starting from creating content that people may actually care about.
Another implicit social contract is that you can tell whether a request is coming from a commercial or non-commercial source based on the originating ISP. This was always a heuristic but it was more reliable in the past.
If 1000 AWS boxes start hammering your API you might raise an eyebrow, but 1000 requests coming from residential ISPs around the world could be an organic surge in demand for your service.
Residential proxy services break this - which has been happening on some level for a long time, but the AI-training-set arms race has driven up demand and thus also supply.
It's quite easy to block all of AWS, for example, but it's less easy to figure out which residential IPs are part of a commercially-operated botnet.
> it's less easy to figure out which residential IPs are part of a commercially-operated botnet
Is the client navigating the site faster than humanly possible? It's a bot. This seems like a simple test.
> 1000 requests coming from residential ISPs around the world could be an organic surge
But probably isn't.
> This seems like a simple test.
Not when the singular bot has a pool of millions of IPs to originate each request from.
If you think there's an easy solution here, productize it and make billions.
IPs that you've never seen before are hitting a single random page deep within your site are bots, or first-time followers of a search engine link. Grey list them and respond slowly. If they are seen again at normal human rates, unthrottle them.
The only possible contract moving forward will be something along the lines of agreeing to paying for access. I can't imagine how this will work, but with the collapse of the ad-supported Internet, that's the only way forward. It's not a straightforward problem to solve because people are not willing to pay for something before they see it, but after seeing the page, the incentive to pay is gone.
The Internet and Web were both designed with the assumption of cooperation. I wonder what they would have built if they'd taken hostility into account from day one.
As we've seen security is really hard to build in after the fact. It has to be part of your design concept from the very first, and pervades every other decision you make. If you try to layer security on top you will lose.
Of course you may discover that a genuinely secure system is also unusably inconvenient and you lose to someone willing to take risks, and it's all moot.
> > One thing I've wound up feeling from all this is that the current web is surprisingly fragile. A significant amount of the web seems to have been held up by implicit understandings and bargains, not by technology.
This is something I've been pondering, and honestly I feel like the author doesn't go far enough. I would go as far as to say a lot of our modern society has been held up by these implicit social contracts. But nowadays we see things like gerrymandering in the US, or overusing the 49-3 in France to pass laws despite the parliament voting against them. Just an overall trend of only feeling constrained by the exact letter of the law and ignoring the spirit of it.
Except it turns out these implicit understandings that you shouldn't do that existed because breaking them makes life shittier for everyone, and that's what we're experiencing now.
The author is suggesting that websites care more about server side issues than client side issues. To the point that they don't realize that users stop using them.
I think that statement is way too strong and obviously not true of businesses. It might be true if hobbyist websites where the creator is personally more interested on the server side but it's definitely not true of professional websites.
Professional websites that have enough of a budget to care about the server side will absolutely care about the client side and will track usage. If 10% fewer people used the website, the analytics would show that and there would be a fire drill.
What I can agree with on the author is more of a nuanced point. Client side problems are a lot harder and have a very long tail due to unique client configurations (OS, browser, extensions, physical hardware). So with thousands of combinations, you end up with some wild and rare issues. It becomes hard to chase all of them down and some you just have to ignore.
This can lead to it feeling like websites don't care about client side but it just shows client side is hard.
> I think that statement is way too strong and obviously not true of businesses
Amazon.com Inc is currently worth 2.4 billion dollars and the only reason is that most businesses insist on giving their customers the worst online experience possible. I wish that I could one day understand the logic, which goes like this:
1. Notice that people are on their phones all the time.
2. And notice that when people are looking to buy something they first go on the computer or on the smart phone.
3. Therefore let's make the most godawful experience on our website possible, to make sure that our potential customers hate us and don't make a purchase.
4. Customers make their purchase on Amazon instead.
5. Profit??
Amazon is fine if you want cheap chinese garbage products or name brands that have a decent chance of being counterfeit. I never buy there.
> most businesses insist on giving their customers the worst online experience possible.
I think you're underselling the amount of work it takes to create an experience as smooth as Amazon's.
There are many one man online businesses with very smooth and user friendly customer experience. Does every business have to be as smooth as Amazon? No. But that's not an argument for giving up completely.
Trillion.
> Amazon.com Inc is currently worth 2.4 billion dollars and the only reason is that most businesses insist on giving their customers the worst online experience possible
Huh?
Not the gp, but from my own experience: some business use out-of-the-box online shop software that is not very good. I wouldn't say "most" but, if you're buying some particular niche products, it becomes true. Slow pages, abysmal usability... one pet peeve is that they offer a brand filter in the left column with checkboxes. I want to select three brands and, every time I tick the checkbox, the page is reloaded. Reloading is painfully slow, so I need one minute to get to the search. If I want to do several searches, it's too much time.
Also, at least in Spain, some delivery companies are awful. I have a package delivered to a convenience store right now. They refuse to give it to me because I have no delivery key. The courier didn't send it to me. I try to get assistance in their web... and they ask me the key that I want them to give me. Nice, huh?
I asked for a refund to the shop. They have ghosted me in the chat, their return form doesn't work. Their email addresses are no-reply. The contact form doesn't work either. Now I need to wait until Monday to phone them.
I know the shop is legit. They're just woefully incompetent and don't know they are or think that's the way things work.
For cheap and not too expensive products, Amazon just works. No "but I went to your house and there was nobody there" bullshit. No questions return policy.
> Amazon.com Inc is currently worth 2.4 billion dollars and the only reason is that most businesses insist on giving their customers the worst online experience possible.
This is an incredibly reductive view of how Amazon came to dominate online retail. If you genuinely believe this, I would strongly urge you to research their history and understand how they became the monopoly they are today.
I assure you, it's not primarily because they care more about the end user's experience.
It's just an example, and it holds true even if it's reductive. If businesses made just 5% of the effort with their online experience as they do with their physical stores or social media campaigning, then they would see massive returns on effort.
Respectfully, this argument reads like it is completely ignorant of the e-commerce landscape over the past 30 years and how much Amazon has shaped and innovated in the space. Not to mention that today they have several verticals beyond e-commerce that make up their valuation.
Okay go on and count only half for the sake of argument. That's still a trillion. Any business can do what Amazon does for their products and their customers. But they don't and they won't. Those who do experience great advantages.
> Any business can do what Amazon does for their products and their customers
No, they can't, as evidenced by not everyone else in e-commerce doing that.
What kind of argument is that? Not doing something is not evidence that it is impossible to do.
I...don't have this experience. It doesn't hold true for me, and I suspect I am not alone. There are certainly some online stores that are not very great, but by and large, I just don't have problems with them. I prefer the seller's website over Amazon.
Amazon, on the other hand, is plagued with fake or bad products from copycat sellers. I have no idea what I am going to get when I place an order. Frankly, I'm surprised when I get the actual thing I ordered.
A couple years back I tried to buy some parts on digikey and literally could not get the checkout to work without completely disabling noscript (assuming that would've helped). They had like a dozen 3rd party tracking scripts. Eventually I gave up and used Amazon.
it's still the case today, in 2025, that when I bought a Focusrite 18i20 mixer from Sweetwater that turned out to be defective, I had to spend a week with a lengthy and super-long-delayed conversation with their support department convincing them that the unit was in fact defective, that I was using it correctly, and finally getting the prized RMA to return it. Whereas if I had bought it from Amazon, I would have received the original package more quickly, and when defective, I could have had it in a box and shipped off from any local shipper that same day with no emails/phone calls required with a new one to arrive the next day. Amazon even as the leader in "enshittification" still offers a dramatically better experience for a wide range of products (though certainly not all of them).
How real is this "crawler plague" that the author refers to? I haven't seen it. But that's just as likely to because I don't care, and therefore am not looking, as it is to be because it's not there. Loading static pages from CDN to scrape training data takes such minimal amounts of resources that it's never going to be a significant part of my costs. Are there cases where this isn't true?
The following is the best I could collect quickly to provide backup to the statement. Unfortunally it's not the high quality first instance of raw statistics I would have liked.
But from what I have read time to time the crawler acted magnitudes outside of what could have been accepted as just badly configured.
https://herman.bearblog.dev/the-great-scrape/
https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...
https://lwn.net/Articles/1008897/
https://tecnobits.com/en/AI-crawlers-on-Wikipedia-platform-d...
https://boston.conman.org/2025/08/21.1
My forum traffic went up 10x due to bots a few months ago. Never seen anything like it.
> Loading static pages from CDN to scrape training data takes such minimal amounts of resources that it's never going to be a significant part of my costs. Are there cases where this isn't true?
Why did you bring up static pages served by a CDN, the absolute best case scenario, as your reference for how crawler spam might affect server performance?
Not OP, but many technologies nowadays push users to use a server-side component when not needed.
An example is NextJS where you're strongly encouraged[0] to run a server (or use a platform like Vercel), even if what you're doing is a fairly simple static site.
Combine inconsiderate crawler (AI or otherwise) with a server-side logic that doesn't really need to be there and you have a recipe for a crash, a big hosting bill, or both.
[0] People see https://nextjs.org/docs/app/guides/static-exports#unsupporte... and go "ah shucks I better have a server component then"
My estimation is at least 70% of traffic on small sites 300-3000 daily views, is not human
Yes, it’s true. Most sites don’t have a forever cache TTL so a crawler that hits every page on a database-backed site is going to hit mostly uncached pages (and therefore the DB).
I also have a faceted search that some stupid crawler has spent the last month iterating through. Also mostly uncached URLs.
Yeah, or an event plugin where spiders walks every day of several years...
I just had to purchase a cloudflare account to protect two of my sites used for CI that run Jenkins and Gerrit servers. These are resource-hungry java VMs which I have running on a minimally powered server as they are intended to be accessed only by a few people, yet crawlers located in eastern Europe and Asia eventually found it and would regularly drive my CPU up to 500% and make the server unavailable (it should go without saying I have always had a robots.txt on these sites that prohibit all crawling. Such files are a quaint relic of a simpler time). For a couple of years I'd block the various offending IPs, but this past month the crawling resumed again this time intentionally swarmed across hundreds of IP numbers so that I could not easily block them. Cloudflare was able to show me within minutes the entirety of the IP numbers came from a single ASN owned by a very large and well known Chinese company and I blocked the entire ASN. While I could figure out these ASNs manually and get blocklists to add to apache config, Cloudflare makes it super easy showing you the whole thing happening in realtime. You can even tailor the 403 response to send them a custom message, in my case, "ALL of the data you are crawling is on github! Get off these servers and go get it there!" (again sure I could write out httpd config for all of that but who wants to bother). They are definitely providing a really critical service.
> intended to be accessed only by a few people
So why are they open to the entire world?
open to people who contribute PRs so they can see why their tests failed, also htdigest / htpasswd access is complicated / impossible (depending on use case) to configure with the way jenkins / gerrit authentication itself works, particularly with internal scripts and hooks that need to communicate with them.
Particularly if your users are keen on solving recaptchas over and over.
How many users do you think are on the poster’s Jenkins/CI system? Sounded like a personal thing or maybe a small team, I didn’t get the impression it was supposed to be public.
The poster ends with a general comment on the usefulness of cloudflare.
It's very real. It's crashed my site a number of times.
[dead]
It's funny because I can no longer view his website's web pages because he implemented a broken by default "anti-bot" system that blocks anything that's not a corporate browser from $latestyear.
> You're using a suspiciously old browser
>You're probably reading this page because you've attempted to access some part of my blog (Wandering Thoughts) or CSpace, the wiki thing it's part of. Unfortunately you're using a browser version that my anti-crawler precautions consider suspicious, most often because it's too old (most often this applies to versions of Chrome). Unfortunately, as of early 2025 there's a plague of high volume crawlers (apparently in part to gather data for LLM training) that use a variety of old browser user agents, especially Chrome user agents. To reduce the load on Wandering Thoughts I'm experimenting with (attempting to) block all of them, and you've run into this.
>If this is in error and you're using a current version of your browser of choice, you can contact me at my current place at the university (you should be able to work out the email address from that). If possible, please let me know what browser you're using and so on, ideally with its exactl User-Agent string.
He then makes a bunch of hints about figuring out his email address without being able to actually load any pages on his site. I was unable to solve this riddle to email him my useragent string.
I don't really get what this article is talking about nor the distinctions that it's trying to draw between server and client. It brings up multiple different things from captcha to actual client performance so it's not clear what "problems" means in the title nor TFA.
The author needs to open with a paragraph that establishes better context. They open with a link to another post where they talk about anti-LLM defenses but it doesn't clarify what they are talking about when they compare server problems with client-side problems.
It's not that website owners don't care that they're frustrating users, losing visitors and customers, or creating a poor experience. It's an intractable problem for most website owners to combat the endless ways that their sites are being botted and bogged down, and having to pay for resources to handle the 98% of traffic their sites are getting that isn't coming from real users and customers. By all means, solve it and everyone will be happy.
Heh. Who asked those website owners to have laggy scrolling, non existent contrast, hijack my back button, generally run so much javascript that a cluster is needed client side just to display a 3 line LLM generated blog post?
To be fair, the sites resorting to extreme anti-bot measures are also often not the ones that are a clientside JavaScript hell.
Thinking of the most extreme option (throwing proof of work checks at browsers), the main stuff that jumps to mind is sites like sourcehut, Linux Kernel Archives and so on and the admins for all of those sites have noted that the traffic they get is far outside of expectations[0]. Not whatever blogspam ended up on the top of Google search that day.
The badly designed sites are often the ones that don't care about their bandwidth anyways.
[0]: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...
This. It seems every website these days needs Javascript enabled just to load static content that could have been loaded between the time I hovered over a link and clicked it.