Ads are a problematic business model, and I think your point there is kind of interesting. But AI companies disintermediating content creators from their users is NOT the web I want to replace it with.
Let’s imagine you have a content creator that runs a paid newsletter. They put in lots of effort to make well-researched and compelling content. They give some of it away to entice interested parties to their site, where some small percentage of them will convert and sign up.
They put the information up under the assumption that viewing the content and seeing the upsell are inextricably linked. Otherwise there is literally no reason for them to make any of it available on the open web.
Now you have AI scrapers, which will happily consume and regurgitate the work, sans the pesky little call to action.
I think it’s basically impossible to prevent AI crawlers. It is like video game cheating, at the extreme they could literally point a camera at the screen and have it do image processing, and talk to the computer through the USB port emulating, a mouse and keyboard outside the machine. They don’t do that, of course, because it is much easier to do it all in software, but that is the ultimate circumvention of any attempt to block them out that doesn’t also block out humans.
I think the business model for “content creating” is going have to change, for better or worse (a lot of YouTube stars are annoying as hell, but sure, stuff like well-written news and educational articles falls under this umbrella as well, so it is unfortunate that they will probably be impacted too).
Cloudflare banning bad actors has at least made scraping more expensive, and changes the economics of it - more sophisticated deception is necessarily more expensive. If the cost is high enough to force entry, scrapers might be willing to pay for access.
But I can imagine more extreme measures. e.g. old web of trust style request signing[0]. I don’t see any easy way for scrapers to beat a functioning WOT system. We just don’t happen to have one of those yet.
> Cloudflare banning bad actors has at least made scraping more expensive, and changes the economics of it - more sophisticated deception is necessarily more expensive. If the cost is high enough to force entry, scrapers might be willing to pay for access.
I think this might actually point at the end state. Scraping bots will eventually get good enough to emulate a person well enough to be indistinguishable (are we there yet?). Then, content creators will have to price their content appropriately. Have a Patreon, for example, where articles are priced at the price where the creator is fine with having people take that content and add it to the model. This is essentially similar to studios pricing their content appropriately… for Netflix to buy it and broadcast it to many streaming users.
Then they will have the problem of making sure their business model is resistant to non-paying users. Netflix can’t stop me from pointing a camcorder at my TV while playing their movies, and distributing it out like that. But, somehow, that fact isn’t catastrophic to their business model for whatever reason, I guess.
Cloudflare can try to ban bad actors. I’m not sure if it is cloudflare, but as someone who usually browses without JavaScript enables I often bump into “maybe you are a bot” walls. I recognize that I’m weird for not running JavaScript, but eventually their filters will have the problem where the net that captures bots also captures normal people.
>Then they will have the problem of making sure their business model is resistant to non-paying users. Netflix can’t stop me from pointing a camcorder at my TV while playing their movies, and distributing it out like that. But, somehow, that fact isn’t catastrophic to their business model for whatever reason, I guess.
Interested to see some LLM-adverserial equivalent of MPAA dots![1]
Beating web of trust is actually pretty easy: pay people to trust you.
Yes, you can identify who got paid to sign a key and ban them. They will create another key, go to someone else, pretend to be someone not yet signed up for WoT (or pay them), and get their new key signed, and sign more keys for money.
So many people will agree to trust for money, and accountability will be so diffuse, that you won't be able to ban them all. Even you, a site operator, would accept enough money from OpenAI to sign their key, for a promise the key will only be used against your competitor's site.
It wouldn't take a lot to make a binary-or-so tree of fake identities, with exponential fanout, and get some people to trust random points in the tree, and use the end nodes to access your site.
Heck, we even have a similar problem right now with IP addresses, and not even with very long trust chains. You are "trusted" by your ISP, who is "trusted" by one of the RIRs or from another ISP. The RIRs trust each other and you trust your local RIR (or probably all of them). We can trace any IP to see who owns it. But is that useful, or is it pointless because all actors involved make money off it? You know, when we tried making IPs more identifying, all that happened is VPN companies sprang up to make money by leasing non-identifying IPs. And most VPN exits don't show up as owned by the VPN company, because they'd be too easy to identify as non-identifying. They pay hosting providers to use their IPs. Sometimes they even pay residential ISPs so you can't even go by hosting provider. The original Internet was a web of trust (represented by physical connectivity), but that's long gone.
It is inevitable, not because of some technological predestination but because if these services get hard-blocked and unable to perform their duties they will ship the agent as a web browser or browser add-on just like all the VSCode forks and then the requests will happen locally through the same pipe as the user's normal browser. It will be functionally indistinguishable from normal web traffic since it will be normal web traffic.
> Otherwise there is literally no reason for them to make any of it available on the open web
This is the hypothesis I always personally find fascinating in light of the army of semi-anonymous Wikipedia volunteers continuously gathering and curating information without pay.
If it became functionally impossible to upsell a little information for more paid information, I'm sure some people would stop creating information online. I don't know if it would be enough to fundamentally alter the character of the web.
Do people (generally) put things online to get money or because they want it online? And is "free" data worse quality than data you have to pay somebody for (or is the challenge more one of curation: when anyone can put anything up for free, sorting high- and low-quality based on whatever criteria becomes a new kind of challenge?).
Any information that requires something approximating a full-time job worth of effort to produce will necessarily go away, barring the small number of independently wealthy creators.
Existing subject-matter experts who blog for fun may or may not stick around, depending on what part of it is “fun” for them.
While some must derive satisfaction from increasing the total sum of human knowledge, others are probably blogging to engage with readers or build their own personal brand, neither of which is served by AI scrapers.
Wikipedia is an interesting case. I still don’t entirely understand why it works, though I think it’s telling that 24 years later no one has replicated their success.
Wikipedia works for the same reason open-source does: because most of the contributors are experts in the subject and have paid jobs in that field. Some are also just enthusiasts.
OpenStreetMap is basically Wikipedia for maps and is quite successful. Over 10M registered users and millions of edits per day. Lots of information is also shared online on forums for free. The hosting (e.g. reddit) is basically a commodity that benefits from network effects. The information is the more interesting bit, and people share it because they feel like it.
> Any information that requires something approximating a full-time job worth of effort to produce will necessarily go away
Many people put more effort into their hobbies than into their "full time" job.
Some of it will go away but perhaps without the expectation that you can earn money more people will share freely.
> While some must derive satisfaction from increasing the total sum of human knowledge, others are probably blogging to engage with readers or build their own personal brand, neither of which is served by AI scrapers.
We don't have to make all business models that someone might want possible though.
> Wikipedia is an interesting case. I still don’t entirely understand why it works, though I think it’s telling that 24 years later no one has replicated their success.
Actually this model is quite common. There are tons of sources of free information curated by volunteers - most are just too niece to get to the scale of Wikipedia.
Ofttimes people are sufficiently anti-ad that this point won't resonate well. I'm personally mostly in that camp in that with relatively few exceptions money seems to make the parts of the web I care about worse (it's hard to replace passion, and wading through SEO-optimized AI drivel to find a good site is a lot of work). Giving them concrete examples of sites which would go away can help make your point.
E.g., Sheldon Brown's bicycle blog is something of a work of art and one of the best bicycle resources literally anywhere. I don't know the man, but I'd be surprised if he'd put in the same effort without the "brand" behind it -- thankful readers writing in, somebody occasionally using the donate button to buy him a coffee, people like me talking about it here, etc.
But even your example gets worse with AI potentially - the "upsell" of his blog isn't paid posts but more subscribers so there will be thankful readers, a few donators, people talking about it. If the only interface becomes an AI summary of his work without credit, it's much more likely he stops writing as it'll seem like he's just screaming into the void
He's that widely respected that amongst those who repair bikes (I maintain a fleet of ~10 for my immediate family) he is simply known as "Saint Sheldon".
I agree that specific examples help, though I think the ones that resonate most will necessarily be niche. As a teen, I loved Penny Arcade, and watched them almost die when the bottom fell out of the banner-ad market.
Now, most of the value I find in the web comes from niche home-improvement forums (which Reddit has mostly digested). But even Reddit has a problem if users stop showing up from SEO.
A agree with your first line but the rest sounds like a similar argument to the ridiculous damages video game companies used to claim due to piracy when most of those pirates never would have bought the game in the first place.
Ultimately the root issue is that copyright is inherently flawed because it tries to increase available useful information by restricting availability. We'd be better off by not pretending that information is scarce and looking for alternative to fund its creation.
there are companies that already do this, and the ONE thing none of them do is place the information they are selling on THE PUBLIC INTERNET. so your point is moot
Maybe, on a social level, we all win by letting AI ruin the attention economy:
The internet is filled with spam. But if you talk to one specific human, your chance of getting a useful answer rises massively. So in a way, a flood of written AI slop is making direct human connections more valuable.
Instead of having 1000+ anonymous subscribers for your newsletter, you'll have a few weekly calls with 5 friends each.
Let’s imagine you have a content creator that runs a paid newsletter. They put in lots of effort to make well-researched and compelling content. They give some of it away to entice interested parties to their site, where some small percentage of them will convert and sign up.
They put the information up under the assumption that viewing the content and seeing the upsell are inextricably linked. Otherwise there is literally no reason for them to make any of it available on the open web.
Now you have AI scrapers, which will happily consume and regurgitate the work, sans the pesky little call to action.
If AI crawlers win here, we all lose.