> Now expand that to movies and games and you can get why this whole generative-AI bubble is going to pop.
What will save it is that, no matter how picky you are as a creator, your audience will never know what exactly was that you dreamed up, so any half-decent approximation will work.
In other words, a corollary to your corollary is, "Fortunately, you don't need them to be, because no one cares about low-order bits".
Or, as we say in Poland, "What the eye doesn't see, the heart doesn't mourn."
> What will save it is that, no matter how picky you are as a creator, your audience will never know what exactly was that you dreamed up, so any half-decent approximation will work.
Part of the problem is the "half decent approximations" tend towards a clichéd average, the audience won't know that the cool cyberpunk cityscape you generated isn't exactly what you had in mind, but they will know that it looks like every other AI generated cyberpunk cityscape and mentally file your creation in the slop folder.
I think the pursuit of fidelity has made the models less creative over time, they make fewer glaring mistakes like giving people six fingers but their output is ever more homogenized and interchangable.
In other words, someone willing to tweak the prompt and press the button enough times to say "yeah, that one, that's really good" is going to have a result which cannot in fact be reliably binned as AI-generated.
I mean, no? None of the AI-generated images managed to be indistinguishable. Some people were much better than others at spotting the differences. He even quotes, at length, an artist giving a detailed breakdown of what's wrong with one of the images he thought was good.
Did you read the article? Respondents performed barely better than chance. Sure, no one was actually 100% wrong[0]. Just almost always wrong, with a noticeable bias towards liking AI art more.
The detailed breakdown you mention? Maybe it's accurate to that artist's thought process, maybe it's more of a rationalization; either way, it's not a general rule they, or anyone, could apply to any of the other AI images. Most of those in the article don't exhibit those "telltale signs", and the one that does - the Victorian Megaship - was actually made by human artist with no AI in the mix.
EDIT:
Another image that stands out to me is Riverside Cafe. Myself, like apparently a lot of other people, going by articles' comments, assumed it's a human-made one, because we vaguely remembered Vang Gogh painted something like it. He did, it's called Café Terrace at Night - and yet, despite immediately evoking the association, Riverside Cafe was made by AI, and is actually nothing like Café Terrace at Night at any level.
(I find it fascinating how this work looks like a copy of Van Gogh at first glance, for no obvious reason, but nothing alike once you pause to look closer. It's like... they have similar low-frequency spectra or something?)
EDIT2:
Played around with the two images in https://ejectamenta.com/imaging-experiments/fourifier/. There are some similarities in the spectra, I can't put my finger on them exactly. But it's probably not the whole answer. I'll try to do some more detailed experimentation later.
--
[0] - Nor should you expect it - it would mean either a perfect calibration, or be the equivalent of flipping a coin and getting heads 30 times in a row; it's not impossible, but you shouldn't expect to see it unless you're interviewing fewer people than literally the entire population of the planet.
> The average participant scored 60%, but people who hated AI art scored 64%, professional artists scored 66%, and people who were both professional artists and hated AI art scored 68%.
> The highest score was 98% (49/50), which 5 out of 11,000 people achieved. Even with 11,000 people, getting scores this high by luck alone is near-impossible.
This accurately boils down to "cannot reliably be binned as AI-generated". Your objection amounts to a vanishing few people who are informed that this is a test being able to do a pretty good job at it.
If 0.0005% of people who are specifically judging art as AI or not AI, in a test which presumably attracts people who would like to be able to do that thing, can do a 98% accurate job, and the average is around 60%: that isn't reliable.
If that doesn't work for you, I encourage you to take the test. Obviously since you've read the article there are some spoilers, but there's still plenty of chances to get it right or wrong. I think you'll discover that you, too, cannot do this reliably. Let us know what happens.
I can't do it reliably and I don't want to - I learnt to spot certain popular video compression artifacts in my youth, and that has not enhanced my life. But any distinction that random people taking a casual internet survey get right 60% of the time is absolutely one that you can make reliably if you put in the effort. Look at something like chicken sexing.
a somewhat counterintuitive argument is this: AI models will make the overall creative landscape more diverse and interesting, ie, less "average"!
Imagine the space of ideas as a circle, with stuff in the middle being more easy to reach (the "cliched average"). Previously, traversing the circle was incredibly hard - we had to use tools like DeviantArt, Instragram, etc to agglomerate the diverse tastes of artists, hoping to find or create the style we're looking for. Creating the same art style is hiring the artist. As a result, on average, what you see is the result of huge amounts of human curation, effort, and branding teams.
Now reduce the effort 1000x, and all of a sudden, it's incredibly easy to reach the edge of the circle (or closer to it). Sure, we might still miss some things at the very outer edge, but it's equivalent to building roads. Motorists appear, people with no time to sit down and spend 10000 hours to learn and master a particular style can simply remix art and create things wildly beyond their manual capabilities. As a result, the amount of content in the infosphere skyrockets, the tastemaking velocity accelerates, and you end up with a more interesting infosphere than you're used to.
To extend the analogy, imagine the circle as a probability distribution; for simplicity, imagine it's a bivariate normal joint distribution (aka. Gaussian in 3D) + some noise, and you're above it and looking down.
When you're commissioning an artist to make you some art, you're basically sampling from the entire distribution. Stuff in the middle is, as you say, easiest to reach, so that's what you'll most likely get. Generative models let more people do art, meaning there's more sampling happening, so the stuff further from the centre will be visited more often, too.
However, AI tools also make another thing easier: moving and narrowing the sampling area. Much like with a very good human artist, you can find some work that's "out there", and ask for variations of it. However, there are only so many good artists to go around. AI making this process much easier and more accessible means more exploration of the circle's edges will happen. Not just "more like this weird thing", but also combinations of 2, 3, 4, N distinct weird things. So in a way, I feel that AI tools will surface creative art disproportionally more than it'll boost the common case.
Well, except for the fly in the ointment that's the advertising industry (aka. the cancer on modern society). Unfortunately, by far most of the creative output of humanity today is done for advertising purposes, and that goal favors the common, as it maximizes the audience (and is least off-putting). Deluge of AI slop is unavoidable, because slop is how the digital world makes money, and generative AI models make it cheaper than generative protein models that did it so far. Don't blame AI research for that, blame advertising.
Tastes are almost never normally distributed along a spectrum, but multi-modal. So the more dimensions you explore in, the more you end up with “islands of taste” on the surface of a hyper sphere and nothing like the normal distribution at all. This phenomenon is deeply tied to why “design by committee” (eg, in movies) always makes financial estimates happy but flops with audiences — there is almost no customer for average anything.
An example of a hit movie or song that was created by committee?
Inside Out 2 had the largest box office of any movie in 2024. Checkout the "research and writing" section in its wikipedia article https://en.wikipedia.org/wiki/Inside_Out_2#Research_and_writ... ... psychological consultants, a feedback loop with a group of teenagers, test screenings.
Or how about "Die with a smile" - currently number 1 in the global top 50 on Spotify. 5 songwriters
Or "APT." - currently number 2 in the global top 50 on Spotify. 11 songwriters
Inside Out 2 has a single writer, who also worked on the first.
Consulting with SMEs, testing with audiences, etc isn’t “design by committee”.
Similarly, “Die With a Smile” seems to have been the work of two people with developed styles with support — again, not a committee:
> The collaboration was a result of Mars inviting Gaga to his studio where he had been working on new music. He presented the track in progress to her and the duo finished writing and recording the song the same day.
Apt seems to have started with a single person goofing around, then pitched as a collaboration and the expanded team entered at that point.
I like the picture, but I'd be more impressed with the exploration argument if we were collectively actually doing a good job giving recognition to original and substantial works that already exist.
It'd be of greater service in that regard to create a high-quality artificial stand-in for that limited-quantity "attention" and "engagement" all the bloodsuckers seem so keen on harvesting.
(And I do blame the advertisers, but frankly anyone handing them new amplifiers, with entirely predictable consequences, is also not blameless.)
I read this argument/analogy and the "AI slop will win" idea reminds me of the idea that "fake news will win".
That is based on perception that it is easier than ever to create fake content, but fails to account for the fact that creating real content (for example, simply taking a video) is even much easier. So while there is more fake content, there is also lot more real content, and so manipulation of reality (for example, denying a genocide) is much harder today than ever.
Anyway, "the AI slop will win" is based on a similar misconception, that total creative output will not increase. But like with fake news, it probably will not be the case, and so the actual amount of good art will increase, too.
I think we are OK as long as normal humans prefer to create real news rather than fake news, and create innovative art rather than cliched art.
> I think we are OK as long as normal humans prefer to create real news rather than fake news, and create innovative art rather than cliched art.
So we're not OK.
I think I need to state my assumptions/beliefs here more explicitly.
First of all, "AI slop" is just the newest iteration on human-produced slop, which we're already drowning in. Not because people prefer to create slop, but because they're paid to do it, because most content is created by marketers and advertisers to sell you shit, and they don't want it to be better than strictly necessary for purpose.
It's the same with fake news, really. Fake news isn't new. Almost all news is fake news; what we call "fake news" is a particular flavor of bullshit that got popular as it got easier for random humans to publish stories competing with established media operations.
In both cases, AI is exacerbating the problem, but it did not create it - we were already drowning in slop.
Which leads me to related point:
> Anyway, "the AI slop will win" is based on a similar misconception, that total creative output will not increase.
It will. But don't forget Sturgeon's law - "ninety percent of everything is crap"[0]. Again, for the past couple decades, we've been drowning in "creative output". It's not a new problem, it's just increasingly noticeable in the past years, because the Web makes it very easy for everyone to create more "creative output" (most of which is, again, advertising), and it finally started overwhelming our ability to filter out the crap and curate the gems.
Adding AI to the mix means more output, which per Sturgeon's law, means disproportionately more crap. That's not AI's fault, that's ours; it's still the same problem we had before.
And as AI oversaturates the cliched average, creators will have to get further and further away from the average to differentiate themselves. If you pour a lot of work into your creation you want to make it clear that it isn't some cliched AI drivel.
> I think the pursuit of fidelity has made the models less creative over time, they make fewer glaring mistakes like giving people six fingers but their output is ever more homogenized and interchangeable.
That may be true of any one model (though I don’t think it really is, either, I think newer image gen models are individually capable of a much wider array of styles than earlier models), but it is pretty clearly not true of the whole range of available models, even if you look at a single model “family” like “SDXL derivatives”.
> I think the pursuit of fidelity has made the models less creative over time (...) their output is ever more homogenized and interchangable.
Ironically, we're long past that point with human creators, at least when it comes to movies and games.
Take sci-fi movies, compare modern ones to the ones from the tail end of the 20th century. Year by year, VFX gets more and more detailed (and expensive) - more and better lights, finer details on every material, more stuff moving and emitting lights, etc. But all that effort arguably killed immersion and believability, by making scenes incomprehensible. There's way too much visual noise in action scenes in particular - bullets and lighting bolts zip around, and all that detail just blurs together. Contrast the 20th century productions - textures weren't as refined, but you could at least tell who's shooting who and when.
Or take video games, where all that graphics works makes everything look the same. Especially games that go for realistic style, they're all homogenous these days, and it's all cheap plastic.
(Seriously, what the fuck went wrong here? All that talk, and research, and work into "physically based rendering", yet in the end, all PBR materials end up looking like painted plastic. Raytracing seems to help a bit when it comes to liquids, but it still can't seem to make metals look like metals and not Fischer-Price toys repainted to gray.)
So I guess in this way, more precision just makes the audience give up entirely.
> they will know that it looks like every other AI generated cyberpunk cityscape and mentally file your creation in the slop folder.
The answer here is the same as with human-produced slop: don't. People are good at spotting patterns, so keep adding those low-order bits until it's no longer obvious you're doing the same thing everyone else is.
EDIT: Also, obligatory reminder that generative models don't give you average of training data with some noise mixed up; they sample from learned distribution. Law of large numbers apply, but it just means that to get more creative output, you need to bias the sampling.
Video games (the much larger industry of the two, by revenue) seems to be closer to understanding this. AAA games dominate advertising and news cycles, but on any best-seller list AAA games are on par with indie and B games (I think they call them AA now?). For every successful $60M PBR-rendered Unreal 5 title there is an equally successful game with low-fidelity graphics but exceptional art direction, story or gameplay.
Western movie studios may discover the same thing soon, with the number of high-budget productions tanking lately.
I agree. The one shining hope I have is the incredible art and animation style of Fortiche[0]'s Arcane[1] series. Watch that, and then watch any recent (and identikit) Pixar movie, and they are just streets ahead. It's just brilliant.
I was just going to say this. If you have an artistic vision that you simply must create to the minutest detail, then like any artist, you're in for a lot of manual work.
If you are not beholden to a precise vision or maybe just want to create something that sells, these tools will likely be significant productivity multipliers.
So far ChatGPT is not for writing books, but is great for SEO-spam blogposts. It is already killing the content marketing industry.
So far Dall-E is not for making master paintings, but it's great for stock images. It might kill most of the clipart and stock image industry.
So far Udio and other song generators are not able to make symphonies, but it's great for quiet background music. It might kill most of the generic royalty-free-music industry.
Half decent approximations work a lot better in generating the equivalent of a stock illustrations of a powerpoint slide.
Actual long form art like a movie works because it includes many well informed choices that work together as a whole.
There seems to be a large gap between generating a few seconds of video vaguely like one's notion, and trying to create 90 minutes that are related and meaningful.
Which doesn't mean that you can't build from this starting place build more robust tools. But if you think that this is a large, hard amount of work, it certainly could call into question optisimitic projections from people who don't even seem to notice that there is work need at all.
That's just sad, and why people have a derogative stance towards generative AI: "half-decent" approximation removes all personality from the output, leading to a bunch of slop on the internet.
It does indeed, but then many of those people don't notice they're already consuming half-decent, personality-less slop, because that's what human artists make too, when churning out commercial art for peanuts and on tight deadlines.
It's less obvious because people project personality onto the content they see, because they implicitly assume the artist cared, and had some vision in mind. Cheap shit doesn't look like cheap shit in isolation. Except when you know it's AI-generated, because this removes the artist from the equation, and with it, your assumptions that there's any personality involved.
I'm not so sure, one of the primary complaints about IP farming slop that major studios have produced recently is a lack of firm creative vision, and clear evidence of design by committee over artist direction.
People can generally see the lack of artistic intent when consuming entertainment.
That's true. Then again, complaints about "lack of firm creative vision, and clear evidence of design by committee over artist direction" is something I've seen levied against Disney for several years now; importantly, they started before generative AI found its way into major productions.
So, while GenAI tools make it easier to create superficially decent work that lacks creative intent, the studios managed to do it just fine with human intelligence only, suggesting the problem isn't AI, but the studios and their modern management policies.
It’s like how there are two types of movie directors (or creative directors in general), the dictatorial “100 takes until I get it exactly how I envision it” type, and the “I hired you to act, so you bring the character to life for me and what will be will be” type
Right now AI is more the latter, but many people want it to be the former
A director letting actors "just be" knows exactly what he/she wants, and choses actors accordingly. Just as the directors that want the most minute detail.
Clint Eastwood tries to do at most one take of a scene. David Fincher is infamous for his dozens of takes.
No one can have a fully formed vision. But intent, yes. Then you use techniques to materialize it. Word is a poor substitute for that intent, which is why there’s so many sketches in a visual project.
And why physical execution frequently significantly departs from sketches and concept art. The amount of intent that doesn't get translated is pretty staggering in both physical and digital pipelines in many projects.
Fair point, particularly given the example. My conclusion wrt. Marvel vs. DC is that DC productions care much less about details, in exactly the way I find off-putting.
Not all details matter, some do. And, it's better to not show the details at all, than to be inconsistent in them.
Like, idk., don't identify a bomb as a specific type of existing air-fuel ordnance and then act about it as if it was a goddamn tactical nuke. Something along these lines was what made me stop watching Arrow series.
The last production I worked on averaged 16 hours per frame for the final rendering. The amount of information encoded in lighting, models, texture, maps, etc is insane.
VFX heavy feature for a Disney subsidiary. Each frame is rendered independently of each other - it’s not like video encoding where each frame depends on the previous one, they all have their own scene assembly that can be sent to a server to parallelize rendering. With enough compute, the entire film can be rendered in a few days. (It’s a little more complicated than that but works to a first order approximation)
I don’t remember how long the final rendering took but it was nearly two months and the final compute budget was 7 or 8 figures. I think we had close to 100k cores running at peak from three different render farms during crunch time, but don’t take my word for it I wasn’t producing the picture.
There's plenty of GPU renderers but they face the same challenge as large language models: GPU memory is much more expensive and limited that CPU memory.
A friend recently told me about a complex scene (I think it was a Marvel or Star Wars flick) where they had so much going on in the scene with smoke, fire, and other special effects that they had to wait for a specialized server with 2TB of RAM to be assembled. They only had one such machine so by the time the rest of the movie was done rendering, that one scene still had a month to go.
I'm not sure how well suited GPUs are to the workload. They're also rather memory constrained. The Moana dataset is from 2016 so it's not exactly cutting edge but good luck loading it into vram.
Most VFX productions take over 2 CPU hours a frame for final video, and have for a very long time. It takes shorter then a month since this gets parallelized on large render farms.
The point is not to be precise. It's to be "good enough".
Trust me, even if you work with human artists, you'll keep saying "it's not quite I initially invisioned, but we don't have budget/time for another revision, so it's good enough for now." all the time.
Corollary: I couldn't create an original visual piece of art to save my life, so prompting is infinitely better than what I could do myself (or am willing to invest time in building skills). The gen-AI bubble isn't going to burst. Pareto always wins.
If you can build a system that can generate engaging games and movies, from an economic (bubble popping or not popping) point of view it's largely irrelevant whether they conform to fine-grained specifications by a human or not.
Text generation is the most mature form of genAI and even that isn't even remotely close to producing infinite engaging stories. Adding the visual aspect to make that story into a movie or the interactive element to turn it into a game is only uphill from there.
Maybe your AI bubble! If you define AI to be something like just another programming language yes you will be sadly disappointed. You see it as an employee with its own intuitions and ways of doing things that you're trying to micromanage.
I have a bad feeling that you'd be a horrible manager if you ever were one.
Yes in a nutshell they explain that you can express a picture or a video with relatively few discrete information.
First paper is the most famous and prompted a lot of research to using text generation tools in the image generation domain : 256 "words" for an image, Second paper is 24 reference image per minutes of video, Third paper is a refinement of the first saying you only need 32 "tokens". I'll let you multiply the numbers.
In kind of the same way as a who's who game, where you can identify any human on earth with ~32bits of information.
The corollary being that contrary to what parent is telling there is no theoretical obstacle to obtaining a video from a textual description.
These papers, from my quick skim (tho I did read the first one fully years ago,) seem to show that some images and to an extent video can be generated from discrete tokens, but does not show that exact images nor that any image can be.
For instance, what combination of tokens must I put in to get _exactly_ Mona Lisa or starry night? (Tho these might be very well represented in the data set. Maybe a lesser known image would be a better example)
As I understand, OC was saying that they can’t produce what they want with any degree of precision since there’s no way to encode that information in discrete tokens.
If you want to know what tokens you want to obtain _exactly_ Mona Lisa, or any other image, you take the image and put it through your image tokenizer aka encode it, and if you have the sequence of token you can decode it to an image.
The whole encoding-decoding process is reversible, and you only lose some imperceptible "details", the process can be either trained with a L2Loss, or a perceptual loss depending what you value.
The point being that images which occurs naturally are not really information rich and can be compressed a lot by neural networks of a few GB that have seen billions of pictures. With that strong prior, aka common knowledge, we can indeed paint with words.
Maybe I’m not able to articulate my thought well enough.
Taking an existing image and reversing the process to get the tokens that led to it then redoing that doesn’t seem the same as inserting token to get a precise novel image.
Especially since, as you said, we’d lose some details, it suggests that not all images can be perfectly described and recreated.
I suppose I’ll need to play around with some of those techniques.
After encoding the models are usually cascaded either with a LLM or a diffusion model.
Natural Image-> Sequence of token, but not all possible sequence of token will be reachable. Like plenty of letters put together form non-sensical words.
Sequence of token -> Natural Image : if the initial sequence of token is unsensical the Natural image will be garbage.
So usually you then modelize the sequence of token so that it produce sensical sequences of token, like you would with a LLM, and you use the LLM to generate more tokens. It also gives you a natural interface to control the generation of token. You can express with words what modifications to the image you should do. Which will allow you to find the golden sequence of token which correspond to the mona-lisa by dialoguing with the LLM, which has been trained to translate from english to visual-word sequence.
Alternatively instead of a LLM you can use a diffusion model, the visual words usually are continuous, but you can displace them iteratively with text using things like "controlnet" (stable diffusion).
You are half right. Its funny because I use the same same. Mine is "A picture is worth a thousand words. thats why it takes 1000 words to describe the exact image that you want! Much better to just use Image to Image instead".
Thats my full quote on this topic. And I think it stands. Sure, people won't describe a picture. instead, they will take an existing picture or video, and do modifications of it, using AI. That is much much simpler and more useful, if you can file a scene, and then animate it later with AI.
Actually, I've gotten some great results with image2text2image with less than a thousand words. Maybe not enough for a video, but for some not too crazy images, it is enough!
Now expand that to movies and games and you can get why this whole generative-AI bubble is going to pop.