Sure, let's make an EU commercial LLM. Let's start by scraping all the Francopho...

IMTDb · on April 11, 2023

You forgot that anyone using it must click a button that says that he will not use it for evil purposes. You also must acknowledge that the AI will not track you. These must be separated disclaimers that need to be validated on every prompt. API usage is thus not allowed.

The AI should also make it 100% clear that whatever gets produced is clearly identifiable as coming form an AI. As a consequence; text cannot be produced because it would be trivial to remove the disclaimer. A currently proposed bill indicates that the AI should only be able to produce images in an obscure format with a randomised watermark that covers at least 65% of the pixels of the image. The bill is scheduled for ratification in 2028 and must be signed by 100% of the state members.

Until then, the grant for the development of this world changing AI is on accelerated path ! Teams can fill a 65 pages document to have a shot at getting a whole $1 million.

Accenture and Capgemini are working on it.

revelio · on April 11, 2023

Heh. Also important: anyone can object to the presence of information that mentions them or they created being known to the AI at any time, and if they object within writing you have 3 days to re-train the AI to remove whatever they objected to. If you fail to meet this deadline then you have to pay 10% of your global revenue to the EU Commission and there is no court case or appeal you can file, you just have to pay.

Unless of course you have a legitimate reason for that data to be in the AI, or to reject the privacy request. What is and is not legitimate isn't specified anywhere because it's obvious. If you ask for clarification because you think it's not obvious, you won't be given any because we don't do things that way around here. If you interpret this clause in a way that we later decide makes us look bad, then the definition of "need" and "legitimate" will change at that moment to make us look good.

BTW inability to retrain within three days is not a legitimate reason. Nor is the need to be competitive with US firms. Now here is your 300,000 EUR grant, have fun!

omneity · on April 11, 2023

BLOOM has been trained on a 3M€ grant from French research agencies CNRS and GENCI.

Doesn’t have any of the constraints you’re talking about.

lhl · on April 11, 2023

BLOOM's training corpus ROOTS did make some efforts at removing PII https://arxiv.org/pdf/2303.03915.pdf btw, but AFAICT that was not at the behest of the French government.

nickpp · on April 11, 2023

The moment Europe decided to regulate tech, it decided in effect to stagnate. Innovation and creativity are incompatible with regulation. Unfortunately for us, tech is where progress happens currently. Europe is being left behind. Not that it was very competitive in the first place anyway.

MisterPea · on April 11, 2023

While true, I think they do innovate in policy around it. Regulation is an ever-evolving field as well and they do think about it more.

But yes, in a half-century I'm very curious where Europe will be. India passed the UK in gdp recently and Germany sooner or later.

raverbashing · on April 11, 2023

The secret is not scraping PII in the first place (which is not really difficult, though it requires some planning)

brookst · on April 11, 2023

“Not that difficult”?

Can you elaborate? Because I think it’s nearly insurmountable.

Is the sentence “Meagan Smith graduated magma cum laude from Northwestern’s business program in 2004” PII? How about if another part of the corpus says “M. Smith had a promising career in business after graduating with honors from a prestigious school, but an unplanned pregnancy caused her to quit her job in 2006”?

Does it matter if it’s from fiction? What if the fiction it comes from uses real people? Or if there might be both real and fictional Meagan Smiths?

And how so you process that kind of thing at the scale of billions of documents?

This is a very hard problem, especially at scale.

raverbashing · on April 11, 2023

Where are you scraping this data from? This is the main question

> “M. Smith had a promising career in business after graduating with honors from a prestigious school, but an unplanned pregnancy caused her to quit her job in 2006”

The main issue is how that statement ended up there in the first place. Even then how many "M. Smith" have studied in prestigious schools? By itself that phrase wouldn't be PII

Now if you have a db entry with "M Smith" and entries for biographical data that's definitely PII

seydor · on April 11, 2023

not sure if it can ever be possible. i can ask chatGPT to do stylography analysis on our comments, find our other accounts and go from there. I'm pretty sure most pieces of human-generated data is identifiable at this point

raverbashing · on April 11, 2023

This is not how it works, (unless of course you're pretending and collecting all these 'auxiliary data' on purpose) and even if it was, there's still plenty of non-PII data around

pixl97 · on April 11, 2023

It would be really interesting to raise a human only on non-PII data and see exactly how screwed up and weird they'd be.

The Golem-Class model behaves in a 'humanlike' manner because it's trained on actual real data like we'd experience in the world. What you're suggesting is some insane psychology test that we'd never allow to happen to a human.

seydor · on April 11, 2023

It hasn't been long that someone applied a basic, cosine similarity to HN comments to find alternate accounts. It worked quite well afaik https://news.ycombinator.com/item?id=33755016

raverbashing · on April 11, 2023

Yes, and?

If you're worried about being identified from alt-accounts you're much more likely to be tracked via reuse of emails or some other information that you have slipped (see multitude of cases)

Simple text is not PII, laws are not interpreted like technical discussions are https://xkcd.com/1494/