Sure, let's make an EU commercial LLM. Let's start by scraping all the Francophone internet. Then let's remove all the PII data and potentially PII-data. Easy-peasy.
Then let's train our network so as not to spew out or make up PII data - easy peasy
Then let's make it able to delete PII data that it has inadvertedly collected on request. Simultaneously it should be recording all the conversations for safety reasons. that must be possible somehow
And let's make sure it never impersonates or makes up defamatory content - that must be super easy.
And let's make it explain itself. But explain truthfully, by giving an oath, not like ChatGPT that likes making things up.
You forgot that anyone using it must click a button that says that he will not use it for evil purposes. You also must acknowledge that the AI will not track you. These must be separated disclaimers that need to be validated on every prompt. API usage is thus not allowed.
The AI should also make it 100% clear that whatever gets produced is clearly identifiable as coming form an AI. As a consequence; text cannot be produced because it would be trivial to remove the disclaimer. A currently proposed bill indicates that the AI should only be able to produce images in an obscure format with a randomised watermark that covers at least 65% of the pixels of the image. The bill is scheduled for ratification in 2028 and must be signed by 100% of the state members.
Until then, the grant for the development of this world changing AI is on accelerated path ! Teams can fill a 65 pages document to have a shot at getting a whole $1 million.
Heh. Also important: anyone can object to the presence of information that mentions them or they created being known to the AI at any time, and if they object within writing you have 3 days to re-train the AI to remove whatever they objected to. If you fail to meet this deadline then you have to pay 10% of your global revenue to the EU Commission and there is no court case or appeal you can file, you just have to pay.
Unless of course you have a legitimate reason for that data to be in the AI, or to reject the privacy request. What is and is not legitimate isn't specified anywhere because it's obvious. If you ask for clarification because you think it's not obvious, you won't be given any because we don't do things that way around here. If you interpret this clause in a way that we later decide makes us look bad, then the definition of "need" and "legitimate" will change at that moment to make us look good.
BTW inability to retrain within three days is not a legitimate reason. Nor is the need to be competitive with US firms. Now here is your 300,000 EUR grant, have fun!
BLOOM's training corpus ROOTS did make some efforts at removing PII https://arxiv.org/pdf/2303.03915.pdf btw, but AFAICT that was not at the behest of the French government.
The moment Europe decided to regulate tech, it decided in effect to stagnate. Innovation and creativity are incompatible with regulation. Unfortunately for us, tech is where progress happens currently. Europe is being left behind. Not that it was very competitive in the first place anyway.
Can you elaborate? Because I think it’s nearly insurmountable.
Is the sentence “Meagan Smith graduated magma cum laude from Northwestern’s business program in 2004” PII? How about if another part of the corpus says “M. Smith had a promising career in business after graduating with honors from a prestigious school, but an unplanned pregnancy caused her to quit her job in 2006”?
Does it matter if it’s from fiction? What if the fiction it comes from uses real people? Or if there might be both real and fictional Meagan Smiths?
And how so you process that kind of thing at the scale of billions of documents?
Where are you scraping this data from? This is the main question
> “M. Smith had a promising career in business after graduating with honors from a prestigious school, but an unplanned pregnancy caused her to quit her job in 2006”
The main issue is how that statement ended up there in the first place. Even then how many "M. Smith" have studied in prestigious schools? By itself that phrase wouldn't be PII
Now if you have a db entry with "M Smith" and entries for biographical data that's definitely PII
not sure if it can ever be possible. i can ask chatGPT to do stylography analysis on our comments, find our other accounts and go from there. I'm pretty sure most pieces of human-generated data is identifiable at this point
This is not how it works, (unless of course you're pretending and collecting all these 'auxiliary data' on purpose) and even if it was, there's still plenty of non-PII data around
It would be really interesting to raise a human only on non-PII data and see exactly how screwed up and weird they'd be.
The Golem-Class model behaves in a 'humanlike' manner because it's trained on actual real data like we'd experience in the world. What you're suggesting is some insane psychology test that we'd never allow to happen to a human.
It hasn't been long that someone applied a basic, cosine similarity to HN comments to find alternate accounts. It worked quite well afaik https://news.ycombinator.com/item?id=33755016
If you're worried about being identified from alt-accounts you're much more likely to be tracked via reuse of emails or some other information that you have slipped (see multitude of cases)
Simple text is not PII, laws are not interpreted like technical discussions are https://xkcd.com/1494/
Then let's train our network so as not to spew out or make up PII data - easy peasy
Then let's make it able to delete PII data that it has inadvertedly collected on request. Simultaneously it should be recording all the conversations for safety reasons. that must be possible somehow
And let's make sure it never impersonates or makes up defamatory content - that must be super easy.
And let's make it explain itself. But explain truthfully, by giving an oath, not like ChatGPT that likes making things up.
Looks very doable to me