Some argue that synthetic data can make AI systems better

daenz · on Feb 18, 2022

Some years ago I worked at a startup that was doing OCR on paper receipts. As part of my application to the company, I wrote a synthetic training data generator[0] to generate a range of CG receipts, along with pixel-perfect accuracy of labeled XY bounding boxes for each letter. Generating synthetic training data allows for a high degree of flexibility to the shape of your data. It allows you to focus on strengthening edge cases where you just don't have enough real world data.

0. https://www.arwmoffat.com/work/synthetic-training-data

nomel · on Feb 19, 2022

When developing anything in the real world, you don't wait for edge cases to happen naturally. You force them to happen, and look at the response. This is the rigor of engineering. Has that been lost?

Frost1x · on Feb 19, 2022

I often here "edge case" as a defense as to why something hasn't been developed or handled. There's an ongoing trend in software to avoid high hanging fruit and focus on the low hanging big and fast returns. I just accept it now although it bothers me everytime I see it happen.

The beauty and allure of science and engineering, to me, is understanding something so well you can predict its future behavior and then predicting it so well you can prevent undesired behaviors and create or improve desired behaviors, within some set of bounds of course.

nomel · on Feb 19, 2022

> I just accept it now although it bothers me everytime I see it happen.

That's all fine and dandy if it has no real consequences, which is usually the case. The use cases of ML, increasingly, have very real consequences. I think this current Wild West mentality is great for greasing innovation, but not so great when a self driving car start slamming into poles at sunset, whenever there's a blinking red "Open" sign behind them.

Fricken · on Feb 19, 2022

In autonomous vehicle development 'edge cases' are the things that engineers would never conjure up via thought experiment. They need to be discovered through real world testing.

Once an edge case has been discovered, they can then artificially generate myriad subtle variations on the edge case, which they use to train their systems.

nomel · on Feb 22, 2022

I think the point is you can augment/synthesize a video with all sorts of random lighting conditions, occlusions, etc, that could take an impossible number of miles to happen naturally. Not filling in some of the search space seems silly, to me, especially if you're instead filling it in with real life data, containing near 100% mundane near ideal conditions.

zwaps · on Feb 19, 2022

Well, to test the system sure. That's normal.

However, the issue not quite as simple when training a model. For instance, the increasingly popular Transformer architecture pushes around internal representations when fed new training samples. These representations are not "causal parameters" - they do not represent unchanging mechanisms in real life. Instead, they depend on data.

The question thus becomes: How prominent should edge cases be in the sample? A ML model trained on many edge cases may be more robust for edge cases we feed it, but it may be more unstable on real-world data (in essence, the model may have new edge cases when new data comes in that would not have been edge cases before).

All of this comes from the fact that the parameter of the model are not "causal" for the underlying latent DGP. ML, in constrast to stats, does not yet have much theory regarding how to think about parameters relative to the sample of observations one trains on. The question of "causal" identification in stats is easier to tackle, because one can usually define a "scientific" or "structural" model where parameters have meaning, and so we can reason whether we capture "real, causal parameters" (which would be compatible with edge cases) or "reduced form parameters" (which would generally depend on the sample we have). ML will likely get there. Right now, people think in terms of "data shift" (which they shouldn't) and generalizing (missing the point a bit). But we will get there.

By the way, this issue arose because ML systems where treated like engineering systems instead of statistical models. For better or worse.

snek_case · on Feb 19, 2022

There's this dogmatic idea in the machine learning field that only real world data is valuable.

cinntaile · on Feb 19, 2022

Data augmentation is a thing within machine learning (deep learning) so that's a big generalization.

snek_case · on Feb 19, 2022

Data augmentation is a big thing in applied machine learning. Deep learning practitioners use data augmentation because it works incredibly well. However, deep learning researchers tend to view data augmentation as some kind of dirty trick that wouldn't be necessary if you just had a bigger dataset.

tsbischof · on Feb 19, 2022

It's also about the issue that people will bias against generating the "difficult" data, even if only subconsciously. Real-world validation is essential to reveal whether you have actually worked through the full phase space.

snek_case · on Feb 19, 2022

Depends. As others have said, you can be intentionally adversarial (looking for corner cases) when testing your system, in a way that real data isn't.

a_bonobo · on Feb 19, 2022

I guess the field is constantly making a decision between 'do we want to outperform humans, but lose interpretability' and 'we should always be interpretable'

Human-generated artificial data will always contain the human's assumptions. But this data might not contain the 'superhuman' element that leads to the ML-system outperforming humans, because we don't know what that is (yet). Receipts are a good example for artificial data, because they're human-made.

But a lot of what we do with ML systems is using data that doesn't come from humans, images of wildlife, satellite images, biological data, etc

amcoastal · on Feb 19, 2022

This is simply not true. Synthetic data is a huge research arm of ML work.

YeGoblynQueenne · on Feb 19, 2022

>> The solution is to just have more data and better data.

Nah, sorry, that is just trying to put out the fire by throwing fuel at it. The big, big weakness of neural networks right now is their reliance on gigantic datasets that require gigantic computational resources. Neural nets need those because they can't generalise to unseen data. So people try to get them to see as much data as possible during training. They still overfit, but if they can overfit to a diverse enough dataset, then they can be useful in practice, even if that's only to solve narrow, specific instances of a problem (like in the domino recognition system in the article).

To address this weakness what is needed is to find ways to make neural nets less reliant on data, not to find ways to make more data. Make neural nets capable of generalising robustly to unseen data, from few training instances. Then you don't need to train in a simulation. Of course that would require a radical rethink of how deep neural nets are trained (even perhaps whether they remain "deep", or whether they are trained using gradient descent, the sources of their data-hunger). Trying to make more data by simulation is only kicking the can down the road and the only effect it can possibly have is to push the time at which the real limitation must be really addressed even further down the line so that an other generation of researchers has to deal with it while the current generation can keep getting their papers published and their grants granted.

See Vladimir Vapnik's challenge to the machine vision community:

https://youtu.be/bQa7hpUpMzM?t=492

To summarise: learn to identify MNIST digits from 60 examples of each class, rather than 6000, while retaining current accuracy. (My words now:) Improve sample efficiency to improve neural nets. Neural nets have shown a remarkable ability to work well when large amount of resources are available. Now, do like everyone else does in computer science and try to make them (sample) efficient.

zwaps · on Feb 19, 2022

This is always an interesting discussion. I think what you say is probably right "in practice" - that is, it is where ML should move now to improve its ability to generalize.

However, from a more statistical viewpoint, this approach is not "theoretically correct". The issue of unseen data arises for one simple reason: the model does not capture causal mechanisms in the DGP. The model instead learns a reduced form that represents the training sample. If the training sample is large, and the model expressive, this leads to incredible results. However, it fundamentally differs from a causal model in that it could not do true out-of-sample predicitions if these are counterfactual to the data. That is, it may work, but we can not rely on it. It may randomly fail (and not in dependence on robustness of smoothness of the model, but in joint dependence to the counterfactual DGP, if that makes sense)

This issue stands independent of the amount of data. Indeed, if you had a model that identified all relevant causal mechanisms, then of course more data is still better.

In some areas, ideas from causal statistical analyses seep in. Some physical systems are well described by ODE and PDEs - which can be learned by DNNs but retain their causal structure.

I see no general way forward, though. The approach you propose moves along the general idea of ML/AI - pose better challenges to the model and see if it works. This has worked well, perhaps it is the way to go.

YeGoblynQueenne · on Feb 19, 2022

>> The issue of unseen data arises for one simple reason: the model does not capture causal mechanisms in the DGP. The model instead learns a reduced form that represents the training sample.

Thanks for articulating it so clearly, that's very much how I think of it!

>> This issue stands independent of the amount of data. Indeed, if you had a model that identified all relevant causal mechanisms, then of course more data is still better.

Well, my point -and, as far as I understand it, Vapnik's point, in my link above- is that if a model can generalise well, then it doesn't need a lot of data, even if a lot of data is available. In terms of a model that can identify causal relations, then it should be able to do so without lots of data. My intuition for that is that if the model needs lots of data it will inevitably learn to represent the sampling error in the data generation process (DGP?). A model that can learn good generalisations from few data on the other hand is in a sense immunised against sampling error because it can ignore most of the data.

Vapnik makes an analogy about a good teacher. He asks, what is it that a good teacher does that helps his students learn better? He answers it by saying that the good teacher gives his students "good predicates", what I would call background knowledge, from which the students can build good statistical invariants. But the students must also be able to build good statistical invariants _on their own_, otherwise it doesn't matter what good predicates the teacher can give them, the student's can't learn good invariants.

I think that's the point of the challenge - Vapnik is asking for learning from few examples (not unreasonably few, I think) as a way to demonstrate that a system is learning good underlying principles- his good predicates and statistical invariants. In the past the machine learning community has sidestepped the issue by finding ways to augment data where it was scarce and then claim progress in certain problems for which there were initially few examples (for example, Bongard problems and Winograd schemas) but Vapnik is not only calling for few data, but also for good predicates, which is I think a better defense than just asking for few-shot learning.

In a sense I think if we find that more data is better, then that means the system is still not learning the good bits that make more data unnecessary.

zwaps · on Feb 20, 2022

I really do like the practical challenge that Vapnik poses and based on what you wrote I support the agenda. In the face of the trial&error mentality prevalent in ML, it does strike me as a wise approach towards advancement.

However, the minor technicality remains in my view: there exists a case where a sufficiently complex causal mechanism can be only asympotically identified from a given DGP. In that case, more data is better, and more data is perhaps even required - but also the model will be fully general and will not start to fit on the randomness in sample/DGP for this "mechanism". So indeed, we can theoretically construct situations where Vapnik's intuition does not hold. Then, "more data is better => model is fitting sampling error" is more of a heuristic - even if it is a good one! If the model manages to disentangle error from mechanism, then more data could still improve the model while retaining generality.

Of course, one might propose that any causal mechanism will be identified "sufficiently well" in a finite sample, so there's a point where we have enough data. Perhaps this can even be proven, and my theoretical counterexample simply never exists. In that case, you'd be right: A model that keeps learning ad-infinum can not be causally identified. I am not aware of such a proof, but I'd not be surprised if it were the case. Or more practically speaking, one may be able to show that improvements to the causal estimates are, at some point, always inferior to the overfitting errors no matter the DGP/model.

I would perhaps propose a reverse of the heuristic: We know a model is good, if it can demonstrate that more data does NOT influence its invariant (causally identified) parts. This will be the challenge to solve for DNNs, as the combination of model+ latent DGP necessarily makes this a matter of assumption. In classical models, we can formulate these assumptions mathematically e.g. as exclusion restrictions and the like. For DNNs, I think we do not have such results yet.

treesprite82 · on Feb 19, 2022

> To summarise: learn to identify MNIST digits from 60 examples of each class, rather than 6000,

SOTA accuracy on a similar but more challenging problem (5-shot 20-way rather than 60-shot 10-way) appears to be around 99.6%: https://paperswithcode.com/sota/few-shot-image-classificatio...

> while retaining current accuracy.

Depends how strict you're being with this. There's room for the gap to shrink, but I think on average classifiers (whether organic or machine) with a large number of examples to go off of will always perform at least marginally better than classifiers with fewer examples.

YeGoblynQueenne · on Feb 19, 2022

>> SOTA accuracy on a similar but more challenging problem (5-shot 20-way rather than 60-shot 10-way) appears to be around 99.6%: https://paperswithcode.com/sota/few-shot-image-classificatio...

I'm aware of results like that but they're just kicking the can down the road with the other foot: they push the problem of training with big data to the pre-training stage and then claim to do "few-" or "one-shot" learning at the end, or even "zero-shot" which is egregious abuse of terminology (and it's very sad that it's accepted terminology). It's like the Aesop's fable where the sparrow hid in the eagle's feathers and jumped up at the last moment to claim "I'm the bird that flies the highest!".

Vapnik's point in the interview I linked above is that you should not need a lot of data for anything, including pre-training. His challenge is for the community to find what he calls good "predicates" which are primitive functions (I think of them as feature detectors) that can be composed into a good statistical invariant, a function representing a high-level concept while having good out-of-dataset generalisation ability. His claim is that if you have a bunch of good predicates and good invariants, then you don't need a lot of data, because the generalisation ability of the good predicates makes up for it. Or, seen another way, lots of data is needed _in the case when_ a good predicate is not known. In a certain way, transfer learning, or meta-learning in the case of the paper you link, is a step towards the right direction, but the reliance on big data for pre-training suggests that the models are still not learning good representations that generalise well - so they still need big data to make up for it.

treesprite82 · on Feb 19, 2022

> they push the problem of training with big data to the pre-training stage and then claim to do "few-" or "one-shot" learning at the end

Humans have had 4 billion years of natural selection and then 4 years of input from all senses before they start identifying digits. I've seen studies suggesting that we're already born with an area of our brain for recognizing letters and words.

Seems at least fair in comparison to allow MAML/pretraining to find a good starting model (e.g: can recognize lines and shapes) by utilizing data other than the classes of interest.

> It's like the Aesop's fable where the sparrow hid in the eagle's feathers and jumped up at the last moment to claim "I'm the bird that flies the highest!".

> you should not need a lot of data for anything

Is choosing suitable starting weights/architecture/"predicates" by hand-designing based on our own built up information qualitatively any different? It still seems like "hiding" utilization of a huge amount of background knowledge about digits/symbols/images/reality.

Arguably harder to expand that way too. I think techniques such as unsupervised learning are probably going to be a more feasible way to utilize the increasing amount of data we're collecting about the universe.

At our current stage, both seem useful. Broad strokes like moving from dense networks to convolutional networks to add locality and translational invariance based on our knowledge that this is an appropriate search space for vision tasks, and then automated methods like NAS and pretraining to determine relevance on a finer level.

We definitely haven't exhausted ways for us to use our intuition to guide networks in the right direction, such as transformers with their attention mechanisms or say a network inherently agnostic to horizontal flips rather than teaching that with data augmentation, but I'm skeptical about what sounds like stepping back into hand-crafted feature extraction which automated techniques have been far more effective at.

> but the reliance on big data for pre-training suggests that the models are still not learning good representations that generalise well - so they still need big data to make up for it.

Wouldn't it be lack of generalization to new tasks after the fact which indicates poor predicates? I don't see why good feature detectors should necessarily themselves be discoverable by hand or with low data, as that doesn't appear to have been the case for organic intelligence.

YeGoblynQueenne · on Feb 19, 2022

Yes, humans come into the world with seemingly a very large amount of background knowledge that we can then use to learn new concepts from very few examples. And as you say this is probably the result of many thousands of years of evolution.

But that's not a question of fairness, rather it's a question of feasibility. If it took us many thousands of years to learn our background knowledge from the real world over many human generations, it's difficult to see how we can reproduce this result with the comparatively poor computational resources and data in our disposal.

There is a peculiar double-blindness in machine learning today, I think, where people are hoping to learn extremely difficult concepts, like meaning in language or like all of intelligence, from simultaneously too much and too little data. Too much because humans don't need to train on the entire web to learn meaning (and Large Language Models trained on the entire web still don't learn it). And too little because if you think of the complexity of the real world and the amount of information that we take in with our senses just sitting still looking around, this is an amount of information that can simply not be matched by the largest imaginable dataset that we could create.

So what's the altnerative? I have a parable (oh no). What do you do when you need a fire? Well, clearly, you light a fire, maybe with matches or with a lighter etc. That's because you know how to light a fire and because the implements to do so are now cheap commodities that most humans can afford easily (I bet even Kalahari bushmen use BIC ligthers nowadays...). What you certainly don't do is sit around waiting for a fire to occur naturaly, say by thunder strike, like humans presumably did before discovering how to make fire from scratch. Because that could take ages and because you have the knowledge necessary to not have to wait for ages. In the same way, we could wait around for ages trying to train systems to develop complex abilities like understanding or intelligence from ever lager datasets- which can take many decades, since, like I say, we have simultaneously too little and too much data; or, we can find a way to transfer the background knowledge bestowed upon us by thousand years of evolution to guide the training of our learning systems towards the goals we want them to achieve, whatever those are. We can give them the spark to start a fire. Or maybe we can't. But, if we can, then there is no sensible reason why we shouldn't.

To clarify, I'm not saying we should go back to feature engineering. Feature engineering was necessary in the past because there is no good way to imbue neural networks with background knowledge. Notably, it's not possible to use a trained neural net as a feature of another neural net, so it's not possible to build up from low-level concepts to higher-level concepts unless it's done end-to-end in the same model, which is limiting, and yet another reason for the gigantism of neural net training datasets. I don't know what the solution is, though. Clearly not explicitly coding expert knowledge in production rules as in expert systems. Much of our knowledge is maybe impossible to articulate explicitly. So we must find a way to encode implicit knowledge, also.

But, again, we don't have to encode _everything_. We can find good predicates, in Vapnik's terminology, and then let the learning systems do the rest. But that can only work _if_ our learning systems _can_ do the rest.

Yes, translational invariance in CNNs is a good example. But it's still not the whole story.

>> Is choosing suitable starting weights/architecture/"predicates" by hand-designing based on our own built up information qualitatively any different? It still seems like "hiding" utilization of a huge amount of background knowledge about digits/symbols/images/reality.

It's basically a trade-off. If you have good background knowledge, you don't need a lot of data. Good background knowledge helps you build robustly generalisable concepts. And if you can reuse the learned concepts as background knowledge, then the sky is the limit. But, if you don't have background knowledge, you need to make up for it, and the only way we know is to train on lots and lots of data- with the limitations that involves (overfitting, large computational costs, etc).

P.S. Sorry- this comment is a bit sloppy and hence overlong.

treesprite82 · on Feb 19, 2022

> But that's not a question of fairness, rather it's a question of feasibility.

I initially took the challenge's intention to be about highlighting modern machine learning's weaknesses compared to biological intelligence, and so barring certain already-existing generalization techniques seemed an arbitrary and asymmetrical restriction.

If it's more meant as "Classifiers can already achieve this particular goal, but I have a theory that human-determined 'predicates' will scale up better in the long run, I challenge you to progress my idea", then I currently disagree but understand.

> If it took us many thousands of years to learn our background knowledge from the real world over many human generations, it's difficult to see how we can reproduce this result with the comparatively poor computational resources and data in our disposal.

My belief would be that we can surpass this result with a combination of using our existing intuition alongside techniques that outperform evolution's hypo-glacial pace and inefficient data utilization.

Given the same narrow problem, some human insight for the search space and a couple of hours of gradient descent on a GPU can match what would take evolution many generations. That doesn't prove we'll get such a speedup on achieving broader intelligence, but at least natural selection hasn't appeared to be a speed limit so far.

> or, we can find a way to transfer the background knowledge bestowed upon us by thousand years of evolution to guide the training of our learning systems towards the goals we want them to achieve, whatever those are.

> To clarify, I'm not saying we should go back to feature engineering

> Clearly not explicitly coding expert knowledge in production rules as in expert systems. Much of our knowledge is maybe impossible to articulate explicitly. So we must find a way to encode implicit knowledge, also.

I'm all for finding ways to use human domain knowledge to guide the network in the right direction, essentially making use of a gigantic dataset from life's history. The trend seems to be to do this at an increasingly high level: weights are found by gradient descent, and hyperparameters by NAS or similar, but humans still designing various layers and blocks.

"Predicates" being a probably-small set of feature detectors which can describe all 2D images makes me think of something like eigenfaces, which it felt backwards to have humans determine. Maybe intended to be broader than that?

> It's basically a trade-off. If you have good background knowledge, you don't need a lot of data. Good background knowledge helps you build robustly generalisable concepts. And if you can reuse the learned concepts as background knowledge, then the sky is the limit.

I'd claim that this is in effect also what pretraining is. Pretraining and instilling a model with human background knowledge both allow faster low-data generalization to new tasks by utilizing large amounts of prior data. Difference is in whether the base data is organic or digital. Using both to find useful predicates seems most promising so far.

teruakohatu · on Feb 19, 2022

> learn to identify MNIST digits from 60 examples of each class, rather than 6000

More like 60,000 per class once augmentation has generated a bunch of new samples from each orginal!

twayt · on Feb 19, 2022

As long as (stochastic) gradient descent and random initialization is the state of the art, more data will be better.

kd5bjo · on Feb 19, 2022

Random initialization is already starting to give way. One of the more promising approaches right now is pretraining with synthetic data to obtain “good” initial weights. Once that converges, you’ve got a network that internally recognizes interesting features which is a bettter-than-random starting point for training with real-world data.

cf. https://arxiv.org/abs/2106.05963

SomewhatLikely · on Feb 18, 2022

Pretty clickbaity. Lots of "some argue", "some say", "has estimated", and "striving to", but not much substance about actual successes. I believe both Tesla and Cruise are working in this direction but there are serious issues to be worked out. I also vaguely remember some work on pose estimation being helped by generating renderings. Going over real successes would make for a more convincing article.

chestervonwinch · on Feb 19, 2022

https://en.wikipedia.org/wiki/Weasel_word

MaxBarraclough · on Feb 19, 2022

My thought exactly. Also, HackerNews submissions are generally meant to use the title from the article.

WalterBright · on Feb 18, 2022

If a computer program is generating the training data, aren't you just training the AI to do the same thing as the already existing computer program does?

csee · on Feb 18, 2022

Not at all. The existence of the Gran Turismo game (a simulator) is not the same thing as an AI that can play Gran Turismo.

stingraycharles · on Feb 18, 2022

But a better analogy would be if an AI generated a computer game that another computer can learn to play.

In the end, it’s more like an anonymization layer than anything. If a computer is trained to generate input data for other computers to train with, there’s not a lot special going on.

manmal · on Feb 19, 2022

Generating an environment and acting within it are wildly different things. Eg Tesla generates virtual camera footage of traffic situations they want their vehicles to handle correctly. The footage generator is basically a scripted video game director, while the trained AI is one of the most complex software projects ever.

csee · on Feb 19, 2022

  "But a better analogy would be if an AI generated a computer game that another computer can learn to play."

Doesn't the latter AI have a policy which contains novel information that does not exist in the former AI?

Even if what you say is true in some abstract information theory sense (and I would question that), there is a world of practical difference in the usefulness of a trained self-driving AI and the game engine within which that AI functions.

taeric · on Feb 19, 2022

But.... if you were to train a model on the simulator of the game, you would expect it to pick up on the rules programmed into the simulator.

This is really no different than it picking on the rules embedded in the data gathering of real world data. Any implicit and hidden decisions in that space would be expected to find their ways into the ML.

redytedy · on Feb 19, 2022

The tricky part is that the simulator itself may not have easy to understand rules. Waymo has a Neurips talk about training world agent models that are used for car behavior in the simulation itself. Trying to make world agents that are indistinguishable from real-world vehicle behavior (e.g. minimizing jenson-shannon curvature entropy) is a completely different task than training a model to safely transport you somewhere.

taeric · on Feb 19, 2022

Right. That was my angle. Sometimes, they are ready to see after the fact. But, just like bias can enter into data collection, so too can it enter simulation.

csee · on Feb 19, 2022

Yes but isn't the optimal policy novel information?

taeric · on Feb 19, 2022

But it's it free of bias and a good execution policy in the real world?

csee · on Feb 19, 2022

That probably depends on how realistic the simulator is. Your trade off is more bias (due to simulation inaccuracy) in exchange for unlimited data (and hence less variance).

taeric · on Feb 19, 2022

Agreed. I could see a good case of training heavily on simulated data, verify/validate on collected data.

My question is ultimately how this really helps with making the system ethical. Just moved the bias from collection to simulation. And... I can't see simulation being less impacted with bias.

csee · on Feb 20, 2022

Ethical is not the right question here, we simply want it to crash less and kill less people

taeric · on Feb 20, 2022

The headline of the article brings in ethics.

wolverine876 · on Feb 18, 2022

I thought the same thing ...

Simulations are daily parts of life, in research and development from real-world simulations in research (e.g., climate) and industry (often in spreadsheets); to the theory of gravity (gravity simulated with mathematics) - and every other theory of science, social science, and humanities; to develping your iPhone app on your laptop or just reading the train schedule or using a mapping program.

So what is the difference? Those simulations were built from reality. The Theory of Gravity was built from and confirmed with empirical observations, not from someone else's simulation of gravity! That essential foundation of science, reality (it's not science otherwise), is what is missing.

Also, we already have the problem of our biases and preconceived notions infecting training data, and AI becoming a simulation of that rather than reality. By then training on 'simulated data' (yikes!), we seem to create more of a loop.

rapjr9 · on Feb 20, 2022

If machine learning (ML) is trained on human behavior it can never be better (for some measure of better) than people are. So if racism is widespread, it will influence decisions trained into a ML algorithm. That raises the question, can we train an ML algorithm to make the decisions we _want_ rather than base its decision making on what people currently do. Training on generated data might be one way to do that, to build in implicit biases because we want more fairness in decision making than people currently exhibit. But then who gets to choose what those "goal" biases are? Someone could add a bias that improves life for people of male gender and worsens it for everyone else for example. This seems like a really important problem that has no clear answer. It's very much related to electing politicians where the choice of politician is also a choice of future goals. We don't know of any objective way to always decide what goals are best for everyone (voting certainly is not objective since many voters are not aware of all possible policy/goal implications and can be tricked into voting against their own interests). Yet it seems certain that researchers are working to introduce selective biases into algorithms, if just to adjust for known biases that are problematic. As an opaque input into an opaque algorithm, biases become invisible in deployments and could become very difficult to reverse or fix later. Even with continuous learning, if people's behavior is the input, the result will never be better than people are currently, which can create problems for the future. Some kind of intentional bias aimed towards goals seems necessary, yet also seems very dangerous since it can introduce biases decided by a very small set of people.

wolverine876 · on Feb 22, 2022

Excellent points. It's human beings all the way down.

RicoElectrico · on Feb 18, 2022

The fact that you can compute y = f(x) doesn't imply you already know x = f⁻¹(y).

WalterBright · on Feb 19, 2022

If you want an AI that can recognize cats, and train it with computer generated pictures of cats, you may wind up with an AI that only recognizes cat pictures from that particular generator.

dmabuf · on Feb 19, 2022

This tradeoff is known as synthetic domain shift and is still an active area of research. https://paperswithcode.com/task/synthetic-to-real-translatio...

tomp · on Feb 18, 2022

No. We had realistic 3D graphics 20 years ago.

I'm not aware of any true 3D computer vision system that could reliably play those games (from just vision).

ynfnehf · on Feb 18, 2022

You can train an AI to do the inverse of the existing program (as is the case for the self-driving described in the article.) Take some input, generate output using the existing program, and then train the AI with the input/output reversed.

btdmaster · on Feb 18, 2022

Not necessarily: https://en.wikipedia.org/wiki/Data_augmentation

ausbah · on Feb 18, 2022

at least in the case of reinforcement learning, no. just because you can simulate the problem doesn't mean you know how to optimally solve it - ex driving a car

renewiltord · on Feb 18, 2022

An obvious counterexample. I was young in 2005 when Tesseract was open sourced. I wanted to use it to do something. But I decided to trial it out first by writing something in Notepad and then screenshotting it and trying. Synthetic data! But no, I didn’t know how to make the existing computer program turn image into text.

8note · on Feb 18, 2022

as a contrast to the other comments, you will still get biases from where the simulation differs from reality.

eg. if your simulated traffic lights dont blink at 60hz, the model trained on it wont know to handle it

throwawaynay · on Feb 18, 2022

Generating a realistic city and a self driving AI are two wildly different tasks

SomewhatLikely · on Feb 18, 2022

Even a simpler task like image classification such as: does the picture contain a lion. Imagine you have 3d model of a lion. You can render it from lots of different angles, lighting conditions, backgrounds, stretched out, curled up, etc. You know the ground truth classification on all renderings is that the picture contains a lion, but being able to generate images of lions is a very different task from recognizing lions in images.

hahajk · on Feb 18, 2022

The potential issue with using synthetic data to simulate the problem (like image classification) is that recognizing a lion in synthetic imagery and recognizing a lion in real imagery may also be very different tasks to a computer.

throwawaynay · on Feb 19, 2022

not that much actually because the image are handled at a lower resolution and with a smaller set of colors, to make the models train faster(a lot faster)

so they look pretty much the same to a computer vision algorithm

bpodgursky · on Feb 18, 2022

This is relevant to healthcare startups where it is extremely difficult to get your hands on enough real Protected Health Information to do any interesting ML work (unless you are already part of an enormous company with PHI... and even then it is harder than you'd think).

rosenjcb · on Feb 19, 2022

I worked at a health insurance company and had access to a lot of data (for ML research). It was impossible to lend that data to contractors for nearly any reason and this frustrated us.

freddealmeida · on Feb 19, 2022

One of my companies has a patent in this space (Neuri) (rather defunct now). It worked exceptionally well for time series data.

Used similar work at Ascent.ai for robotics data (visual mostly) and that worked very well.

In fact I don't think we ever really saw an issue with this approach.

NalNezumi · on Feb 19, 2022

Ascent as in the startup in Tokyo? Didn't the self driving car approach fail so miserably they had to 180 degree pivot to P&P robotics applications?

Interesting, but Sim2real have had some lab success for the past few years. Making it work in real application seems to be way more trickier, especially profitably

fxtentacle · on Feb 18, 2022

Odd, my impression is that everything moves in the opposite direction. Good synthetic data is expensive. Recording the real world is free.

Just a variational autoencoder with a discrete latent space is good enough to learn a usable phoneme recognition and pronunciation model from raw WAV files with unsupervised learning.

And clip shows is just for far you can get without supervision. So what's the point in paying for artificial data if you can solve the problem without it?

krapht · on Feb 18, 2022

What. That really depends on your task. In my field, real data is extremely expensive and if we could generate believable synthetic data, we'd save a ton of money.

hervature · on Feb 19, 2022

> if we could generate believable synthetic data, we'd save a ton of money.

It sounds like you are saying good synthetic data is more expensive?

vhold · on Feb 19, 2022

With synthetic data you can also generate the thing you are trying to solve with the AI in the first place, like generating a depth map, and an object classification map to go with a simulated image.

With real world data a human will have to label it.

fxtentacle · on Feb 19, 2022

Yes but AIs trained on synthetic depth maps tend to learn the oddities and noise patterns of the rendering, not actually the depth recognition part.

That's why so many AIs that work almost perfectly on the Sintel benchmark (synthetic flow and depth data) then fail to transfer over to KITTI (real images with lidar).

variaga · on Feb 19, 2022

Recording real-world data may be cheap (not "free") in some circumstances, but properly labeling real-world data so it will be useful can be incredibly expensive.

Even if you're doing completely unsupervised learning, you will still need correctly labeled data at some point to validate your model.

And - I tell you as someone who has actually tried to do this - getting a wide range of high-quality voice recordings across multiple speakers is NOT CHEAP, and trying to train a model that will work under real-world conditions using "whatever dataset I downloaded from kaggle.com" is a fool's errand.

zhobbs · on Feb 19, 2022

Recording real world data can be cheap for some use cases, but often labeling it is very expensive.

nomel · on Feb 19, 2022

> Recording the real world is free.

Waiting for edge cases to occur, in the real world, is definitely not free.

fhood · on Feb 19, 2022

What a strange article. I think that a company like Nvidia could probably provide a great deal of additional value to the world of scientific computer modeling. And I think that simulations can work really well to assist in training models, I don't really understand why that would be up for debate, they already do. What I don't understand is talking about "simulating the entire world down to atomic interactions" or teleporting to Mars via data collected from..."sensors".

Little sections of this, particularly wrt pushing towards more standardized systems for building computer models make sense and seem like a worthwhile goal, but most of this reads like nonsense to me.

ddingus · on Feb 19, 2022

Does this not depend on the source data?

Some datasets are easy to render, generate, whatever. In that scenario, sure! Seems like a solid case can be made, particularly where an analytic approach can speak to the data elements needed.

Other data needs to be sourced from the world. That's harder, and it's extremely likely artificial data is either too expensive to create at the fidelity, or lack of, to make economic sense, or the cases are too numerous for an analytic approach to be inclusive.

sbrother · on Feb 19, 2022

"Some argue"? We've been doing this for the last ten years and that's just as far as my career goes back.

rosenjcb · on Feb 19, 2022

You're right. This isn't new or controversial. I don't know why HN is so weak on data science. Maybe software devs are weak on math in general.

hoppla · on Feb 19, 2022

How cool would it not be if the computers would not only use synthetic data for training, but also simulating the outcome of their own actions before taking it. But not only it’s own actions - like a game of chess it would also include all the possible immediate actions of other actors

naveen99 · on Feb 19, 2022

Dreams are basically synthetic data that train our brains while we sleep.

ouid · on Feb 19, 2022

well that just sounds like programming with extra steps.