Some years ago I worked at a startup that was doing OCR on paper receipts. As part of my application to the company, I wrote a synthetic training data generator[0] to generate a range of CG receipts, along with pixel-perfect accuracy of labeled XY bounding boxes for each letter. Generating synthetic training data allows for a high degree of flexibility to the shape of your data. It allows you to focus on strengthening edge cases where you just don't have enough real world data.
When developing anything in the real world, you don't wait for edge cases to happen naturally. You force them to happen, and look at the response. This is the rigor of engineering. Has that been lost?
I often here "edge case" as a defense as to why something hasn't been developed or handled. There's an ongoing trend in software to avoid high hanging fruit and focus on the low hanging big and fast returns. I just accept it now although it bothers me everytime I see it happen.
The beauty and allure of science and engineering, to me, is understanding something so well you can predict its future behavior and then predicting it so well you can prevent undesired behaviors and create or improve desired behaviors, within some set of bounds of course.
> I just accept it now although it bothers me everytime I see it happen.
That's all fine and dandy if it has no real consequences, which is usually the case. The use cases of ML, increasingly, have very real consequences. I think this current Wild West mentality is great for greasing innovation, but not so great when a self driving car start slamming into poles at sunset, whenever there's a blinking red "Open" sign behind them.
In autonomous vehicle development 'edge cases' are the things that engineers would never conjure up via thought experiment. They need to be discovered through real world testing.
Once an edge case has been discovered, they can then artificially generate myriad subtle variations on the edge case, which they use to train their systems.
I think the point is you can augment/synthesize a video with all sorts of random lighting conditions, occlusions, etc, that could take an impossible number of miles to happen naturally. Not filling in some of the search space seems silly, to me, especially if you're instead filling it in with real life data, containing near 100% mundane near ideal conditions.
However, the issue not quite as simple when training a model. For instance, the increasingly popular Transformer architecture pushes around internal representations when fed new training samples. These representations are not "causal parameters" - they do not represent unchanging mechanisms in real life. Instead, they depend on data.
The question thus becomes: How prominent should edge cases be in the sample? A ML model trained on many edge cases may be more robust for edge cases we feed it, but it may be more unstable on real-world data (in essence, the model may have new edge cases when new data comes in that would not have been edge cases before).
All of this comes from the fact that the parameter of the model are not "causal" for the underlying latent DGP. ML, in constrast to stats, does not yet have much theory regarding how to think about parameters relative to the sample of observations one trains on. The question of "causal" identification in stats is easier to tackle, because one can usually define a "scientific" or "structural" model where parameters have meaning, and so we can reason whether we capture "real, causal parameters" (which would be compatible with edge cases) or "reduced form parameters" (which would generally depend on the sample we have).
ML will likely get there. Right now, people think in terms of "data shift" (which they shouldn't) and generalizing (missing the point a bit). But we will get there.
By the way, this issue arose because ML systems where treated like engineering systems instead of statistical models. For better or worse.
Data augmentation is a big thing in applied machine learning. Deep learning practitioners use data augmentation because it works incredibly well. However, deep learning researchers tend to view data augmentation as some kind of dirty trick that wouldn't be necessary if you just had a bigger dataset.
It's also about the issue that people will bias against generating the "difficult" data, even if only subconsciously. Real-world validation is essential to reveal whether you have actually worked through the full phase space.
I guess the field is constantly making a decision between 'do we want to outperform humans, but lose interpretability' and 'we should always be interpretable'
Human-generated artificial data will always contain the human's assumptions. But this data might not contain the 'superhuman' element that leads to the ML-system outperforming humans, because we don't know what that is (yet). Receipts are a good example for artificial data, because they're human-made.
But a lot of what we do with ML systems is using data that doesn't come from humans, images of wildlife, satellite images, biological data, etc
>> The solution is to just have more data and better data.
Nah, sorry, that is just trying to put out the fire by throwing fuel at it. The
big, big weakness of neural networks right now is their reliance on gigantic
datasets that require gigantic computational resources. Neural nets need those
because they can't generalise to unseen data. So people try to get them to see
as much data as possible during training. They still overfit, but if they can
overfit to a diverse enough dataset, then they can be useful in practice, even
if that's only to solve narrow, specific instances of a problem (like in the
domino recognition system in the article).
To address this weakness what is needed is to find ways to make neural nets less
reliant on data, not to find ways to make more data. Make neural nets capable of
generalising robustly to unseen data, from few training instances. Then you
don't need to train in a simulation. Of course that would require a radical
rethink of how deep neural nets are trained (even perhaps whether they remain
"deep", or whether they are trained using gradient descent, the sources of their
data-hunger). Trying to make more data by simulation is only kicking the can
down the road and the only effect it can possibly have is to push the time at
which the real limitation must be really addressed even further down the line so
that an other generation of researchers has to deal with it while the current
generation can keep getting their papers published and their grants granted.
See Vladimir Vapnik's challenge to the machine vision community:
To summarise: learn to identify MNIST digits from 60 examples of each class,
rather than 6000, while retaining current accuracy. (My words now:) Improve
sample efficiency to improve neural nets. Neural nets have shown a remarkable
ability to work well when large amount of resources are available. Now, do like
everyone else does in computer science and try to make them (sample) efficient.
This is always an interesting discussion. I think what you say is probably right "in practice" - that is, it is where ML should move now to improve its ability to generalize.
However, from a more statistical viewpoint, this approach is not "theoretically correct". The issue of unseen data arises for one simple reason: the model does not capture causal mechanisms in the DGP. The model instead learns a reduced form that represents the training sample. If the training sample is large, and the model expressive, this leads to incredible results. However, it fundamentally differs from a causal model in that it could not do true out-of-sample predicitions if these are counterfactual to the data. That is, it may work, but we can not rely on it. It may randomly fail (and not in dependence on robustness of smoothness of the model, but in joint dependence to the counterfactual DGP, if that makes sense)
This issue stands independent of the amount of data. Indeed, if you had a model that identified all relevant causal mechanisms, then of course more data is still better.
In some areas, ideas from causal statistical analyses seep in. Some physical systems are well described by ODE and PDEs - which can be learned by DNNs but retain their causal structure.
I see no general way forward, though. The approach you propose moves along the general idea of ML/AI - pose better challenges to the model and see if it works. This has worked well, perhaps it is the way to go.
>> The issue of unseen data arises for one simple reason: the model does not
capture causal mechanisms in the DGP. The model instead learns a reduced form
that represents the training sample.
Thanks for articulating it so clearly, that's very much how I think of it!
>> This issue stands independent of the amount of data. Indeed, if you had a
model that identified all relevant causal mechanisms, then of course more data
is still better.
Well, my point -and, as far as I understand it, Vapnik's point, in my link
above- is that if a model can generalise well, then it doesn't need a lot of
data, even if a lot of data is available. In terms of a model that can identify
causal relations, then it should be able to do so without lots of data. My
intuition for that is that if the model needs lots of data it will inevitably
learn to represent the sampling error in the data generation process (DGP?). A
model that can learn good generalisations from few data on the other hand is in
a sense immunised against sampling error because it can ignore most of the data.
Vapnik makes an analogy about a good teacher. He asks, what is it that a good
teacher does that helps his students learn better? He answers it by saying that
the good teacher gives his students "good predicates", what I would call
background knowledge, from which the students can build good statistical
invariants. But the students must also be able to build good statistical
invariants _on their own_, otherwise it doesn't matter what good predicates the
teacher can give them, the student's can't learn good invariants.
I think that's the point of the challenge - Vapnik is asking for learning from
few examples (not unreasonably few, I think) as a way to demonstrate that a
system is learning good underlying principles- his good predicates and
statistical invariants. In the past the machine learning community has
sidestepped the issue by finding ways to augment data where it was scarce and
then claim progress in certain problems for which there were initially few
examples (for example, Bongard problems and Winograd schemas) but Vapnik is not
only calling for few data, but also for good predicates, which is I think a
better defense than just asking for few-shot learning.
In a sense I think if we find that more data is better, then that means the
system is still not learning the good bits that make more data unnecessary.
I really do like the practical challenge that Vapnik poses and based on what you wrote I support the agenda. In the face of the trial&error mentality prevalent in ML, it does strike me as a wise approach towards advancement.
However, the minor technicality remains in my view: there exists a case where a sufficiently complex causal mechanism can be only asympotically identified from a given DGP. In that case, more data is better, and more data is perhaps even required - but also the model will be fully general and will not start to fit on the randomness in sample/DGP for this "mechanism". So indeed, we can theoretically construct situations where Vapnik's intuition does not hold. Then, "more data is better => model is fitting sampling error" is more of a heuristic - even if it is a good one! If the model manages to disentangle error from mechanism, then more data could still improve the model while retaining generality.
Of course, one might propose that any causal mechanism will be identified "sufficiently well" in a finite sample, so there's a point where we have enough data. Perhaps this can even be proven, and my theoretical counterexample simply never exists. In that case, you'd be right: A model that keeps learning ad-infinum can not be causally identified. I am not aware of such a proof, but I'd not be surprised if it were the case. Or more practically speaking, one may be able to show that improvements to the causal estimates are, at some point, always inferior to the overfitting errors no matter the DGP/model.
I would perhaps propose a reverse of the heuristic: We know a model is good, if it can demonstrate that more data does NOT influence its invariant (causally identified) parts. This will be the challenge to solve for DNNs, as the combination of model+ latent DGP necessarily makes this a matter of assumption. In classical models, we can formulate these assumptions mathematically e.g. as exclusion restrictions and the like. For DNNs, I think we do not have such results yet.
Depends how strict you're being with this. There's room for the gap to shrink, but I think on average classifiers (whether organic or machine) with a large number of examples to go off of will always perform at least marginally better than classifiers with fewer examples.
I'm aware of results like that but they're just kicking the can down the road
with the other foot: they push the problem of training with big data to the
pre-training stage and then claim to do "few-" or "one-shot" learning at the
end, or even "zero-shot" which is egregious abuse of terminology (and it's very
sad that it's accepted terminology). It's like the Aesop's fable where the
sparrow hid in the eagle's feathers and jumped up at the last moment to claim
"I'm the bird that flies the highest!".
Vapnik's point in the interview I linked above is that you should not need a lot
of data for anything, including pre-training. His challenge is for the community
to find what he calls good "predicates" which are primitive functions (I think
of them as feature detectors) that can be composed into a good statistical
invariant, a function representing a high-level concept while having good
out-of-dataset generalisation ability. His claim is that if you have a bunch of
good predicates and good invariants, then you don't need a lot of data, because
the generalisation ability of the good predicates makes up for it. Or, seen
another way, lots of data is needed _in the case when_ a good predicate is not
known. In a certain way, transfer learning, or meta-learning in the case of the
paper you link, is a step towards the right direction, but the reliance on big
data for pre-training suggests that the models are still not learning good
representations that generalise well - so they still need big data to make up
for it.
> they push the problem of training with big data to the pre-training stage and then claim to do "few-" or "one-shot" learning at the end
Humans have had 4 billion years of natural selection and then 4 years of input from all senses before they start identifying digits. I've seen studies suggesting that we're already born with an area of our brain for recognizing letters and words.
Seems at least fair in comparison to allow MAML/pretraining to find a good starting model (e.g: can recognize lines and shapes) by utilizing data other than the classes of interest.
> It's like the Aesop's fable where the sparrow hid in the eagle's feathers and jumped up at the last moment to claim "I'm the bird that flies the highest!".
> you should not need a lot of data for anything
Is choosing suitable starting weights/architecture/"predicates" by hand-designing based on our own built up information qualitatively any different? It still seems like "hiding" utilization of a huge amount of background knowledge about digits/symbols/images/reality.
Arguably harder to expand that way too. I think techniques such as unsupervised learning are probably going to be a more feasible way to utilize the increasing amount of data we're collecting about the universe.
At our current stage, both seem useful. Broad strokes like moving from dense networks to convolutional networks to add locality and translational invariance based on our knowledge that this is an appropriate search space for vision tasks, and then automated methods like NAS and pretraining to determine relevance on a finer level.
We definitely haven't exhausted ways for us to use our intuition to guide networks in the right direction, such as transformers with their attention mechanisms or say a network inherently agnostic to horizontal flips rather than teaching that with data augmentation, but I'm skeptical about what sounds like stepping back into hand-crafted feature extraction which automated techniques have been far more effective at.
> but the reliance on big data for pre-training suggests that the models are still not learning good representations that generalise well - so they still need big data to make up for it.
Wouldn't it be lack of generalization to new tasks after the fact which indicates poor predicates? I don't see why good feature detectors should necessarily themselves be discoverable by hand or with low data, as that doesn't appear to have been the case for organic intelligence.
Yes, humans come into the world with seemingly a very large amount of background
knowledge that we can then use to learn new concepts from very few examples. And
as you say this is probably the result of many thousands of years of evolution.
But that's not a question of fairness, rather it's a question of feasibility. If
it took us many thousands of years to learn our background knowledge from the
real world over many human generations, it's difficult to see how we can
reproduce this result with the comparatively poor computational resources and
data in our disposal.
There is a peculiar double-blindness in machine learning today, I think, where
people are hoping to learn extremely difficult concepts, like meaning in
language or like all of intelligence, from simultaneously too much and too
little data. Too much because humans don't need to train on the entire web to
learn meaning (and Large Language Models trained on the entire web still don't
learn it). And too little because if you think of the complexity of the real
world and the amount of information that we take in with our senses just sitting
still looking around, this is an amount of information that can simply not be
matched by the largest imaginable dataset that we could create.
So what's the altnerative? I have a parable (oh no). What do you do when you
need a fire? Well, clearly, you light a fire, maybe with matches or with a
lighter etc. That's because you know how to light a fire and because the
implements to do so are now cheap commodities that most humans can afford easily
(I bet even Kalahari bushmen use BIC ligthers nowadays...). What you certainly
don't do is sit around waiting for a fire to occur naturaly, say by thunder
strike, like humans presumably did before discovering how to make fire from
scratch. Because that could take ages and because you have the knowledge
necessary to not have to wait for ages. In the same way, we could wait around
for ages trying to train systems to develop complex abilities like understanding
or intelligence from ever lager datasets- which can take many decades, since,
like I say, we have simultaneously too little and too much data; or, we can find
a way to transfer the background knowledge bestowed upon us by thousand years of
evolution to guide the training of our learning systems towards the goals we
want them to achieve, whatever those are. We can give them the spark to start a
fire. Or maybe we can't. But, if we can, then there is no sensible reason why we
shouldn't.
To clarify, I'm not saying we should go back to feature engineering. Feature
engineering was necessary in the past because there is no good way to imbue
neural networks with background knowledge. Notably, it's not possible to use a
trained neural net as a feature of another neural net, so it's not possible to
build up from low-level concepts to higher-level concepts unless it's done
end-to-end in the same model, which is limiting, and yet another reason for the
gigantism of neural net training datasets. I don't know what the solution is,
though. Clearly not explicitly coding expert knowledge in production rules
as in expert systems. Much of our knowledge is maybe impossible to articulate
explicitly. So we must find a way to encode implicit knowledge, also.
But, again, we don't have to encode _everything_. We can find good predicates,
in Vapnik's terminology, and then let the learning systems do the rest. But that
can only work _if_ our learning systems _can_ do the rest.
Yes, translational invariance in CNNs is a good example. But it's still not the
whole story.
>> Is choosing suitable starting weights/architecture/"predicates" by
hand-designing based on our own built up information qualitatively any
different? It still seems like "hiding" utilization of a huge amount of
background knowledge about digits/symbols/images/reality.
It's basically a trade-off. If you have good background knowledge, you don't
need a lot of data. Good background knowledge helps you build robustly
generalisable concepts. And if you can reuse the learned concepts as background
knowledge, then the sky is the limit. But, if you don't have background
knowledge, you need to make up for it, and the only way we know is to train on
lots and lots of data- with the limitations that involves (overfitting, large
computational costs, etc).
P.S. Sorry- this comment is a bit sloppy and hence overlong.
> But that's not a question of fairness, rather it's a question of feasibility.
I initially took the challenge's intention to be about highlighting modern machine learning's weaknesses compared to biological intelligence, and so barring certain already-existing generalization techniques seemed an arbitrary and asymmetrical restriction.
If it's more meant as "Classifiers can already achieve this particular goal, but I have a theory that human-determined 'predicates' will scale up better in the long run, I challenge you to progress my idea", then I currently disagree but understand.
> If it took us many thousands of years to learn our background knowledge from the real world over many human generations, it's difficult to see how we can reproduce this result with the comparatively poor computational resources and data in our disposal.
My belief would be that we can surpass this result with a combination of using our existing intuition alongside techniques that outperform evolution's hypo-glacial pace and inefficient data utilization.
Given the same narrow problem, some human insight for the search space and a couple of hours of gradient descent on a GPU can match what would take evolution many generations. That doesn't prove we'll get such a speedup on achieving broader intelligence, but at least natural selection hasn't appeared to be a speed limit so far.
> or, we can find a way to transfer the background knowledge bestowed upon us by thousand years of evolution to guide the training of our learning systems towards the goals we want them to achieve, whatever those are.
> To clarify, I'm not saying we should go back to feature engineering
> Clearly not explicitly coding expert knowledge in production rules as in expert systems. Much of our knowledge is maybe impossible to articulate explicitly. So we must find a way to encode implicit knowledge, also.
I'm all for finding ways to use human domain knowledge to guide the network in the right direction, essentially making use of a gigantic dataset from life's history. The trend seems to be to do this at an increasingly high level: weights are found by gradient descent, and hyperparameters by NAS or similar, but humans still designing various layers and blocks.
"Predicates" being a probably-small set of feature detectors which can describe all 2D images makes me think of something like eigenfaces, which it felt backwards to have humans determine. Maybe intended to be broader than that?
> It's basically a trade-off. If you have good background knowledge, you don't need a lot of data. Good background knowledge helps you build robustly generalisable concepts. And if you can reuse the learned concepts as background knowledge, then the sky is the limit.
I'd claim that this is in effect also what pretraining is. Pretraining and instilling a model with human background knowledge both allow faster low-data generalization to new tasks by utilizing large amounts of prior data. Difference is in whether the base data is organic or digital. Using both to find useful predicates seems most promising so far.
Random initialization is already starting to give way. One of the more promising approaches right now is pretraining with synthetic data to obtain “good” initial weights. Once that converges, you’ve got a network that internally recognizes interesting features which is a bettter-than-random starting point for training with real-world data.
Pretty clickbaity. Lots of "some argue", "some say", "has estimated", and "striving to", but not much substance about actual successes. I believe both Tesla and Cruise are working in this direction but there are serious issues to be worked out. I also vaguely remember some work on pose estimation being helped by generating renderings. Going over real successes would make for a more convincing article.
If a computer program is generating the training data, aren't you just training the AI to do the same thing as the already existing computer program does?
But a better analogy would be if an AI generated a computer game that another computer can learn to play.
In the end, it’s more like an anonymization layer than anything. If a computer is trained to generate input data for other computers to train with, there’s not a lot special going on.
Generating an environment and acting within it are wildly different things. Eg Tesla generates virtual camera footage of traffic situations they want their vehicles to handle correctly. The footage generator is basically a scripted video game director, while the trained AI is one of the most complex software projects ever.
"But a better analogy would be if an AI generated a computer game that another computer can learn to play."
Doesn't the latter AI have a policy which contains novel information that does not exist in the former AI?
Even if what you say is true in some abstract information theory sense (and I would question that), there is a world of practical difference in the usefulness of a trained self-driving AI and the game engine within which that AI functions.
But.... if you were to train a model on the simulator of the game, you would expect it to pick up on the rules programmed into the simulator.
This is really no different than it picking on the rules embedded in the data gathering of real world data. Any implicit and hidden decisions in that space would be expected to find their ways into the ML.
The tricky part is that the simulator itself may not have easy to understand rules. Waymo has a Neurips talk about training world agent models that are used for car behavior in the simulation itself. Trying to make world agents that are indistinguishable from real-world vehicle behavior (e.g. minimizing jenson-shannon curvature entropy) is a completely different task than training a model to safely transport you somewhere.
Right. That was my angle. Sometimes, they are ready to see after the fact. But, just like bias can enter into data collection, so too can it enter simulation.
That probably depends on how realistic the simulator is. Your trade off is more bias (due to simulation inaccuracy) in exchange for unlimited data (and hence less variance).
Agreed. I could see a good case of training heavily on simulated data, verify/validate on collected data.
My question is ultimately how this really helps with making the system ethical. Just moved the bias from collection to simulation. And... I can't see simulation being less impacted with bias.
Simulations are daily parts of life, in research and development from real-world simulations in research (e.g., climate) and industry (often in spreadsheets); to the theory of gravity (gravity simulated with mathematics) - and every other theory of science, social science, and humanities; to develping your iPhone app on your laptop or just reading the train schedule or using a mapping program.
So what is the difference? Those simulations were built from reality. The Theory of Gravity was built from and confirmed with empirical observations, not from someone else's simulation of gravity! That essential foundation of science, reality (it's not science otherwise), is what is missing.
Also, we already have the problem of our biases and preconceived notions infecting training data, and AI becoming a simulation of that rather than reality. By then training on 'simulated data' (yikes!), we seem to create more of a loop.
If machine learning (ML) is trained on human behavior it can never be better (for some measure of better) than people are. So if racism is widespread, it will influence decisions trained into a ML algorithm. That raises the question, can we train an ML algorithm to make the decisions we _want_ rather than base its decision making on what people currently do. Training on generated data might be one way to do that, to build in implicit biases because we want more fairness in decision making than people currently exhibit. But then who gets to choose what those "goal" biases are? Someone could add a bias that improves life for people of male gender and worsens it for everyone else for example. This seems like a really important problem that has no clear answer. It's very much related to electing politicians where the choice of politician is also a choice of future goals. We don't know of any objective way to always decide what goals are best for everyone (voting certainly is not objective since many voters are not aware of all possible policy/goal implications and can be tricked into voting against their own interests). Yet it seems certain that researchers are working to introduce selective biases into algorithms, if just to adjust for known biases that are problematic. As an opaque input into an opaque algorithm, biases become invisible in deployments and could become very difficult to reverse or fix later. Even with continuous learning, if people's behavior is the input, the result will never be better than people are currently, which can create problems for the future. Some kind of intentional bias aimed towards goals seems necessary, yet also seems very dangerous since it can introduce biases decided by a very small set of people.
If you want an AI that can recognize cats, and train it with computer generated pictures of cats, you may wind up with an AI that only recognizes cat pictures from that particular generator.
You can train an AI to do the inverse of the existing program (as is the case for the self-driving described in the article.) Take some input, generate output using the existing program, and then train the AI with the input/output reversed.
at least in the case of reinforcement learning, no. just because you can simulate the problem doesn't mean you know how to optimally solve it - ex driving a car
An obvious counterexample. I was young in 2005 when Tesseract was open sourced. I wanted to use it to do something. But I decided to trial it out first by writing something in Notepad and then screenshotting it and trying. Synthetic data! But no, I didn’t know how to make the existing computer program turn image into text.
Even a simpler task like image classification such as: does the picture contain a lion.
Imagine you have 3d model of a lion. You can render it from lots of different angles, lighting conditions, backgrounds, stretched out, curled up, etc. You know the ground truth classification on all renderings is that the picture contains a lion, but being able to generate images of lions is a very different task from recognizing lions in images.
The potential issue with using synthetic data to simulate the problem (like image classification) is that recognizing a lion in synthetic imagery and recognizing a lion in real imagery may also be very different tasks to a computer.
not that much actually
because the image are handled at a lower resolution and with a smaller set of colors, to make the models train faster(a lot faster)
so they look pretty much the same to a computer vision algorithm
This is relevant to healthcare startups where it is extremely difficult to get your hands on enough real Protected Health Information to do any interesting ML work (unless you are already part of an enormous company with PHI... and even then it is harder than you'd think).
I worked at a health insurance company and had access to a lot of data (for ML research). It was impossible to lend that data to contractors for nearly any reason and this frustrated us.
Ascent as in the startup in Tokyo? Didn't the self driving car approach fail so miserably they had to 180 degree pivot to P&P robotics applications?
Interesting, but Sim2real have had some lab success for the past few years. Making it work in real application seems to be way more trickier, especially profitably
Odd, my impression is that everything moves in the opposite direction. Good synthetic data is expensive. Recording the real world is free.
Just a variational autoencoder with a discrete latent space is good enough to learn a usable phoneme recognition and pronunciation model from raw WAV files with unsupervised learning.
And clip shows is just for far you can get without supervision. So what's the point in paying for artificial data if you can solve the problem without it?
What. That really depends on your task. In my field, real data is extremely expensive and if we could generate believable synthetic data, we'd save a ton of money.
With synthetic data you can also generate the thing you are trying to solve with the AI in the first place, like generating a depth map, and an object classification map to go with a simulated image.
With real world data a human will have to label it.
Yes but AIs trained on synthetic depth maps tend to learn the oddities and noise patterns of the rendering, not actually the depth recognition part.
That's why so many AIs that work almost perfectly on the Sintel benchmark (synthetic flow and depth data) then fail to transfer over to KITTI (real images with lidar).
Recording real-world data may be cheap (not "free") in some circumstances, but properly labeling real-world data so it will be useful can be incredibly expensive.
Even if you're doing completely unsupervised learning, you will still need correctly labeled data at some point to validate your model.
And - I tell you as someone who has actually tried to do this - getting a wide range of high-quality voice recordings across multiple speakers is NOT CHEAP, and trying to train a model that will work under real-world conditions using "whatever dataset I downloaded from kaggle.com" is a fool's errand.
What a strange article. I think that a company like Nvidia could probably provide a great deal of additional value to the world of scientific computer modeling. And I think that simulations can work really well to assist in training models, I don't really understand why that would be up for debate, they already do. What I don't understand is talking about "simulating the entire world down to atomic interactions" or teleporting to Mars via data collected from..."sensors".
Little sections of this, particularly wrt pushing towards more standardized systems for building computer models make sense and seem like a worthwhile goal, but most of this reads like nonsense to me.
Some datasets are easy to render, generate, whatever. In that scenario, sure! Seems like a solid case can be made, particularly where an analytic approach can speak to the data elements needed.
Other data needs to be sourced from the world. That's harder, and it's extremely likely artificial data is either too expensive to create at the fidelity, or lack of, to make economic sense, or the cases are too numerous for an analytic approach to be inclusive.
How cool would it not be if the computers would not only use synthetic data for training, but also simulating the outcome of their own actions before taking it. But not only it’s own actions - like a game of chess it would also include all the possible immediate actions of other actors
0. https://www.arwmoffat.com/work/synthetic-training-data