Rather than requiring real-world direct, voluminous data sets for Machines to learn the tasks they are designed to perform, a new type of data set promises to lower the cost and production time of machine learning, in addition to providing privacy protections, by creating ‘seed data,’ core principles, that the machines can extrapolate more detailed instruction from.
When It Comes to AI, Can We Ditch the Datasets? Using Synthetic Data for Training Machine-Learning Models
Adam Zewe, Massachusetts Institute of Technology
A machine-learning model for image classification that’s trained using synthetic data can rival one trained on the real thing, a study shows.
Huge amounts of data are needed to train machine-learning models to perform image classification tasks, such as identifying damage in satellite photos following a natural disaster. However, these data are not always easy to come by. Datasets may cost millions of dollars to generate, if usable data exist in the first place, and even the best datasets often contain biases that negatively impact a model’s performance.
To circumvent some of the problems presented by datasets, MIT researchers developed a method for training a machine learning model that, rather than using a dataset, uses a special type of machine-learning model to generate extremely realistic synthetic data that can train another model for downstream vision tasks.
Their results show that a contrastive representation learning model trained using only these synthetic data is able to learn visual representations that rival or even outperform those learned from real data.