AI training is expensive: Only technology “giants” can afford it?

23:34 01/06/2024

2 minutes of reading

Researchers say data is key to creating more intelligent and capable AI systems. The article takes the example of two text generation models, Llama 3 from Meta and OLMo from the Allen Institute for Artificial Intelligence (AI2) to illustrate. Although it has almost the same structure, Llama 3 is trained on a larger amount of data so it performs better.

Photo source: GettyImages

However, data quality is just as important as quantity. AI models operate based on the principle of “garbage in, garbage out”, so filtering and checking data quality is necessary.

Data racing can lead to problems. Experts fear that the focus on big and high-quality data will turn AI development into the monopoly of a few companies with big budgets. They can monopolize data sets and stifle innovation by others.

Additionally, data collection is sometimes not transparent. Some AI companies have pulled data from sources such as YouTube videos, Google Maps reviews without asking permission from content owners or creators. Some companies are even considering using copyright-protected content to train their models.

Another problem is the use of cheap labor in developing countries to label training data. These people are paid low wages and are exposed to violent content for long periods of time without benefits.

Commercial data transactions are also not entirely fair. OpenAI has spent hundreds of millions of dollars to buy content rights, far exceeding the budgets of most research groups, non-profit organizations and startups.

With the AI ​​training data market expected to grow strongly, data platforms are charging higher fees. This hurts the AI ​​research community as a whole because smaller groups cannot afford it.

However, there are some independent efforts to make open data sets free for everyone. EleutherAI, a nonprofit research group, is collaborating with the University of Toronto and other institutions to build The Pile v2, a suite of billions of text snippets.

The question is whether these efforts can keep up with major technology corporations. If data collection and testing still depends on financial resources, the answer is likely no, at least until there is a research breakthrough that levels the playing field.

Share this article:

Keywords:

Comment (0)

Related articles

REGISTER

TODAY

Sign up to get the inside scoop on today's biggest stories in markets, technology delivered daily.

    By clicking “Sign Up”, you accept our Terms of Service and Privacy Policy. You can opt out at any time.