AI Training Data

Music Used to Train AI: The Atlantic Builds a Searchable Database

Atlantic reporter Alex Reisner found four music datasets used to train AI models, totalling over 21 million tracks. Google and Stability AI confirmed use in research papers.

LUMIEN4 min read
Music Used to Train AI: The Atlantic Builds a Searchable Database

Atlantic reporter Alex Reisner has identified four music datasets being used to train AI models and made them searchable to the public. Two of the datasets are enormous: one contains 12 million tracks, another 9 million. The remaining two each hold over 100,000 songs. According to Reisner, the datasets have been downloaded thousands of times, and both Google and Stability AI have confirmed their use in published research papers. Some of the sources, such as the Free Music Archive, are free for personal streaming but carry separate restrictions on commercial or training use.

What happened

Alex Reisner, a reporter at The Atlantic, tracked down four music datasets that have been used to train AI models and built a public, searchable database so anyone can look up whether their music is in them.

The scale here is hard to ignore. Two of the four datasets are very large:

  • One dataset contains 12 million tracks
  • Another contains 9 million tracks
  • The remaining two each contain over 100,000 songs

Combined, that puts the total well above 21 million tracks across these four sources alone. According to Reisner, the datasets have been downloaded thousands of times. It is not possible to confirm every organisation that has used them, but Google and Stability AI have both cited these datasets in their own research papers, which counts as confirmation of use.

The licensing problem

Not all of these datasets are obviously illegal to use, which is part of what makes the situation complicated. Some sources, like the Free Music Archive, allow personal streaming at no cost. But “free to stream” and “free to train an AI model on” are very different things legally. Reisner’s reporting highlights the gap between what a licence technically permits and how these collections have actually been used.

This matters because musicians and rights holders often have no idea their work has ended up in a training set. The searchable database Reisner built is a direct response to that information gap.

Why it matters

This story sits in the middle of a fast-moving legal and commercial fight over AI training data. Courts in the US are already hearing cases about whether scraping copyrighted material for AI training counts as fair use. Music is a particularly sensitive area because the industry has well-established licensing structures, active collecting societies, and artists who are very publicly vocal about how their work is used.

The confirmation from Google and Stability AI is significant. It moves this from speculation to documented fact: major AI developers have used these specific collections. That makes the datasets relevant evidence in any future legal action, and it gives rights holders something concrete to point to.

For businesses building products on top of AI music tools, there is a real risk that the models underneath those tools were trained on data with questionable licensing. That exposure could matter if litigation expands the way copyright holders are pushing for.

Our take

Reisner’s work is exactly the kind of journalism this space needs. The AI industry has benefited from the fact that training data is largely invisible. You cannot hear a song in a model’s output the way you can see an artist’s style in an image generator, so the problem stayed abstract for most people. A searchable database changes that.

What strikes us is how routine the use of these datasets apparently was. Thousands of downloads, citations in peer-reviewed papers, no particular attempt to conceal any of it. That is not the behaviour of an industry that thought it was doing something legally risky. It is the behaviour of an industry that assumed training data was a free resource until told otherwise by a court.

That assumption is now being tested. Whether you are a musician, a developer, or a business using AI audio tools, it is worth understanding what is in the stack you are relying on.

What to do about it

If you are a musician or rights holder, start with the searchable database Reisner published at The Atlantic. Check whether your catalogue appears in any of the four datasets. If it does, document it and speak to an IP lawyer about your options, especially if you can link your work to a specific AI product or research paper.

If you are a business using AI music generation tools, ask your vendor directly which training datasets their model was built on and request documentation. “We trained on licensed data” is not enough. You want specifics, because vague assurances will not protect you if litigation moves your way.

Source: The Verge · AI

More from AI News