Pirate archivist group scrapes Spotify's 300TB library, posts free torrents for downloading 86,000,000 tracks — investigation underway as music and metadata hit torrent sites

Recent updates

Update 12/22 at 4:31PM ET Spotify has shared the following statement:

"Spotify has identified and disabled the nefarious user accounts that engaged in unlawful scraping. We've implemented new safeguards for these types of anti-copyright attacks and are actively monitoring for suspicious behavior. Since day one, we have stood with the artist community against piracy, and we are actively working with our industry partners to protect creators and defend their rights."

Spotify, the largest music streaming platform in the world with hundreds of millions of active users, and an extensive library of music has allegedly been hacked by Anna's Archive. The shadow library, who labels itself as archivists, has apparently scraped nearly the entirety of the platform, downloading roughly 300 TB of music that is now being distributed illegally via torrents.

"An investigation into unauthorized access identified that a third party scraped public metadata and used illicit tactics to circumvent DRM to access some of the platform’s audio files. We are actively investigating the incident."

Article continues below

That "some" in the above comment is key because the leaked collection consists of around 86 million files in particular, representing ~37% of all music available on the platform (but 99.9% of listens). Most of them are preserved in Spotify's original OGG Vorbis 160 kbps format, but if any song has a popularity rating of exactly 0, then they've been re-encoded to 75kpbs to save space.

With that, there's 256 million rows of metadata that accounts for 99.6% of all listens on Spotify and it has been complied into query-able SQL databases. The group has done a near-lossless JSON reconstruction of Spotify's API, including 186 million unique ISRCs. — identifiers for individual recordings worldwide; think of them as ISBNs for music. All the album info, artist info, cover art etc., is included.

Songs on Spotify grouped by popularity — (Image credit: Anna's Archive)

The blog post released by Anna's Archive going over this leak is surprisingly informative, including a bunch of charts that break down how Spotify treats music in general. For instance, around 70% of all songs on the platform barely get any attention, while 0.1% of the tracks are the most popular of all time. Most songs are also singles, rather than part of an album, and 120 BPM is the most common tempo.

Anyhow, the reason for this large-scale hack, as described by Anna's Archive itself, is preservation of music. Since the group is notorious for open-sourcing books without consent, it's applying much of the same logic here, arguing that Spotify's collection is too overtly focused on popular artists and sound quality. There needs to be an "authoritative list of torrents aiming to represent all music ever produced."

The torrents are self-hosted, and the files are packaged using Anna’s Archive Containers (AAC), a custom format the group has used for years. The metadata has already been released while the rest of the data will follow a staggered release pattern in huge chunks, categorized by popularity. Therefore, the aftermath of this scrape will only truly show up down the line.

Follow Tom's Hardware on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.

TOPICS

Hassam Nasir is a die-hard hardware enthusiast with years of experience as a tech editor and writer, focusing on detailed CPU comparisons and general hardware news. When he’s not working, you’ll find him bending tubes for his ever-evolving custom water-loop gaming rig or benchmarking the latest CPUs and GPUs just for fun.

24 Comments Comment from the forums

usertests

Based. BTW, the metadata is the point rather than the audio files.
Reply
Math Geek

so i'm gonna need to borrow a NAS or 10 ..........

no reason really.....
Reply
S58_is_the_goat

Did they use zip or rar to compress...
Reply
Lafong

Back of the envelope estimate?

300 TB is 37 percent of what Spotify has.

Therefore..100 percent would occupy about 811 TB.

If each song occupies about 4 MB (roughly the space occupied by a high quality mp3 of perhaps 2.75 minutes), Spotify would have about 203 million songs....ignoring anything other than the audio file itself.

Is that plausible? I know nothing at all about what format Spotify uses or how many text, picture or video files are involved.
Reply
usertests

Lafong said:
Is that plausible?
https://annas-archive.org/blog/backing-up-spotify.htmlhttps://annas-archive.li/blog/backing-up-spotify.html
They say around 256 million tracks in the blog post, and it would take 700+ additional TB storage to handle what they didn't scrape. But some of them are encoded at 75 Kbps so that is affecting estimates.

They say many of the least popular tracks are AI generated.
Reply
cyrusfox

How much of this is podcast? Those take a lot of space, even if lower encoded. Listen once and forget content.

For the music 160 kbps is considered lossy, Spotify has higher tiers and lossless but I guess they went with the free content only (perhaps a clue to how they scraped so much).
Reply
bit_user

I don't support piracy, but I do worry about IP just "disappearing" into a black hole, due to things rights disputes, technical glitches, billing problems, and old content potentially being deleted to free up space. Imagine future civilizations sift through our archeological record and think our culture "stopped" after the 2020's, because we stopped producing books, magazines, CDs, and blurays.

I think the content industry is being too greedy in their new content-as-a-service business model, and could've headed off the more principled hacks of this sort by something like creating their own locked-down archive and committing to release every copyrighted work, once its copyright expired.
Reply
bit_user

cyrusfox said:
How much of this is podcast? Those take a lot of space, even if lower encoded. Listen once and forget content.
Not only that, but now AI-generated slop, which is probably accounting for most of those never-listened or highly-unpopular tracks.
Reply
bit_user

S58_is_the_goat said:
Did they use zip or rar to compress...
Ogg Vorbis is an audio codec, similar to MP3. I haven't tried recompressing them, but the general rule of thumb is that you don't really achieve any additional compression by zipping or rar'ing an already-compressed audio or video file. Maybe a couple %, which isn't worth the practical annoyances of having to access it inside of that archive file.

I think the main reason you see ZIPs or RARs of something like a CD is just for packaging purposes and maybe also the CRC. However, these tracks have no natural package structure and bittorrent provides much better checksums. By keeping the track separate, a good torrent client will let you pick and choose which files you want to torrent, so you don't have to download the entire archive.
Reply
bit_user

usertests said:
the metadata is the point rather than the audio files.
Not sure about that, given that they're an archivist group and they did go to the trouble of grabbing 300 TB worth of tracks, but I'd probably find the metadata at least as interesting as the tracks, given that most of the music can also be accessed on other services.
Reply

Show more comments