Podcast Metadata Scraping

Turns out scraping metadata from iTunes compatible podcast RSS feeds is extremely easy. I couldn't remember which episode Lebanese Politics Podcast had the hosts talking about the Beirut trash protests, so I went to iTunes to try and find out. However, the search in iTunes is shockingly bad. Plus I wanted to run a grep through the thing for my own purposes anyways, so I decided to scrape this information down.

The code is at https://github.com/peixian/podcast-description-search and distributed under an MIT license.

It takes in the iTunes RSS feed and can either do a simple query on the podcast description, or throw back a csv:

$ ./scrape.py -h
usage: search.py [-h] [-q QUERY] [-o OUT] itunes_url

Takes in an itunes podcast id, searches for a specific string, also dumps all the shows to a csv.

positional arguments:
itunes_url            URL from itunes for podcast

optional arguments:
-h, --help            show this help message and exit
-q QUERY, --query QUERY
Specific string to search for. This is quite dumb so the csv with a more complex engine might be better
-o OUT, --out OUT     Path to dump the results to a csv

Since iTunes compatible RSS feeds are standardized, turns out the public API left up at https://itunes.apple.com/lookup takes a single ID url param, which is a iTunes global unique ID. You can do a GET on this endpoint without throwing any auth, and it'll return back where the original RSS feed is hosted at. In my case, it was hosted at Soundcloud, which then provides the open endpoint to grabbing the iTunes standarized RSS format.

Posted: 2020-05-09
Filed Under: N/A