Be Cool: Podcast Fetching Improvements

“Cool URIs don’t change”
— Sir Tim Berners-Lee, Web Developer

For the first time since launching, I’ve added a bunch of improvements to Player FM’s feed fetching bot. This change improves timeliness (i.e., content is now more up-to-date), removes duplicate entries, and removes many error pages.

This article is mostly technical back-end stuff, so you may wish to look away now if you don’t want to know how the sausage is made.

I’ll share a quick summary below, but a brief background on the feed bot first. The feed bot is a continuous process that runs on its own server. It checks every series on Player FM about once every few hours, requesting the latest episode data from the publisher’s server. When new episodes are recognised, it adds them to Player FM’s database. If there’s an error calling the server, an exponential decay algorithm is applied to ensure the bot doesn’t waste time on it.

Aliasing

Player FM has elements of being a wiki, including the fact that any user can add a new podcast series. So it’s not surprising that Player FM borrows some concepts from Wikipedia, and this “alias” concept is related to Wikipedia’s redirects. When a Wikipedia user renames an article, the old title lives on so it can redirect to the new one. That’s nice if people (or computers) have previously saved the old article. Now, Player FM does the same thing. If a feed title changes, it will remember the old title and redirect it to the new one.

In more detail, Player FM now aliases three identifiers: “slugs”, ids, and URLs. “Slugs” are similar to the title and designed to appear in the URL, e.g. the show titled “This Week In Startups – Audio” has the slug “this-week-in-startups-audio”. IDs are the underlying ID, and will be important for applications outside of the website, like mobile apps. URLs are the feed URLs where the podcast lives on the web.

The next few points identify some of the benefits of this alias concept.

Manually updating aliases and slugs

Because of the alias concept, we can now safely update the URL for a series. Some publishers have asked me to do so, because someone previously submitted a URL they consider to be unofficial. I’ve also noticed some shows, like 5by5 network, have moved away from Feedburner, so we need the ability to switch around URLs. In cases where Player FM has indexed both the old and the new URL, the system will automatically merge subscriptions too. And because of aliases, if someone later on tries subscribing to the old URL, they’ll automatically be subscribed to the new one.

Similarly, we can now update slugs safely. The first implication is that I was able to make our slugs a little bit cooler. Specfically, you won’t see paths like /series/—a–great_show. It would now appear as /a-great-show. Slug aliases made it possible to update Player FM’s URLs without breaking old links and angering the Googlebot. A possible future implication of slug aliases is we can manually change individual shows’ slugs if they are complicated.

Handling Redirects

The bot has always followed redirects, but previously, it would just follow the redirect every single time it called it. Now, it will update the series URL to reflect the new place it’s pointing to. And of course, it will create a URL alias in case someone tries to add the old URL again. (Are you spotting the pattern here?) This happens as long as the redirect is a permanent one (301).

Canonical FeedBurner URLs

About a quarter of feeds are Feedburner-hosted, and since Feedburner URLs come in many different forms, it was worth “canonicalising” them. Basically, we want to see all of these as the same thing: http://feeds.feedburner.com/GoodShow, http://feeds.feedburner.com/goodshow.xml, http://feeds2.feedburner.com/goodshow?format=rss. I wrote this up in more detail on my blog.

“Untitled” Titles

New or broken feeds would previously show up with title “Untitled”. This was poor user experience in the event a user had just imported several series. Now the title will show up a little more intelligently; typically, as the domain name of the feed, and in the case of FeedBurner, will show up as the path (e.g. “GoodShow” in the example above). Untitled episodes will now show with the name of their parent series, combined with the date.

Work Remaining

The main thing left here is actually identifying duplicate series, now that we have a mechanism for dealing with them. For now, this happens automatically in the event of redirection, and we’ll respond to manual requests from publishers, but in the future, I hope to automate more of this using a tool to flag possible duplicates. This would probably be based on matching titles (as proposed here, the “dumb” technique of matching titles is probably the most effective technique too).