TIL: Ruby and RSS feeds
by Graham Marlow
I've been digging into Ruby's stdlib RSS parser for a side project and am very impressed by the overall experience. Here's how easy it is to get started:
require "open-uri"
require "rss"
feed = URI.open("https://jvns.ca/atom.xml") do |raw|
RSS::Parser.parse(raw)
end
That said, doing something interesting with the resulting feed is not quite so simple.
For one, you can't just support RSS. Atom is a more recent standard used by many blogs (although I think irrelevant in the world of podcasts). There's about a 50% split in the use of RSS and Atom in the tiny list of feeds that I follow, so a feed reader must handle both formats.
Adding Atom support introduces an extra branch to our snippet:
URI.open("https://jvns.ca/atom.xml") do |raw|
feed = RSS::Parser.parse(raw)
title = case feed
when RSS::Rss
feed.channel.title
when RSS::Atom::Feed
feed.title.content
end
end
The need to handle both standards independently is kind of frustrating.
That said, it does make sense from a library perspective. The RSS gem is principally concerned with parsing XML per the RSS and Atom standards, returning objects that correspond one-to-one. Any conveniences for general feed reading are left to the application.
Wrapping the RSS gem in another class helps encapsulate differences in standards:
class FeedReader
attr_reader :title
def initialize(url)
@url = url
end
def fetch
feed = URI.open(@url) { |r| RSS::Parser.parse(r) }
case feed
when RSS::Rss
@title = feed.channel.title
when RSS::Atom::Feed
@title = feed.title.content
end
end
end
Worse than dealing with competing standards is the fact that not everyone publishes the content of an article as part of their feed. Many bloggers only use RSS as a link aggregator that points subscribers to their webpage, omitting the content entirely:
<rss version="2.0">
<channel>
<title>Redacted Blog</title>
<link>https://www.redacted.io</link>
<description>This is my blog</description>
<item>
<title>Article title goes here</title>
<link>https://www.redacted.io/this-is-my-blog</link>
<pubDate>Thu, 25 Jul 2024 00:00:00 GMT</pubDate>
<!-- No content! -->
</item>
</channel>
</rss>
How do RSS readers handle this situation? The solution varies based on the app.
The two I've tested, NetNewsWire and Readwise Reader, manage to include the entire article content in the app, despite the RSS feed omitting it (assuming no paywalls). My guess is these services make an HTTP request to the source, scraping the resulting HTML for the article content and ignoring everything else.
Firefox users are likely familiar with a feature called Reader View that transforms a webpage into its bare-minimum content. All of the layout elements are removed in favor of highlighting the text of the page. The JS library that Firefox uses is open source on their Github: mozilla/readability.
On the Ruby side of things there's a handy port called ruby-readability that we can use to extract omitted article content directly from the associated website:
require "ruby-readability"
URI.open("https://jvns.ca/atom.xml") do |raw|
feed = RSS::Parser.parse(raw)
url = case feed
when RSS::Rss
feed.items.first.link
when RSS::Atom::Feed
feed.entries.first.link.href
end
# Raw HTML content
source = URI.parse(url).read
# Just the article HTML content
article_content = Readability::Document.new(source).content
end
So far the results are good, but I haven't tested it on many blogs.