4.6 KiB
4.6 KiB
RSS URL Extraction - How It Works
The Problem
Different RSS feed providers use different fields to store the article URL:
Example 1: Standard RSS (uses link)
<item>
<title>Article Title</title>
<link>https://example.com/article/123</link>
<guid>internal-id-456</guid>
</item>
Example 2: Some feeds (uses guid as URL)
<item>
<title>Article Title</title>
<guid>https://example.com/article/123</guid>
</item>
Example 3: Atom feeds (uses id)
<entry>
<title>Article Title</title>
<id>https://example.com/article/123</id>
</entry>
Example 4: Complex feeds (guid as object)
<item>
<title>Article Title</title>
<guid isPermaLink="true">https://example.com/article/123</guid>
</item>
Example 5: Multiple links
<item>
<title>Article Title</title>
<link rel="alternate" type="text/html" href="https://example.com/article/123"/>
<link rel="enclosure" type="image/jpeg" href="https://example.com/image.jpg"/>
</item>
Our Solution
The extract_article_url() function tries multiple strategies in order:
Strategy 1: Check link field (most common)
if entry.get('link') and entry.get('link', '').startswith('http'):
return entry.get('link')
✅ Works for: Most RSS 2.0 feeds
Strategy 2: Check guid field
if entry.get('guid'):
guid = entry.get('guid')
# guid can be a string
if isinstance(guid, str) and guid.startswith('http'):
return guid
# or a dict with 'href'
elif isinstance(guid, dict) and guid.get('href', '').startswith('http'):
return guid.get('href')
✅ Works for: Feeds that use GUID as permalink
Strategy 3: Check id field
if entry.get('id') and entry.get('id', '').startswith('http'):
return entry.get('id')
✅ Works for: Atom feeds
Strategy 4: Check links array
if entry.get('links'):
for link in entry.get('links', []):
if isinstance(link, dict) and link.get('href', '').startswith('http'):
# Prefer 'alternate' type
if link.get('type') == 'text/html' or link.get('rel') == 'alternate':
return link.get('href')
✅ Works for: Feeds with multiple links (prefers HTML content)
Real-World Examples
Süddeutsche Zeitung
entry = {
'title': 'Munich News',
'link': 'https://www.sueddeutsche.de/muenchen/article-123',
'guid': 'sz-internal-123'
}
# Returns: 'https://www.sueddeutsche.de/muenchen/article-123'
Medium Blog
entry = {
'title': 'Blog Post',
'guid': 'https://medium.com/@user/post-abc123',
'link': None
}
# Returns: 'https://medium.com/@user/post-abc123'
YouTube RSS
entry = {
'title': 'Video Title',
'id': 'https://www.youtube.com/watch?v=abc123',
'link': None
}
# Returns: 'https://www.youtube.com/watch?v=abc123'
Complex Feed
entry = {
'title': 'Article',
'links': [
{'rel': 'alternate', 'type': 'text/html', 'href': 'https://example.com/article'},
{'rel': 'enclosure', 'type': 'image/jpeg', 'href': 'https://example.com/image.jpg'}
]
}
# Returns: 'https://example.com/article' (prefers text/html)
Validation
All extracted URLs must:
- Start with
http://orhttps:// - Be a valid string (not None or empty)
If no valid URL is found:
return None
# Crawler will skip this entry and log a warning
Testing Different Feeds
To test if a feed works with our extractor:
import feedparser
from rss_utils import extract_article_url
# Parse feed
feed = feedparser.parse('https://example.com/rss')
# Test each entry
for entry in feed.entries[:5]:
url = extract_article_url(entry)
if url:
print(f"✓ {entry.get('title', 'No title')[:50]}")
print(f" URL: {url}")
else:
print(f"✗ {entry.get('title', 'No title')[:50]}")
print(f" No valid URL found")
print(f" Available fields: {list(entry.keys())}")
Supported Feed Types
✅ RSS 2.0
✅ RSS 1.0
✅ Atom
✅ Custom RSS variants
✅ Feeds with multiple links
✅ Feeds with GUID as permalink
Edge Cases Handled
- GUID is not a URL: Checks if it starts with
http - Multiple links: Prefers
text/htmltype - GUID as dict: Extracts
hreffield - Missing fields: Returns None instead of crashing
- Non-HTTP URLs: Filters out
mailto:,ftp:, etc.
Future Improvements
Potential enhancements:
- Support for
feedburner:origLink - Support for
pheedo:origLink - Resolve shortened URLs (bit.ly, etc.)
- Handle relative URLs (convert to absolute)
- Cache URL extraction results