Files
Munich-news/news_crawler/RSS_URL_EXTRACTION.md
2025-11-10 19:13:33 +01:00

4.6 KiB

RSS URL Extraction - How It Works

The Problem

Different RSS feed providers use different fields to store the article URL:

<item>
  <title>Article Title</title>
  <link>https://example.com/article/123</link>
  <guid>internal-id-456</guid>
</item>

Example 2: Some feeds (uses guid as URL)

<item>
  <title>Article Title</title>
  <guid>https://example.com/article/123</guid>
</item>

Example 3: Atom feeds (uses id)

<entry>
  <title>Article Title</title>
  <id>https://example.com/article/123</id>
</entry>

Example 4: Complex feeds (guid as object)

<item>
  <title>Article Title</title>
  <guid isPermaLink="true">https://example.com/article/123</guid>
</item>
<item>
  <title>Article Title</title>
  <link rel="alternate" type="text/html" href="https://example.com/article/123"/>
  <link rel="enclosure" type="image/jpeg" href="https://example.com/image.jpg"/>
</item>

Our Solution

The extract_article_url() function tries multiple strategies in order:

if entry.get('link') and entry.get('link', '').startswith('http'):
    return entry.get('link')

Works for: Most RSS 2.0 feeds

Strategy 2: Check guid field

if entry.get('guid'):
    guid = entry.get('guid')
    # guid can be a string
    if isinstance(guid, str) and guid.startswith('http'):
        return guid
    # or a dict with 'href'
    elif isinstance(guid, dict) and guid.get('href', '').startswith('http'):
        return guid.get('href')

Works for: Feeds that use GUID as permalink

Strategy 3: Check id field

if entry.get('id') and entry.get('id', '').startswith('http'):
    return entry.get('id')

Works for: Atom feeds

if entry.get('links'):
    for link in entry.get('links', []):
        if isinstance(link, dict) and link.get('href', '').startswith('http'):
            # Prefer 'alternate' type
            if link.get('type') == 'text/html' or link.get('rel') == 'alternate':
                return link.get('href')

Works for: Feeds with multiple links (prefers HTML content)

Real-World Examples

Süddeutsche Zeitung

entry = {
    'title': 'Munich News',
    'link': 'https://www.sueddeutsche.de/muenchen/article-123',
    'guid': 'sz-internal-123'
}
# Returns: 'https://www.sueddeutsche.de/muenchen/article-123'

Medium Blog

entry = {
    'title': 'Blog Post',
    'guid': 'https://medium.com/@user/post-abc123',
    'link': None
}
# Returns: 'https://medium.com/@user/post-abc123'

YouTube RSS

entry = {
    'title': 'Video Title',
    'id': 'https://www.youtube.com/watch?v=abc123',
    'link': None
}
# Returns: 'https://www.youtube.com/watch?v=abc123'

Complex Feed

entry = {
    'title': 'Article',
    'links': [
        {'rel': 'alternate', 'type': 'text/html', 'href': 'https://example.com/article'},
        {'rel': 'enclosure', 'type': 'image/jpeg', 'href': 'https://example.com/image.jpg'}
    ]
}
# Returns: 'https://example.com/article' (prefers text/html)

Validation

All extracted URLs must:

  1. Start with http:// or https://
  2. Be a valid string (not None or empty)

If no valid URL is found:

return None
# Crawler will skip this entry and log a warning

Testing Different Feeds

To test if a feed works with our extractor:

import feedparser
from rss_utils import extract_article_url

# Parse feed
feed = feedparser.parse('https://example.com/rss')

# Test each entry
for entry in feed.entries[:5]:
    url = extract_article_url(entry)
    if url:
        print(f"✓ {entry.get('title', 'No title')[:50]}")
        print(f"  URL: {url}")
    else:
        print(f"✗ {entry.get('title', 'No title')[:50]}")
        print(f"  No valid URL found")
        print(f"  Available fields: {list(entry.keys())}")

Supported Feed Types

RSS 2.0
RSS 1.0
Atom
Custom RSS variants
Feeds with multiple links
Feeds with GUID as permalink

Edge Cases Handled

  1. GUID is not a URL: Checks if it starts with http
  2. Multiple links: Prefers text/html type
  3. GUID as dict: Extracts href field
  4. Missing fields: Returns None instead of crashing
  5. Non-HTTP URLs: Filters out mailto:, ftp:, etc.

Future Improvements

Potential enhancements:

  • Support for feedburner:origLink
  • Support for pheedo:origLink
  • Resolve shortened URLs (bit.ly, etc.)
  • Handle relative URLs (convert to absolute)
  • Cache URL extraction results