# RSS URL Extraction - How It Works ## The Problem Different RSS feed providers use different fields to store the article URL: ### Example 1: Standard RSS (uses `link`) ```xml Article Title https://example.com/article/123 internal-id-456 ``` ### Example 2: Some feeds (uses `guid` as URL) ```xml Article Title https://example.com/article/123 ``` ### Example 3: Atom feeds (uses `id`) ```xml Article Title https://example.com/article/123 ``` ### Example 4: Complex feeds (guid as object) ```xml Article Title https://example.com/article/123 ``` ### Example 5: Multiple links ```xml Article Title ``` ## Our Solution The `extract_article_url()` function tries multiple strategies in order: ### Strategy 1: Check `link` field (most common) ```python if entry.get('link') and entry.get('link', '').startswith('http'): return entry.get('link') ``` ✅ Works for: Most RSS 2.0 feeds ### Strategy 2: Check `guid` field ```python if entry.get('guid'): guid = entry.get('guid') # guid can be a string if isinstance(guid, str) and guid.startswith('http'): return guid # or a dict with 'href' elif isinstance(guid, dict) and guid.get('href', '').startswith('http'): return guid.get('href') ``` ✅ Works for: Feeds that use GUID as permalink ### Strategy 3: Check `id` field ```python if entry.get('id') and entry.get('id', '').startswith('http'): return entry.get('id') ``` ✅ Works for: Atom feeds ### Strategy 4: Check `links` array ```python if entry.get('links'): for link in entry.get('links', []): if isinstance(link, dict) and link.get('href', '').startswith('http'): # Prefer 'alternate' type if link.get('type') == 'text/html' or link.get('rel') == 'alternate': return link.get('href') ``` ✅ Works for: Feeds with multiple links (prefers HTML content) ## Real-World Examples ### Süddeutsche Zeitung ```python entry = { 'title': 'Munich News', 'link': 'https://www.sueddeutsche.de/muenchen/article-123', 'guid': 'sz-internal-123' } # Returns: 'https://www.sueddeutsche.de/muenchen/article-123' ``` ### Medium Blog ```python entry = { 'title': 'Blog Post', 'guid': 'https://medium.com/@user/post-abc123', 'link': None } # Returns: 'https://medium.com/@user/post-abc123' ``` ### YouTube RSS ```python entry = { 'title': 'Video Title', 'id': 'https://www.youtube.com/watch?v=abc123', 'link': None } # Returns: 'https://www.youtube.com/watch?v=abc123' ``` ### Complex Feed ```python entry = { 'title': 'Article', 'links': [ {'rel': 'alternate', 'type': 'text/html', 'href': 'https://example.com/article'}, {'rel': 'enclosure', 'type': 'image/jpeg', 'href': 'https://example.com/image.jpg'} ] } # Returns: 'https://example.com/article' (prefers text/html) ``` ## Validation All extracted URLs must: 1. Start with `http://` or `https://` 2. Be a valid string (not None or empty) If no valid URL is found: ```python return None # Crawler will skip this entry and log a warning ``` ## Testing Different Feeds To test if a feed works with our extractor: ```python import feedparser from rss_utils import extract_article_url # Parse feed feed = feedparser.parse('https://example.com/rss') # Test each entry for entry in feed.entries[:5]: url = extract_article_url(entry) if url: print(f"✓ {entry.get('title', 'No title')[:50]}") print(f" URL: {url}") else: print(f"✗ {entry.get('title', 'No title')[:50]}") print(f" No valid URL found") print(f" Available fields: {list(entry.keys())}") ``` ## Supported Feed Types ✅ RSS 2.0 ✅ RSS 1.0 ✅ Atom ✅ Custom RSS variants ✅ Feeds with multiple links ✅ Feeds with GUID as permalink ## Edge Cases Handled 1. **GUID is not a URL**: Checks if it starts with `http` 2. **Multiple links**: Prefers `text/html` type 3. **GUID as dict**: Extracts `href` field 4. **Missing fields**: Returns None instead of crashing 5. **Non-HTTP URLs**: Filters out `mailto:`, `ftp:`, etc. ## Future Improvements Potential enhancements: - [ ] Support for `feedburner:origLink` - [ ] Support for `pheedo:origLink` - [ ] Resolve shortened URLs (bit.ly, etc.) - [ ] Handle relative URLs (convert to absolute) - [ ] Cache URL extraction results