dongho/Munich-news

Fork 0

Files

Dongho Kim ac5738c29d update

2025-11-10 19:13:33 +01:00

4.6 KiB

Raw Blame History

RSS URL Extraction - How It Works

The Problem

Different RSS feed providers use different fields to store the article URL:

Example 1: Standard RSS (uses `link`)

<item>
  <title>Article Title</title>
  <link>https://example.com/article/123</link>
  <guid>internal-id-456</guid>
</item>

Example 2: Some feeds (uses `guid` as URL)

<item>
  <title>Article Title</title>
  <guid>https://example.com/article/123</guid>
</item>

Example 3: Atom feeds (uses `id`)

<entry>
  <title>Article Title</title>
  <id>https://example.com/article/123</id>
</entry>

Example 4: Complex feeds (guid as object)

<item>
  <title>Article Title</title>
  <guid isPermaLink="true">https://example.com/article/123</guid>
</item>

Example 5: Multiple links

<item>
  <title>Article Title</title>
  <link rel="alternate" type="text/html" href="https://example.com/article/123"/>
  <link rel="enclosure" type="image/jpeg" href="https://example.com/image.jpg"/>
</item>

Our Solution

The extract_article_url() function tries multiple strategies in order:

Strategy 1: Check `link` field (most common)

if entry.get('link') and entry.get('link', '').startswith('http'):
    return entry.get('link')

✅ Works for: Most RSS 2.0 feeds

Strategy 2: Check `guid` field

if entry.get('guid'):
    guid = entry.get('guid')
    # guid can be a string
    if isinstance(guid, str) and guid.startswith('http'):
        return guid
    # or a dict with 'href'
    elif isinstance(guid, dict) and guid.get('href', '').startswith('http'):
        return guid.get('href')

✅ Works for: Feeds that use GUID as permalink

Strategy 3: Check `id` field

if entry.get('id') and entry.get('id', '').startswith('http'):
    return entry.get('id')

✅ Works for: Atom feeds

Strategy 4: Check `links` array

if entry.get('links'):
    for link in entry.get('links', []):
        if isinstance(link, dict) and link.get('href', '').startswith('http'):
            # Prefer 'alternate' type
            if link.get('type') == 'text/html' or link.get('rel') == 'alternate':
                return link.get('href')

✅ Works for: Feeds with multiple links (prefers HTML content)

Real-World Examples

Süddeutsche Zeitung

entry = {
    'title': 'Munich News',
    'link': 'https://www.sueddeutsche.de/muenchen/article-123',
    'guid': 'sz-internal-123'
}
# Returns: 'https://www.sueddeutsche.de/muenchen/article-123'

Medium Blog

entry = {
    'title': 'Blog Post',
    'guid': 'https://medium.com/@user/post-abc123',
    'link': None
}
# Returns: 'https://medium.com/@user/post-abc123'

YouTube RSS

entry = {
    'title': 'Video Title',
    'id': 'https://www.youtube.com/watch?v=abc123',
    'link': None
}
# Returns: 'https://www.youtube.com/watch?v=abc123'

Complex Feed

entry = {
    'title': 'Article',
    'links': [
        {'rel': 'alternate', 'type': 'text/html', 'href': 'https://example.com/article'},
        {'rel': 'enclosure', 'type': 'image/jpeg', 'href': 'https://example.com/image.jpg'}
    ]
}
# Returns: 'https://example.com/article' (prefers text/html)

Validation

All extracted URLs must:

Start with http:// or https://
Be a valid string (not None or empty)

If no valid URL is found:

return None
# Crawler will skip this entry and log a warning

Testing Different Feeds

To test if a feed works with our extractor:

import feedparser
from rss_utils import extract_article_url

# Parse feed
feed = feedparser.parse('https://example.com/rss')

# Test each entry
for entry in feed.entries[:5]:
    url = extract_article_url(entry)
    if url:
        print(f"✓ {entry.get('title', 'No title')[:50]}")
        print(f"  URL: {url}")
    else:
        print(f"✗ {entry.get('title', 'No title')[:50]}")
        print(f"  No valid URL found")
        print(f"  Available fields: {list(entry.keys())}")

Supported Feed Types

✅ RSS 2.0
✅ RSS 1.0
✅ Atom
✅ Custom RSS variants
✅ Feeds with multiple links
✅ Feeds with GUID as permalink

Edge Cases Handled

GUID is not a URL: Checks if it starts with http
Multiple links: Prefers text/html type
GUID as dict: Extracts href field
Missing fields: Returns None instead of crashing
Non-HTTP URLs: Filters out mailto:, ftp:, etc.

Future Improvements

Potential enhancements:

Support for feedburner:origLink
Support for pheedo:origLink
Resolve shortened URLs (bit.ly, etc.)
Handle relative URLs (convert to absolute)
Cache URL extraction results

4.6 KiB

Raw Blame History

RSS URL Extraction - How It Works

The Problem

Example 1: Standard RSS (uses `link`)

Example 2: Some feeds (uses `guid` as URL)

Example 3: Atom feeds (uses `id`)

Example 4: Complex feeds (guid as object)

Example 5: Multiple links

Our Solution

Strategy 1: Check `link` field (most common)

Strategy 2: Check `guid` field

Strategy 3: Check `id` field

Strategy 4: Check `links` array

Real-World Examples

Süddeutsche Zeitung

Medium Blog

YouTube RSS

Complex Feed

Validation

Testing Different Feeds

Supported Feed Types

Edge Cases Handled

Future Improvements

Build together

Resources

Get help

4.6 KiB Raw Blame History

RSS URL Extraction - How It Works

The Problem

Example 1: Standard RSS (uses link)

Example 2: Some feeds (uses guid as URL)

Example 3: Atom feeds (uses id)

Example 4: Complex feeds (guid as object)

Example 5: Multiple links

Our Solution

Strategy 1: Check link field (most common)

Strategy 2: Check guid field

Strategy 3: Check id field

Strategy 4: Check links array

Real-World Examples

Süddeutsche Zeitung

Medium Blog

YouTube RSS

Complex Feed

Validation

Testing Different Feeds

Supported Feed Types

Edge Cases Handled

Future Improvements

Build together

Resources

Get help

4.6 KiB

Raw Blame History

Example 1: Standard RSS (uses `link`)

Example 2: Some feeds (uses `guid` as URL)

Example 3: Atom feeds (uses `id`)

Strategy 1: Check `link` field (most common)

Strategy 2: Check `guid` field

Strategy 3: Check `id` field

Strategy 4: Check `links` array