update

2025-11-10 19:13:33 +01:00
commit ac5738c29d
64 changed files with 9445 additions and 0 deletions
--- a/news_crawler/RSS_URL_EXTRACTION.md
+++ b/news_crawler/RSS_URL_EXTRACTION.md
@@ -0,0 +1,194 @@
+# RSS URL Extraction - How It Works
+
+## The Problem
+
+Different RSS feed providers use different fields to store the article URL:
+
+### Example 1: Standard RSS (uses `link`)
+```xml
+<item>
+  <title>Article Title</title>
+  <link>https://example.com/article/123</link>
+  <guid>internal-id-456</guid>
+</item>
+```
+
+### Example 2: Some feeds (uses `guid` as URL)
+```xml
+<item>
+  <title>Article Title</title>
+  <guid>https://example.com/article/123</guid>
+</item>
+```
+
+### Example 3: Atom feeds (uses `id`)
+```xml
+<entry>
+  <title>Article Title</title>
+  <id>https://example.com/article/123</id>
+</entry>
+```
+
+### Example 4: Complex feeds (guid as object)
+```xml
+<item>
+  <title>Article Title</title>
+  <guid isPermaLink="true">https://example.com/article/123</guid>
+</item>
+```
+
+### Example 5: Multiple links
+```xml
+<item>
+  <title>Article Title</title>
+  <link rel="alternate" type="text/html" href="https://example.com/article/123"/>
+  <link rel="enclosure" type="image/jpeg" href="https://example.com/image.jpg"/>
+</item>
+```
+
+## Our Solution
+
+The `extract_article_url()` function tries multiple strategies in order:
+
+### Strategy 1: Check `link` field (most common)
+```python
+if entry.get('link') and entry.get('link', '').startswith('http'):
+    return entry.get('link')
+```
+✅ Works for: Most RSS 2.0 feeds
+
+### Strategy 2: Check `guid` field
+```python
+if entry.get('guid'):
+    guid = entry.get('guid')
+    # guid can be a string
+    if isinstance(guid, str) and guid.startswith('http'):
+        return guid
+    # or a dict with 'href'
+    elif isinstance(guid, dict) and guid.get('href', '').startswith('http'):
+        return guid.get('href')
+```
+✅ Works for: Feeds that use GUID as permalink
+
+### Strategy 3: Check `id` field
+```python
+if entry.get('id') and entry.get('id', '').startswith('http'):
+    return entry.get('id')
+```
+✅ Works for: Atom feeds
+
+### Strategy 4: Check `links` array
+```python
+if entry.get('links'):
+    for link in entry.get('links', []):
+        if isinstance(link, dict) and link.get('href', '').startswith('http'):
+            # Prefer 'alternate' type
+            if link.get('type') == 'text/html' or link.get('rel') == 'alternate':
+                return link.get('href')
+```
+✅ Works for: Feeds with multiple links (prefers HTML content)
+
+## Real-World Examples
+
+### Süddeutsche Zeitung
+```python
+entry = {
+    'title': 'Munich News',
+    'link': 'https://www.sueddeutsche.de/muenchen/article-123',
+    'guid': 'sz-internal-123'
+}
+# Returns: 'https://www.sueddeutsche.de/muenchen/article-123'
+```
+
+### Medium Blog
+```python
+entry = {
+    'title': 'Blog Post',
+    'guid': 'https://medium.com/@user/post-abc123',
+    'link': None
+}
+# Returns: 'https://medium.com/@user/post-abc123'
+```
+
+### YouTube RSS
+```python
+entry = {
+    'title': 'Video Title',
+    'id': 'https://www.youtube.com/watch?v=abc123',
+    'link': None
+}
+# Returns: 'https://www.youtube.com/watch?v=abc123'
+```
+
+### Complex Feed
+```python
+entry = {
+    'title': 'Article',
+    'links': [
+        {'rel': 'alternate', 'type': 'text/html', 'href': 'https://example.com/article'},
+        {'rel': 'enclosure', 'type': 'image/jpeg', 'href': 'https://example.com/image.jpg'}
+    ]
+}
+# Returns: 'https://example.com/article' (prefers text/html)
+```
+
+## Validation
+
+All extracted URLs must:
+1. Start with `http://` or `https://`
+2. Be a valid string (not None or empty)
+
+If no valid URL is found:
+```python
+return None
+# Crawler will skip this entry and log a warning
+```
+
+## Testing Different Feeds
+
+To test if a feed works with our extractor:
+
+```python
+import feedparser
+from rss_utils import extract_article_url
+
+# Parse feed
+feed = feedparser.parse('https://example.com/rss')
+
+# Test each entry
+for entry in feed.entries[:5]:
+    url = extract_article_url(entry)
+    if url:
+        print(f"✓ {entry.get('title', 'No title')[:50]}")
+        print(f"  URL: {url}")
+    else:
+        print(f"✗ {entry.get('title', 'No title')[:50]}")
+        print(f"  No valid URL found")
+        print(f"  Available fields: {list(entry.keys())}")
+```
+
+## Supported Feed Types
+
+✅ RSS 2.0  
+✅ RSS 1.0  
+✅ Atom  
+✅ Custom RSS variants  
+✅ Feeds with multiple links  
+✅ Feeds with GUID as permalink  
+
+## Edge Cases Handled
+
+1. **GUID is not a URL**: Checks if it starts with `http`
+2. **Multiple links**: Prefers `text/html` type
+3. **GUID as dict**: Extracts `href` field
+4. **Missing fields**: Returns None instead of crashing
+5. **Non-HTTP URLs**: Filters out `mailto:`, `ftp:`, etc.
+
+## Future Improvements
+
+Potential enhancements:
+- [ ] Support for `feedburner:origLink`
+- [ ] Support for `pheedo:origLink`
+- [ ] Resolve shortened URLs (bit.ly, etc.)
+- [ ] Handle relative URLs (convert to absolute)
+- [ ] Cache URL extraction results