update
This commit is contained in:
194
news_crawler/RSS_URL_EXTRACTION.md
Normal file
194
news_crawler/RSS_URL_EXTRACTION.md
Normal file
@@ -0,0 +1,194 @@
|
||||
# RSS URL Extraction - How It Works
|
||||
|
||||
## The Problem
|
||||
|
||||
Different RSS feed providers use different fields to store the article URL:
|
||||
|
||||
### Example 1: Standard RSS (uses `link`)
|
||||
```xml
|
||||
<item>
|
||||
<title>Article Title</title>
|
||||
<link>https://example.com/article/123</link>
|
||||
<guid>internal-id-456</guid>
|
||||
</item>
|
||||
```
|
||||
|
||||
### Example 2: Some feeds (uses `guid` as URL)
|
||||
```xml
|
||||
<item>
|
||||
<title>Article Title</title>
|
||||
<guid>https://example.com/article/123</guid>
|
||||
</item>
|
||||
```
|
||||
|
||||
### Example 3: Atom feeds (uses `id`)
|
||||
```xml
|
||||
<entry>
|
||||
<title>Article Title</title>
|
||||
<id>https://example.com/article/123</id>
|
||||
</entry>
|
||||
```
|
||||
|
||||
### Example 4: Complex feeds (guid as object)
|
||||
```xml
|
||||
<item>
|
||||
<title>Article Title</title>
|
||||
<guid isPermaLink="true">https://example.com/article/123</guid>
|
||||
</item>
|
||||
```
|
||||
|
||||
### Example 5: Multiple links
|
||||
```xml
|
||||
<item>
|
||||
<title>Article Title</title>
|
||||
<link rel="alternate" type="text/html" href="https://example.com/article/123"/>
|
||||
<link rel="enclosure" type="image/jpeg" href="https://example.com/image.jpg"/>
|
||||
</item>
|
||||
```
|
||||
|
||||
## Our Solution
|
||||
|
||||
The `extract_article_url()` function tries multiple strategies in order:
|
||||
|
||||
### Strategy 1: Check `link` field (most common)
|
||||
```python
|
||||
if entry.get('link') and entry.get('link', '').startswith('http'):
|
||||
return entry.get('link')
|
||||
```
|
||||
✅ Works for: Most RSS 2.0 feeds
|
||||
|
||||
### Strategy 2: Check `guid` field
|
||||
```python
|
||||
if entry.get('guid'):
|
||||
guid = entry.get('guid')
|
||||
# guid can be a string
|
||||
if isinstance(guid, str) and guid.startswith('http'):
|
||||
return guid
|
||||
# or a dict with 'href'
|
||||
elif isinstance(guid, dict) and guid.get('href', '').startswith('http'):
|
||||
return guid.get('href')
|
||||
```
|
||||
✅ Works for: Feeds that use GUID as permalink
|
||||
|
||||
### Strategy 3: Check `id` field
|
||||
```python
|
||||
if entry.get('id') and entry.get('id', '').startswith('http'):
|
||||
return entry.get('id')
|
||||
```
|
||||
✅ Works for: Atom feeds
|
||||
|
||||
### Strategy 4: Check `links` array
|
||||
```python
|
||||
if entry.get('links'):
|
||||
for link in entry.get('links', []):
|
||||
if isinstance(link, dict) and link.get('href', '').startswith('http'):
|
||||
# Prefer 'alternate' type
|
||||
if link.get('type') == 'text/html' or link.get('rel') == 'alternate':
|
||||
return link.get('href')
|
||||
```
|
||||
✅ Works for: Feeds with multiple links (prefers HTML content)
|
||||
|
||||
## Real-World Examples
|
||||
|
||||
### Süddeutsche Zeitung
|
||||
```python
|
||||
entry = {
|
||||
'title': 'Munich News',
|
||||
'link': 'https://www.sueddeutsche.de/muenchen/article-123',
|
||||
'guid': 'sz-internal-123'
|
||||
}
|
||||
# Returns: 'https://www.sueddeutsche.de/muenchen/article-123'
|
||||
```
|
||||
|
||||
### Medium Blog
|
||||
```python
|
||||
entry = {
|
||||
'title': 'Blog Post',
|
||||
'guid': 'https://medium.com/@user/post-abc123',
|
||||
'link': None
|
||||
}
|
||||
# Returns: 'https://medium.com/@user/post-abc123'
|
||||
```
|
||||
|
||||
### YouTube RSS
|
||||
```python
|
||||
entry = {
|
||||
'title': 'Video Title',
|
||||
'id': 'https://www.youtube.com/watch?v=abc123',
|
||||
'link': None
|
||||
}
|
||||
# Returns: 'https://www.youtube.com/watch?v=abc123'
|
||||
```
|
||||
|
||||
### Complex Feed
|
||||
```python
|
||||
entry = {
|
||||
'title': 'Article',
|
||||
'links': [
|
||||
{'rel': 'alternate', 'type': 'text/html', 'href': 'https://example.com/article'},
|
||||
{'rel': 'enclosure', 'type': 'image/jpeg', 'href': 'https://example.com/image.jpg'}
|
||||
]
|
||||
}
|
||||
# Returns: 'https://example.com/article' (prefers text/html)
|
||||
```
|
||||
|
||||
## Validation
|
||||
|
||||
All extracted URLs must:
|
||||
1. Start with `http://` or `https://`
|
||||
2. Be a valid string (not None or empty)
|
||||
|
||||
If no valid URL is found:
|
||||
```python
|
||||
return None
|
||||
# Crawler will skip this entry and log a warning
|
||||
```
|
||||
|
||||
## Testing Different Feeds
|
||||
|
||||
To test if a feed works with our extractor:
|
||||
|
||||
```python
|
||||
import feedparser
|
||||
from rss_utils import extract_article_url
|
||||
|
||||
# Parse feed
|
||||
feed = feedparser.parse('https://example.com/rss')
|
||||
|
||||
# Test each entry
|
||||
for entry in feed.entries[:5]:
|
||||
url = extract_article_url(entry)
|
||||
if url:
|
||||
print(f"✓ {entry.get('title', 'No title')[:50]}")
|
||||
print(f" URL: {url}")
|
||||
else:
|
||||
print(f"✗ {entry.get('title', 'No title')[:50]}")
|
||||
print(f" No valid URL found")
|
||||
print(f" Available fields: {list(entry.keys())}")
|
||||
```
|
||||
|
||||
## Supported Feed Types
|
||||
|
||||
✅ RSS 2.0
|
||||
✅ RSS 1.0
|
||||
✅ Atom
|
||||
✅ Custom RSS variants
|
||||
✅ Feeds with multiple links
|
||||
✅ Feeds with GUID as permalink
|
||||
|
||||
## Edge Cases Handled
|
||||
|
||||
1. **GUID is not a URL**: Checks if it starts with `http`
|
||||
2. **Multiple links**: Prefers `text/html` type
|
||||
3. **GUID as dict**: Extracts `href` field
|
||||
4. **Missing fields**: Returns None instead of crashing
|
||||
5. **Non-HTTP URLs**: Filters out `mailto:`, `ftp:`, etc.
|
||||
|
||||
## Future Improvements
|
||||
|
||||
Potential enhancements:
|
||||
- [ ] Support for `feedburner:origLink`
|
||||
- [ ] Support for `pheedo:origLink`
|
||||
- [ ] Resolve shortened URLs (bit.ly, etc.)
|
||||
- [ ] Handle relative URLs (convert to absolute)
|
||||
- [ ] Cache URL extraction results
|
||||
Reference in New Issue
Block a user