8.7 KiB
8.7 KiB
How the News Crawler Works
🎯 Overview
The crawler dynamically extracts article metadata from any website using multiple fallback strategies.
📊 Flow Diagram
RSS Feed URL
↓
Parse RSS Feed
↓
For each article link:
↓
┌─────────────────────────────────────┐
│ 1. Fetch HTML Page │
│ GET https://example.com/article │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ 2. Parse with BeautifulSoup │
│ soup = BeautifulSoup(html) │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ 3. Clean HTML │
│ Remove: scripts, styles, nav, │
│ footer, header, ads │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ 4. Extract Title │
│ Try: H1 → OG meta → Twitter → │
│ Title tag │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ 5. Extract Author │
│ Try: Meta author → rel=author → │
│ Class names → JSON-LD │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ 6. Extract Date │
│ Try: <time> → Meta tags → │
│ Class names → JSON-LD │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ 7. Extract Content │
│ Try: <article> → Class names → │
│ <main> → <body> │
│ Filter short paragraphs │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ 8. Save to MongoDB │
│ { │
│ title, author, date, │
│ content, word_count │
│ } │
└─────────────────────────────────────┘
↓
Wait 1 second (rate limiting)
↓
Next article
🔍 Detailed Example
Input: RSS Feed Entry
<item>
<title>New U-Bahn Line Opens</title>
<link>https://www.sueddeutsche.de/muenchen/article-123</link>
<pubDate>Mon, 10 Nov 2024 10:00:00 +0100</pubDate>
</item>
Step 1: Fetch HTML
url = "https://www.sueddeutsche.de/muenchen/article-123"
response = requests.get(url)
html = response.content
Step 2: Parse HTML
soup = BeautifulSoup(html, 'html.parser')
Step 3: Extract Title
# Try H1
h1 = soup.find('h1')
# Result: "New U-Bahn Line Opens in Munich"
# If no H1, try OG meta
og_title = soup.find('meta', property='og:title')
# Fallback chain continues...
Step 4: Extract Author
# Try meta author
meta_author = soup.find('meta', name='author')
# Result: None
# Try class names
author_elem = soup.select_one('[class*="author"]')
# Result: "Max Mustermann"
Step 5: Extract Date
# Try time tag
time_tag = soup.find('time')
# Result: "2024-11-10T10:00:00Z"
Step 6: Extract Content
# Try article tag
article = soup.find('article')
paragraphs = article.find_all('p')
# Filter paragraphs
content = []
for p in paragraphs:
text = p.get_text().strip()
if len(text) >= 50: # Keep substantial paragraphs
content.append(text)
full_content = '\n\n'.join(content)
# Result: "The new U-Bahn line connecting the city center..."
Step 7: Save to Database
article_doc = {
'title': 'New U-Bahn Line Opens in Munich',
'author': 'Max Mustermann',
'link': 'https://www.sueddeutsche.de/muenchen/article-123',
'summary': 'Short summary from RSS...',
'full_content': 'The new U-Bahn line connecting...',
'word_count': 1250,
'source': 'Süddeutsche Zeitung München',
'published_at': '2024-11-10T10:00:00Z',
'crawled_at': datetime.utcnow(),
'created_at': datetime.utcnow()
}
db.articles.update_one(
{'link': article_url},
{'$set': article_doc},
upsert=True
)
🎨 What Makes It "Dynamic"?
Traditional Approach (Hardcoded)
# Only works for one specific site
title = soup.find('h1', class_='article-title').text
author = soup.find('span', class_='author-name').text
❌ Breaks when site changes ❌ Doesn't work on other sites
Our Approach (Dynamic)
# Works on ANY site
title = extract_title(soup) # Tries 4 different methods
author = extract_author(soup) # Tries 5 different methods
✅ Adapts to different HTML structures ✅ Falls back to alternatives ✅ Works across multiple sites
🛡️ Robustness Features
1. Multiple Strategies
Each field has 4-6 extraction strategies
def extract_title(soup):
# Try strategy 1
if h1 := soup.find('h1'):
return h1.text
# Try strategy 2
if og_title := soup.find('meta', property='og:title'):
return og_title['content']
# Try strategy 3...
# Try strategy 4...
2. Validation
# Title must be reasonable length
if title and len(title) > 10:
return title
# Author must be < 100 chars
if author and len(author) < 100:
return author
3. Cleaning
# Remove site name from title
if ' | ' in title:
title = title.split(' | ')[0]
# Remove "By" from author
author = author.replace('By ', '').strip()
4. Error Handling
try:
data = extract_article_content(url)
except Timeout:
print("Timeout - skip")
except RequestException:
print("Network error - skip")
except Exception:
print("Unknown error - skip")
📈 Success Metrics
After crawling, you'll see:
📰 Crawling feed: Süddeutsche Zeitung München
🔍 Crawling: New U-Bahn Line Opens...
✓ Saved (1250 words)
Title: ✓ Found
Author: ✓ Found (Max Mustermann)
Date: ✓ Found (2024-11-10T10:00:00Z)
Content: ✓ Found (1250 words)
🗄️ Database Result
Before Crawling:
{
title: "New U-Bahn Line Opens",
link: "https://example.com/article",
summary: "Short RSS summary...",
source: "Süddeutsche Zeitung"
}
After Crawling:
{
title: "New U-Bahn Line Opens in Munich", // ← Enhanced
author: "Max Mustermann", // ← NEW!
link: "https://example.com/article",
summary: "Short RSS summary...",
full_content: "The new U-Bahn line...", // ← NEW! (1250 words)
word_count: 1250, // ← NEW!
source: "Süddeutsche Zeitung",
published_at: "2024-11-10T10:00:00Z", // ← Enhanced
crawled_at: ISODate("2024-11-10T16:30:00Z"), // ← NEW!
created_at: ISODate("2024-11-10T16:00:00Z")
}
🚀 Running the Crawler
cd news_crawler
pip install -r requirements.txt
python crawler_service.py 10
Output:
============================================================
🚀 Starting RSS Feed Crawler
============================================================
Found 3 active feed(s)
📰 Crawling feed: Süddeutsche Zeitung München
🔍 Crawling: New U-Bahn Line Opens...
✓ Saved (1250 words)
🔍 Crawling: Munich Weather Update...
✓ Saved (450 words)
✓ Crawled 2 articles
============================================================
✓ Crawling Complete!
Total feeds processed: 3
Total articles crawled: 15
Duration: 45.23 seconds
============================================================
Now you have rich, structured article data ready for AI processing! 🎉