307 lines
8.7 KiB
Markdown
307 lines
8.7 KiB
Markdown
# How the News Crawler Works
|
|
|
|
## 🎯 Overview
|
|
|
|
The crawler dynamically extracts article metadata from any website using multiple fallback strategies.
|
|
|
|
## 📊 Flow Diagram
|
|
|
|
```
|
|
RSS Feed URL
|
|
↓
|
|
Parse RSS Feed
|
|
↓
|
|
For each article link:
|
|
↓
|
|
┌─────────────────────────────────────┐
|
|
│ 1. Fetch HTML Page │
|
|
│ GET https://example.com/article │
|
|
└─────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────┐
|
|
│ 2. Parse with BeautifulSoup │
|
|
│ soup = BeautifulSoup(html) │
|
|
└─────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────┐
|
|
│ 3. Clean HTML │
|
|
│ Remove: scripts, styles, nav, │
|
|
│ footer, header, ads │
|
|
└─────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────┐
|
|
│ 4. Extract Title │
|
|
│ Try: H1 → OG meta → Twitter → │
|
|
│ Title tag │
|
|
└─────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────┐
|
|
│ 5. Extract Author │
|
|
│ Try: Meta author → rel=author → │
|
|
│ Class names → JSON-LD │
|
|
└─────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────┐
|
|
│ 6. Extract Date │
|
|
│ Try: <time> → Meta tags → │
|
|
│ Class names → JSON-LD │
|
|
└─────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────┐
|
|
│ 7. Extract Content │
|
|
│ Try: <article> → Class names → │
|
|
│ <main> → <body> │
|
|
│ Filter short paragraphs │
|
|
└─────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────┐
|
|
│ 8. Save to MongoDB │
|
|
│ { │
|
|
│ title, author, date, │
|
|
│ content, word_count │
|
|
│ } │
|
|
└─────────────────────────────────────┘
|
|
↓
|
|
Wait 1 second (rate limiting)
|
|
↓
|
|
Next article
|
|
```
|
|
|
|
## 🔍 Detailed Example
|
|
|
|
### Input: RSS Feed Entry
|
|
```xml
|
|
<item>
|
|
<title>New U-Bahn Line Opens</title>
|
|
<link>https://www.sueddeutsche.de/muenchen/article-123</link>
|
|
<pubDate>Mon, 10 Nov 2024 10:00:00 +0100</pubDate>
|
|
</item>
|
|
```
|
|
|
|
### Step 1: Fetch HTML
|
|
```python
|
|
url = "https://www.sueddeutsche.de/muenchen/article-123"
|
|
response = requests.get(url)
|
|
html = response.content
|
|
```
|
|
|
|
### Step 2: Parse HTML
|
|
```python
|
|
soup = BeautifulSoup(html, 'html.parser')
|
|
```
|
|
|
|
### Step 3: Extract Title
|
|
```python
|
|
# Try H1
|
|
h1 = soup.find('h1')
|
|
# Result: "New U-Bahn Line Opens in Munich"
|
|
|
|
# If no H1, try OG meta
|
|
og_title = soup.find('meta', property='og:title')
|
|
# Fallback chain continues...
|
|
```
|
|
|
|
### Step 4: Extract Author
|
|
```python
|
|
# Try meta author
|
|
meta_author = soup.find('meta', name='author')
|
|
# Result: None
|
|
|
|
# Try class names
|
|
author_elem = soup.select_one('[class*="author"]')
|
|
# Result: "Max Mustermann"
|
|
```
|
|
|
|
### Step 5: Extract Date
|
|
```python
|
|
# Try time tag
|
|
time_tag = soup.find('time')
|
|
# Result: "2024-11-10T10:00:00Z"
|
|
```
|
|
|
|
### Step 6: Extract Content
|
|
```python
|
|
# Try article tag
|
|
article = soup.find('article')
|
|
paragraphs = article.find_all('p')
|
|
|
|
# Filter paragraphs
|
|
content = []
|
|
for p in paragraphs:
|
|
text = p.get_text().strip()
|
|
if len(text) >= 50: # Keep substantial paragraphs
|
|
content.append(text)
|
|
|
|
full_content = '\n\n'.join(content)
|
|
# Result: "The new U-Bahn line connecting the city center..."
|
|
```
|
|
|
|
### Step 7: Save to Database
|
|
```python
|
|
article_doc = {
|
|
'title': 'New U-Bahn Line Opens in Munich',
|
|
'author': 'Max Mustermann',
|
|
'link': 'https://www.sueddeutsche.de/muenchen/article-123',
|
|
'summary': 'Short summary from RSS...',
|
|
'full_content': 'The new U-Bahn line connecting...',
|
|
'word_count': 1250,
|
|
'source': 'Süddeutsche Zeitung München',
|
|
'published_at': '2024-11-10T10:00:00Z',
|
|
'crawled_at': datetime.utcnow(),
|
|
'created_at': datetime.utcnow()
|
|
}
|
|
|
|
db.articles.update_one(
|
|
{'link': article_url},
|
|
{'$set': article_doc},
|
|
upsert=True
|
|
)
|
|
```
|
|
|
|
## 🎨 What Makes It "Dynamic"?
|
|
|
|
### Traditional Approach (Hardcoded)
|
|
```python
|
|
# Only works for one specific site
|
|
title = soup.find('h1', class_='article-title').text
|
|
author = soup.find('span', class_='author-name').text
|
|
```
|
|
❌ Breaks when site changes
|
|
❌ Doesn't work on other sites
|
|
|
|
### Our Approach (Dynamic)
|
|
```python
|
|
# Works on ANY site
|
|
title = extract_title(soup) # Tries 4 different methods
|
|
author = extract_author(soup) # Tries 5 different methods
|
|
```
|
|
✅ Adapts to different HTML structures
|
|
✅ Falls back to alternatives
|
|
✅ Works across multiple sites
|
|
|
|
## 🛡️ Robustness Features
|
|
|
|
### 1. Multiple Strategies
|
|
Each field has 4-6 extraction strategies
|
|
```python
|
|
def extract_title(soup):
|
|
# Try strategy 1
|
|
if h1 := soup.find('h1'):
|
|
return h1.text
|
|
|
|
# Try strategy 2
|
|
if og_title := soup.find('meta', property='og:title'):
|
|
return og_title['content']
|
|
|
|
# Try strategy 3...
|
|
# Try strategy 4...
|
|
```
|
|
|
|
### 2. Validation
|
|
```python
|
|
# Title must be reasonable length
|
|
if title and len(title) > 10:
|
|
return title
|
|
|
|
# Author must be < 100 chars
|
|
if author and len(author) < 100:
|
|
return author
|
|
```
|
|
|
|
### 3. Cleaning
|
|
```python
|
|
# Remove site name from title
|
|
if ' | ' in title:
|
|
title = title.split(' | ')[0]
|
|
|
|
# Remove "By" from author
|
|
author = author.replace('By ', '').strip()
|
|
```
|
|
|
|
### 4. Error Handling
|
|
```python
|
|
try:
|
|
data = extract_article_content(url)
|
|
except Timeout:
|
|
print("Timeout - skip")
|
|
except RequestException:
|
|
print("Network error - skip")
|
|
except Exception:
|
|
print("Unknown error - skip")
|
|
```
|
|
|
|
## 📈 Success Metrics
|
|
|
|
After crawling, you'll see:
|
|
|
|
```
|
|
📰 Crawling feed: Süddeutsche Zeitung München
|
|
🔍 Crawling: New U-Bahn Line Opens...
|
|
✓ Saved (1250 words)
|
|
|
|
Title: ✓ Found
|
|
Author: ✓ Found (Max Mustermann)
|
|
Date: ✓ Found (2024-11-10T10:00:00Z)
|
|
Content: ✓ Found (1250 words)
|
|
```
|
|
|
|
## 🗄️ Database Result
|
|
|
|
**Before Crawling:**
|
|
```javascript
|
|
{
|
|
title: "New U-Bahn Line Opens",
|
|
link: "https://example.com/article",
|
|
summary: "Short RSS summary...",
|
|
source: "Süddeutsche Zeitung"
|
|
}
|
|
```
|
|
|
|
**After Crawling:**
|
|
```javascript
|
|
{
|
|
title: "New U-Bahn Line Opens in Munich", // ← Enhanced
|
|
author: "Max Mustermann", // ← NEW!
|
|
link: "https://example.com/article",
|
|
summary: "Short RSS summary...",
|
|
full_content: "The new U-Bahn line...", // ← NEW! (1250 words)
|
|
word_count: 1250, // ← NEW!
|
|
source: "Süddeutsche Zeitung",
|
|
published_at: "2024-11-10T10:00:00Z", // ← Enhanced
|
|
crawled_at: ISODate("2024-11-10T16:30:00Z"), // ← NEW!
|
|
created_at: ISODate("2024-11-10T16:00:00Z")
|
|
}
|
|
```
|
|
|
|
## 🚀 Running the Crawler
|
|
|
|
```bash
|
|
cd news_crawler
|
|
pip install -r requirements.txt
|
|
python crawler_service.py 10
|
|
```
|
|
|
|
Output:
|
|
```
|
|
============================================================
|
|
🚀 Starting RSS Feed Crawler
|
|
============================================================
|
|
Found 3 active feed(s)
|
|
|
|
📰 Crawling feed: Süddeutsche Zeitung München
|
|
🔍 Crawling: New U-Bahn Line Opens...
|
|
✓ Saved (1250 words)
|
|
🔍 Crawling: Munich Weather Update...
|
|
✓ Saved (450 words)
|
|
✓ Crawled 2 articles
|
|
|
|
============================================================
|
|
✓ Crawling Complete!
|
|
Total feeds processed: 3
|
|
Total articles crawled: 15
|
|
Duration: 45.23 seconds
|
|
============================================================
|
|
```
|
|
|
|
Now you have rich, structured article data ready for AI processing! 🎉
|