update
This commit is contained in:
306
docs/CRAWLER_HOW_IT_WORKS.md
Normal file
306
docs/CRAWLER_HOW_IT_WORKS.md
Normal file
@@ -0,0 +1,306 @@
|
||||
# How the News Crawler Works
|
||||
|
||||
## 🎯 Overview
|
||||
|
||||
The crawler dynamically extracts article metadata from any website using multiple fallback strategies.
|
||||
|
||||
## 📊 Flow Diagram
|
||||
|
||||
```
|
||||
RSS Feed URL
|
||||
↓
|
||||
Parse RSS Feed
|
||||
↓
|
||||
For each article link:
|
||||
↓
|
||||
┌─────────────────────────────────────┐
|
||||
│ 1. Fetch HTML Page │
|
||||
│ GET https://example.com/article │
|
||||
└─────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────┐
|
||||
│ 2. Parse with BeautifulSoup │
|
||||
│ soup = BeautifulSoup(html) │
|
||||
└─────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────┐
|
||||
│ 3. Clean HTML │
|
||||
│ Remove: scripts, styles, nav, │
|
||||
│ footer, header, ads │
|
||||
└─────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────┐
|
||||
│ 4. Extract Title │
|
||||
│ Try: H1 → OG meta → Twitter → │
|
||||
│ Title tag │
|
||||
└─────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────┐
|
||||
│ 5. Extract Author │
|
||||
│ Try: Meta author → rel=author → │
|
||||
│ Class names → JSON-LD │
|
||||
└─────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────┐
|
||||
│ 6. Extract Date │
|
||||
│ Try: <time> → Meta tags → │
|
||||
│ Class names → JSON-LD │
|
||||
└─────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────┐
|
||||
│ 7. Extract Content │
|
||||
│ Try: <article> → Class names → │
|
||||
│ <main> → <body> │
|
||||
│ Filter short paragraphs │
|
||||
└─────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────┐
|
||||
│ 8. Save to MongoDB │
|
||||
│ { │
|
||||
│ title, author, date, │
|
||||
│ content, word_count │
|
||||
│ } │
|
||||
└─────────────────────────────────────┘
|
||||
↓
|
||||
Wait 1 second (rate limiting)
|
||||
↓
|
||||
Next article
|
||||
```
|
||||
|
||||
## 🔍 Detailed Example
|
||||
|
||||
### Input: RSS Feed Entry
|
||||
```xml
|
||||
<item>
|
||||
<title>New U-Bahn Line Opens</title>
|
||||
<link>https://www.sueddeutsche.de/muenchen/article-123</link>
|
||||
<pubDate>Mon, 10 Nov 2024 10:00:00 +0100</pubDate>
|
||||
</item>
|
||||
```
|
||||
|
||||
### Step 1: Fetch HTML
|
||||
```python
|
||||
url = "https://www.sueddeutsche.de/muenchen/article-123"
|
||||
response = requests.get(url)
|
||||
html = response.content
|
||||
```
|
||||
|
||||
### Step 2: Parse HTML
|
||||
```python
|
||||
soup = BeautifulSoup(html, 'html.parser')
|
||||
```
|
||||
|
||||
### Step 3: Extract Title
|
||||
```python
|
||||
# Try H1
|
||||
h1 = soup.find('h1')
|
||||
# Result: "New U-Bahn Line Opens in Munich"
|
||||
|
||||
# If no H1, try OG meta
|
||||
og_title = soup.find('meta', property='og:title')
|
||||
# Fallback chain continues...
|
||||
```
|
||||
|
||||
### Step 4: Extract Author
|
||||
```python
|
||||
# Try meta author
|
||||
meta_author = soup.find('meta', name='author')
|
||||
# Result: None
|
||||
|
||||
# Try class names
|
||||
author_elem = soup.select_one('[class*="author"]')
|
||||
# Result: "Max Mustermann"
|
||||
```
|
||||
|
||||
### Step 5: Extract Date
|
||||
```python
|
||||
# Try time tag
|
||||
time_tag = soup.find('time')
|
||||
# Result: "2024-11-10T10:00:00Z"
|
||||
```
|
||||
|
||||
### Step 6: Extract Content
|
||||
```python
|
||||
# Try article tag
|
||||
article = soup.find('article')
|
||||
paragraphs = article.find_all('p')
|
||||
|
||||
# Filter paragraphs
|
||||
content = []
|
||||
for p in paragraphs:
|
||||
text = p.get_text().strip()
|
||||
if len(text) >= 50: # Keep substantial paragraphs
|
||||
content.append(text)
|
||||
|
||||
full_content = '\n\n'.join(content)
|
||||
# Result: "The new U-Bahn line connecting the city center..."
|
||||
```
|
||||
|
||||
### Step 7: Save to Database
|
||||
```python
|
||||
article_doc = {
|
||||
'title': 'New U-Bahn Line Opens in Munich',
|
||||
'author': 'Max Mustermann',
|
||||
'link': 'https://www.sueddeutsche.de/muenchen/article-123',
|
||||
'summary': 'Short summary from RSS...',
|
||||
'full_content': 'The new U-Bahn line connecting...',
|
||||
'word_count': 1250,
|
||||
'source': 'Süddeutsche Zeitung München',
|
||||
'published_at': '2024-11-10T10:00:00Z',
|
||||
'crawled_at': datetime.utcnow(),
|
||||
'created_at': datetime.utcnow()
|
||||
}
|
||||
|
||||
db.articles.update_one(
|
||||
{'link': article_url},
|
||||
{'$set': article_doc},
|
||||
upsert=True
|
||||
)
|
||||
```
|
||||
|
||||
## 🎨 What Makes It "Dynamic"?
|
||||
|
||||
### Traditional Approach (Hardcoded)
|
||||
```python
|
||||
# Only works for one specific site
|
||||
title = soup.find('h1', class_='article-title').text
|
||||
author = soup.find('span', class_='author-name').text
|
||||
```
|
||||
❌ Breaks when site changes
|
||||
❌ Doesn't work on other sites
|
||||
|
||||
### Our Approach (Dynamic)
|
||||
```python
|
||||
# Works on ANY site
|
||||
title = extract_title(soup) # Tries 4 different methods
|
||||
author = extract_author(soup) # Tries 5 different methods
|
||||
```
|
||||
✅ Adapts to different HTML structures
|
||||
✅ Falls back to alternatives
|
||||
✅ Works across multiple sites
|
||||
|
||||
## 🛡️ Robustness Features
|
||||
|
||||
### 1. Multiple Strategies
|
||||
Each field has 4-6 extraction strategies
|
||||
```python
|
||||
def extract_title(soup):
|
||||
# Try strategy 1
|
||||
if h1 := soup.find('h1'):
|
||||
return h1.text
|
||||
|
||||
# Try strategy 2
|
||||
if og_title := soup.find('meta', property='og:title'):
|
||||
return og_title['content']
|
||||
|
||||
# Try strategy 3...
|
||||
# Try strategy 4...
|
||||
```
|
||||
|
||||
### 2. Validation
|
||||
```python
|
||||
# Title must be reasonable length
|
||||
if title and len(title) > 10:
|
||||
return title
|
||||
|
||||
# Author must be < 100 chars
|
||||
if author and len(author) < 100:
|
||||
return author
|
||||
```
|
||||
|
||||
### 3. Cleaning
|
||||
```python
|
||||
# Remove site name from title
|
||||
if ' | ' in title:
|
||||
title = title.split(' | ')[0]
|
||||
|
||||
# Remove "By" from author
|
||||
author = author.replace('By ', '').strip()
|
||||
```
|
||||
|
||||
### 4. Error Handling
|
||||
```python
|
||||
try:
|
||||
data = extract_article_content(url)
|
||||
except Timeout:
|
||||
print("Timeout - skip")
|
||||
except RequestException:
|
||||
print("Network error - skip")
|
||||
except Exception:
|
||||
print("Unknown error - skip")
|
||||
```
|
||||
|
||||
## 📈 Success Metrics
|
||||
|
||||
After crawling, you'll see:
|
||||
|
||||
```
|
||||
📰 Crawling feed: Süddeutsche Zeitung München
|
||||
🔍 Crawling: New U-Bahn Line Opens...
|
||||
✓ Saved (1250 words)
|
||||
|
||||
Title: ✓ Found
|
||||
Author: ✓ Found (Max Mustermann)
|
||||
Date: ✓ Found (2024-11-10T10:00:00Z)
|
||||
Content: ✓ Found (1250 words)
|
||||
```
|
||||
|
||||
## 🗄️ Database Result
|
||||
|
||||
**Before Crawling:**
|
||||
```javascript
|
||||
{
|
||||
title: "New U-Bahn Line Opens",
|
||||
link: "https://example.com/article",
|
||||
summary: "Short RSS summary...",
|
||||
source: "Süddeutsche Zeitung"
|
||||
}
|
||||
```
|
||||
|
||||
**After Crawling:**
|
||||
```javascript
|
||||
{
|
||||
title: "New U-Bahn Line Opens in Munich", // ← Enhanced
|
||||
author: "Max Mustermann", // ← NEW!
|
||||
link: "https://example.com/article",
|
||||
summary: "Short RSS summary...",
|
||||
full_content: "The new U-Bahn line...", // ← NEW! (1250 words)
|
||||
word_count: 1250, // ← NEW!
|
||||
source: "Süddeutsche Zeitung",
|
||||
published_at: "2024-11-10T10:00:00Z", // ← Enhanced
|
||||
crawled_at: ISODate("2024-11-10T16:30:00Z"), // ← NEW!
|
||||
created_at: ISODate("2024-11-10T16:00:00Z")
|
||||
}
|
||||
```
|
||||
|
||||
## 🚀 Running the Crawler
|
||||
|
||||
```bash
|
||||
cd news_crawler
|
||||
pip install -r requirements.txt
|
||||
python crawler_service.py 10
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
============================================================
|
||||
🚀 Starting RSS Feed Crawler
|
||||
============================================================
|
||||
Found 3 active feed(s)
|
||||
|
||||
📰 Crawling feed: Süddeutsche Zeitung München
|
||||
🔍 Crawling: New U-Bahn Line Opens...
|
||||
✓ Saved (1250 words)
|
||||
🔍 Crawling: Munich Weather Update...
|
||||
✓ Saved (450 words)
|
||||
✓ Crawled 2 articles
|
||||
|
||||
============================================================
|
||||
✓ Crawling Complete!
|
||||
Total feeds processed: 3
|
||||
Total articles crawled: 15
|
||||
Duration: 45.23 seconds
|
||||
============================================================
|
||||
```
|
||||
|
||||
Now you have rich, structured article data ready for AI processing! 🎉
|
||||
Reference in New Issue
Block a user