Munich-news/docs/CRAWLER_HOW_IT_WORKS.md

# How the News Crawler Works

## 🎯 Overview

The crawler dynamically extracts article metadata from any website using multiple fallback strategies.

## 📊 Flow Diagram

```
RSS Feed URL
    ↓
Parse RSS Feed
    ↓
For each article link:
    ↓
┌─────────────────────────────────────┐
│  1. Fetch HTML Page                 │
│     GET https://example.com/article │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│  2. Parse with BeautifulSoup        │
│     soup = BeautifulSoup(html)      │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│  3. Clean HTML                      │
│     Remove: scripts, styles, nav,   │
│     footer, header, ads             │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│  4. Extract Title                   │
│     Try: H1 → OG meta → Twitter →   │
│     Title tag                       │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│  5. Extract Author                  │
│     Try: Meta author → rel=author → │
│     Class names → JSON-LD           │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│  6. Extract Date                    │
│     Try: <time> → Meta tags →       │
│     Class names → JSON-LD           │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│  7. Extract Content                 │
│     Try: <article> → Class names →  │
│     <main> → <body>                 │
│     Filter short paragraphs         │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│  8. Save to MongoDB                 │
│     {                               │
│       title, author, date,          │
│       content, word_count           │
│     }                               │
└─────────────────────────────────────┘
    ↓
Wait 1 second (rate limiting)
    ↓
Next article
```

## 🔍 Detailed Example

### Input: RSS Feed Entry
```xml
<item>
  <title>New U-Bahn Line Opens</title>
  <link>https://www.sueddeutsche.de/muenchen/article-123</link>
  <pubDate>Mon, 10 Nov 2024 10:00:00 +0100</pubDate>
</item>
```

### Step 1: Fetch HTML
```python
url = "https://www.sueddeutsche.de/muenchen/article-123"
response = requests.get(url)
html = response.content
```

### Step 2: Parse HTML
```python
soup = BeautifulSoup(html, 'html.parser')
```

### Step 3: Extract Title
```python
# Try H1
h1 = soup.find('h1')
# Result: "New U-Bahn Line Opens in Munich"

# If no H1, try OG meta
og_title = soup.find('meta', property='og:title')
# Fallback chain continues...
```

### Step 4: Extract Author
```python
# Try meta author
meta_author = soup.find('meta', name='author')
# Result: None

# Try class names
author_elem = soup.select_one('[class*="author"]')
# Result: "Max Mustermann"
```

### Step 5: Extract Date
```python
# Try time tag
time_tag = soup.find('time')
# Result: "2024-11-10T10:00:00Z"
```

### Step 6: Extract Content
```python
# Try article tag
article = soup.find('article')
paragraphs = article.find_all('p')

# Filter paragraphs
content = []
for p in paragraphs:
    text = p.get_text().strip()
    if len(text) >= 50:  # Keep substantial paragraphs
        content.append(text)

full_content = '\n\n'.join(content)
# Result: "The new U-Bahn line connecting the city center..."
```

### Step 7: Save to Database
```python
article_doc = {
    'title': 'New U-Bahn Line Opens in Munich',
    'author': 'Max Mustermann',
    'link': 'https://www.sueddeutsche.de/muenchen/article-123',
    'summary': 'Short summary from RSS...',
    'full_content': 'The new U-Bahn line connecting...',
    'word_count': 1250,
    'source': 'Süddeutsche Zeitung München',
    'published_at': '2024-11-10T10:00:00Z',
    'crawled_at': datetime.utcnow(),
    'created_at': datetime.utcnow()
}

db.articles.update_one(
    {'link': article_url},
    {'$set': article_doc},
    upsert=True
)
```

## 🎨 What Makes It "Dynamic"?

### Traditional Approach (Hardcoded)
```python
# Only works for one specific site
title = soup.find('h1', class_='article-title').text
author = soup.find('span', class_='author-name').text
```
❌ Breaks when site changes
❌ Doesn't work on other sites

### Our Approach (Dynamic)
```python
# Works on ANY site
title = extract_title(soup)  # Tries 4 different methods
author = extract_author(soup)  # Tries 5 different methods
```
✅ Adapts to different HTML structures
✅ Falls back to alternatives
✅ Works across multiple sites

## 🛡️ Robustness Features

### 1. Multiple Strategies
Each field has 4-6 extraction strategies
```python
def extract_title(soup):
    # Try strategy 1
    if h1 := soup.find('h1'):
        return h1.text

    # Try strategy 2
    if og_title := soup.find('meta', property='og:title'):
        return og_title['content']

    # Try strategy 3...
    # Try strategy 4...
```

### 2. Validation
```python
# Title must be reasonable length
if title and len(title) > 10:
    return title

# Author must be < 100 chars
if author and len(author) < 100:
    return author
```

### 3. Cleaning
```python
# Remove site name from title
if ' | ' in title:
    title = title.split(' | ')[0]

# Remove "By" from author
author = author.replace('By ', '').strip()
```

### 4. Error Handling
```python
try:
    data = extract_article_content(url)
except Timeout:
    print("Timeout - skip")
except RequestException:
    print("Network error - skip")
except Exception:
    print("Unknown error - skip")
```

## 📈 Success Metrics

After crawling, you'll see:

```
📰 Crawling feed: Süddeutsche Zeitung München
   🔍 Crawling: New U-Bahn Line Opens...
   ✓ Saved (1250 words)

   Title: ✓ Found
   Author: ✓ Found (Max Mustermann)
   Date: ✓ Found (2024-11-10T10:00:00Z)
   Content: ✓ Found (1250 words)
```

## 🗄️ Database Result

**Before Crawling:**
```javascript
{
  title: "New U-Bahn Line Opens",
  link: "https://example.com/article",
  summary: "Short RSS summary...",
  source: "Süddeutsche Zeitung"
}
```

**After Crawling:**
```javascript
{
  title: "New U-Bahn Line Opens in Munich",  // ← Enhanced
  author: "Max Mustermann",                   // ← NEW!
  link: "https://example.com/article",
  summary: "Short RSS summary...",
  full_content: "The new U-Bahn line...",    // ← NEW! (1250 words)
  word_count: 1250,                           // ← NEW!
  source: "Süddeutsche Zeitung",
  published_at: "2024-11-10T10:00:00Z",      // ← Enhanced
  crawled_at: ISODate("2024-11-10T16:30:00Z"), // ← NEW!
  created_at: ISODate("2024-11-10T16:00:00Z")
}
```

## 🚀 Running the Crawler

```bash
cd news_crawler
pip install -r requirements.txt
python crawler_service.py 10
```

Output:
```
============================================================
🚀 Starting RSS Feed Crawler
============================================================
Found 3 active feed(s)

📰 Crawling feed: Süddeutsche Zeitung München
   🔍 Crawling: New U-Bahn Line Opens...
   ✓ Saved (1250 words)
   🔍 Crawling: Munich Weather Update...
   ✓ Saved (450 words)
   ✓ Crawled 2 articles

============================================================
✓ Crawling Complete!
  Total feeds processed: 3
  Total articles crawled: 15
  Duration: 45.23 seconds
============================================================
```

Now you have rich, structured article data ready for AI processing! 🎉