Files
Munich-news/news_crawler/HOW_IT_WORKS.md
2025-11-10 19:13:33 +01:00

8.7 KiB

How the News Crawler Works

🎯 Overview

The crawler dynamically extracts article metadata from any website using multiple fallback strategies.

📊 Flow Diagram

RSS Feed URL
    ↓
Parse RSS Feed
    ↓
For each article link:
    ↓
┌─────────────────────────────────────┐
│  1. Fetch HTML Page                 │
│     GET https://example.com/article │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│  2. Parse with BeautifulSoup        │
│     soup = BeautifulSoup(html)      │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│  3. Clean HTML                      │
│     Remove: scripts, styles, nav,   │
│     footer, header, ads             │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│  4. Extract Title                   │
│     Try: H1 → OG meta → Twitter →   │
│     Title tag                       │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│  5. Extract Author                  │
│     Try: Meta author → rel=author → │
│     Class names → JSON-LD           │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│  6. Extract Date                    │
│     Try: <time> → Meta tags →       │
│     Class names → JSON-LD           │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│  7. Extract Content                 │
│     Try: <article> → Class names →  │
│     <main> → <body>                 │
│     Filter short paragraphs         │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│  8. Save to MongoDB                 │
│     {                               │
│       title, author, date,          │
│       content, word_count           │
│     }                               │
└─────────────────────────────────────┘
    ↓
Wait 1 second (rate limiting)
    ↓
Next article

🔍 Detailed Example

Input: RSS Feed Entry

<item>
  <title>New U-Bahn Line Opens</title>
  <link>https://www.sueddeutsche.de/muenchen/article-123</link>
  <pubDate>Mon, 10 Nov 2024 10:00:00 +0100</pubDate>
</item>

Step 1: Fetch HTML

url = "https://www.sueddeutsche.de/muenchen/article-123"
response = requests.get(url)
html = response.content

Step 2: Parse HTML

soup = BeautifulSoup(html, 'html.parser')

Step 3: Extract Title

# Try H1
h1 = soup.find('h1')
# Result: "New U-Bahn Line Opens in Munich"

# If no H1, try OG meta
og_title = soup.find('meta', property='og:title')
# Fallback chain continues...

Step 4: Extract Author

# Try meta author
meta_author = soup.find('meta', name='author')
# Result: None

# Try class names
author_elem = soup.select_one('[class*="author"]')
# Result: "Max Mustermann"

Step 5: Extract Date

# Try time tag
time_tag = soup.find('time')
# Result: "2024-11-10T10:00:00Z"

Step 6: Extract Content

# Try article tag
article = soup.find('article')
paragraphs = article.find_all('p')

# Filter paragraphs
content = []
for p in paragraphs:
    text = p.get_text().strip()
    if len(text) >= 50:  # Keep substantial paragraphs
        content.append(text)

full_content = '\n\n'.join(content)
# Result: "The new U-Bahn line connecting the city center..."

Step 7: Save to Database

article_doc = {
    'title': 'New U-Bahn Line Opens in Munich',
    'author': 'Max Mustermann',
    'link': 'https://www.sueddeutsche.de/muenchen/article-123',
    'summary': 'Short summary from RSS...',
    'full_content': 'The new U-Bahn line connecting...',
    'word_count': 1250,
    'source': 'Süddeutsche Zeitung München',
    'published_at': '2024-11-10T10:00:00Z',
    'crawled_at': datetime.utcnow(),
    'created_at': datetime.utcnow()
}

db.articles.update_one(
    {'link': article_url},
    {'$set': article_doc},
    upsert=True
)

🎨 What Makes It "Dynamic"?

Traditional Approach (Hardcoded)

# Only works for one specific site
title = soup.find('h1', class_='article-title').text
author = soup.find('span', class_='author-name').text

Breaks when site changes Doesn't work on other sites

Our Approach (Dynamic)

# Works on ANY site
title = extract_title(soup)  # Tries 4 different methods
author = extract_author(soup)  # Tries 5 different methods

Adapts to different HTML structures Falls back to alternatives Works across multiple sites

🛡️ Robustness Features

1. Multiple Strategies

Each field has 4-6 extraction strategies

def extract_title(soup):
    # Try strategy 1
    if h1 := soup.find('h1'):
        return h1.text
    
    # Try strategy 2
    if og_title := soup.find('meta', property='og:title'):
        return og_title['content']
    
    # Try strategy 3...
    # Try strategy 4...

2. Validation

# Title must be reasonable length
if title and len(title) > 10:
    return title

# Author must be < 100 chars
if author and len(author) < 100:
    return author

3. Cleaning

# Remove site name from title
if ' | ' in title:
    title = title.split(' | ')[0]

# Remove "By" from author
author = author.replace('By ', '').strip()

4. Error Handling

try:
    data = extract_article_content(url)
except Timeout:
    print("Timeout - skip")
except RequestException:
    print("Network error - skip")
except Exception:
    print("Unknown error - skip")

📈 Success Metrics

After crawling, you'll see:

📰 Crawling feed: Süddeutsche Zeitung München
   🔍 Crawling: New U-Bahn Line Opens...
   ✓ Saved (1250 words)
   
   Title: ✓ Found
   Author: ✓ Found (Max Mustermann)
   Date: ✓ Found (2024-11-10T10:00:00Z)
   Content: ✓ Found (1250 words)

🗄️ Database Result

Before Crawling:

{
  title: "New U-Bahn Line Opens",
  link: "https://example.com/article",
  summary: "Short RSS summary...",
  source: "Süddeutsche Zeitung"
}

After Crawling:

{
  title: "New U-Bahn Line Opens in Munich",  // ← Enhanced
  author: "Max Mustermann",                   // ← NEW!
  link: "https://example.com/article",
  summary: "Short RSS summary...",
  full_content: "The new U-Bahn line...",    // ← NEW! (1250 words)
  word_count: 1250,                           // ← NEW!
  source: "Süddeutsche Zeitung",
  published_at: "2024-11-10T10:00:00Z",      // ← Enhanced
  crawled_at: ISODate("2024-11-10T16:30:00Z"), // ← NEW!
  created_at: ISODate("2024-11-10T16:00:00Z")
}

🚀 Running the Crawler

cd news_crawler
pip install -r requirements.txt
python crawler_service.py 10

Output:

============================================================
🚀 Starting RSS Feed Crawler
============================================================
Found 3 active feed(s)

📰 Crawling feed: Süddeutsche Zeitung München
   🔍 Crawling: New U-Bahn Line Opens...
   ✓ Saved (1250 words)
   🔍 Crawling: Munich Weather Update...
   ✓ Saved (450 words)
   ✓ Crawled 2 articles

============================================================
✓ Crawling Complete!
  Total feeds processed: 3
  Total articles crawled: 15
  Duration: 45.23 seconds
============================================================

Now you have rich, structured article data ready for AI processing! 🎉