update

2025-11-11 14:09:21 +01:00
parent bcd0a10576
commit 1075a91eac
57 changed files with 5598 additions and 1366 deletions
--- a/docs/CRAWLER_HOW_IT_WORKS.md
+++ b/docs/CRAWLER_HOW_IT_WORKS.md
@@ -0,0 +1,306 @@
+# How the News Crawler Works
+
+## 🎯 Overview
+
+The crawler dynamically extracts article metadata from any website using multiple fallback strategies.
+
+## 📊 Flow Diagram
+
+```
+RSS Feed URL
+    ↓
+Parse RSS Feed
+    ↓
+For each article link:
+    ↓
+┌─────────────────────────────────────┐
+│  1. Fetch HTML Page                 │
+│     GET https://example.com/article │
+└─────────────────────────────────────┘
+    ↓
+┌─────────────────────────────────────┐
+│  2. Parse with BeautifulSoup        │
+│     soup = BeautifulSoup(html)      │
+└─────────────────────────────────────┘
+    ↓
+┌─────────────────────────────────────┐
+│  3. Clean HTML                      │
+│     Remove: scripts, styles, nav,   │
+│     footer, header, ads             │
+└─────────────────────────────────────┘
+    ↓
+┌─────────────────────────────────────┐
+│  4. Extract Title                   │
+│     Try: H1 → OG meta → Twitter →   │
+│     Title tag                       │
+└─────────────────────────────────────┘
+    ↓
+┌─────────────────────────────────────┐
+│  5. Extract Author                  │
+│     Try: Meta author → rel=author → │
+│     Class names → JSON-LD           │
+└─────────────────────────────────────┘
+    ↓
+┌─────────────────────────────────────┐
+│  6. Extract Date                    │
+│     Try: <time> → Meta tags →       │
+│     Class names → JSON-LD           │
+└─────────────────────────────────────┘
+    ↓
+┌─────────────────────────────────────┐
+│  7. Extract Content                 │
+│     Try: <article> → Class names →  │
+│     <main> → <body>                 │
+│     Filter short paragraphs         │
+└─────────────────────────────────────┘
+    ↓
+┌─────────────────────────────────────┐
+│  8. Save to MongoDB                 │
+│     {                               │
+│       title, author, date,          │
+│       content, word_count           │
+│     }                               │
+└─────────────────────────────────────┘
+    ↓
+Wait 1 second (rate limiting)
+    ↓
+Next article
+```
+
+## 🔍 Detailed Example
+
+### Input: RSS Feed Entry
+```xml
+<item>
+  <title>New U-Bahn Line Opens</title>
+  <link>https://www.sueddeutsche.de/muenchen/article-123</link>
+  <pubDate>Mon, 10 Nov 2024 10:00:00 +0100</pubDate>
+</item>
+```
+
+### Step 1: Fetch HTML
+```python
+url = "https://www.sueddeutsche.de/muenchen/article-123"
+response = requests.get(url)
+html = response.content
+```
+
+### Step 2: Parse HTML
+```python
+soup = BeautifulSoup(html, 'html.parser')
+```
+
+### Step 3: Extract Title
+```python
+# Try H1
+h1 = soup.find('h1')
+# Result: "New U-Bahn Line Opens in Munich"
+
+# If no H1, try OG meta
+og_title = soup.find('meta', property='og:title')
+# Fallback chain continues...
+```
+
+### Step 4: Extract Author
+```python
+# Try meta author
+meta_author = soup.find('meta', name='author')
+# Result: None
+
+# Try class names
+author_elem = soup.select_one('[class*="author"]')
+# Result: "Max Mustermann"
+```
+
+### Step 5: Extract Date
+```python
+# Try time tag
+time_tag = soup.find('time')
+# Result: "2024-11-10T10:00:00Z"
+```
+
+### Step 6: Extract Content
+```python
+# Try article tag
+article = soup.find('article')
+paragraphs = article.find_all('p')
+
+# Filter paragraphs
+content = []
+for p in paragraphs:
+    text = p.get_text().strip()
+    if len(text) >= 50:  # Keep substantial paragraphs
+        content.append(text)
+
+full_content = '\n\n'.join(content)
+# Result: "The new U-Bahn line connecting the city center..."
+```
+
+### Step 7: Save to Database
+```python
+article_doc = {
+    'title': 'New U-Bahn Line Opens in Munich',
+    'author': 'Max Mustermann',
+    'link': 'https://www.sueddeutsche.de/muenchen/article-123',
+    'summary': 'Short summary from RSS...',
+    'full_content': 'The new U-Bahn line connecting...',
+    'word_count': 1250,
+    'source': 'Süddeutsche Zeitung München',
+    'published_at': '2024-11-10T10:00:00Z',
+    'crawled_at': datetime.utcnow(),
+    'created_at': datetime.utcnow()
+}
+
+db.articles.update_one(
+    {'link': article_url},
+    {'$set': article_doc},
+    upsert=True
+)
+```
+
+## 🎨 What Makes It "Dynamic"?
+
+### Traditional Approach (Hardcoded)
+```python
+# Only works for one specific site
+title = soup.find('h1', class_='article-title').text
+author = soup.find('span', class_='author-name').text
+```
+❌ Breaks when site changes
+❌ Doesn't work on other sites
+
+### Our Approach (Dynamic)
+```python
+# Works on ANY site
+title = extract_title(soup)  # Tries 4 different methods
+author = extract_author(soup)  # Tries 5 different methods
+```
+✅ Adapts to different HTML structures
+✅ Falls back to alternatives
+✅ Works across multiple sites
+
+## 🛡️ Robustness Features
+
+### 1. Multiple Strategies
+Each field has 4-6 extraction strategies
+```python
+def extract_title(soup):
+    # Try strategy 1
+    if h1 := soup.find('h1'):
+        return h1.text
+    
+    # Try strategy 2
+    if og_title := soup.find('meta', property='og:title'):
+        return og_title['content']
+    
+    # Try strategy 3...
+    # Try strategy 4...
+```
+
+### 2. Validation
+```python
+# Title must be reasonable length
+if title and len(title) > 10:
+    return title
+
+# Author must be < 100 chars
+if author and len(author) < 100:
+    return author
+```
+
+### 3. Cleaning
+```python
+# Remove site name from title
+if ' | ' in title:
+    title = title.split(' | ')[0]
+
+# Remove "By" from author
+author = author.replace('By ', '').strip()
+```
+
+### 4. Error Handling
+```python
+try:
+    data = extract_article_content(url)
+except Timeout:
+    print("Timeout - skip")
+except RequestException:
+    print("Network error - skip")
+except Exception:
+    print("Unknown error - skip")
+```
+
+## 📈 Success Metrics
+
+After crawling, you'll see:
+
+```
+📰 Crawling feed: Süddeutsche Zeitung München
+   🔍 Crawling: New U-Bahn Line Opens...
+   ✓ Saved (1250 words)
+   
+   Title: ✓ Found
+   Author: ✓ Found (Max Mustermann)
+   Date: ✓ Found (2024-11-10T10:00:00Z)
+   Content: ✓ Found (1250 words)
+```
+
+## 🗄️ Database Result
+
+**Before Crawling:**
+```javascript
+{
+  title: "New U-Bahn Line Opens",
+  link: "https://example.com/article",
+  summary: "Short RSS summary...",
+  source: "Süddeutsche Zeitung"
+}
+```
+
+**After Crawling:**
+```javascript
+{
+  title: "New U-Bahn Line Opens in Munich",  // ← Enhanced
+  author: "Max Mustermann",                   // ← NEW!
+  link: "https://example.com/article",
+  summary: "Short RSS summary...",
+  full_content: "The new U-Bahn line...",    // ← NEW! (1250 words)
+  word_count: 1250,                           // ← NEW!
+  source: "Süddeutsche Zeitung",
+  published_at: "2024-11-10T10:00:00Z",      // ← Enhanced
+  crawled_at: ISODate("2024-11-10T16:30:00Z"), // ← NEW!
+  created_at: ISODate("2024-11-10T16:00:00Z")
+}
+```
+
+## 🚀 Running the Crawler
+
+```bash
+cd news_crawler
+pip install -r requirements.txt
+python crawler_service.py 10
+```
+
+Output:
+```
+============================================================
+🚀 Starting RSS Feed Crawler
+============================================================
+Found 3 active feed(s)
+
+📰 Crawling feed: Süddeutsche Zeitung München
+   🔍 Crawling: New U-Bahn Line Opens...
+   ✓ Saved (1250 words)
+   🔍 Crawling: Munich Weather Update...
+   ✓ Saved (450 words)
+   ✓ Crawled 2 articles
+
+============================================================
+✓ Crawling Complete!
+  Total feeds processed: 3
+  Total articles crawled: 15
+  Duration: 45.23 seconds
+============================================================
+```
+
+Now you have rich, structured article data ready for AI processing! 🎉