New U-Bahn Line Opens

→ │ │ Filter short paragraphs │ └─────────────────────────────────────┘ ↓ ┌─────────────────────────────────────┐ │ 8. Save to MongoDB │ │ { │ │ title, author, date, │ │ content, word_count │ │ } │ └─────────────────────────────────────┘ ↓ Wait 1 second (rate limiting) ↓ Next article ``` ## 🔍 Detailed Example ### Input: RSS Feed Entry ```xml New U-Bahn Line Opens https://www.sueddeutsche.de/muenchen/article-123 Mon, 10 Nov 2024 10:00:00 +0100 ``` ### Step 1: Fetch HTML ```python url = "https://www.sueddeutsche.de/muenchen/article-123" response = requests.get(url) html = response.content ``` ### Step 2: Parse HTML ```python soup = BeautifulSoup(html, 'html.parser') ``` ### Step 3: Extract Title ```python # Try H1 h1 = soup.find('h1') # Result: "New U-Bahn Line Opens in Munich" # If no H1, try OG meta og_title = soup.find('meta', property='og:title') # Fallback chain continues... ``` ### Step 4: Extract Author ```python # Try meta author meta_author = soup.find('meta', name='author') # Result: None # Try class names author_elem = soup.select_one('[class*="author"]') # Result: "Max Mustermann" ``` ### Step 5: Extract Date ```python # Try time tag time_tag = soup.find('time') # Result: "2024-11-10T10:00:00Z" ``` ### Step 6: Extract Content ```python # Try article tag article = soup.find('article') paragraphs = article.find_all('p') # Filter paragraphs content = [] for p in paragraphs: text = p.get_text().strip() if len(text) >= 50: # Keep substantial paragraphs content.append(text) full_content = '\n\n'.join(content) # Result: "The new U-Bahn line connecting the city center..." ``` ### Step 7: Save to Database ```python article_doc = { 'title': 'New U-Bahn Line Opens in Munich', 'author': 'Max Mustermann', 'link': 'https://www.sueddeutsche.de/muenchen/article-123', 'summary': 'Short summary from RSS...', 'full_content': 'The new U-Bahn line connecting...', 'word_count': 1250, 'source': 'Süddeutsche Zeitung München', 'published_at': '2024-11-10T10:00:00Z', 'crawled_at': datetime.utcnow(), 'created_at': datetime.utcnow() } db.articles.update_one( {'link': article_url}, {'$set': article_doc}, upsert=True ) ``` ## 🎨 What Makes It "Dynamic"? ### Traditional Approach (Hardcoded) ```python # Only works for one specific site title = soup.find('h1', class_='article-title').text author = soup.find('span', class_='author-name').text ``` ❌ Breaks when site changes ❌ Doesn't work on other sites ### Our Approach (Dynamic) ```python # Works on ANY site title = extract_title(soup) # Tries 4 different methods author = extract_author(soup) # Tries 5 different methods ``` ✅ Adapts to different HTML structures ✅ Falls back to alternatives ✅ Works across multiple sites ## 🛡️ Robustness Features ### 1. Multiple Strategies Each field has 4-6 extraction strategies ```python def extract_title(soup): # Try strategy 1 if h1 := soup.find('h1'): return h1.text # Try strategy 2 if og_title := soup.find('meta', property='og:title'): return og_title['content'] # Try strategy 3... # Try strategy 4... ``` ### 2. Validation ```python # Title must be reasonable length if title and len(title) > 10: return title # Author must be < 100 chars if author and len(author) < 100: return author ``` ### 3. Cleaning ```python # Remove site name from title if ' | ' in title: title = title.split(' | ')[0] # Remove "By" from author author = author.replace('By ', '').strip() ``` ### 4. Error Handling ```python try: data = extract_article_content(url) except Timeout: print("Timeout - skip") except RequestException: print("Network error - skip") except Exception: print("Unknown error - skip") ``` ## 📈 Success Metrics After crawling, you'll see: ``` 📰 Crawling feed: Süddeutsche Zeitung München 🔍 Crawling: New U-Bahn Line Opens... ✓ Saved (1250 words) Title: ✓ Found Author: ✓ Found (Max Mustermann) Date: ✓ Found (2024-11-10T10:00:00Z) Content: ✓ Found (1250 words) ``` ## 🗄️ Database Result **Before Crawling:** ```javascript { title: "New U-Bahn Line Opens", link: "https://example.com/article", summary: "Short RSS summary...", source: "Süddeutsche Zeitung" } ``` **After Crawling:** ```javascript { title: "New U-Bahn Line Opens in Munich", // ← Enhanced author: "Max Mustermann", // ← NEW! link: "https://example.com/article", summary: "Short RSS summary...", full_content: "The new U-Bahn line...", // ← NEW! (1250 words) word_count: 1250, // ← NEW! source: "Süddeutsche Zeitung", published_at: "2024-11-10T10:00:00Z", // ← Enhanced crawled_at: ISODate("2024-11-10T16:30:00Z"), // ← NEW! created_at: ISODate("2024-11-10T16:00:00Z") } ``` ## 🚀 Running the Crawler ```bash cd news_crawler pip install -r requirements.txt python crawler_service.py 10 ``` Output: ``` ============================================================ 🚀 Starting RSS Feed Crawler ============================================================ Found 3 active feed(s) 📰 Crawling feed: Süddeutsche Zeitung München 🔍 Crawling: New U-Bahn Line Opens... ✓ Saved (1250 words) 🔍 Crawling: Munich Weather Update... ✓ Saved (450 words) ✓ Crawled 2 articles ============================================================ ✓ Crawling Complete! Total feeds processed: 3 Total articles crawled: 15 Duration: 45.23 seconds ============================================================ ``` Now you have rich, structured article data ready for AI processing! 🎉