# News Crawler - Quick Start ## 1. Install Dependencies ```bash cd news_crawler pip install -r requirements.txt ``` ## 2. Configure Environment Make sure MongoDB is running and accessible. The crawler will use the same database as the backend. Default connection: `mongodb://localhost:27017/` To use a different MongoDB URI, create a `.env` file: ```env MONGODB_URI=mongodb://localhost:27017/ ``` ## 3. Run the Crawler ```bash # Crawl up to 10 articles per feed python crawler_service.py # Crawl up to 20 articles per feed python crawler_service.py 20 ``` ## 4. Verify Results Check your MongoDB database: ```bash # Using mongosh mongosh use munich_news db.articles.find({full_content: {$exists: true}}).count() db.articles.findOne({full_content: {$exists: true}}) ``` ## 5. Schedule Regular Crawling ### Option A: Cron (Linux/Mac) ```bash # Edit crontab crontab -e # Add this line to run every 6 hours 0 */6 * * * cd /path/to/news_crawler && /path/to/venv/bin/python crawler_service.py ``` ### Option B: Docker ```bash # Build and run docker-compose up # Or run as a one-off docker-compose run --rm crawler ``` ### Option C: Manual Just run the script whenever you want to fetch new articles: ```bash python crawler_service.py ``` ## What Gets Crawled? The crawler: 1. Fetches all active RSS feeds from the database 2. For each feed, gets the latest articles 3. Crawls the full content from each article URL 4. Saves: title, full_content, word_count, crawled_at 5. Skips articles that already have content ## Output Example ``` ============================================================ 🚀 Starting RSS Feed Crawler ============================================================ Found 3 active feed(s) 📰 Crawling feed: Süddeutsche Zeitung München URL: https://www.sueddeutsche.de/muenchen/rss 🔍 Crawling: New U-Bahn Line Opens in Munich... ✓ Saved (1250 words) 🔍 Crawling: Munich Weather Update... ✓ Saved (450 words) ✓ Crawled 2 articles from Süddeutsche Zeitung München ============================================================ ✓ Crawling Complete! Total feeds processed: 3 Total articles crawled: 15 Duration: 45.23 seconds ============================================================ ``` ## Troubleshooting **No feeds found:** - Make sure you've added RSS feeds via the backend API - Check MongoDB connection **Can't extract content:** - Some sites block scrapers - Some sites require JavaScript (not supported yet) - Check if the URL is accessible **Timeout errors:** - Increase timeout in the code - Check your internet connection ## Next Steps Once articles are crawled, you can: - View them in the frontend - Use Ollama to summarize them - Generate newsletters with full content - Perform text analysis