update

2025-11-10 19:13:33 +01:00
commit ac5738c29d
64 changed files with 9445 additions and 0 deletions
--- a/news_crawler/QUICKSTART.md
+++ b/news_crawler/QUICKSTART.md
@@ -0,0 +1,127 @@
+# News Crawler - Quick Start
+
+## 1. Install Dependencies
+
+```bash
+cd news_crawler
+pip install -r requirements.txt
+```
+
+## 2. Configure Environment
+
+Make sure MongoDB is running and accessible. The crawler will use the same database as the backend.
+
+Default connection: `mongodb://localhost:27017/`
+
+To use a different MongoDB URI, create a `.env` file:
+```env
+MONGODB_URI=mongodb://localhost:27017/
+```
+
+## 3. Run the Crawler
+
+```bash
+# Crawl up to 10 articles per feed
+python crawler_service.py
+
+# Crawl up to 20 articles per feed
+python crawler_service.py 20
+```
+
+## 4. Verify Results
+
+Check your MongoDB database:
+
+```bash
+# Using mongosh
+mongosh
+use munich_news
+db.articles.find({full_content: {$exists: true}}).count()
+db.articles.findOne({full_content: {$exists: true}})
+```
+
+## 5. Schedule Regular Crawling
+
+### Option A: Cron (Linux/Mac)
+
+```bash
+# Edit crontab
+crontab -e
+
+# Add this line to run every 6 hours
+0 */6 * * * cd /path/to/news_crawler && /path/to/venv/bin/python crawler_service.py
+```
+
+### Option B: Docker
+
+```bash
+# Build and run
+docker-compose up
+
+# Or run as a one-off
+docker-compose run --rm crawler
+```
+
+### Option C: Manual
+
+Just run the script whenever you want to fetch new articles:
+
+```bash
+python crawler_service.py
+```
+
+## What Gets Crawled?
+
+The crawler:
+1. Fetches all active RSS feeds from the database
+2. For each feed, gets the latest articles
+3. Crawls the full content from each article URL
+4. Saves: title, full_content, word_count, crawled_at
+5. Skips articles that already have content
+
+## Output Example
+
+```
+============================================================
+🚀 Starting RSS Feed Crawler
+============================================================
+Found 3 active feed(s)
+
+📰 Crawling feed: Süddeutsche Zeitung München
+   URL: https://www.sueddeutsche.de/muenchen/rss
+   🔍 Crawling: New U-Bahn Line Opens in Munich...
+   ✓ Saved (1250 words)
+   🔍 Crawling: Munich Weather Update...
+   ✓ Saved (450 words)
+   ✓ Crawled 2 articles from Süddeutsche Zeitung München
+
+============================================================
+✓ Crawling Complete!
+  Total feeds processed: 3
+  Total articles crawled: 15
+  Duration: 45.23 seconds
+============================================================
+```
+
+## Troubleshooting
+
+**No feeds found:**
+- Make sure you've added RSS feeds via the backend API
+- Check MongoDB connection
+
+**Can't extract content:**
+- Some sites block scrapers
+- Some sites require JavaScript (not supported yet)
+- Check if the URL is accessible
+
+**Timeout errors:**
+- Increase timeout in the code
+- Check your internet connection
+
+## Next Steps
+
+Once articles are crawled, you can:
+- View them in the frontend
+- Use Ollama to summarize them
+- Generate newsletters with full content
+- Perform text analysis