2.7 KiB
2.7 KiB
News Crawler - Quick Start
1. Install Dependencies
cd news_crawler
pip install -r requirements.txt
2. Configure Environment
Make sure MongoDB is running and accessible. The crawler will use the same database as the backend.
Default connection: mongodb://localhost:27017/
To use a different MongoDB URI, create a .env file:
MONGODB_URI=mongodb://localhost:27017/
3. Run the Crawler
# Crawl up to 10 articles per feed
python crawler_service.py
# Crawl up to 20 articles per feed
python crawler_service.py 20
4. Verify Results
Check your MongoDB database:
# Using mongosh
mongosh
use munich_news
db.articles.find({full_content: {$exists: true}}).count()
db.articles.findOne({full_content: {$exists: true}})
5. Schedule Regular Crawling
Option A: Cron (Linux/Mac)
# Edit crontab
crontab -e
# Add this line to run every 6 hours
0 */6 * * * cd /path/to/news_crawler && /path/to/venv/bin/python crawler_service.py
Option B: Docker
# Build and run
docker-compose up
# Or run as a one-off
docker-compose run --rm crawler
Option C: Manual
Just run the script whenever you want to fetch new articles:
python crawler_service.py
What Gets Crawled?
The crawler:
- Fetches all active RSS feeds from the database
- For each feed, gets the latest articles
- Crawls the full content from each article URL
- Saves: title, full_content, word_count, crawled_at
- Skips articles that already have content
Output Example
============================================================
🚀 Starting RSS Feed Crawler
============================================================
Found 3 active feed(s)
📰 Crawling feed: Süddeutsche Zeitung München
URL: https://www.sueddeutsche.de/muenchen/rss
🔍 Crawling: New U-Bahn Line Opens in Munich...
✓ Saved (1250 words)
🔍 Crawling: Munich Weather Update...
✓ Saved (450 words)
✓ Crawled 2 articles from Süddeutsche Zeitung München
============================================================
✓ Crawling Complete!
Total feeds processed: 3
Total articles crawled: 15
Duration: 45.23 seconds
============================================================
Troubleshooting
No feeds found:
- Make sure you've added RSS feeds via the backend API
- Check MongoDB connection
Can't extract content:
- Some sites block scrapers
- Some sites require JavaScript (not supported yet)
- Check if the URL is accessible
Timeout errors:
- Increase timeout in the code
- Check your internet connection
Next Steps
Once articles are crawled, you can:
- View them in the frontend
- Use Ollama to summarize them
- Generate newsletters with full content
- Perform text analysis