Files

2025-11-10 19:13:33 +01:00

2.7 KiB

Raw Blame History

News Crawler - Quick Start

1. Install Dependencies

cd news_crawler
pip install -r requirements.txt

2. Configure Environment

Make sure MongoDB is running and accessible. The crawler will use the same database as the backend.

Default connection: mongodb://localhost:27017/

To use a different MongoDB URI, create a .env file:

MONGODB_URI=mongodb://localhost:27017/

3. Run the Crawler

# Crawl up to 10 articles per feed
python crawler_service.py

# Crawl up to 20 articles per feed
python crawler_service.py 20

4. Verify Results

Check your MongoDB database:

# Using mongosh
mongosh
use munich_news
db.articles.find({full_content: {$exists: true}}).count()
db.articles.findOne({full_content: {$exists: true}})

5. Schedule Regular Crawling

Option A: Cron (Linux/Mac)

# Edit crontab
crontab -e

# Add this line to run every 6 hours
0 */6 * * * cd /path/to/news_crawler && /path/to/venv/bin/python crawler_service.py

Option B: Docker

# Build and run
docker-compose up

# Or run as a one-off
docker-compose run --rm crawler

Option C: Manual

Just run the script whenever you want to fetch new articles:

python crawler_service.py

What Gets Crawled?

The crawler:

Fetches all active RSS feeds from the database
For each feed, gets the latest articles
Crawls the full content from each article URL
Saves: title, full_content, word_count, crawled_at
Skips articles that already have content

Output Example

============================================================
🚀 Starting RSS Feed Crawler
============================================================
Found 3 active feed(s)

📰 Crawling feed: Süddeutsche Zeitung München
   URL: https://www.sueddeutsche.de/muenchen/rss
   🔍 Crawling: New U-Bahn Line Opens in Munich...
   ✓ Saved (1250 words)
   🔍 Crawling: Munich Weather Update...
   ✓ Saved (450 words)
   ✓ Crawled 2 articles from Süddeutsche Zeitung München

============================================================
✓ Crawling Complete!
  Total feeds processed: 3
  Total articles crawled: 15
  Duration: 45.23 seconds
============================================================

Troubleshooting

No feeds found:

Make sure you've added RSS feeds via the backend API
Check MongoDB connection

Can't extract content:

Some sites block scrapers
Some sites require JavaScript (not supported yet)
Check if the URL is accessible

Timeout errors:

Increase timeout in the code
Check your internet connection

Next Steps

Once articles are crawled, you can:

View them in the frontend
Use Ollama to summarize them
Generate newsletters with full content
Perform text analysis

2.7 KiB

Raw Blame History

News Crawler - Quick Start

1. Install Dependencies

2. Configure Environment

3. Run the Crawler

4. Verify Results

5. Schedule Regular Crawling

Option A: Cron (Linux/Mac)

Option B: Docker

Option C: Manual

What Gets Crawled?

Output Example

Troubleshooting

Next Steps

Build together

Resources

Get help

2.7 KiB Raw Blame History

News Crawler - Quick Start

1. Install Dependencies

2. Configure Environment

3. Run the Crawler

4. Verify Results

5. Schedule Regular Crawling

Option A: Cron (Linux/Mac)

Option B: Docker

Option C: Manual

What Gets Crawled?

Output Example

Troubleshooting

Next Steps

Build together

Resources

Get help

2.7 KiB

Raw Blame History