Files
Munich-news/news_crawler/README.md
2025-11-10 19:13:33 +01:00

4.9 KiB

News Crawler Microservice

A standalone microservice that crawls full article content from RSS feeds and stores it in MongoDB.

Features

  • 🔍 Extracts full article content from RSS feed links
  • 📊 Calculates word count
  • 🔄 Avoids re-crawling already processed articles
  • ⏱️ Rate limiting (1 second delay between requests)
  • 🎯 Smart content extraction using multiple selectors
  • 🧹 Cleans up scripts, styles, and navigation elements

Installation

  1. Create a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Configure environment variables: Create a .env file in the project root (or use the backend's .env):
MONGODB_URI=mongodb://localhost:27017/

Usage

Standalone Execution

Run the crawler directly:

# Crawl up to 10 articles per feed (default)
python crawler_service.py

# Crawl up to 20 articles per feed
python crawler_service.py 20

As a Module

from crawler_service import crawl_all_feeds, crawl_rss_feed

# Crawl all active feeds
result = crawl_all_feeds(max_articles_per_feed=10)
print(result)

# Crawl a specific feed
crawl_rss_feed(
    feed_url='https://example.com/rss',
    feed_name='Example News',
    max_articles=10
)

Via Backend API

The backend has integrated endpoints:

# Start crawler
curl -X POST http://localhost:5001/api/crawler/start

# Check status
curl http://localhost:5001/api/crawler/status

# Crawl specific feed
curl -X POST http://localhost:5001/api/crawler/feed/<feed_id>

How It Works

  1. Fetch RSS Feeds: Gets all active RSS feeds from MongoDB
  2. Parse Feed: Extracts article links from each feed
  3. Crawl Content: For each article:
    • Fetches HTML page
    • Removes scripts, styles, navigation
    • Extracts main content using smart selectors
    • Calculates word count
  4. Store Data: Saves to MongoDB with metadata
  5. Skip Duplicates: Avoids re-crawling articles with existing content

Content Extraction Strategy

The crawler tries multiple selectors in order:

  1. <article> tag
  2. Elements with class containing "article-content", "article-body"
  3. Elements with class containing "post-content", "entry-content"
  4. <main> tag
  5. Fallback to all <p> tags in body

Database Schema

Articles are stored with these fields:

{
  title: String,           // Article title
  link: String,            // Article URL (unique)
  summary: String,         // Short summary
  full_content: String,    // Full article text (max 10,000 chars)
  word_count: Number,      // Number of words
  source: String,          // RSS feed name
  published_at: String,    // Publication date
  crawled_at: DateTime,    // When content was crawled
  created_at: DateTime     // When added to database
}

Scheduling

Using Cron (Linux/Mac)

# Run every 6 hours
0 */6 * * * cd /path/to/news_crawler && /path/to/venv/bin/python crawler_service.py

Using systemd Timer (Linux)

Create /etc/systemd/system/news-crawler.service:

[Unit]
Description=News Crawler Service

[Service]
Type=oneshot
WorkingDirectory=/path/to/news_crawler
ExecStart=/path/to/venv/bin/python crawler_service.py
User=your-user

Create /etc/systemd/system/news-crawler.timer:

[Unit]
Description=Run News Crawler every 6 hours

[Timer]
OnBootSec=5min
OnUnitActiveSec=6h

[Install]
WantedBy=timers.target

Enable and start:

sudo systemctl enable news-crawler.timer
sudo systemctl start news-crawler.timer

Using Docker

Create Dockerfile:

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY crawler_service.py .

CMD ["python", "crawler_service.py"]

Build and run:

docker build -t news-crawler .
docker run --env-file ../.env news-crawler

Configuration

Environment variables:

  • MONGODB_URI - MongoDB connection string (default: mongodb://localhost:27017/)

Rate Limiting

  • 1 second delay between article requests
  • Respects server resources
  • User-Agent header included

Troubleshooting

Issue: Can't extract content

  • Some sites block scrapers
  • Try adjusting User-Agent header
  • Some sites require JavaScript (consider Selenium)

Issue: Timeout errors

  • Increase timeout in extract_article_content()
  • Check network connectivity

Issue: Memory usage

  • Reduce max_articles_per_feed
  • Content limited to 10,000 characters per article

Architecture

This is a standalone microservice that:

  • Can run independently of the main backend
  • Shares the same MongoDB database
  • Can be deployed separately
  • Can be scheduled independently

Next Steps

Once articles are crawled, you can:

  • Use Ollama to summarize articles
  • Perform sentiment analysis
  • Extract keywords and topics
  • Generate newsletter content
  • Create article recommendations