dongho/Munich-news

Fork 0

Files

Dongho Kim ac5738c29d update

2025-11-10 19:13:33 +01:00

4.9 KiB

Raw Blame History

News Crawler Microservice

A standalone microservice that crawls full article content from RSS feeds and stores it in MongoDB.

Features

🔍 Extracts full article content from RSS feed links
📊 Calculates word count
🔄 Avoids re-crawling already processed articles
⏱️ Rate limiting (1 second delay between requests)
🎯 Smart content extraction using multiple selectors
🧹 Cleans up scripts, styles, and navigation elements

Installation

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Configure environment variables: Create a .env file in the project root (or use the backend's .env):

MONGODB_URI=mongodb://localhost:27017/

Usage

Standalone Execution

Run the crawler directly:

# Crawl up to 10 articles per feed (default)
python crawler_service.py

# Crawl up to 20 articles per feed
python crawler_service.py 20

As a Module

from crawler_service import crawl_all_feeds, crawl_rss_feed

# Crawl all active feeds
result = crawl_all_feeds(max_articles_per_feed=10)
print(result)

# Crawl a specific feed
crawl_rss_feed(
    feed_url='https://example.com/rss',
    feed_name='Example News',
    max_articles=10
)

Via Backend API

The backend has integrated endpoints:

# Start crawler
curl -X POST http://localhost:5001/api/crawler/start

# Check status
curl http://localhost:5001/api/crawler/status

# Crawl specific feed
curl -X POST http://localhost:5001/api/crawler/feed/<feed_id>

How It Works

Fetch RSS Feeds: Gets all active RSS feeds from MongoDB
Parse Feed: Extracts article links from each feed
Crawl Content: For each article:
- Fetches HTML page
- Removes scripts, styles, navigation
- Extracts main content using smart selectors
- Calculates word count
Store Data: Saves to MongoDB with metadata
Skip Duplicates: Avoids re-crawling articles with existing content

Content Extraction Strategy

The crawler tries multiple selectors in order:

<article> tag
Elements with class containing "article-content", "article-body"
Elements with class containing "post-content", "entry-content"
<main> tag
Fallback to all <p> tags in body

Database Schema

Articles are stored with these fields:

{
  title: String,           // Article title
  link: String,            // Article URL (unique)
  summary: String,         // Short summary
  full_content: String,    // Full article text (max 10,000 chars)
  word_count: Number,      // Number of words
  source: String,          // RSS feed name
  published_at: String,    // Publication date
  crawled_at: DateTime,    // When content was crawled
  created_at: DateTime     // When added to database
}

Scheduling

Using Cron (Linux/Mac)

# Run every 6 hours
0 */6 * * * cd /path/to/news_crawler && /path/to/venv/bin/python crawler_service.py

Using systemd Timer (Linux)

Create /etc/systemd/system/news-crawler.service:

[Unit]
Description=News Crawler Service

[Service]
Type=oneshot
WorkingDirectory=/path/to/news_crawler
ExecStart=/path/to/venv/bin/python crawler_service.py
User=your-user

Create /etc/systemd/system/news-crawler.timer:

[Unit]
Description=Run News Crawler every 6 hours

[Timer]
OnBootSec=5min
OnUnitActiveSec=6h

[Install]
WantedBy=timers.target

Enable and start:

sudo systemctl enable news-crawler.timer
sudo systemctl start news-crawler.timer

Using Docker

Create Dockerfile:

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY crawler_service.py .

CMD ["python", "crawler_service.py"]

Build and run:

docker build -t news-crawler .
docker run --env-file ../.env news-crawler

Configuration

Environment variables:

MONGODB_URI - MongoDB connection string (default: mongodb://localhost:27017/)

Rate Limiting

1 second delay between article requests
Respects server resources
User-Agent header included

Troubleshooting

Issue: Can't extract content

Some sites block scrapers
Try adjusting User-Agent header
Some sites require JavaScript (consider Selenium)

Issue: Timeout errors

Increase timeout in extract_article_content()
Check network connectivity

Issue: Memory usage

Reduce max_articles_per_feed
Content limited to 10,000 characters per article

Architecture

This is a standalone microservice that:

Can run independently of the main backend
Shares the same MongoDB database
Can be deployed separately
Can be scheduled independently

Next Steps

Once articles are crawled, you can:

Use Ollama to summarize articles
Perform sentiment analysis
Extract keywords and topics
Generate newsletter content
Create article recommendations

4.9 KiB

Raw Blame History

News Crawler Microservice

Features

Installation

Usage

Standalone Execution

As a Module

Via Backend API

How It Works

Content Extraction Strategy

Database Schema

Scheduling

Using Cron (Linux/Mac)

Using systemd Timer (Linux)

Using Docker

Configuration

Rate Limiting

Troubleshooting

Architecture

Next Steps

Build together

Resources

Get help

4.9 KiB Raw Blame History

News Crawler Microservice

Features

Installation

Usage

Standalone Execution

As a Module

Via Backend API

How It Works

Content Extraction Strategy

Database Schema

Scheduling

Using Cron (Linux/Mac)

Using systemd Timer (Linux)

Using Docker

Configuration

Rate Limiting

Troubleshooting

Architecture

Next Steps

Build together

Resources

Get help

4.9 KiB

Raw Blame History