4.9 KiB
4.9 KiB
News Crawler Microservice
A standalone microservice that crawls full article content from RSS feeds and stores it in MongoDB.
Features
- 🔍 Extracts full article content from RSS feed links
- 📊 Calculates word count
- 🔄 Avoids re-crawling already processed articles
- ⏱️ Rate limiting (1 second delay between requests)
- 🎯 Smart content extraction using multiple selectors
- 🧹 Cleans up scripts, styles, and navigation elements
Installation
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Configure environment variables:
Create a
.envfile in the project root (or use the backend's.env):
MONGODB_URI=mongodb://localhost:27017/
Usage
Standalone Execution
Run the crawler directly:
# Crawl up to 10 articles per feed (default)
python crawler_service.py
# Crawl up to 20 articles per feed
python crawler_service.py 20
As a Module
from crawler_service import crawl_all_feeds, crawl_rss_feed
# Crawl all active feeds
result = crawl_all_feeds(max_articles_per_feed=10)
print(result)
# Crawl a specific feed
crawl_rss_feed(
feed_url='https://example.com/rss',
feed_name='Example News',
max_articles=10
)
Via Backend API
The backend has integrated endpoints:
# Start crawler
curl -X POST http://localhost:5001/api/crawler/start
# Check status
curl http://localhost:5001/api/crawler/status
# Crawl specific feed
curl -X POST http://localhost:5001/api/crawler/feed/<feed_id>
How It Works
- Fetch RSS Feeds: Gets all active RSS feeds from MongoDB
- Parse Feed: Extracts article links from each feed
- Crawl Content: For each article:
- Fetches HTML page
- Removes scripts, styles, navigation
- Extracts main content using smart selectors
- Calculates word count
- Store Data: Saves to MongoDB with metadata
- Skip Duplicates: Avoids re-crawling articles with existing content
Content Extraction Strategy
The crawler tries multiple selectors in order:
<article>tag- Elements with class containing "article-content", "article-body"
- Elements with class containing "post-content", "entry-content"
<main>tag- Fallback to all
<p>tags in body
Database Schema
Articles are stored with these fields:
{
title: String, // Article title
link: String, // Article URL (unique)
summary: String, // Short summary
full_content: String, // Full article text (max 10,000 chars)
word_count: Number, // Number of words
source: String, // RSS feed name
published_at: String, // Publication date
crawled_at: DateTime, // When content was crawled
created_at: DateTime // When added to database
}
Scheduling
Using Cron (Linux/Mac)
# Run every 6 hours
0 */6 * * * cd /path/to/news_crawler && /path/to/venv/bin/python crawler_service.py
Using systemd Timer (Linux)
Create /etc/systemd/system/news-crawler.service:
[Unit]
Description=News Crawler Service
[Service]
Type=oneshot
WorkingDirectory=/path/to/news_crawler
ExecStart=/path/to/venv/bin/python crawler_service.py
User=your-user
Create /etc/systemd/system/news-crawler.timer:
[Unit]
Description=Run News Crawler every 6 hours
[Timer]
OnBootSec=5min
OnUnitActiveSec=6h
[Install]
WantedBy=timers.target
Enable and start:
sudo systemctl enable news-crawler.timer
sudo systemctl start news-crawler.timer
Using Docker
Create Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY crawler_service.py .
CMD ["python", "crawler_service.py"]
Build and run:
docker build -t news-crawler .
docker run --env-file ../.env news-crawler
Configuration
Environment variables:
MONGODB_URI- MongoDB connection string (default:mongodb://localhost:27017/)
Rate Limiting
- 1 second delay between article requests
- Respects server resources
- User-Agent header included
Troubleshooting
Issue: Can't extract content
- Some sites block scrapers
- Try adjusting User-Agent header
- Some sites require JavaScript (consider Selenium)
Issue: Timeout errors
- Increase timeout in
extract_article_content() - Check network connectivity
Issue: Memory usage
- Reduce
max_articles_per_feed - Content limited to 10,000 characters per article
Architecture
This is a standalone microservice that:
- Can run independently of the main backend
- Shares the same MongoDB database
- Can be deployed separately
- Can be scheduled independently
Next Steps
Once articles are crawled, you can:
- Use Ollama to summarize articles
- Perform sentiment analysis
- Extract keywords and topics
- Generate newsletter content
- Create article recommendations