226 lines
4.9 KiB
Markdown
226 lines
4.9 KiB
Markdown
# News Crawler Microservice
|
|
|
|
A standalone microservice that crawls full article content from RSS feeds and stores it in MongoDB.
|
|
|
|
## Features
|
|
|
|
- 🔍 Extracts full article content from RSS feed links
|
|
- 📊 Calculates word count
|
|
- 🔄 Avoids re-crawling already processed articles
|
|
- ⏱️ Rate limiting (1 second delay between requests)
|
|
- 🎯 Smart content extraction using multiple selectors
|
|
- 🧹 Cleans up scripts, styles, and navigation elements
|
|
|
|
## Installation
|
|
|
|
1. Create a virtual environment:
|
|
```bash
|
|
python -m venv venv
|
|
source venv/bin/activate # On Windows: venv\Scripts\activate
|
|
```
|
|
|
|
2. Install dependencies:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
3. Configure environment variables:
|
|
Create a `.env` file in the project root (or use the backend's `.env`):
|
|
```env
|
|
MONGODB_URI=mongodb://localhost:27017/
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Standalone Execution
|
|
|
|
Run the crawler directly:
|
|
|
|
```bash
|
|
# Crawl up to 10 articles per feed (default)
|
|
python crawler_service.py
|
|
|
|
# Crawl up to 20 articles per feed
|
|
python crawler_service.py 20
|
|
```
|
|
|
|
### As a Module
|
|
|
|
```python
|
|
from crawler_service import crawl_all_feeds, crawl_rss_feed
|
|
|
|
# Crawl all active feeds
|
|
result = crawl_all_feeds(max_articles_per_feed=10)
|
|
print(result)
|
|
|
|
# Crawl a specific feed
|
|
crawl_rss_feed(
|
|
feed_url='https://example.com/rss',
|
|
feed_name='Example News',
|
|
max_articles=10
|
|
)
|
|
```
|
|
|
|
### Via Backend API
|
|
|
|
The backend has integrated endpoints:
|
|
|
|
```bash
|
|
# Start crawler
|
|
curl -X POST http://localhost:5001/api/crawler/start
|
|
|
|
# Check status
|
|
curl http://localhost:5001/api/crawler/status
|
|
|
|
# Crawl specific feed
|
|
curl -X POST http://localhost:5001/api/crawler/feed/<feed_id>
|
|
```
|
|
|
|
## How It Works
|
|
|
|
1. **Fetch RSS Feeds**: Gets all active RSS feeds from MongoDB
|
|
2. **Parse Feed**: Extracts article links from each feed
|
|
3. **Crawl Content**: For each article:
|
|
- Fetches HTML page
|
|
- Removes scripts, styles, navigation
|
|
- Extracts main content using smart selectors
|
|
- Calculates word count
|
|
4. **Store Data**: Saves to MongoDB with metadata
|
|
5. **Skip Duplicates**: Avoids re-crawling articles with existing content
|
|
|
|
## Content Extraction Strategy
|
|
|
|
The crawler tries multiple selectors in order:
|
|
|
|
1. `<article>` tag
|
|
2. Elements with class containing "article-content", "article-body"
|
|
3. Elements with class containing "post-content", "entry-content"
|
|
4. `<main>` tag
|
|
5. Fallback to all `<p>` tags in body
|
|
|
|
## Database Schema
|
|
|
|
Articles are stored with these fields:
|
|
|
|
```javascript
|
|
{
|
|
title: String, // Article title
|
|
link: String, // Article URL (unique)
|
|
summary: String, // Short summary
|
|
full_content: String, // Full article text (max 10,000 chars)
|
|
word_count: Number, // Number of words
|
|
source: String, // RSS feed name
|
|
published_at: String, // Publication date
|
|
crawled_at: DateTime, // When content was crawled
|
|
created_at: DateTime // When added to database
|
|
}
|
|
```
|
|
|
|
## Scheduling
|
|
|
|
### Using Cron (Linux/Mac)
|
|
|
|
```bash
|
|
# Run every 6 hours
|
|
0 */6 * * * cd /path/to/news_crawler && /path/to/venv/bin/python crawler_service.py
|
|
```
|
|
|
|
### Using systemd Timer (Linux)
|
|
|
|
Create `/etc/systemd/system/news-crawler.service`:
|
|
```ini
|
|
[Unit]
|
|
Description=News Crawler Service
|
|
|
|
[Service]
|
|
Type=oneshot
|
|
WorkingDirectory=/path/to/news_crawler
|
|
ExecStart=/path/to/venv/bin/python crawler_service.py
|
|
User=your-user
|
|
```
|
|
|
|
Create `/etc/systemd/system/news-crawler.timer`:
|
|
```ini
|
|
[Unit]
|
|
Description=Run News Crawler every 6 hours
|
|
|
|
[Timer]
|
|
OnBootSec=5min
|
|
OnUnitActiveSec=6h
|
|
|
|
[Install]
|
|
WantedBy=timers.target
|
|
```
|
|
|
|
Enable and start:
|
|
```bash
|
|
sudo systemctl enable news-crawler.timer
|
|
sudo systemctl start news-crawler.timer
|
|
```
|
|
|
|
### Using Docker
|
|
|
|
Create `Dockerfile`:
|
|
```dockerfile
|
|
FROM python:3.11-slim
|
|
|
|
WORKDIR /app
|
|
|
|
COPY requirements.txt .
|
|
RUN pip install --no-cache-dir -r requirements.txt
|
|
|
|
COPY crawler_service.py .
|
|
|
|
CMD ["python", "crawler_service.py"]
|
|
```
|
|
|
|
Build and run:
|
|
```bash
|
|
docker build -t news-crawler .
|
|
docker run --env-file ../.env news-crawler
|
|
```
|
|
|
|
## Configuration
|
|
|
|
Environment variables:
|
|
|
|
- `MONGODB_URI` - MongoDB connection string (default: `mongodb://localhost:27017/`)
|
|
|
|
## Rate Limiting
|
|
|
|
- 1 second delay between article requests
|
|
- Respects server resources
|
|
- User-Agent header included
|
|
|
|
## Troubleshooting
|
|
|
|
**Issue: Can't extract content**
|
|
- Some sites block scrapers
|
|
- Try adjusting User-Agent header
|
|
- Some sites require JavaScript (consider Selenium)
|
|
|
|
**Issue: Timeout errors**
|
|
- Increase timeout in `extract_article_content()`
|
|
- Check network connectivity
|
|
|
|
**Issue: Memory usage**
|
|
- Reduce `max_articles_per_feed`
|
|
- Content limited to 10,000 characters per article
|
|
|
|
## Architecture
|
|
|
|
This is a standalone microservice that:
|
|
- Can run independently of the main backend
|
|
- Shares the same MongoDB database
|
|
- Can be deployed separately
|
|
- Can be scheduled independently
|
|
|
|
## Next Steps
|
|
|
|
Once articles are crawled, you can:
|
|
- Use Ollama to summarize articles
|
|
- Perform sentiment analysis
|
|
- Extract keywords and topics
|
|
- Generate newsletter content
|
|
- Create article recommendations
|