This commit is contained in:
2025-11-10 19:13:33 +01:00
commit ac5738c29d
64 changed files with 9445 additions and 0 deletions

225
news_crawler/README.md Normal file
View File

@@ -0,0 +1,225 @@
# News Crawler Microservice
A standalone microservice that crawls full article content from RSS feeds and stores it in MongoDB.
## Features
- 🔍 Extracts full article content from RSS feed links
- 📊 Calculates word count
- 🔄 Avoids re-crawling already processed articles
- ⏱️ Rate limiting (1 second delay between requests)
- 🎯 Smart content extraction using multiple selectors
- 🧹 Cleans up scripts, styles, and navigation elements
## Installation
1. Create a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Configure environment variables:
Create a `.env` file in the project root (or use the backend's `.env`):
```env
MONGODB_URI=mongodb://localhost:27017/
```
## Usage
### Standalone Execution
Run the crawler directly:
```bash
# Crawl up to 10 articles per feed (default)
python crawler_service.py
# Crawl up to 20 articles per feed
python crawler_service.py 20
```
### As a Module
```python
from crawler_service import crawl_all_feeds, crawl_rss_feed
# Crawl all active feeds
result = crawl_all_feeds(max_articles_per_feed=10)
print(result)
# Crawl a specific feed
crawl_rss_feed(
feed_url='https://example.com/rss',
feed_name='Example News',
max_articles=10
)
```
### Via Backend API
The backend has integrated endpoints:
```bash
# Start crawler
curl -X POST http://localhost:5001/api/crawler/start
# Check status
curl http://localhost:5001/api/crawler/status
# Crawl specific feed
curl -X POST http://localhost:5001/api/crawler/feed/<feed_id>
```
## How It Works
1. **Fetch RSS Feeds**: Gets all active RSS feeds from MongoDB
2. **Parse Feed**: Extracts article links from each feed
3. **Crawl Content**: For each article:
- Fetches HTML page
- Removes scripts, styles, navigation
- Extracts main content using smart selectors
- Calculates word count
4. **Store Data**: Saves to MongoDB with metadata
5. **Skip Duplicates**: Avoids re-crawling articles with existing content
## Content Extraction Strategy
The crawler tries multiple selectors in order:
1. `<article>` tag
2. Elements with class containing "article-content", "article-body"
3. Elements with class containing "post-content", "entry-content"
4. `<main>` tag
5. Fallback to all `<p>` tags in body
## Database Schema
Articles are stored with these fields:
```javascript
{
title: String, // Article title
link: String, // Article URL (unique)
summary: String, // Short summary
full_content: String, // Full article text (max 10,000 chars)
word_count: Number, // Number of words
source: String, // RSS feed name
published_at: String, // Publication date
crawled_at: DateTime, // When content was crawled
created_at: DateTime // When added to database
}
```
## Scheduling
### Using Cron (Linux/Mac)
```bash
# Run every 6 hours
0 */6 * * * cd /path/to/news_crawler && /path/to/venv/bin/python crawler_service.py
```
### Using systemd Timer (Linux)
Create `/etc/systemd/system/news-crawler.service`:
```ini
[Unit]
Description=News Crawler Service
[Service]
Type=oneshot
WorkingDirectory=/path/to/news_crawler
ExecStart=/path/to/venv/bin/python crawler_service.py
User=your-user
```
Create `/etc/systemd/system/news-crawler.timer`:
```ini
[Unit]
Description=Run News Crawler every 6 hours
[Timer]
OnBootSec=5min
OnUnitActiveSec=6h
[Install]
WantedBy=timers.target
```
Enable and start:
```bash
sudo systemctl enable news-crawler.timer
sudo systemctl start news-crawler.timer
```
### Using Docker
Create `Dockerfile`:
```dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY crawler_service.py .
CMD ["python", "crawler_service.py"]
```
Build and run:
```bash
docker build -t news-crawler .
docker run --env-file ../.env news-crawler
```
## Configuration
Environment variables:
- `MONGODB_URI` - MongoDB connection string (default: `mongodb://localhost:27017/`)
## Rate Limiting
- 1 second delay between article requests
- Respects server resources
- User-Agent header included
## Troubleshooting
**Issue: Can't extract content**
- Some sites block scrapers
- Try adjusting User-Agent header
- Some sites require JavaScript (consider Selenium)
**Issue: Timeout errors**
- Increase timeout in `extract_article_content()`
- Check network connectivity
**Issue: Memory usage**
- Reduce `max_articles_per_feed`
- Content limited to 10,000 characters per article
## Architecture
This is a standalone microservice that:
- Can run independently of the main backend
- Shares the same MongoDB database
- Can be deployed separately
- Can be scheduled independently
## Next Steps
Once articles are crawled, you can:
- Use Ollama to summarize articles
- Perform sentiment analysis
- Extract keywords and topics
- Generate newsletter content
- Create article recommendations