update
This commit is contained in:
225
news_crawler/README.md
Normal file
225
news_crawler/README.md
Normal file
@@ -0,0 +1,225 @@
|
||||
# News Crawler Microservice
|
||||
|
||||
A standalone microservice that crawls full article content from RSS feeds and stores it in MongoDB.
|
||||
|
||||
## Features
|
||||
|
||||
- 🔍 Extracts full article content from RSS feed links
|
||||
- 📊 Calculates word count
|
||||
- 🔄 Avoids re-crawling already processed articles
|
||||
- ⏱️ Rate limiting (1 second delay between requests)
|
||||
- 🎯 Smart content extraction using multiple selectors
|
||||
- 🧹 Cleans up scripts, styles, and navigation elements
|
||||
|
||||
## Installation
|
||||
|
||||
1. Create a virtual environment:
|
||||
```bash
|
||||
python -m venv venv
|
||||
source venv/bin/activate # On Windows: venv\Scripts\activate
|
||||
```
|
||||
|
||||
2. Install dependencies:
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
3. Configure environment variables:
|
||||
Create a `.env` file in the project root (or use the backend's `.env`):
|
||||
```env
|
||||
MONGODB_URI=mongodb://localhost:27017/
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Standalone Execution
|
||||
|
||||
Run the crawler directly:
|
||||
|
||||
```bash
|
||||
# Crawl up to 10 articles per feed (default)
|
||||
python crawler_service.py
|
||||
|
||||
# Crawl up to 20 articles per feed
|
||||
python crawler_service.py 20
|
||||
```
|
||||
|
||||
### As a Module
|
||||
|
||||
```python
|
||||
from crawler_service import crawl_all_feeds, crawl_rss_feed
|
||||
|
||||
# Crawl all active feeds
|
||||
result = crawl_all_feeds(max_articles_per_feed=10)
|
||||
print(result)
|
||||
|
||||
# Crawl a specific feed
|
||||
crawl_rss_feed(
|
||||
feed_url='https://example.com/rss',
|
||||
feed_name='Example News',
|
||||
max_articles=10
|
||||
)
|
||||
```
|
||||
|
||||
### Via Backend API
|
||||
|
||||
The backend has integrated endpoints:
|
||||
|
||||
```bash
|
||||
# Start crawler
|
||||
curl -X POST http://localhost:5001/api/crawler/start
|
||||
|
||||
# Check status
|
||||
curl http://localhost:5001/api/crawler/status
|
||||
|
||||
# Crawl specific feed
|
||||
curl -X POST http://localhost:5001/api/crawler/feed/<feed_id>
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
1. **Fetch RSS Feeds**: Gets all active RSS feeds from MongoDB
|
||||
2. **Parse Feed**: Extracts article links from each feed
|
||||
3. **Crawl Content**: For each article:
|
||||
- Fetches HTML page
|
||||
- Removes scripts, styles, navigation
|
||||
- Extracts main content using smart selectors
|
||||
- Calculates word count
|
||||
4. **Store Data**: Saves to MongoDB with metadata
|
||||
5. **Skip Duplicates**: Avoids re-crawling articles with existing content
|
||||
|
||||
## Content Extraction Strategy
|
||||
|
||||
The crawler tries multiple selectors in order:
|
||||
|
||||
1. `<article>` tag
|
||||
2. Elements with class containing "article-content", "article-body"
|
||||
3. Elements with class containing "post-content", "entry-content"
|
||||
4. `<main>` tag
|
||||
5. Fallback to all `<p>` tags in body
|
||||
|
||||
## Database Schema
|
||||
|
||||
Articles are stored with these fields:
|
||||
|
||||
```javascript
|
||||
{
|
||||
title: String, // Article title
|
||||
link: String, // Article URL (unique)
|
||||
summary: String, // Short summary
|
||||
full_content: String, // Full article text (max 10,000 chars)
|
||||
word_count: Number, // Number of words
|
||||
source: String, // RSS feed name
|
||||
published_at: String, // Publication date
|
||||
crawled_at: DateTime, // When content was crawled
|
||||
created_at: DateTime // When added to database
|
||||
}
|
||||
```
|
||||
|
||||
## Scheduling
|
||||
|
||||
### Using Cron (Linux/Mac)
|
||||
|
||||
```bash
|
||||
# Run every 6 hours
|
||||
0 */6 * * * cd /path/to/news_crawler && /path/to/venv/bin/python crawler_service.py
|
||||
```
|
||||
|
||||
### Using systemd Timer (Linux)
|
||||
|
||||
Create `/etc/systemd/system/news-crawler.service`:
|
||||
```ini
|
||||
[Unit]
|
||||
Description=News Crawler Service
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
WorkingDirectory=/path/to/news_crawler
|
||||
ExecStart=/path/to/venv/bin/python crawler_service.py
|
||||
User=your-user
|
||||
```
|
||||
|
||||
Create `/etc/systemd/system/news-crawler.timer`:
|
||||
```ini
|
||||
[Unit]
|
||||
Description=Run News Crawler every 6 hours
|
||||
|
||||
[Timer]
|
||||
OnBootSec=5min
|
||||
OnUnitActiveSec=6h
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
```
|
||||
|
||||
Enable and start:
|
||||
```bash
|
||||
sudo systemctl enable news-crawler.timer
|
||||
sudo systemctl start news-crawler.timer
|
||||
```
|
||||
|
||||
### Using Docker
|
||||
|
||||
Create `Dockerfile`:
|
||||
```dockerfile
|
||||
FROM python:3.11-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
COPY requirements.txt .
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
COPY crawler_service.py .
|
||||
|
||||
CMD ["python", "crawler_service.py"]
|
||||
```
|
||||
|
||||
Build and run:
|
||||
```bash
|
||||
docker build -t news-crawler .
|
||||
docker run --env-file ../.env news-crawler
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Environment variables:
|
||||
|
||||
- `MONGODB_URI` - MongoDB connection string (default: `mongodb://localhost:27017/`)
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
- 1 second delay between article requests
|
||||
- Respects server resources
|
||||
- User-Agent header included
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Issue: Can't extract content**
|
||||
- Some sites block scrapers
|
||||
- Try adjusting User-Agent header
|
||||
- Some sites require JavaScript (consider Selenium)
|
||||
|
||||
**Issue: Timeout errors**
|
||||
- Increase timeout in `extract_article_content()`
|
||||
- Check network connectivity
|
||||
|
||||
**Issue: Memory usage**
|
||||
- Reduce `max_articles_per_feed`
|
||||
- Content limited to 10,000 characters per article
|
||||
|
||||
## Architecture
|
||||
|
||||
This is a standalone microservice that:
|
||||
- Can run independently of the main backend
|
||||
- Shares the same MongoDB database
|
||||
- Can be deployed separately
|
||||
- Can be scheduled independently
|
||||
|
||||
## Next Steps
|
||||
|
||||
Once articles are crawled, you can:
|
||||
- Use Ollama to summarize articles
|
||||
- Perform sentiment analysis
|
||||
- Extract keywords and topics
|
||||
- Generate newsletter content
|
||||
- Create article recommendations
|
||||
Reference in New Issue
Block a user