Munich-news/news_crawler/README.md

# News Crawler Microservice

A standalone microservice that crawls full article content from RSS feeds and stores it in MongoDB.

## Features

- 🔍 Extracts full article content from RSS feed links
- 📊 Calculates word count
- 🔄 Avoids re-crawling already processed articles
- ⏱️ Rate limiting (1 second delay between requests)
- 🎯 Smart content extraction using multiple selectors
- 🧹 Cleans up scripts, styles, and navigation elements

## Installation

1. Create a virtual environment:
```bash
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

3. Configure environment variables:
Create a `.env` file in the project root (or use the backend's `.env`):
```env
MONGODB_URI=mongodb://localhost:27017/
```

## Usage

### Standalone Execution

Run the crawler directly:

```bash
# Crawl up to 10 articles per feed (default)
python crawler_service.py

# Crawl up to 20 articles per feed
python crawler_service.py 20
```

### As a Module

```python
from crawler_service import crawl_all_feeds, crawl_rss_feed

# Crawl all active feeds
result = crawl_all_feeds(max_articles_per_feed=10)
print(result)

# Crawl a specific feed
crawl_rss_feed(
    feed_url='https://example.com/rss',
    feed_name='Example News',
    max_articles=10
)
```

### Via Backend API

The backend has integrated endpoints:

```bash
# Start crawler
curl -X POST http://localhost:5001/api/crawler/start

# Check status
curl http://localhost:5001/api/crawler/status

# Crawl specific feed
curl -X POST http://localhost:5001/api/crawler/feed/<feed_id>
```

## How It Works

1. **Fetch RSS Feeds**: Gets all active RSS feeds from MongoDB
2. **Parse Feed**: Extracts article links from each feed
3. **Crawl Content**: For each article:
   - Fetches HTML page
   - Removes scripts, styles, navigation
   - Extracts main content using smart selectors
   - Calculates word count
4. **Store Data**: Saves to MongoDB with metadata
5. **Skip Duplicates**: Avoids re-crawling articles with existing content

## Content Extraction Strategy

The crawler tries multiple selectors in order:

1. `<article>` tag
2. Elements with class containing "article-content", "article-body"
3. Elements with class containing "post-content", "entry-content"
4. `<main>` tag
5. Fallback to all `<p>` tags in body

## Database Schema

Articles are stored with these fields:

```javascript
{
  title: String,           // Article title
  link: String,            // Article URL (unique)
  summary: String,         // Short summary
  full_content: String,    // Full article text (max 10,000 chars)
  word_count: Number,      // Number of words
  source: String,          // RSS feed name
  published_at: String,    // Publication date
  crawled_at: DateTime,    // When content was crawled
  created_at: DateTime     // When added to database
}
```

## Scheduling

### Using Cron (Linux/Mac)

```bash
# Run every 6 hours
0 */6 * * * cd /path/to/news_crawler && /path/to/venv/bin/python crawler_service.py
```

### Using systemd Timer (Linux)

Create `/etc/systemd/system/news-crawler.service`:
```ini
[Unit]
Description=News Crawler Service

[Service]
Type=oneshot
WorkingDirectory=/path/to/news_crawler
ExecStart=/path/to/venv/bin/python crawler_service.py
User=your-user
```

Create `/etc/systemd/system/news-crawler.timer`:
```ini
[Unit]
Description=Run News Crawler every 6 hours

[Timer]
OnBootSec=5min
OnUnitActiveSec=6h

[Install]
WantedBy=timers.target
```

Enable and start:
```bash
sudo systemctl enable news-crawler.timer
sudo systemctl start news-crawler.timer
```

### Using Docker

Create `Dockerfile`:
```dockerfile
FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY crawler_service.py .

CMD ["python", "crawler_service.py"]
```

Build and run:
```bash
docker build -t news-crawler .
docker run --env-file ../.env news-crawler
```

## Configuration

Environment variables:

- `MONGODB_URI` - MongoDB connection string (default: `mongodb://localhost:27017/`)

## Rate Limiting

- 1 second delay between article requests
- Respects server resources
- User-Agent header included

## Troubleshooting

**Issue: Can't extract content**
- Some sites block scrapers
- Try adjusting User-Agent header
- Some sites require JavaScript (consider Selenium)

**Issue: Timeout errors**
- Increase timeout in `extract_article_content()`
- Check network connectivity

**Issue: Memory usage**
- Reduce `max_articles_per_feed`
- Content limited to 10,000 characters per article

## Architecture

This is a standalone microservice that:
- Can run independently of the main backend
- Shares the same MongoDB database
- Can be deployed separately
- Can be scheduled independently

## Next Steps

Once articles are crawled, you can:
- Use Ollama to summarize articles
- Perform sentiment analysis
- Extract keywords and topics
- Generate newsletter content
- Create article recommendations