update

2025-11-10 19:13:33 +01:00
commit ac5738c29d
64 changed files with 9445 additions and 0 deletions
--- a/news_crawler/README.md
+++ b/news_crawler/README.md
@@ -0,0 +1,225 @@
+# News Crawler Microservice
+
+A standalone microservice that crawls full article content from RSS feeds and stores it in MongoDB.
+
+## Features
+
+- 🔍 Extracts full article content from RSS feed links
+- 📊 Calculates word count
+- 🔄 Avoids re-crawling already processed articles
+- ⏱️ Rate limiting (1 second delay between requests)
+- 🎯 Smart content extraction using multiple selectors
+- 🧹 Cleans up scripts, styles, and navigation elements
+
+## Installation
+
+1. Create a virtual environment:
+```bash
+python -m venv venv
+source venv/bin/activate  # On Windows: venv\Scripts\activate
+```
+
+2. Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+
+3. Configure environment variables:
+Create a `.env` file in the project root (or use the backend's `.env`):
+```env
+MONGODB_URI=mongodb://localhost:27017/
+```
+
+## Usage
+
+### Standalone Execution
+
+Run the crawler directly:
+
+```bash
+# Crawl up to 10 articles per feed (default)
+python crawler_service.py
+
+# Crawl up to 20 articles per feed
+python crawler_service.py 20
+```
+
+### As a Module
+
+```python
+from crawler_service import crawl_all_feeds, crawl_rss_feed
+
+# Crawl all active feeds
+result = crawl_all_feeds(max_articles_per_feed=10)
+print(result)
+
+# Crawl a specific feed
+crawl_rss_feed(
+    feed_url='https://example.com/rss',
+    feed_name='Example News',
+    max_articles=10
+)
+```
+
+### Via Backend API
+
+The backend has integrated endpoints:
+
+```bash
+# Start crawler
+curl -X POST http://localhost:5001/api/crawler/start
+
+# Check status
+curl http://localhost:5001/api/crawler/status
+
+# Crawl specific feed
+curl -X POST http://localhost:5001/api/crawler/feed/<feed_id>
+```
+
+## How It Works
+
+1. **Fetch RSS Feeds**: Gets all active RSS feeds from MongoDB
+2. **Parse Feed**: Extracts article links from each feed
+3. **Crawl Content**: For each article:
+   - Fetches HTML page
+   - Removes scripts, styles, navigation
+   - Extracts main content using smart selectors
+   - Calculates word count
+4. **Store Data**: Saves to MongoDB with metadata
+5. **Skip Duplicates**: Avoids re-crawling articles with existing content
+
+## Content Extraction Strategy
+
+The crawler tries multiple selectors in order:
+
+1. `<article>` tag
+2. Elements with class containing "article-content", "article-body"
+3. Elements with class containing "post-content", "entry-content"
+4. `<main>` tag
+5. Fallback to all `<p>` tags in body
+
+## Database Schema
+
+Articles are stored with these fields:
+
+```javascript
+{
+  title: String,           // Article title
+  link: String,            // Article URL (unique)
+  summary: String,         // Short summary
+  full_content: String,    // Full article text (max 10,000 chars)
+  word_count: Number,      // Number of words
+  source: String,          // RSS feed name
+  published_at: String,    // Publication date
+  crawled_at: DateTime,    // When content was crawled
+  created_at: DateTime     // When added to database
+}
+```
+
+## Scheduling
+
+### Using Cron (Linux/Mac)
+
+```bash
+# Run every 6 hours
+0 */6 * * * cd /path/to/news_crawler && /path/to/venv/bin/python crawler_service.py
+```
+
+### Using systemd Timer (Linux)
+
+Create `/etc/systemd/system/news-crawler.service`:
+```ini
+[Unit]
+Description=News Crawler Service
+
+[Service]
+Type=oneshot
+WorkingDirectory=/path/to/news_crawler
+ExecStart=/path/to/venv/bin/python crawler_service.py
+User=your-user
+```
+
+Create `/etc/systemd/system/news-crawler.timer`:
+```ini
+[Unit]
+Description=Run News Crawler every 6 hours
+
+[Timer]
+OnBootSec=5min
+OnUnitActiveSec=6h
+
+[Install]
+WantedBy=timers.target
+```
+
+Enable and start:
+```bash
+sudo systemctl enable news-crawler.timer
+sudo systemctl start news-crawler.timer
+```
+
+### Using Docker
+
+Create `Dockerfile`:
+```dockerfile
+FROM python:3.11-slim
+
+WORKDIR /app
+
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+
+COPY crawler_service.py .
+
+CMD ["python", "crawler_service.py"]
+```
+
+Build and run:
+```bash
+docker build -t news-crawler .
+docker run --env-file ../.env news-crawler
+```
+
+## Configuration
+
+Environment variables:
+
+- `MONGODB_URI` - MongoDB connection string (default: `mongodb://localhost:27017/`)
+
+## Rate Limiting
+
+- 1 second delay between article requests
+- Respects server resources
+- User-Agent header included
+
+## Troubleshooting
+
+**Issue: Can't extract content**
+- Some sites block scrapers
+- Try adjusting User-Agent header
+- Some sites require JavaScript (consider Selenium)
+
+**Issue: Timeout errors**
+- Increase timeout in `extract_article_content()`
+- Check network connectivity
+
+**Issue: Memory usage**
+- Reduce `max_articles_per_feed`
+- Content limited to 10,000 characters per article
+
+## Architecture
+
+This is a standalone microservice that:
+- Can run independently of the main backend
+- Shares the same MongoDB database
+- Can be deployed separately
+- Can be scheduled independently
+
+## Next Steps
+
+Once articles are crawled, you can:
+- Use Ollama to summarize articles
+- Perform sentiment analysis
+- Extract keywords and topics
+- Generate newsletter content
+- Create article recommendations