# News Crawler Microservice A standalone microservice that crawls full article content from RSS feeds and stores it in MongoDB. ## Features - ๐Ÿ” Extracts full article content from RSS feed links - ๐Ÿ“Š Calculates word count - ๐Ÿ”„ Avoids re-crawling already processed articles - โฑ๏ธ Rate limiting (1 second delay between requests) - ๐ŸŽฏ Smart content extraction using multiple selectors - ๐Ÿงน Cleans up scripts, styles, and navigation elements ## Installation 1. Create a virtual environment: ```bash python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate ``` 2. Install dependencies: ```bash pip install -r requirements.txt ``` 3. Configure environment variables: Create a `.env` file in the project root (or use the backend's `.env`): ```env MONGODB_URI=mongodb://localhost:27017/ ``` ## Usage ### Standalone Execution Run the crawler directly: ```bash # Crawl up to 10 articles per feed (default) python crawler_service.py # Crawl up to 20 articles per feed python crawler_service.py 20 ``` ### As a Module ```python from crawler_service import crawl_all_feeds, crawl_rss_feed # Crawl all active feeds result = crawl_all_feeds(max_articles_per_feed=10) print(result) # Crawl a specific feed crawl_rss_feed( feed_url='https://example.com/rss', feed_name='Example News', max_articles=10 ) ``` ### Via Backend API The backend has integrated endpoints: ```bash # Start crawler curl -X POST http://localhost:5001/api/crawler/start # Check status curl http://localhost:5001/api/crawler/status # Crawl specific feed curl -X POST http://localhost:5001/api/crawler/feed/ ``` ## How It Works 1. **Fetch RSS Feeds**: Gets all active RSS feeds from MongoDB 2. **Parse Feed**: Extracts article links from each feed 3. **Crawl Content**: For each article: - Fetches HTML page - Removes scripts, styles, navigation - Extracts main content using smart selectors - Calculates word count 4. **Store Data**: Saves to MongoDB with metadata 5. **Skip Duplicates**: Avoids re-crawling articles with existing content ## Content Extraction Strategy The crawler tries multiple selectors in order: 1. `
` tag 2. Elements with class containing "article-content", "article-body" 3. Elements with class containing "post-content", "entry-content" 4. `
` tag 5. Fallback to all `

` tags in body ## Database Schema Articles are stored with these fields: ```javascript { title: String, // Article title link: String, // Article URL (unique) summary: String, // Short summary full_content: String, // Full article text (max 10,000 chars) word_count: Number, // Number of words source: String, // RSS feed name published_at: String, // Publication date crawled_at: DateTime, // When content was crawled created_at: DateTime // When added to database } ``` ## Scheduling ### Using Cron (Linux/Mac) ```bash # Run every 6 hours 0 */6 * * * cd /path/to/news_crawler && /path/to/venv/bin/python crawler_service.py ``` ### Using systemd Timer (Linux) Create `/etc/systemd/system/news-crawler.service`: ```ini [Unit] Description=News Crawler Service [Service] Type=oneshot WorkingDirectory=/path/to/news_crawler ExecStart=/path/to/venv/bin/python crawler_service.py User=your-user ``` Create `/etc/systemd/system/news-crawler.timer`: ```ini [Unit] Description=Run News Crawler every 6 hours [Timer] OnBootSec=5min OnUnitActiveSec=6h [Install] WantedBy=timers.target ``` Enable and start: ```bash sudo systemctl enable news-crawler.timer sudo systemctl start news-crawler.timer ``` ### Using Docker Create `Dockerfile`: ```dockerfile FROM python:3.11-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY crawler_service.py . CMD ["python", "crawler_service.py"] ``` Build and run: ```bash docker build -t news-crawler . docker run --env-file ../.env news-crawler ``` ## Configuration Environment variables: - `MONGODB_URI` - MongoDB connection string (default: `mongodb://localhost:27017/`) ## Rate Limiting - 1 second delay between article requests - Respects server resources - User-Agent header included ## Troubleshooting **Issue: Can't extract content** - Some sites block scrapers - Try adjusting User-Agent header - Some sites require JavaScript (consider Selenium) **Issue: Timeout errors** - Increase timeout in `extract_article_content()` - Check network connectivity **Issue: Memory usage** - Reduce `max_articles_per_feed` - Content limited to 10,000 characters per article ## Architecture This is a standalone microservice that: - Can run independently of the main backend - Shares the same MongoDB database - Can be deployed separately - Can be scheduled independently ## Next Steps Once articles are crawled, you can: - Use Ollama to summarize articles - Perform sentiment analysis - Extract keywords and topics - Generate newsletter content - Create article recommendations