Files
Munich-news/TEST_INSTRUCTIONS.md
2025-11-10 19:13:33 +01:00

133 lines
3.1 KiB
Markdown

# Testing RSS Feed URL Extraction
## Quick Test (Recommended)
Run this from the project root with backend virtual environment activated:
```bash
# 1. Activate backend virtual environment
cd backend
source venv/bin/activate # On Windows: venv\Scripts\activate
# 2. Go back to project root
cd ..
# 3. Run the test
python test_feeds_quick.py
```
This will:
- ✓ Check what RSS feeds are in your database
- ✓ Fetch each feed
- ✓ Test URL extraction on first 3 articles
- ✓ Show what fields are available
- ✓ Verify summary and date extraction
## Expected Output
```
================================================================================
RSS Feed Test - Checking Database Feeds
================================================================================
✓ Found 3 feed(s) in database
================================================================================
Feed: Süddeutsche Zeitung München
URL: https://www.sueddeutsche.de/muenchen/rss
Active: True
================================================================================
Fetching RSS feed...
✓ Found 20 entries
--- Entry 1 ---
Title: New U-Bahn Line Opens in Munich
✓ URL extracted: https://www.sueddeutsche.de/muenchen/article-123
✓ Summary: The new U-Bahn line connecting the city center...
✓ Date: Mon, 10 Nov 2024 10:00:00 +0100
--- Entry 2 ---
Title: Munich Weather Update
✓ URL extracted: https://www.sueddeutsche.de/muenchen/article-124
✓ Summary: Weather forecast for the week...
✓ Date: Mon, 10 Nov 2024 09:30:00 +0100
...
```
## If No Feeds Found
Add a feed first:
```bash
curl -X POST http://localhost:5001/api/rss-feeds \
-H "Content-Type: application/json" \
-d '{"name": "Süddeutsche Politik", "url": "https://rss.sueddeutsche.de/rss/Politik"}'
```
## Testing News Crawler
Once feeds are verified, test the crawler:
```bash
# 1. Install crawler dependencies
cd news_crawler
pip install -r requirements.txt
# 2. Run the test
python test_rss_feeds.py
# 3. Or run the actual crawler
python crawler_service.py 5
```
## Troubleshooting
### "No module named 'pymongo'"
- Activate the backend virtual environment first
- Or install dependencies: `pip install -r backend/requirements.txt`
### "No RSS feeds in database"
- Make sure backend is running
- Add feeds via API (see above)
- Or check if MongoDB is running: `docker-compose ps`
### "Could not extract URL"
- The test will show available fields
- Check if the feed uses `guid`, `id`, or `links` instead of `link`
- Our utility should handle most cases automatically
### "No entries found"
- The RSS feed URL might be invalid
- Try opening the URL in a browser
- Check if it returns valid XML
## Manual Database Check
Using mongosh:
```bash
mongosh
use munich_news
db.rss_feeds.find()
db.articles.find().limit(3)
```
## What to Look For
**Good signs:**
- URLs are extracted successfully
- URLs start with `http://` or `https://`
- Summaries are present
- Dates are extracted
⚠️ **Warning signs:**
- "Could not extract URL" messages
- Empty summaries (not critical)
- Missing dates (not critical)
**Problems:**
- No entries found in feed
- All URL extractions fail
- Feed parsing errors