# AI-Powered News Aggregation - COMPLETE ✅ ## Overview Successfully implemented a complete AI-powered news aggregation system that detects duplicate stories from multiple sources and generates neutral, balanced summaries. ## Features Implemented ### 1. AI-Powered Article Clustering ✅ **What it does:** - Automatically detects when different news sources cover the same story - Uses Ollama AI to intelligently compare article content - Groups related articles by `cluster_id` - Marks the first article as `is_primary: true` **How it works:** - Compares articles published within 24 hours - Uses AI prompt: "Are these two articles about the same story?" - Falls back to keyword matching if AI fails - Real-time clustering during crawl **Test Results:** - ✅ Housing story from 2 sources → Clustered together - ✅ Bayern transfer from 2 sources → Clustered together - ✅ Different stories → Separate clusters ### 2. Neutral Summary Generation ✅ **What it does:** - Synthesizes multiple articles into one balanced summary - Combines perspectives from all sources - Highlights agreements and differences - Maintains neutral, objective tone **How it works:** - Takes all articles in a cluster - Sends combined context to Ollama - AI generates ~200-word neutral summary - Saves to `cluster_summaries` collection **Test Results:** ``` Bayern Transfer Story (2 sources): "Bayern Munich has recently signed Brazilian footballer, aged 23, for €50 million to bolster their attacking lineup as per reports from abendzeitung-muenchen and sueddeutsche. The new addition is expected to inject much-needed dynamism into the team's offense..." ``` ### 3. Smart Prioritization ✅ **What it does:** - Prioritizes stories covered by multiple sources (more important) - Shows multi-source stories first with neutral summaries - Fills remaining slots with single-source stories **Sorting Logic:** 1. **Primary sort:** Number of sources (descending) 2. **Secondary sort:** Publish date (newest first) **Example Output:** ``` 1. Munich Housing (2 sources) → Neutral summary 2. Bayern Transfer (2 sources) → Neutral summary 3. Local story (1 source) → Individual summary 4. Local story (1 source) → Individual summary ... ``` ## Database Schema ### Articles Collection ```javascript { _id: ObjectId("..."), title: "München: Stadtrat beschließt...", content: "Full article text...", summary: "AI-generated summary...", source: "abendzeitung-muenchen", link: "https://...", published_at: ISODate("2025-11-12T..."), // Clustering fields cluster_id: "1762937577.365818", is_primary: true, // Metadata word_count: 450, summary_word_count: 120, category: "local", crawled_at: ISODate("..."), summarized_at: ISODate("...") } ``` ### Cluster Summaries Collection ```javascript { _id: ObjectId("..."), cluster_id: "1762937577.365818", neutral_summary: "Combined neutral summary from all sources...", sources: ["abendzeitung-muenchen", "sueddeutsche"], article_count: 2, created_at: ISODate("2025-11-12T..."), updated_at: ISODate("2025-11-12T...") } ``` ## API Endpoints ### Get All Articles (Default) ```bash GET /api/news ``` Returns all articles individually (current behavior) ### Get Clustered Articles (Recommended) ```bash GET /api/news?mode=clustered&limit=10 ``` Returns: - One article per story - Multi-source stories with neutral summaries first - Single-source stories with individual summaries - Smart prioritization by popularity **Response Format:** ```javascript { "articles": [ { "title": "...", "summary": "Neutral summary combining all sources...", "summary_type": "neutral", "is_clustered": true, "source_count": 2, "sources": ["source1", "source2"], "related_articles": [ {"source": "source2", "title": "...", "link": "..."} ] } ], "mode": "clustered" } ``` ### Get Statistics ```bash GET /api/stats ``` Returns: ```javascript { "articles": 51, "crawled_articles": 45, "summarized_articles": 40, "clustered_articles": 47, "neutral_summaries": 3 } ``` ## Workflow ### Complete Crawl Process 1. **Crawl RSS feeds** from multiple sources 2. **Extract full content** from article URLs 3. **Generate AI summaries** for each article 4. **Cluster similar articles** using AI comparison 5. **Generate neutral summaries** for multi-source clusters 6. **Save everything** to MongoDB ### Time Windows - **Clustering window:** 24 hours (rolling) - **Crawl schedule:** Daily at 6:00 AM Berlin time - **Manual trigger:** Available via crawler service ## Configuration ### Environment Variables ```bash # Ollama AI OLLAMA_BASE_URL=http://ollama:11434 OLLAMA_MODEL=phi3:latest OLLAMA_ENABLED=true OLLAMA_TIMEOUT=120 # Clustering CLUSTERING_TIME_WINDOW=24 # hours CLUSTERING_SIMILARITY_THRESHOLD=0.50 # Summaries SUMMARY_MAX_WORDS=150 # individual NEUTRAL_SUMMARY_MAX_WORDS=200 # cluster ``` ## Files Created/Modified ### New Files - `news_crawler/article_clustering.py` - AI clustering logic - `news_crawler/cluster_summarizer.py` - Neutral summary generation - `test-clustering-real.py` - Clustering tests - `test-neutral-summaries.py` - Summary generation tests - `test-complete-workflow.py` - End-to-end tests ### Modified Files - `news_crawler/crawler_service.py` - Added clustering + summarization - `news_crawler/ollama_client.py` - Added `generate()` method - `backend/routes/news_routes.py` - Added clustered endpoint with prioritization ## Performance ### Metrics - **Clustering:** ~20-40s per article pair (AI comparison) - **Neutral summary:** ~30-40s per cluster - **Success rate:** 100% in tests - **Accuracy:** High - correctly identifies same/different stories ### Optimization - Clustering runs during crawl (real-time) - Neutral summaries generated after crawl (batch) - Results cached in database - 24-hour time window limits comparisons ## Testing ### Test Coverage ✅ AI clustering with same stories ✅ AI clustering with different stories ✅ Neutral summary generation ✅ Multi-source prioritization ✅ Database integration ✅ End-to-end workflow ### Test Commands ```bash # Test clustering docker-compose exec crawler python /app/test-clustering-real.py # Test neutral summaries docker-compose exec crawler python /app/test-neutral-summaries.py # Test complete workflow docker-compose exec crawler python /app/test-complete-workflow.py ``` ## Benefits ### For Users - ✅ **No duplicate stories** - See each story once - ✅ **Balanced coverage** - Multiple perspectives combined - ✅ **Prioritized content** - Important stories first - ✅ **Source transparency** - See all sources covering a story - ✅ **Efficient reading** - One summary instead of multiple articles ### For the System - ✅ **Intelligent deduplication** - AI-powered, not just URL matching - ✅ **Scalable** - Works with any number of sources - ✅ **Flexible** - 24-hour time window catches late-breaking news - ✅ **Reliable** - Fallback mechanisms if AI fails - ✅ **Maintainable** - Clear separation of concerns ## Future Enhancements ### Potential Improvements 1. **Update summaries** when new articles join a cluster 2. **Summary versioning** to track changes over time 3. **Quality scoring** for generated summaries 4. **Multi-language support** for summaries 5. **Sentiment analysis** across sources 6. **Fact extraction** and verification 7. **Trending topics** detection 8. **User preferences** for source weighting ### Integration Ideas - Email newsletters with neutral summaries - Push notifications for multi-source stories - RSS feed of clustered articles - API for third-party apps - Analytics dashboard ## Conclusion The Munich News Aggregator now provides: 1. ✅ **Smart clustering** - AI detects duplicate stories 2. ✅ **Neutral summaries** - Balanced multi-source coverage 3. ✅ **Smart prioritization** - Important stories first 4. ✅ **Source transparency** - See all perspectives 5. ✅ **Efficient delivery** - One summary per story **Result:** Users get comprehensive, balanced news coverage without information overload! --- ## Quick Start ### View Clustered News ```bash curl "http://localhost:5001/api/news?mode=clustered&limit=10" ``` ### Trigger Manual Crawl ```bash docker-compose exec crawler python /app/scheduled_crawler.py ``` ### Check Statistics ```bash curl "http://localhost:5001/api/stats" ``` ### View Cluster Summaries in Database ```bash docker-compose exec mongodb mongosh -u admin -p changeme --authenticationDatabase admin munich_news --eval "db.cluster_summaries.find().pretty()" ``` --- **Status:** ✅ Production Ready **Last Updated:** November 12, 2025 **Version:** 2.0 (AI-Powered)