update
This commit is contained in:
487
.kiro/specs/ai-article-summarization/design.md
Normal file
487
.kiro/specs/ai-article-summarization/design.md
Normal file
@@ -0,0 +1,487 @@
|
||||
# Design Document - AI Article Summarization
|
||||
|
||||
## Overview
|
||||
|
||||
This design integrates Ollama AI into the news crawler workflow to automatically generate concise summaries of articles. The system will extract full article content, send it to Ollama for summarization, and store both the original content and the AI-generated summary in MongoDB.
|
||||
|
||||
## Architecture
|
||||
|
||||
### High-Level Flow
|
||||
|
||||
```
|
||||
RSS Feed → Extract Content → Summarize with Ollama → Store in MongoDB
|
||||
↓ ↓ ↓
|
||||
Full Article Text AI Summary (≤150 words) Both Stored
|
||||
```
|
||||
|
||||
### Component Diagram
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ News Crawler Service │
|
||||
│ │
|
||||
│ ┌────────────────┐ ┌──────────────────┐ │
|
||||
│ │ RSS Parser │──────→│ Content Extractor│ │
|
||||
│ └────────────────┘ └──────────────────┘ │
|
||||
│ │ │
|
||||
│ ↓ │
|
||||
│ ┌──────────────────┐ │
|
||||
│ │ Ollama Client │ │
|
||||
│ │ (New Component) │ │
|
||||
│ └──────────────────┘ │
|
||||
│ │ │
|
||||
│ ↓ │
|
||||
│ ┌──────────────────┐ │
|
||||
│ │ Database Writer │ │
|
||||
│ └──────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
↓
|
||||
┌──────────────────┐
|
||||
│ Ollama Server │
|
||||
│ (External) │
|
||||
└──────────────────┘
|
||||
│
|
||||
↓
|
||||
┌──────────────────┐
|
||||
│ MongoDB │
|
||||
└──────────────────┘
|
||||
```
|
||||
|
||||
## Components and Interfaces
|
||||
|
||||
### 1. Ollama Client Module
|
||||
|
||||
**File:** `news_crawler/ollama_client.py`
|
||||
|
||||
**Purpose:** Handle communication with Ollama server for summarization
|
||||
|
||||
**Interface:**
|
||||
```python
|
||||
class OllamaClient:
|
||||
def __init__(self, base_url, model, api_key=None, enabled=True):
|
||||
"""Initialize Ollama client with configuration"""
|
||||
|
||||
def summarize_article(self, content: str, max_words: int = 150) -> dict:
|
||||
"""
|
||||
Summarize article content using Ollama
|
||||
|
||||
Args:
|
||||
content: Full article text
|
||||
max_words: Maximum words in summary (default 150)
|
||||
|
||||
Returns:
|
||||
{
|
||||
'summary': str, # AI-generated summary
|
||||
'word_count': int, # Summary word count
|
||||
'success': bool, # Whether summarization succeeded
|
||||
'error': str or None, # Error message if failed
|
||||
'duration': float # Time taken in seconds
|
||||
}
|
||||
"""
|
||||
|
||||
def is_available(self) -> bool:
|
||||
"""Check if Ollama server is reachable"""
|
||||
|
||||
def test_connection(self) -> dict:
|
||||
"""Test connection and return server info"""
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
|
||||
1. **summarize_article()**
|
||||
- Constructs prompt for Ollama
|
||||
- Sends HTTP POST request
|
||||
- Handles timeouts and errors
|
||||
- Validates response
|
||||
- Returns structured result
|
||||
|
||||
2. **is_available()**
|
||||
- Quick health check
|
||||
- Returns True/False
|
||||
- Used before attempting summarization
|
||||
|
||||
3. **test_connection()**
|
||||
- Detailed connection test
|
||||
- Returns server info and model list
|
||||
- Used for diagnostics
|
||||
|
||||
### 2. Enhanced Crawler Service
|
||||
|
||||
**File:** `news_crawler/crawler_service.py`
|
||||
|
||||
**Changes:**
|
||||
|
||||
```python
|
||||
# Add Ollama client initialization
|
||||
from ollama_client import OllamaClient
|
||||
|
||||
# Initialize at module level
|
||||
ollama_client = OllamaClient(
|
||||
base_url=os.getenv('OLLAMA_BASE_URL'),
|
||||
model=os.getenv('OLLAMA_MODEL'),
|
||||
api_key=os.getenv('OLLAMA_API_KEY'),
|
||||
enabled=os.getenv('OLLAMA_ENABLED', 'false').lower() == 'true'
|
||||
)
|
||||
|
||||
# Modify crawl_rss_feed() to include summarization
|
||||
def crawl_rss_feed(feed_url, feed_name, max_articles=10):
|
||||
# ... existing code ...
|
||||
|
||||
# After extracting content
|
||||
article_data = extract_article_content(article_url)
|
||||
|
||||
# NEW: Summarize with Ollama
|
||||
summary_result = None
|
||||
if ollama_client.enabled and article_data.get('content'):
|
||||
print(f" 🤖 Summarizing with AI...")
|
||||
summary_result = ollama_client.summarize_article(
|
||||
article_data['content'],
|
||||
max_words=150
|
||||
)
|
||||
|
||||
if summary_result['success']:
|
||||
print(f" ✓ Summary generated ({summary_result['word_count']} words)")
|
||||
else:
|
||||
print(f" ⚠ Summarization failed: {summary_result['error']}")
|
||||
|
||||
# Build article document with summary
|
||||
article_doc = {
|
||||
'title': article_data.get('title'),
|
||||
'author': article_data.get('author'),
|
||||
'link': article_url,
|
||||
'content': article_data.get('content'),
|
||||
'summary': summary_result['summary'] if summary_result and summary_result['success'] else None,
|
||||
'word_count': article_data.get('word_count'),
|
||||
'summary_word_count': summary_result['word_count'] if summary_result and summary_result['success'] else None,
|
||||
'source': feed_name,
|
||||
'published_at': extract_published_date(entry),
|
||||
'crawled_at': article_data.get('crawled_at'),
|
||||
'summarized_at': datetime.utcnow() if summary_result and summary_result['success'] else None,
|
||||
'created_at': datetime.utcnow()
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Configuration Module
|
||||
|
||||
**File:** `news_crawler/config.py` (new file)
|
||||
|
||||
**Purpose:** Centralize configuration management
|
||||
|
||||
```python
|
||||
import os
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv(dotenv_path='../.env')
|
||||
|
||||
class Config:
|
||||
# MongoDB
|
||||
MONGODB_URI = os.getenv('MONGODB_URI', 'mongodb://localhost:27017/')
|
||||
DB_NAME = 'munich_news'
|
||||
|
||||
# Ollama
|
||||
OLLAMA_BASE_URL = os.getenv('OLLAMA_BASE_URL', 'http://localhost:11434')
|
||||
OLLAMA_MODEL = os.getenv('OLLAMA_MODEL', 'phi3:latest')
|
||||
OLLAMA_API_KEY = os.getenv('OLLAMA_API_KEY', '')
|
||||
OLLAMA_ENABLED = os.getenv('OLLAMA_ENABLED', 'false').lower() == 'true'
|
||||
OLLAMA_TIMEOUT = int(os.getenv('OLLAMA_TIMEOUT', '30'))
|
||||
|
||||
# Crawler
|
||||
RATE_LIMIT_DELAY = 1 # seconds between requests
|
||||
MAX_CONTENT_LENGTH = 50000 # characters
|
||||
```
|
||||
|
||||
## Data Models
|
||||
|
||||
### Updated Article Schema
|
||||
|
||||
```javascript
|
||||
{
|
||||
_id: ObjectId,
|
||||
title: String,
|
||||
author: String,
|
||||
link: String, // Unique index
|
||||
content: String, // Full article content
|
||||
summary: String, // AI-generated summary (≤150 words)
|
||||
word_count: Number, // Original content word count
|
||||
summary_word_count: Number, // Summary word count
|
||||
source: String,
|
||||
published_at: String,
|
||||
crawled_at: DateTime,
|
||||
summarized_at: DateTime, // When AI summary was generated
|
||||
created_at: DateTime
|
||||
}
|
||||
```
|
||||
|
||||
### Ollama Request Format
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "phi3:latest",
|
||||
"prompt": "Summarize the following article in 150 words or less. Focus on the key points and main message:\n\n[ARTICLE CONTENT]",
|
||||
"stream": false,
|
||||
"options": {
|
||||
"temperature": 0.7,
|
||||
"max_tokens": 200
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Ollama Response Format
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "phi3:latest",
|
||||
"created_at": "2024-11-10T16:30:00Z",
|
||||
"response": "The AI-generated summary text here...",
|
||||
"done": true,
|
||||
"total_duration": 5000000000
|
||||
}
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Error Scenarios and Responses
|
||||
|
||||
| Scenario | Handling | User Impact |
|
||||
|----------|----------|-------------|
|
||||
| Ollama server down | Log warning, store original content | Article saved without summary |
|
||||
| Ollama timeout (>30s) | Cancel request, store original | Article saved without summary |
|
||||
| Empty summary returned | Log error, store original | Article saved without summary |
|
||||
| Invalid response format | Log error, store original | Article saved without summary |
|
||||
| Network error | Retry once, then store original | Article saved without summary |
|
||||
| Model not found | Log error, disable Ollama | All articles saved without summaries |
|
||||
|
||||
### Error Logging Format
|
||||
|
||||
```python
|
||||
{
|
||||
'timestamp': datetime.utcnow(),
|
||||
'article_url': article_url,
|
||||
'error_type': 'timeout|connection|invalid_response|empty_summary',
|
||||
'error_message': str(error),
|
||||
'ollama_config': {
|
||||
'base_url': OLLAMA_BASE_URL,
|
||||
'model': OLLAMA_MODEL,
|
||||
'enabled': OLLAMA_ENABLED
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
|
||||
1. **test_ollama_client.py**
|
||||
- Test summarization with mock responses
|
||||
- Test timeout handling
|
||||
- Test error scenarios
|
||||
- Test connection checking
|
||||
|
||||
2. **test_crawler_with_ollama.py**
|
||||
- Test crawler with Ollama enabled
|
||||
- Test crawler with Ollama disabled
|
||||
- Test fallback when Ollama fails
|
||||
- Test rate limiting
|
||||
|
||||
### Integration Tests
|
||||
|
||||
1. **test_end_to_end.py**
|
||||
- Crawl real RSS feed
|
||||
- Summarize with real Ollama
|
||||
- Verify database storage
|
||||
- Check all fields populated
|
||||
|
||||
### Manual Testing
|
||||
|
||||
1. Test with Ollama enabled and working
|
||||
2. Test with Ollama disabled
|
||||
3. Test with Ollama unreachable
|
||||
4. Test with slow Ollama responses
|
||||
5. Test with various article lengths
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Timing Estimates
|
||||
|
||||
- Article extraction: 2-5 seconds
|
||||
- Ollama summarization: 5-15 seconds (depends on article length and model)
|
||||
- Database write: <1 second
|
||||
- **Total per article: 8-21 seconds**
|
||||
|
||||
### Optimization Strategies
|
||||
|
||||
1. **Sequential Processing**
|
||||
- Process one article at a time
|
||||
- Prevents overwhelming Ollama
|
||||
- Easier to debug
|
||||
|
||||
2. **Timeout Management**
|
||||
- 30-second timeout per request
|
||||
- Prevents hanging on slow responses
|
||||
|
||||
3. **Rate Limiting**
|
||||
- 1-second delay between articles
|
||||
- Respects server resources
|
||||
|
||||
4. **Future: Batch Processing**
|
||||
- Queue articles for summarization
|
||||
- Process in batches
|
||||
- Use Celery for async processing
|
||||
|
||||
### Resource Usage
|
||||
|
||||
- **Memory**: ~100MB per crawler instance
|
||||
- **Network**: ~1-5KB per article (to Ollama)
|
||||
- **Storage**: +150 words per article (~1KB)
|
||||
- **CPU**: Minimal (Ollama does the heavy lifting)
|
||||
|
||||
## Security Considerations
|
||||
|
||||
1. **API Key Storage**
|
||||
- Store in environment variables
|
||||
- Never commit to git
|
||||
- Use secrets management in production
|
||||
|
||||
2. **Content Sanitization**
|
||||
- Don't log full article content
|
||||
- Sanitize URLs in logs
|
||||
- Limit error message detail
|
||||
|
||||
3. **Network Security**
|
||||
- Support HTTPS for Ollama
|
||||
- Validate SSL certificates
|
||||
- Use secure connections
|
||||
|
||||
4. **Rate Limiting**
|
||||
- Prevent abuse of Ollama server
|
||||
- Implement backoff on errors
|
||||
- Monitor usage patterns
|
||||
|
||||
## Deployment Considerations
|
||||
|
||||
### Environment Variables
|
||||
|
||||
```bash
|
||||
# Required
|
||||
OLLAMA_BASE_URL=http://localhost:11434
|
||||
OLLAMA_MODEL=phi3:latest
|
||||
OLLAMA_ENABLED=true
|
||||
|
||||
# Optional
|
||||
OLLAMA_API_KEY=your-api-key
|
||||
OLLAMA_TIMEOUT=30
|
||||
```
|
||||
|
||||
### Docker Deployment
|
||||
|
||||
```yaml
|
||||
# docker-compose.yml
|
||||
services:
|
||||
crawler:
|
||||
build: ./news_crawler
|
||||
environment:
|
||||
- OLLAMA_BASE_URL=http://ollama:11434
|
||||
- OLLAMA_ENABLED=true
|
||||
depends_on:
|
||||
- ollama
|
||||
- mongodb
|
||||
|
||||
ollama:
|
||||
image: ollama/ollama:latest
|
||||
ports:
|
||||
- "11434:11434"
|
||||
volumes:
|
||||
- ollama_data:/root/.ollama
|
||||
```
|
||||
|
||||
### Monitoring
|
||||
|
||||
1. **Metrics to Track**
|
||||
- Summarization success rate
|
||||
- Average summarization time
|
||||
- Ollama server uptime
|
||||
- Error frequency by type
|
||||
|
||||
2. **Logging**
|
||||
- Log all summarization attempts
|
||||
- Log errors with context
|
||||
- Log performance metrics
|
||||
|
||||
3. **Alerts**
|
||||
- Alert if Ollama is down >5 minutes
|
||||
- Alert if success rate <80%
|
||||
- Alert if average time >20 seconds
|
||||
|
||||
## Migration Plan
|
||||
|
||||
### Phase 1: Add Ollama Client (Week 1)
|
||||
- Create ollama_client.py
|
||||
- Add configuration
|
||||
- Write unit tests
|
||||
- Test with sample articles
|
||||
|
||||
### Phase 2: Integrate with Crawler (Week 1)
|
||||
- Modify crawler_service.py
|
||||
- Add summarization step
|
||||
- Update database schema
|
||||
- Test end-to-end
|
||||
|
||||
### Phase 3: Update Backend API (Week 2)
|
||||
- Update news routes
|
||||
- Add summary fields to responses
|
||||
- Update frontend to display summaries
|
||||
- Deploy to production
|
||||
|
||||
### Phase 4: Monitor and Optimize (Ongoing)
|
||||
- Monitor performance
|
||||
- Tune prompts for better summaries
|
||||
- Optimize rate limiting
|
||||
- Add batch processing if needed
|
||||
|
||||
## Rollback Plan
|
||||
|
||||
If issues arise:
|
||||
|
||||
1. **Immediate**: Set `OLLAMA_ENABLED=false`
|
||||
2. **Short-term**: Revert crawler code changes
|
||||
3. **Long-term**: Remove Ollama integration
|
||||
|
||||
System will continue to work with original content if Ollama is disabled.
|
||||
|
||||
## Success Metrics
|
||||
|
||||
- ✅ 95%+ of articles successfully summarized
|
||||
- ✅ Average summarization time <15 seconds
|
||||
- ✅ Zero data loss (all articles stored even if summarization fails)
|
||||
- ✅ Ollama uptime >99%
|
||||
- ✅ Summary quality: readable and accurate (manual review)
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. **Multi-language Support**
|
||||
- Detect article language
|
||||
- Use appropriate model
|
||||
- Translate summaries
|
||||
|
||||
2. **Custom Summary Lengths**
|
||||
- Allow configuration per feed
|
||||
- Support different lengths for different use cases
|
||||
|
||||
3. **Sentiment Analysis**
|
||||
- Add sentiment score
|
||||
- Categorize as positive/negative/neutral
|
||||
|
||||
4. **Keyword Extraction**
|
||||
- Extract key topics
|
||||
- Enable better search
|
||||
|
||||
5. **Batch Processing**
|
||||
- Queue articles
|
||||
- Process in parallel
|
||||
- Use Celery for async
|
||||
|
||||
6. **Caching**
|
||||
- Cache summaries
|
||||
- Avoid re-processing
|
||||
- Use Redis for cache
|
||||
164
.kiro/specs/ai-article-summarization/requirements.md
Normal file
164
.kiro/specs/ai-article-summarization/requirements.md
Normal file
@@ -0,0 +1,164 @@
|
||||
# Requirements Document
|
||||
|
||||
## Introduction
|
||||
|
||||
This feature integrates Ollama AI into the news crawler to automatically summarize articles before storing them in the database. Instead of storing full article content, the system will generate concise 150-word summaries using AI, making the content more digestible for newsletter readers and reducing storage requirements.
|
||||
|
||||
## Glossary
|
||||
|
||||
- **Crawler Service**: The standalone microservice that fetches and processes article content from RSS feeds
|
||||
- **Ollama Server**: The AI inference server that provides text summarization capabilities
|
||||
- **Article Content**: The full text extracted from a news article webpage
|
||||
- **Summary**: A concise AI-generated version of the article content (max 150 words)
|
||||
- **MongoDB**: The database where articles and summaries are stored
|
||||
|
||||
## Requirements
|
||||
|
||||
### Requirement 1: Ollama Integration in Crawler
|
||||
|
||||
**User Story:** As a system administrator, I want the crawler to use Ollama for summarization, so that articles are automatically condensed before storage.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN the crawler extracts article content, THE Crawler Service SHALL send the content to the Ollama Server for summarization
|
||||
2. WHEN sending content to Ollama, THE Crawler Service SHALL include a prompt requesting a summary of 150 words or less
|
||||
3. WHEN Ollama returns a summary, THE Crawler Service SHALL validate that the summary is not empty
|
||||
4. IF the Ollama Server is unavailable, THEN THE Crawler Service SHALL store the original content without summarization and log a warning
|
||||
5. WHEN summarization fails, THE Crawler Service SHALL continue processing other articles without stopping
|
||||
|
||||
### Requirement 2: Configuration Management
|
||||
|
||||
**User Story:** As a system administrator, I want to configure Ollama settings, so that I can control the summarization behavior.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. THE Crawler Service SHALL read Ollama configuration from environment variables
|
||||
2. THE Crawler Service SHALL support the following configuration options:
|
||||
- OLLAMA_BASE_URL (server URL)
|
||||
- OLLAMA_MODEL (model name)
|
||||
- OLLAMA_ENABLED (enable/disable flag)
|
||||
- OLLAMA_API_KEY (optional authentication)
|
||||
3. WHERE OLLAMA_ENABLED is false, THE Crawler Service SHALL store original content without summarization
|
||||
4. WHERE OLLAMA_ENABLED is true AND Ollama is unreachable, THE Crawler Service SHALL log an error and store original content
|
||||
|
||||
### Requirement 3: Summary Storage
|
||||
|
||||
**User Story:** As a developer, I want summaries stored in the database, so that the frontend can display concise article previews.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN a summary is generated, THE Crawler Service SHALL store it in the `summary` field in MongoDB
|
||||
2. WHEN storing an article, THE Crawler Service SHALL include both the original content and the AI summary
|
||||
3. THE Crawler Service SHALL store the following fields:
|
||||
- `content` (original full text)
|
||||
- `summary` (AI-generated, max 150 words)
|
||||
- `word_count` (original content word count)
|
||||
- `summary_word_count` (summary word count)
|
||||
- `summarized_at` (timestamp when summarized)
|
||||
4. WHEN an article already has a summary, THE Crawler Service SHALL not re-summarize it
|
||||
|
||||
### Requirement 4: Error Handling and Resilience
|
||||
|
||||
**User Story:** As a system administrator, I want the crawler to handle AI failures gracefully, so that the system remains reliable.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. IF Ollama returns an error, THEN THE Crawler Service SHALL log the error and store the original content
|
||||
2. IF Ollama times out (>30 seconds), THEN THE Crawler Service SHALL cancel the request and store the original content
|
||||
3. IF the summary is empty or invalid, THEN THE Crawler Service SHALL store the original content
|
||||
4. WHEN an error occurs, THE Crawler Service SHALL include an error indicator in the database record
|
||||
5. THE Crawler Service SHALL continue processing remaining articles after any summarization failure
|
||||
|
||||
### Requirement 5: Performance and Rate Limiting
|
||||
|
||||
**User Story:** As a system administrator, I want the crawler to respect rate limits, so that it doesn't overwhelm the Ollama server.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. THE Crawler Service SHALL wait at least 1 second between Ollama API calls
|
||||
2. THE Crawler Service SHALL set a timeout of 30 seconds for each Ollama request
|
||||
3. WHEN processing multiple articles, THE Crawler Service SHALL process them sequentially to avoid overloading Ollama
|
||||
4. THE Crawler Service SHALL log the time taken for each summarization
|
||||
5. THE Crawler Service SHALL display progress indicators showing summarization status
|
||||
|
||||
### Requirement 6: Monitoring and Logging
|
||||
|
||||
**User Story:** As a system administrator, I want detailed logs of summarization activity, so that I can monitor and troubleshoot the system.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. THE Crawler Service SHALL log when summarization starts for each article
|
||||
2. THE Crawler Service SHALL log the original word count and summary word count
|
||||
3. THE Crawler Service SHALL log any errors or warnings from Ollama
|
||||
4. THE Crawler Service SHALL display a summary of total articles summarized at the end
|
||||
5. THE Crawler Service SHALL include summarization statistics in the final report
|
||||
|
||||
### Requirement 7: API Endpoint Updates
|
||||
|
||||
**User Story:** As a frontend developer, I want API endpoints to return summaries, so that I can display them to users.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN fetching articles via GET /api/news, THE Backend API SHALL include the `summary` field if available
|
||||
2. WHEN fetching a single article via GET /api/news/<url>, THE Backend API SHALL include both `content` and `summary`
|
||||
3. THE Backend API SHALL include a `has_summary` boolean field indicating if AI summarization was performed
|
||||
4. THE Backend API SHALL include `summarized_at` timestamp if available
|
||||
5. WHERE no summary exists, THE Backend API SHALL return a preview of the original content (first 200 chars)
|
||||
|
||||
### Requirement 8: Backward Compatibility
|
||||
|
||||
**User Story:** As a developer, I want the system to work with existing articles, so that no data migration is required.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. THE Crawler Service SHALL work with articles that don't have summaries
|
||||
2. THE Backend API SHALL handle articles with or without summaries gracefully
|
||||
3. WHERE an article has no summary, THE Backend API SHALL generate a preview from the content field
|
||||
4. THE Crawler Service SHALL not re-process articles that already have summaries
|
||||
5. THE system SHALL continue to function if Ollama is disabled or unavailable
|
||||
|
||||
## Non-Functional Requirements
|
||||
|
||||
### Performance
|
||||
- Summarization SHALL complete within 30 seconds per article
|
||||
- The crawler SHALL process at least 10 articles per minute (including summarization)
|
||||
- Database operations SHALL not be significantly slower with summary storage
|
||||
|
||||
### Reliability
|
||||
- The system SHALL maintain 99% uptime even if Ollama is unavailable
|
||||
- Failed summarizations SHALL not prevent article storage
|
||||
- The crawler SHALL recover from Ollama errors without manual intervention
|
||||
|
||||
### Security
|
||||
- Ollama API keys SHALL be stored in environment variables, not in code
|
||||
- Article content SHALL not be logged to prevent sensitive data exposure
|
||||
- API communication with Ollama SHALL support HTTPS
|
||||
|
||||
### Scalability
|
||||
- The system SHALL support multiple Ollama servers for load balancing (future)
|
||||
- The crawler SHALL handle articles of any length (up to 50,000 words)
|
||||
- The database schema SHALL support future enhancements (tags, categories, etc.)
|
||||
|
||||
## Dependencies
|
||||
|
||||
- Ollama server must be running and accessible
|
||||
- `requests` Python library for HTTP communication
|
||||
- Environment variables properly configured
|
||||
- MongoDB with sufficient storage for both content and summaries
|
||||
|
||||
## Assumptions
|
||||
|
||||
- Ollama server is already set up and configured
|
||||
- The phi3:latest model (or configured model) supports summarization tasks
|
||||
- Network connectivity between crawler and Ollama server is reliable
|
||||
- Articles are in English or the configured Ollama model supports the article language
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
- Support for multiple languages
|
||||
- Customizable summary length
|
||||
- Sentiment analysis integration
|
||||
- Keyword extraction
|
||||
- Category classification
|
||||
- Batch summarization for improved performance
|
||||
- Caching of summaries to avoid re-processing
|
||||
92
.kiro/specs/ai-article-summarization/tasks.md
Normal file
92
.kiro/specs/ai-article-summarization/tasks.md
Normal file
@@ -0,0 +1,92 @@
|
||||
# Implementation Plan
|
||||
|
||||
- [x] 1. Create Ollama client module
|
||||
- Create `news_crawler/ollama_client.py` with OllamaClient class
|
||||
- Implement `summarize_article()` method with prompt construction and API call
|
||||
- Implement `is_available()` method for health checks
|
||||
- Implement `test_connection()` method for diagnostics
|
||||
- Add timeout handling (30 seconds)
|
||||
- Add error handling for connection, timeout, and invalid responses
|
||||
- _Requirements: 1.1, 1.2, 1.3, 1.4, 1.5, 4.1, 4.2, 4.3, 5.2_
|
||||
|
||||
- [x] 2. Create configuration module for crawler
|
||||
- Create `news_crawler/config.py` with Config class
|
||||
- Load environment variables (OLLAMA_BASE_URL, OLLAMA_MODEL, OLLAMA_ENABLED, OLLAMA_API_KEY, OLLAMA_TIMEOUT)
|
||||
- Add validation for required configuration
|
||||
- Add default values for optional configuration
|
||||
- _Requirements: 2.1, 2.2, 2.3, 2.4_
|
||||
|
||||
- [x] 3. Integrate Ollama client into crawler service
|
||||
- Import OllamaClient in `news_crawler/crawler_service.py`
|
||||
- Initialize Ollama client at module level using Config
|
||||
- Modify `crawl_rss_feed()` to call summarization after content extraction
|
||||
- Add conditional logic to skip summarization if OLLAMA_ENABLED is false
|
||||
- Add error handling to continue processing if summarization fails
|
||||
- Add logging for summarization start, success, and failure
|
||||
- Add rate limiting delay after summarization
|
||||
- _Requirements: 1.1, 1.2, 1.3, 1.4, 1.5, 2.3, 2.4, 4.1, 4.5, 5.1, 5.3, 6.1, 6.2, 6.3_
|
||||
|
||||
- [x] 4. Update database schema and storage
|
||||
- Modify article document structure in `crawl_rss_feed()` to include:
|
||||
- `summary` field (AI-generated summary)
|
||||
- `summary_word_count` field
|
||||
- `summarized_at` field (timestamp)
|
||||
- Update MongoDB upsert logic to handle new fields
|
||||
- Add check to skip re-summarization if article already has summary
|
||||
- _Requirements: 3.1, 3.2, 3.3, 3.4, 8.4_
|
||||
|
||||
- [x] 5. Update backend API to return summaries
|
||||
- Modify `backend/routes/news_routes.py` GET /api/news endpoint
|
||||
- Add `summary`, `summary_word_count`, `summarized_at` fields to response
|
||||
- Add `has_summary` boolean field to indicate if AI summarization was performed
|
||||
- Modify GET /api/news/<url> endpoint to include summary fields
|
||||
- Add fallback to content preview if no summary exists
|
||||
- _Requirements: 7.1, 7.2, 7.3, 7.4, 7.5, 8.1, 8.2, 8.3_
|
||||
|
||||
- [x] 6. Update database schema documentation
|
||||
- Update `backend/DATABASE_SCHEMA.md` with new summary fields
|
||||
- Add example document showing summary fields
|
||||
- Document the summarization workflow
|
||||
- _Requirements: 3.1, 3.2, 3.3_
|
||||
|
||||
- [x] 7. Add environment variable configuration
|
||||
- Update `backend/env.template` with Ollama configuration
|
||||
- Add comments explaining each Ollama setting
|
||||
- Document default values
|
||||
- _Requirements: 2.1, 2.2_
|
||||
|
||||
- [x] 8. Create test script for Ollama integration
|
||||
- Create `news_crawler/test_ollama.py` to test Ollama connection
|
||||
- Test summarization with sample article
|
||||
- Test error handling (timeout, connection failure)
|
||||
- Display configuration and connection status
|
||||
- _Requirements: 1.1, 1.2, 1.3, 1.4, 2.1, 2.2, 4.1, 4.2_
|
||||
|
||||
- [x] 9. Update crawler statistics and logging
|
||||
- Add summarization statistics to final report in `crawl_all_feeds()`
|
||||
- Track total articles summarized vs failed
|
||||
- Log average summarization time
|
||||
- Display progress indicators during summarization
|
||||
- _Requirements: 5.4, 6.1, 6.2, 6.3, 6.4, 6.5_
|
||||
|
||||
- [x] 10. Create documentation for AI summarization
|
||||
- Create `news_crawler/AI_SUMMARIZATION.md` explaining the feature
|
||||
- Document configuration options
|
||||
- Provide troubleshooting guide
|
||||
- Add examples of usage
|
||||
- _Requirements: 2.1, 2.2, 2.3, 2.4, 6.1, 6.2, 6.3_
|
||||
|
||||
- [x] 11. Update main README with AI summarization info
|
||||
- Add section about AI summarization feature
|
||||
- Document Ollama setup requirements
|
||||
- Add configuration examples
|
||||
- Update API endpoint documentation
|
||||
- _Requirements: 2.1, 2.2, 7.1, 7.2_
|
||||
|
||||
- [x] 12. Test end-to-end workflow
|
||||
- Run crawler with Ollama enabled
|
||||
- Verify articles are summarized correctly
|
||||
- Check database contains all expected fields
|
||||
- Test API endpoints return summaries
|
||||
- Verify error handling when Ollama is disabled/unavailable
|
||||
- _Requirements: 1.1, 1.2, 1.3, 1.4, 1.5, 3.1, 3.2, 3.3, 3.4, 4.1, 4.2, 4.3, 4.4, 4.5, 7.1, 7.2, 7.3, 7.4, 7.5, 8.1, 8.2, 8.3, 8.4, 8.5_
|
||||
Reference in New Issue
Block a user