Munich-news/.kiro/specs/ai-article-summarization/requirements.md

# Requirements Document

## Introduction

This feature integrates Ollama AI into the news crawler to automatically summarize articles before storing them in the database. Instead of storing full article content, the system will generate concise 150-word summaries using AI, making the content more digestible for newsletter readers and reducing storage requirements.

## Glossary

- **Crawler Service**: The standalone microservice that fetches and processes article content from RSS feeds
- **Ollama Server**: The AI inference server that provides text summarization capabilities
- **Article Content**: The full text extracted from a news article webpage
- **Summary**: A concise AI-generated version of the article content (max 150 words)
- **MongoDB**: The database where articles and summaries are stored

## Requirements

### Requirement 1: Ollama Integration in Crawler

**User Story:** As a system administrator, I want the crawler to use Ollama for summarization, so that articles are automatically condensed before storage.

#### Acceptance Criteria

1. WHEN the crawler extracts article content, THE Crawler Service SHALL send the content to the Ollama Server for summarization
2. WHEN sending content to Ollama, THE Crawler Service SHALL include a prompt requesting a summary of 150 words or less
3. WHEN Ollama returns a summary, THE Crawler Service SHALL validate that the summary is not empty
4. IF the Ollama Server is unavailable, THEN THE Crawler Service SHALL store the original content without summarization and log a warning
5. WHEN summarization fails, THE Crawler Service SHALL continue processing other articles without stopping

### Requirement 2: Configuration Management

**User Story:** As a system administrator, I want to configure Ollama settings, so that I can control the summarization behavior.

#### Acceptance Criteria

1. THE Crawler Service SHALL read Ollama configuration from environment variables
2. THE Crawler Service SHALL support the following configuration options:
   - OLLAMA_BASE_URL (server URL)
   - OLLAMA_MODEL (model name)
   - OLLAMA_ENABLED (enable/disable flag)
   - OLLAMA_API_KEY (optional authentication)
3. WHERE OLLAMA_ENABLED is false, THE Crawler Service SHALL store original content without summarization
4. WHERE OLLAMA_ENABLED is true AND Ollama is unreachable, THE Crawler Service SHALL log an error and store original content

### Requirement 3: Summary Storage

**User Story:** As a developer, I want summaries stored in the database, so that the frontend can display concise article previews.

#### Acceptance Criteria

1. WHEN a summary is generated, THE Crawler Service SHALL store it in the `summary` field in MongoDB
2. WHEN storing an article, THE Crawler Service SHALL include both the original content and the AI summary
3. THE Crawler Service SHALL store the following fields:
   - `content` (original full text)
   - `summary` (AI-generated, max 150 words)
   - `word_count` (original content word count)
   - `summary_word_count` (summary word count)
   - `summarized_at` (timestamp when summarized)
4. WHEN an article already has a summary, THE Crawler Service SHALL not re-summarize it

### Requirement 4: Error Handling and Resilience

**User Story:** As a system administrator, I want the crawler to handle AI failures gracefully, so that the system remains reliable.

#### Acceptance Criteria

1. IF Ollama returns an error, THEN THE Crawler Service SHALL log the error and store the original content
2. IF Ollama times out (>30 seconds), THEN THE Crawler Service SHALL cancel the request and store the original content
3. IF the summary is empty or invalid, THEN THE Crawler Service SHALL store the original content
4. WHEN an error occurs, THE Crawler Service SHALL include an error indicator in the database record
5. THE Crawler Service SHALL continue processing remaining articles after any summarization failure

### Requirement 5: Performance and Rate Limiting

**User Story:** As a system administrator, I want the crawler to respect rate limits, so that it doesn't overwhelm the Ollama server.

#### Acceptance Criteria

1. THE Crawler Service SHALL wait at least 1 second between Ollama API calls
2. THE Crawler Service SHALL set a timeout of 30 seconds for each Ollama request
3. WHEN processing multiple articles, THE Crawler Service SHALL process them sequentially to avoid overloading Ollama
4. THE Crawler Service SHALL log the time taken for each summarization
5. THE Crawler Service SHALL display progress indicators showing summarization status

### Requirement 6: Monitoring and Logging

**User Story:** As a system administrator, I want detailed logs of summarization activity, so that I can monitor and troubleshoot the system.

#### Acceptance Criteria

1. THE Crawler Service SHALL log when summarization starts for each article
2. THE Crawler Service SHALL log the original word count and summary word count
3. THE Crawler Service SHALL log any errors or warnings from Ollama
4. THE Crawler Service SHALL display a summary of total articles summarized at the end
5. THE Crawler Service SHALL include summarization statistics in the final report

### Requirement 7: API Endpoint Updates

**User Story:** As a frontend developer, I want API endpoints to return summaries, so that I can display them to users.

#### Acceptance Criteria

1. WHEN fetching articles via GET /api/news, THE Backend API SHALL include the `summary` field if available
2. WHEN fetching a single article via GET /api/news/<url>, THE Backend API SHALL include both `content` and `summary`
3. THE Backend API SHALL include a `has_summary` boolean field indicating if AI summarization was performed
4. THE Backend API SHALL include `summarized_at` timestamp if available
5. WHERE no summary exists, THE Backend API SHALL return a preview of the original content (first 200 chars)

### Requirement 8: Backward Compatibility

**User Story:** As a developer, I want the system to work with existing articles, so that no data migration is required.

#### Acceptance Criteria

1. THE Crawler Service SHALL work with articles that don't have summaries
2. THE Backend API SHALL handle articles with or without summaries gracefully
3. WHERE an article has no summary, THE Backend API SHALL generate a preview from the content field
4. THE Crawler Service SHALL not re-process articles that already have summaries
5. THE system SHALL continue to function if Ollama is disabled or unavailable

## Non-Functional Requirements

### Performance
- Summarization SHALL complete within 30 seconds per article
- The crawler SHALL process at least 10 articles per minute (including summarization)
- Database operations SHALL not be significantly slower with summary storage

### Reliability
- The system SHALL maintain 99% uptime even if Ollama is unavailable
- Failed summarizations SHALL not prevent article storage
- The crawler SHALL recover from Ollama errors without manual intervention

### Security
- Ollama API keys SHALL be stored in environment variables, not in code
- Article content SHALL not be logged to prevent sensitive data exposure
- API communication with Ollama SHALL support HTTPS

### Scalability
- The system SHALL support multiple Ollama servers for load balancing (future)
- The crawler SHALL handle articles of any length (up to 50,000 words)
- The database schema SHALL support future enhancements (tags, categories, etc.)

## Dependencies

- Ollama server must be running and accessible
- `requests` Python library for HTTP communication
- Environment variables properly configured
- MongoDB with sufficient storage for both content and summaries

## Assumptions

- Ollama server is already set up and configured
- The phi3:latest model (or configured model) supports summarization tasks
- Network connectivity between crawler and Ollama server is reliable
- Articles are in English or the configured Ollama model supports the article language

## Future Enhancements

- Support for multiple languages
- Customizable summary length
- Sentiment analysis integration
- Keyword extraction
- Category classification
- Batch summarization for improved performance
- Caching of summaries to avoid re-processing