Files
2025-11-10 19:13:33 +01:00

8.0 KiB

Requirements Document

Introduction

This feature integrates Ollama AI into the news crawler to automatically summarize articles before storing them in the database. Instead of storing full article content, the system will generate concise 150-word summaries using AI, making the content more digestible for newsletter readers and reducing storage requirements.

Glossary

  • Crawler Service: The standalone microservice that fetches and processes article content from RSS feeds
  • Ollama Server: The AI inference server that provides text summarization capabilities
  • Article Content: The full text extracted from a news article webpage
  • Summary: A concise AI-generated version of the article content (max 150 words)
  • MongoDB: The database where articles and summaries are stored

Requirements

Requirement 1: Ollama Integration in Crawler

User Story: As a system administrator, I want the crawler to use Ollama for summarization, so that articles are automatically condensed before storage.

Acceptance Criteria

  1. WHEN the crawler extracts article content, THE Crawler Service SHALL send the content to the Ollama Server for summarization
  2. WHEN sending content to Ollama, THE Crawler Service SHALL include a prompt requesting a summary of 150 words or less
  3. WHEN Ollama returns a summary, THE Crawler Service SHALL validate that the summary is not empty
  4. IF the Ollama Server is unavailable, THEN THE Crawler Service SHALL store the original content without summarization and log a warning
  5. WHEN summarization fails, THE Crawler Service SHALL continue processing other articles without stopping

Requirement 2: Configuration Management

User Story: As a system administrator, I want to configure Ollama settings, so that I can control the summarization behavior.

Acceptance Criteria

  1. THE Crawler Service SHALL read Ollama configuration from environment variables
  2. THE Crawler Service SHALL support the following configuration options:
    • OLLAMA_BASE_URL (server URL)
    • OLLAMA_MODEL (model name)
    • OLLAMA_ENABLED (enable/disable flag)
    • OLLAMA_API_KEY (optional authentication)
  3. WHERE OLLAMA_ENABLED is false, THE Crawler Service SHALL store original content without summarization
  4. WHERE OLLAMA_ENABLED is true AND Ollama is unreachable, THE Crawler Service SHALL log an error and store original content

Requirement 3: Summary Storage

User Story: As a developer, I want summaries stored in the database, so that the frontend can display concise article previews.

Acceptance Criteria

  1. WHEN a summary is generated, THE Crawler Service SHALL store it in the summary field in MongoDB
  2. WHEN storing an article, THE Crawler Service SHALL include both the original content and the AI summary
  3. THE Crawler Service SHALL store the following fields:
    • content (original full text)
    • summary (AI-generated, max 150 words)
    • word_count (original content word count)
    • summary_word_count (summary word count)
    • summarized_at (timestamp when summarized)
  4. WHEN an article already has a summary, THE Crawler Service SHALL not re-summarize it

Requirement 4: Error Handling and Resilience

User Story: As a system administrator, I want the crawler to handle AI failures gracefully, so that the system remains reliable.

Acceptance Criteria

  1. IF Ollama returns an error, THEN THE Crawler Service SHALL log the error and store the original content
  2. IF Ollama times out (>30 seconds), THEN THE Crawler Service SHALL cancel the request and store the original content
  3. IF the summary is empty or invalid, THEN THE Crawler Service SHALL store the original content
  4. WHEN an error occurs, THE Crawler Service SHALL include an error indicator in the database record
  5. THE Crawler Service SHALL continue processing remaining articles after any summarization failure

Requirement 5: Performance and Rate Limiting

User Story: As a system administrator, I want the crawler to respect rate limits, so that it doesn't overwhelm the Ollama server.

Acceptance Criteria

  1. THE Crawler Service SHALL wait at least 1 second between Ollama API calls
  2. THE Crawler Service SHALL set a timeout of 30 seconds for each Ollama request
  3. WHEN processing multiple articles, THE Crawler Service SHALL process them sequentially to avoid overloading Ollama
  4. THE Crawler Service SHALL log the time taken for each summarization
  5. THE Crawler Service SHALL display progress indicators showing summarization status

Requirement 6: Monitoring and Logging

User Story: As a system administrator, I want detailed logs of summarization activity, so that I can monitor and troubleshoot the system.

Acceptance Criteria

  1. THE Crawler Service SHALL log when summarization starts for each article
  2. THE Crawler Service SHALL log the original word count and summary word count
  3. THE Crawler Service SHALL log any errors or warnings from Ollama
  4. THE Crawler Service SHALL display a summary of total articles summarized at the end
  5. THE Crawler Service SHALL include summarization statistics in the final report

Requirement 7: API Endpoint Updates

User Story: As a frontend developer, I want API endpoints to return summaries, so that I can display them to users.

Acceptance Criteria

  1. WHEN fetching articles via GET /api/news, THE Backend API SHALL include the summary field if available
  2. WHEN fetching a single article via GET /api/news/, THE Backend API SHALL include both content and summary
  3. THE Backend API SHALL include a has_summary boolean field indicating if AI summarization was performed
  4. THE Backend API SHALL include summarized_at timestamp if available
  5. WHERE no summary exists, THE Backend API SHALL return a preview of the original content (first 200 chars)

Requirement 8: Backward Compatibility

User Story: As a developer, I want the system to work with existing articles, so that no data migration is required.

Acceptance Criteria

  1. THE Crawler Service SHALL work with articles that don't have summaries
  2. THE Backend API SHALL handle articles with or without summaries gracefully
  3. WHERE an article has no summary, THE Backend API SHALL generate a preview from the content field
  4. THE Crawler Service SHALL not re-process articles that already have summaries
  5. THE system SHALL continue to function if Ollama is disabled or unavailable

Non-Functional Requirements

Performance

  • Summarization SHALL complete within 30 seconds per article
  • The crawler SHALL process at least 10 articles per minute (including summarization)
  • Database operations SHALL not be significantly slower with summary storage

Reliability

  • The system SHALL maintain 99% uptime even if Ollama is unavailable
  • Failed summarizations SHALL not prevent article storage
  • The crawler SHALL recover from Ollama errors without manual intervention

Security

  • Ollama API keys SHALL be stored in environment variables, not in code
  • Article content SHALL not be logged to prevent sensitive data exposure
  • API communication with Ollama SHALL support HTTPS

Scalability

  • The system SHALL support multiple Ollama servers for load balancing (future)
  • The crawler SHALL handle articles of any length (up to 50,000 words)
  • The database schema SHALL support future enhancements (tags, categories, etc.)

Dependencies

  • Ollama server must be running and accessible
  • requests Python library for HTTP communication
  • Environment variables properly configured
  • MongoDB with sufficient storage for both content and summaries

Assumptions

  • Ollama server is already set up and configured
  • The phi3:latest model (or configured model) supports summarization tasks
  • Network connectivity between crawler and Ollama server is reliable
  • Articles are in English or the configured Ollama model supports the article language

Future Enhancements

  • Support for multiple languages
  • Customizable summary length
  • Sentiment analysis integration
  • Keyword extraction
  • Category classification
  • Batch summarization for improved performance
  • Caching of summaries to avoid re-processing