August 21, 2024 6:15 PM PDT
This document outlines the discussions and technical considerations regarding the AI Automation Framework, specifically focusing on the design of a distributed system for workflow management.
Requirement
- Connect to external services with authentication.
High Level Design
- API considerations.
- Database schema for workflows.
- Potential need for multiple database instances.
- Use of NoSQL database with:
- Partition key: workflow ID
- Sort key: creation timestamp.
Workflow Management
- Workflow service to schedule the first run.
- Workers to schedule new runs.
- Considerations for crash recovery of workers.
- Workflow scheduler maintains heartbeat with workers.
- Scaling worker pool:
- Preference for a pull model over a push model for easier maintenance.
- Heartbeat failure detection: assume worker is dead after three failed heartbeats.
Worker Configuration
- Each worker contains multiple Docker containers for security.
- Workers will update their status to indicate operational state.
Database Considerations
- Choice between relational and non-relational databases:
- Non-relational preferred for large scale but requires strong consistency.
- Use of WorkerID as a lock for conditional updates to prevent double scheduling.
- Options include DynamoDB or sharded MySQL to ensure strong consistency.
- Partition key considerations:
- Workflow run vs. scheduled time.
- Optimizing for scan frequency and secondary indexing.
Job Management
- Worker retry mechanisms:
- If a worker fails, the management service should update the workflow run database.
- Monitoring system required to manage job failures:
- Manual investigation may be necessary for job reloads.
- Automatic failover is an option but carries risks.
Performance Metrics
- Current performance metrics indicate 317 jobs per second with 100k concurrent runs.
- Complexity of the system affects availability and performance.
Additional Considerations
- Discussion on whether retries can be part of the workflow definition.