AI Automation Framework

August 21, 2024 6:15 PM PDT

This document outlines the discussions and technical considerations regarding the AI Automation Framework, specifically focusing on the design of a distributed system for workflow management.

Requirement

Connect to external services with authentication.

High Level Design

API considerations.
Database schema for workflows.
Potential need for multiple database instances.
Use of NoSQL database with:
- Partition key: workflow ID
- Sort key: creation timestamp.

Workflow Management

Workflow service to schedule the first run.
Workers to schedule new runs.
Considerations for crash recovery of workers.
Workflow scheduler maintains heartbeat with workers.
Scaling worker pool:
- Preference for a pull model over a push model for easier maintenance.
- Heartbeat failure detection: assume worker is dead after three failed heartbeats.

Worker Configuration

Each worker contains multiple Docker containers for security.
Workers will update their status to indicate operational state.

Database Considerations

Choice between relational and non-relational databases:
- Non-relational preferred for large scale but requires strong consistency.
- Use of WorkerID as a lock for conditional updates to prevent double scheduling.
- Options include DynamoDB or sharded MySQL to ensure strong consistency.
Partition key considerations:
- Workflow run vs. scheduled time.
- Optimizing for scan frequency and secondary indexing.

Job Management

Worker retry mechanisms:
- If a worker fails, the management service should update the workflow run database.
Monitoring system required to manage job failures:
- Manual investigation may be necessary for job reloads.
- Automatic failover is an option but carries risks.

Performance Metrics

Current performance metrics indicate 317 jobs per second with 100k concurrent runs.
Complexity of the system affects availability and performance.

Additional Considerations

Discussion on whether retries can be part of the workflow definition.