January 22, 2023 6:00 PM PST
This document summarizes the discussion and design considerations for a Job Scheduler system, focusing on functional and non-functional requirements, high-level design, and database schema.
Requirements
- User Load: 10,000 daily users
- Job Load: Each user can have 10 jobs every day
- Half are scheduled jobs (repeatable)
- Half are immediately executed jobs
- Job Management: Ability to check the status of jobs
- Job Duration: Ranges from seconds to hours
- Bonus Feature: Check job results
Functional Requirements
- Create and delete job schedules
- Query jobs by owner
- Update and delete jobs
- Query job history
- Check job status
- Job notifications (bonus)
- Implement job priorities
Non-Functional Requirements
- High availability
- High scalability
- Reliability: Ensure no missing or duplicate jobs
- Consistency: Maintain job integrity
- Performance:
- 10,000 writes per second
- 20,000 reads per second
- Storage: Estimated at 1PB for job data
High-Level Design
- Database: A "Job" box representing the database
- Discussion Points:
- SQL vs NoSQL: Both can work, but NoSQL is preferred for scalability.
- Key-value store vs NoSQL: Dynamic job information is better suited for NoSQL.
Database Schema Design
-
Job Table:
- Job ID (UUID)
- Created Time
- Scheduler ID
- Owner ID
- Repeatable (boolean)
- Job Status (enum: scheduled, running, complete)
- Job Result (success, failure)
- Retried (yes/no)
- Max Retry
-
Partitioning:
- Partition Key: Scheduled Time
- Sort Key: Owner ID
Job Processing
- Worker Timing: Accuracy in seconds
- Scaling: Utilize sharding to manage load
- Message Queue:
- Options include Kafka with at most once, at least once, and exactly once delivery semantics.
- Each worker is responsible for one partition to avoid database bottlenecks.
Priority Handling
- Implement priority in Kafka:
- Assign different numbers of workers to different priorities.
- Address potential idle workers with a rudimentary implementation.
Redundancy and Failover
- Database architecture can include master and replica configurations.
- Use a master-controller structure for multiple masters.
- Kafka can handle failures effectively.
Audience Feedback
- Overall performance was reasonable, but there were suggestions for improvement:
- More detail needed on priority implementation.
- Clarification on message queue design and execution handling.
- Job schema design should be streamlined.
Interviewee Self-Assessment
- Acknowledged time constraints during the discussion.
- Suggested using a daemon for job definition updates.
- Discussed handling job failures and the importance of retry mechanisms.
Key Questions Raised
- How to handle job visibility and status checks?
- What are the implications of using UUIDs as partition keys?
- How to balance load across shards and ensure efficient querying?
Additional Considerations
- The importance of understanding the data types used in partition keys.
- The need for a clear design on job history and metadata management.
- Potential use of delayed message queues for job execution timing.