Job Scheduler

January 22, 2023 6:00 PM PST

This document summarizes the discussion and design considerations for a Job Scheduler system, focusing on functional and non-functional requirements, high-level design, and database schema.

Requirements

User Load: 10,000 daily users
Job Load: Each user can have 10 jobs every day
- Half are scheduled jobs (repeatable)
- Half are immediately executed jobs
Job Management: Ability to check the status of jobs
Job Duration: Ranges from seconds to hours
Bonus Feature: Check job results

Functional Requirements

Create and delete job schedules
Query jobs by owner
Update and delete jobs
Query job history
Check job status
Job notifications (bonus)
Implement job priorities

Non-Functional Requirements

High availability
High scalability
Reliability: Ensure no missing or duplicate jobs
Consistency: Maintain job integrity
Performance:
- 10,000 writes per second
- 20,000 reads per second
Storage: Estimated at 1PB for job data

High-Level Design

Database: A "Job" box representing the database
Discussion Points:
- SQL vs NoSQL: Both can work, but NoSQL is preferred for scalability.
- Key-value store vs NoSQL: Dynamic job information is better suited for NoSQL.

Database Schema Design

Job Table:
- Job ID (UUID)
- Created Time
- Scheduler ID
- Owner ID
- Repeatable (boolean)
- Job Status (enum: scheduled, running, complete)
- Job Result (success, failure)
- Retried (yes/no)
- Max Retry
Partitioning:
- Partition Key: Scheduled Time
- Sort Key: Owner ID

Job Processing

Worker Timing: Accuracy in seconds
Scaling: Utilize sharding to manage load
Message Queue:
- Options include Kafka with at most once, at least once, and exactly once delivery semantics.
- Each worker is responsible for one partition to avoid database bottlenecks.

Priority Handling

Implement priority in Kafka:
- Assign different numbers of workers to different priorities.
- Address potential idle workers with a rudimentary implementation.

Redundancy and Failover

Database architecture can include master and replica configurations.
Use a master-controller structure for multiple masters.
Kafka can handle failures effectively.

Audience Feedback

Overall performance was reasonable, but there were suggestions for improvement:
- More detail needed on priority implementation.
- Clarification on message queue design and execution handling.
- Job schema design should be streamlined.

Interviewee Self-Assessment

Acknowledged time constraints during the discussion.
Suggested using a daemon for job definition updates.
Discussed handling job failures and the importance of retry mechanisms.

Key Questions Raised

How to handle job visibility and status checks?
What are the implications of using UUIDs as partition keys?
How to balance load across shards and ensure efficient querying?

Additional Considerations

The importance of understanding the data types used in partition keys.
The need for a clear design on job history and metadata management.
Potential use of delayed message queues for job execution timing.