October 17, 2021 7:00 PM PDT
This document summarizes the mock system design interview focused on building a job scheduler system capable of executing millions of machine learning jobs per day. The interview covered functional and non-functional requirements, system design, and discussions on scalability and architecture.
Requirements
Functional Requirements
- Handle up to 10 million jobs per day
- Support various scheduling patterns:
- Run at least once
- Cron jobs: hourly, daily, weekly, monthly
Non-Functional Requirements
- Highly available: Can schedule jobs at any time
- Highly scalable
- High durability
Requirement Clarification
- Jobs are submitted by clients.
- Output can be directed to disk or cloud storage.
- Clients may not need to see the results immediately.
- Jobs can be run multiple times (idempotency accepted).
System Design
System Design Diagram
- Utilized the single responsibility principle to define services.
Components
-
Job Planner
- Accepts new job definitions from clients.
- Distributes jobs into tasks and writes records for jobs and tasks.
- Example:
scheduleStartTime
= 9 AM (epoch time)jobInterval
= 1 weekjobRecurringTime
= 6 times (creates 6 records in the task table)
-
Job Scanner
- Periodically picks up tasks to run.
- API:
- Initial design:
schedule(taskDescription, scheduledStartTime)
- Revised design:
schedule(taskDescription, scheduledStartTime, jobIntervals, jobRecurringTime)
- Initial design:
Database Schema
- Chose DynamoDB for its scalability as a NoSQL database.
- Job states transition through:
- "scheduled" → "Enqueue" → "Claimed" → "Processing" → "Successful/Failed"
Discussions During the Interview
-
Job Submission:
- Jobs transition through defined states.
-
Task Scanning Frequency:
- Task scanner actively queries tasks every 5 minutes, which is acceptable given the long execution time of AI tasks.
-
Failure Management:
- Addressed concerns about task scanner failure.
- Proposed partitioning based on scheduled start time and using Global Secondary Indexes (GSI) for efficient scanning.
-
Scaling the Job Scanner:
- Multiple task scanners can be created.
- Implement sharding based on a hash function to prevent multiple scanners from picking up the same job.
- Consider using ZooKeeper for heartbeat checks to keep scanners alive.
-
Handling Infinite Repeating Tasks:
- Schedule the next task each time the current task is executed.
-
Separation of Job and Tasks:
- Jobs can be executed multiple times as tasks, allowing for efficient job environment preparation.
-
Database Considerations:
- SQL vs. NoSQL: SQL could be a valid alternative if complex queries are needed in the future, with MySQL scaling well through sharding/partitioning.
Conclusion
The design of the job scheduler system demonstrates a clear understanding of the requirements and scalability challenges. All components should support multiple instances to ensure robustness and efficiency in handling a high volume of jobs.