May 29, 2022 7:00 PM PDT
This document summarizes a design interview focused on creating an escalation and notification system. The interview covered functional and non-functional requirements, system design, and various technical discussions regarding the architecture and workflow of the proposed system.
Requirements
Functional Requirements
- Support for a library that can handle multiple companies or a single company.
- Integration with SSO (Single Sign-On) for company onboarding.
- Users should be able to define customizable escalation rules.
- Group escalation features with customizable teams.
- Support for tickets of varying severity levels.
- Ability for different companies/users/groups to generate distinct rules.
- Notifications can be triggered by users or services from other monitoring systems.
Non-Functional Requirements
- Scalability to support one company with 100,000 teams and 1 million employees, or 100 companies with 1000 transactions per second (TPS).
- High availability and low latency (a few seconds of delay is acceptable).
- Accuracy in escalation (at least one escalation must occur).
System Design
External APIs
-
Create Tickets:
/api/v1/createTickets(ticketId, group)
- Response: 200 with
createTime
andgroup
.
- Response: 200 with
-
Update Tickets:
/api/v1/updateTickets(ticketId, status)
- Possible statuses: NonRead, Under Investigation, Pending Deployment/Fix, Resolved.
-
Create Rules:
/api/v1/createRules(ruleName, group, ruleInformation)
-
Update Rules:
/api/v1/updateRules
Design Considerations
- A NoSQL database was chosen for scalability.
- Implementation of a message queue and worker system for handling different types of notifications.
- Metadata storage includes alert rules and group information.
- User identification is managed through SSO, allowing differentiation between data from various companies.
Workflow
- Create a ticket and fetch metadata service.
- Assign tasks and resolve to the next escalation point.
- Use a queue to manage tasks, with workers reading from the queue and verifying against the metadata service.
- If a worker receives a task it does not own, it can return the task to the queue or push it to the appropriate worker.
- Zookeeper can be utilized for sharding.
- Implement disaster recovery services to manage unprocessed tasks if a worker goes down.
Message Handling
- Messages can be added back to a priority queue if they fail (e.g., invalid email).
- Acknowledgment from users can mark messages as "in progress," dropping future tasks.
- Metrics are essential for scaling workers and ensuring system functionality.
Feedback Summary
Interviewer Feedback
- The candidate was deemed a good fit for L4, borderline for L5.
- Requirement gathering and design clarity were areas for improvement.
- The metadata service was considered too monolithic and could benefit from refactoring into multiple services.
- API responses should be more accurate (e.g., creation should return 201 or 202).
Audience Feedback
- Suggestions were made regarding fault tolerance and acknowledgment processes.
- Discussion on the use of priority queues versus scheduling services.
- Consideration of database design was noted as an area that needed more focus.
Key Questions Raised
- How to manage failures in worker tasks and the necessity of a disaster recovery mechanism.
- The importance of ensuring that all services receive notifications without being overwhelmed.
- The potential need for a timed mechanism to handle unaddressed tickets.
Conclusion
The interview highlighted various aspects of designing an escalation and notification system, including functional and non-functional requirements, system architecture, and the importance of robust error handling and scalability. The feedback provided insights into areas for improvement in both design clarity and technical implementation.