June 19, 2022 7:00 PM PDT
This document summarizes a mock system design interview focused on creating a distributed message queue. The discussion covered functional and non-functional requirements, system design components, and various technical considerations necessary for implementing a reliable and fault-tolerant messaging system.
Interview Details
- Topic: Distributed Message Queue
- Target Level: L5 (Senior)
- Duration: 45 minutes
- Drawing Tool Used: Jamboard
Requirements
Functional Requirements
- Distributed message queue across regions.
- Messages must be delivered in the order they were sent (e.g., A, B, C).
- Support for multiple topics.
- Consumers will pull messages.
- Messages should maintain the same order within a partition.
- Support for lower priority messages.
- Ensure "once and only once" delivery semantics.
Non-Functional Requirements
- Most Important:
- Reliability
- Fault tolerance
- Scalability is less critical compared to reliability.
- Smaller message sizes are preferred.
- The system does not need to be as scalable as Kafka.
System Design
External APIs
- Producer:
post(message, ID)
- Consumer:
pull(ID)
Key Components
- Acknowledgment: Use
ack=3
to ensure messages are reliably persisted by three replicas. - Metadata Service: Similar to a ZooKeeper service, it supports leader/follower election and records the leader's identity.
- Ordering: The leader component will handle message ordering, synchronizing calls using the clock of the leader replica.
Message Handling
- Consumers send a
consumerID
and receive a list ofMessageID
s. - The API service will return a list of messages, allowing consumers to receive multiple messages at once.
- The leader decides the order of messages and communicates this to replicas.
High Availability
- The metadata service should not be a single point of failure.
- Use Raft consensus or ZooKeeper with multiple nodes to ensure high availability.
- If a replica fails, the metadata service allocates another storage node to maintain the required number of replicas.
Message Consumption and Acknowledgment
- The API service tracks consumed messages and updates the watermark in the metadata service.
- Messages are not removed immediately to save disk space and handle edge cases.
Discussion Points
- The interviewer and interviewee discussed various approaches to maintaining order, handling failures, and ensuring message delivery guarantees.
- The interviewee emphasized the importance of fault tolerance and acknowledged the complexity of guaranteeing order with multiple consumers.
- Different strategies for implementing the queue, including in-memory and disk-based storage, were explored.
Audience Feedback
- The audience provided insights on various use cases and existing solutions like Kafka and RocketMQ.
- Discussions included the importance of timestamps, partitioning, and the role of load balancers in the system architecture.
- The audience highlighted the need for a consensus algorithm to ensure total order and the implementation of idempotent keys for producers and consumers.
Conclusion
The mock interview effectively covered the critical aspects of designing a distributed message queue, focusing on reliability, fault tolerance, and message ordering. The discussion highlighted the complexities involved in such systems and the various trade-offs that must be considered in the design process.