April 3, 2022 7:00 PM PDT
The meeting focused on the design and implementation of a distributed log collection system. The discussion covered functional requirements, system design trade-offs, and the technologies involved in log processing, including log agents, Kafka, Spark, and Elasticsearch. The goal was to establish a central system for collecting and searching logs generated by various services across multiple servers.
Functional Requirements
- Centralized log collection and search system.
- Log files generated by multiple servers need to be collected for troubleshooting.
- Logs are saved for approximately 30 days, with archiving for older data.
- Data loss and latency are tolerable to some extent.
- The system must be scalable to handle logs from all microservices within the organization.
System Design Considerations
Log Collection Model
-
Pull Model
- Pros: Central server can operate at its own speed.
- Cons: Security concerns from opening ports on machines; maintenance of metadata for log sources.
-
Push Model
- Pros: Easier management.
- Cons: Requires a consensus protocol for log file destinations.
Assumed a push model for the design:
- Install a log agent on each node.
- Log agent sends logs to a Kafka cluster.
Quota Management
- Quotas for each service should be configurable.
- Options for enforcing quotas:
- Leverage Kafka's quota system.
- Implement a rate limiter service for log agents.
Log Processing Flow
-
Log Generation:
- Log agent checks with the rate limiter.
- Compresses and sends logs to Kafka.
- Spark real-time processing service reads logs from Kafka and sends them to Elasticsearch.
-
Log Search:
- User requests log search via a UI.
- Load balancer forwards the request to the search service.
- Search service queries Elasticsearch.
-
Archiving:
- An archive service periodically retrieves old logs from Elasticsearch and saves them to storage (e.g., Amazon S3).
Benefits of Log Agents
- Centralized maintenance reduces complexity.
- Allows for upgrades and new features without impacting multiple teams.
- Decouples logging from application failures.
Handling Log Failures
- If a virtual machine fails, an offset can be recorded to track how much of the log has been sent to Kafka, facilitating recovery.
Technology Choices
- Kafka: Preferred for its native sharding support, allowing for high-volume log management.
- Spark: Widely used for real-time processing, though alternatives like a custom Java component could be considered.
- Elasticsearch: Used for complex queries and log aggregation.
Metrics Collection
- Separate Kafka cluster for metrics collection.
- Time series database for storing and querying metrics.
- Visualization through Grafana.
Compliance and Data Retention
- S3 supports automatic expiration; compliance requirements can be managed by adjusting retention policies.
Feedback and Areas for Improvement
Interviewer Feedback
- Strengths: Thorough requirement gathering and fluency in discussion.
- Areas for Improvement:
- Avoid jumping to specific solutions too quickly.
- Gather more detailed requirements, such as log size.
Self-Feedback
- Should have clarified log size assumptions.
- Need to improve familiarity with Elasticsearch.
Audience Feedback
- Clarification needed on log types (application vs. machine logs).
- Consideration of scalability and data delivery guarantees.
Conclusion
The meeting concluded with a comprehensive discussion on the architecture and implementation of a distributed log collection system, addressing both technical and operational aspects. Future discussions may focus on refining the design and addressing compliance requirements in more detail.