June 4, 2023 7:00 PM PDT
The meeting focused on the design and implementation of a log collection system. Key discussions included functional and non-functional requirements, design alternatives, and specific technologies to be utilized in the system. The conversation also touched on performance considerations, data management, and compliance needs.
Time series database performance
Requirements
Functional Requirements
- Install log agent service
- Enable logs search capability
- Logs should be retained for approximately 30 days
Non-Functional Requirements
- Tolerate data loss during high log volume
- Acceptable latency levels
- Scalability
Out of Scope
- Authentication
- Archiving / Cold storage
Design Considerations
Top-Level Alternatives
- Push Model:
- Install an agent to handle load on the server.
- Pull Model:
- Server processes logs at its own speed but introduces complexity and security issues.
Proposed Solution
- Push logs to Kafka.
- Default quota: 50 QPS, 100 KB (configurable).
- Internal API for updating service quota.
- Add a rate limiter service.
Log Search Support
- Utilize Spark or real-time processing to pull from Kafka and write to a search service.
- Implement a log search UI and load balancer.
- Archive logs into S3 for improved performance.
Justification for Log Agent
- Centralized maintenance by a single team.
- Uniform programming language usage.
- Offset tracking for sent logs.
Technology Choices
Alternatives to Kafka
- RabbitMQ or Amazon SQS, though Kafka is preferred due to sharding support.
Metrics and Structured Search
- Separate queue for metrics collection.
- Real-time processing service for metrics visualization (e.g., Grafana dashboard).
- Consideration for time series databases for metrics storage.
Data Deletion and Compliance
- Use S3 or Elasticsearch for data storage.
- S3 supports expiration and can move data to Glacier.
Kafka Considerations
- Default message size is 1MB; large messages may cause head-of-line blocking.
- Buffering at the agent level to manage message flow.
Load Balancing and Traffic Management
- Consider adding a load balancer between the agent and Kafka.
- Implement rate limiting, load shedding, and back pressure mechanisms to handle peak traffic.
Elasticsearch and Time Series Databases
- Elasticsearch is optimized for text search, while time series databases are better for numerical data aggregation.
- Spark may be integrated to pre-aggregate data before sending it downstream.
Conclusion
The meeting concluded with a consensus on the proposed architecture and technologies to be used for the log collection system, emphasizing the importance of scalability, performance, and compliance in the design.