February 5, 2023 7:00 PM PST
This meeting focused on the design and implementation of a data platform for a metrics system. The discussion covered various aspects including scalability, data collection methods, system architecture, and the roles of different teams in managing data. The goal was to create a robust system that can handle a high volume of data while ensuring accessibility and reliability.
Meeting Details
- Topic: Data Platform for Metrics System
- Interviewer: [Role]
- Interviewee: [Role]
- Level: L5 (Senior)
Requirements
- Support for 10,000 users with 1,000 products.
- Provide a metrics system for products.
- Granularity: Not real-time.
- Features: Alarm, dashboard, data retention.
API and Scaling Estimates
- Estimated write QPS: 10 million QPS.
- Need to consider scalability and availability.
Design Considerations
- Business Logic Ownership: Determine whether the service team or platform team owns the business logic for alarms and actions.
- Data Storage: Preference for NoSQL databases over SQL for scalability.
- Data Collection Model:
- Use a pull model to collect data to mitigate traffic spikes.
- Consider adding a permission service for data access management.
Optimization Strategies
- Implement a small data aggregator service to pre-aggregate data before it is collected into the central service.
- Consider using cloud solutions for easier onboarding and scaling.
Data Handling and Failures
- Use Zookeeper to manage failures in data pulls and ensure that services can restart if they fail.
- Implement retry mechanisms for data collection service failures.
Data Collection Rules
- The data platform team will create templates for data collection rules, which service teams can customize.
Message Queue Usage
- Consider using a message queue (MQ) for certain services, but not necessarily for all.
Data Aggregation Across Teams
- A compute service may be needed to handle data aggregation across multiple teams.
- Discussed the importance of understanding team interactions when gathering requirements.
Monitoring and Alerts
- Implement monitoring systems to track data collection and alert for anomalies.
- Consider using cloud-based solutions for faster deployment and scalability.
Data Storage and Processing
- Discussed the use of various data storage solutions like Elastic Search and Databricks.
- Emphasized the need for data validation and aggregation services.
Schema Management
- Different teams may have different schemas; the platform should accommodate this.
- Discussed the possibility of using Kafka topics to manage data schema changes.
Real-time vs. Delayed Data
- Addressed the need for both real-time and delayed data processing paths.
- Suggested multi-level aggregation strategies to manage data granularity.
Conclusion
The meeting concluded with a consensus on the need for a hybrid system that balances real-time data processing with batch processing capabilities. The importance of clear roles, responsibilities, and communication between teams was emphasized to ensure the successful implementation of the metrics system.