January 8, 2023 7:00 PM PST
This meeting focused on the design and implementation of distributed resilient storage systems, particularly in the context of real-world applications. Key topics included system design challenges, authentication processes, data consistency, and strategies for improving service resilience. The discussion highlighted various technical solutions and considerations for managing distributed data storage effectively.
Presenter: K, Tech Lead at FAANG
Key Topics Discussed
1. System Design: IP Blocker
- Implementation of IP blocking in response to legal requirements in specific countries.
- Handling high throughput with a third-party service managing 20,000 queries per second (QPS) and 2 million connections per second.
- Warm start capabilities using precomputed results.
2. Performance Considerations
- Emphasis on throughput without sufficient focus on latency and Service Level Agreements (SLA).
- IPv4 address caching can be pre-built in less than three days.
- IPv6 solutions can leverage similar caching strategies.
3. Control Plane and Data Plane
- Configuration management through a cloud resource manager.
- User requests for service allocation without authentication for private network services.
- Latency for data replication is targeted at 5 minutes.
4. Key Challenges
- High request rates (70,000-80,000 requests per second) necessitate strategies to reduce latency.
- Detection of data tampering and improving service resilience.
- Issues with large customers monopolizing resources and geographical boundaries affecting service.
5. Data Plane Authentication
- Requirements for high throughput and low latency (under 5ms).
- Proposed solution: co-locating authentication services with data services.
- Challenges with permission revocation and the propagation of changes to cache instances.
6. Data Signing and Security
- Importance of signing authentication data with a private key and verifying with a public key to prevent unauthorized modifications.
- Ensuring updates occur on secure machines to protect against upstream attacks.
- Embedding signing signatures during write operations.
7. Distributed Data Store Consistency Levels
- Discussion on five levels of consistency, ranging from eventual to session consistency.
- Multi-tenant architecture considerations, including GDPR compliance and shuffle sharding to mitigate risks from malicious tenants.
8. Container Isolation and Performance
- Challenges of achieving pay-as-you-go models with container-based isolation.
- Noted performance differences in Jupyter notebook service startup times compared to competitors.
- Exploration of Kubernetes fleets and GVisor for improved isolation and resource management.
- Use of Kata containers for small clients, despite slower performance compared to other solutions.
Conclusion
The meeting provided valuable insights into the complexities of designing distributed resilient storage systems, particularly in terms of authentication, data consistency, and performance. Future discussions may focus on refining these strategies and exploring new technologies to enhance service delivery and security.