February 20, 2022 6:00 PM PST
This document summarizes a mock system design interview focused on a Spam and Abuse Machine Learning System. The discussion revolved around the requirements, system design, algorithms, model evaluation, and deployment strategies for detecting abusive content on social media platforms.
Interview Details
- Interviewer: Not specified
- Interviewee: Not specified
- Level: L5 (Senior)
- Duration: 45 minutes
- Topic Covered: Spam and Abuse ML System
Requirements
Functional Requirements
- Develop a text-based ML algorithm to detect spam and abuse.
- Actions include:
- Blocking tweets containing abusive or spam content.
- Tagging tweets with an abuse label.
- Allowing users to report inappropriate tweets.
- Focus on binary classification of abusive content.
Non-Functional Requirements
- Handle 10,000 abuse reports.
- Process approximately 100,000 tweets per month.
System Design
Data Collection
- Utilize 10,000 reports as the negative class for abusive content.
- Two methods for obtaining additional data:
- Reports + Annotators:
- Combine 10,000 true negatives with 50% of 10,000 positive class (good tweets) for a total of 20,000 labeled tweets.
- Reports + Crowdsourcing + Annotators:
- This method is more complex but can yield true positive and true negative classes.
- Reports + Annotators:
Feature Selection
-
User Profile Features:
- Historical reports, follower count, following count, favorites, account age, retweet ratio, and comments.
-
Tweet Content Features:
- Analyze text for word count, digit count, special characters, and normalization (e.g., lowercasing, punctuation removal).
Algorithm
- Use supervised learning with logistic regression.
- Employ cross-entropy loss function and macro F1 score for model selection.
- Training/testing split: 70% training, 30% testing.
- Considerations for scaling with larger datasets:
- Neural models (CNN, RNN, Transformer) for larger datasets.
- CNN: Fast and parallel processing but not recommended for this task.
- RNN: Sequential processing, not ideal for parallelization.
- Transformer: Suitable for parallel training, handles long dependencies effectively.
Model Evaluation
-
Offline Metrics:
- Macro F1 score, precision of the negative class.
-
Online Metrics:
- User reports, customer retention rate, session time, and average session duration.
Model Deployment
-
Optimization techniques such as quantization to improve computational efficiency.
-
Considerations for model updates:
- Batch Learning: Update parameters when sufficient data is collected.
- Online Learning: Continuous updates, more complex to implement.
-
Initial approach: Use batch learning with logistic regression for the first six months, followed by monitoring for data drift and conducting A/B testing.
Feedback and Discussion Points
- Interviewer feedback highlighted the need for a high-level overview before diving into details.
- Audience suggestions included:
- Discussing the importance of trade-offs in algorithm selection.
- Emphasizing the need for a backup plan if the primary model fails.
- Addressing data consistency, data shift, and overfitting concerns.
Additional Considerations
- Discussed the need for a feature store for both online and offline processing.
- Considered the implications of recall and precision in model performance.
- Addressed the importance of model training and inference environments (CPU vs. GPU).
- Discussed the potential for using naive models as baselines for performance comparison.
This summary encapsulates the key discussions and considerations for designing a Spam and Abuse Machine Learning System, focusing on technical aspects and strategic planning for effective deployment and evaluation.