Overview
A production-ready ML pipeline for detecting SIMBox fraud in telecom networks by analyzing Call Detail Records (CDR). The system uses SQL-based feature engineering and multiple ML models to identify fraudulent SIM activity with high accuracy.
Current Performance: 90.41% AUC, 83.59% F1 Score (XGBoost with v11 features)
Key Features
- 63 engineered features across 7 categories (call volume, patterns, temporal, geographic, network/device, SMS, SIM age)
- SQL-first architecture for fast feature extraction using DuckDB
- 4 ML models with optimized hyperparameters (XGBoost, LightGBM, Random Forest, Logistic Regression)
- Experiment tracking with DuckDB-based versioning
- Simple API for quick training, comparison, and deployment
Technical Stack
- Database: DuckDB (columnar, fast analytics)
- Feature Engineering: Pure SQL for performance and flexibility
- ML Models: XGBoost, LightGBM, Random Forest, Logistic Regression
- Data Processing: Pandas, NumPy, Scikit-learn
- Orchestration: Python with minimal boilerplate
- Package Management: UV for fast dependency management
Results
Version | Features | F1 Score | AUC | Best Model |
---|---|---|---|---|
v11 | 63 | 83.59% | 90.41% | XGBoost |
v10 | 62 | 81.16% | 88.52% | XGBoost |
v8 | 58 | 80.79% | 88.41% | XGBoost |
Model Comparison (v11)
- XGBoost: 90.41% AUC, 83.59% F1 (best overall)
- LightGBM: 90.38% AUC, 83.34% F1 (faster training)
- Random Forest: 89.20% AUC, 81.50% F1 (solid baseline)
- Logistic Regression: 85.50% AUC, 76.80% F1 (linear baseline)
Architecture Highlights
Philosophy: Simple is better than complex. Focus on fast experimentation rather than over-engineering.
- SQL-based Features: All 63 features extracted via SQL for speed and maintainability
- DuckDB Analytics: Columnar database optimized for analytical queries
- Modular Pipeline: Easy to modify features, compare models, and track experiments
- One-line Training:
quick_train("xgboost")
for rapid prototyping - Model Comparison: Compare all 4 models with
quick_compare()
Implementation
The pipeline processes telecom CDR data through:
- Data Ingestion: Raw CDR data loaded into DuckDB
- Feature Engineering: SQL queries generate 63 features per SIM
- Model Training: Automated hyperparameter tuning with cross-validation
- Evaluation: Comprehensive metrics (AUC, F1, Precision, Recall, Confusion Matrix)
- Experiment Tracking: All results versioned in DuckDB
Designed for ML researchers and data scientists who need to iterate quickly on features and models without unnecessary complexity.