SIMBox Fraud Detection

Overview

A production-ready ML pipeline for detecting SIMBox fraud in telecom networks by analyzing Call Detail Records (CDR). The system uses SQL-based feature engineering and multiple ML models to identify fraudulent SIM activity with high accuracy.

Current Performance: 90.41% AUC, 83.59% F1 Score (XGBoost with v11 features)

Key Features

63 engineered features across 7 categories (call volume, patterns, temporal, geographic, network/device, SMS, SIM age)
SQL-first architecture for fast feature extraction using DuckDB
4 ML models with optimized hyperparameters (XGBoost, LightGBM, Random Forest, Logistic Regression)
Experiment tracking with DuckDB-based versioning
Simple API for quick training, comparison, and deployment

Technical Stack

Database: DuckDB (columnar, fast analytics)
Feature Engineering: Pure SQL for performance and flexibility
ML Models: XGBoost, LightGBM, Random Forest, Logistic Regression
Data Processing: Pandas, NumPy, Scikit-learn
Orchestration: Python with minimal boilerplate
Package Management: UV for fast dependency management

Results

Version	Features	F1 Score	AUC	Best Model
v11	63	83.59%	90.41%	XGBoost
v10	62	81.16%	88.52%	XGBoost
v8	58	80.79%	88.41%	XGBoost

Model Comparison (v11)

XGBoost: 90.41% AUC, 83.59% F1 (best overall)
LightGBM: 90.38% AUC, 83.34% F1 (faster training)
Random Forest: 89.20% AUC, 81.50% F1 (solid baseline)
Logistic Regression: 85.50% AUC, 76.80% F1 (linear baseline)

Architecture Highlights

Philosophy: Simple is better than complex. Focus on fast experimentation rather than over-engineering.

SQL-based Features: All 63 features extracted via SQL for speed and maintainability
DuckDB Analytics: Columnar database optimized for analytical queries
Modular Pipeline: Easy to modify features, compare models, and track experiments
One-line Training: quick_train("xgboost") for rapid prototyping
Model Comparison: Compare all 4 models with quick_compare()

Implementation

The pipeline processes telecom CDR data through:

Data Ingestion: Raw CDR data loaded into DuckDB
Feature Engineering: SQL queries generate 63 features per SIM
Model Training: Automated hyperparameter tuning with cross-validation
Evaluation: Comprehensive metrics (AUC, F1, Precision, Recall, Confusion Matrix)
Experiment Tracking: All results versioned in DuckDB

Designed for ML researchers and data scientists who need to iterate quickly on features and models without unnecessary complexity.