Job Aggregation Data Platform

Job Aggregation Data Platform

Production-grade data engineering platform with web scraping pipelines, ETL workflows, and intelligent job matching system.

Overview

A production-ready data engineering platform that scrapes, processes, and delivers personalized job notifications. Built with Apache Airflow for orchestration, the system automates the entire pipeline from data collection to intelligent matching and notification delivery.

Key Achievements:

  • 90% reduction in job discovery time through automation
  • 99.5% uptime with automated health monitoring
  • Thousands of jobs processed hourly from multiple sources

Technical Architecture

Orchestration: Apache Airflow 3.1.0 with separated DAG Processor

Storage: MinIO data lake (S3-compatible) + PostgreSQL 16 warehouse with full-text search

Application: FastAPI backend with intelligent matching engine and multi-language support

Infrastructure: Dockerized services with resource limits and health checks

Core Features

  • Web scraping from multiple job boards with rate limiting and retry logic
  • ETL pipelines with staging/production workflow and hash-based deduplication
  • Intelligent matching using full-text search and user preference filtering
  • Real-time notifications with zero-duplicate tracking
  • Production engineering: Safe restart scripts, automated cleanup, log rotation

Performance Highlights

  • 99.8% reduction in Airflow metadata size through XCom optimization
  • 70% storage savings with Parquet compression
  • 10-100x faster queries using PostgreSQL GIN indexes
  • Processes thousands of jobs in under 5 minutes

Technical Challenges

Preventing Redundant Runs: Implemented safe shutdown scripts that pause DAGs before restart to avoid queuing missed schedules

Cross-Platform Deduplication: Built hash-based deduplication using normalized job descriptions to eliminate duplicates across sources

Memory Management: Applied resource limits per service and DAG serialization to prevent excessive memory consumption

Technologies

Apache Airflow 3.1.0 • Python 3.12 • PostgreSQL 16 • MinIO • FastAPI • Docker Compose • Requests • BeautifulSoup • Pandas • SQLAlchemy • Boto3

Screenshots & Gallery

Click on any image to view in fullscreen

Project Details

Technologies
Apache Airflow Web Scraping ETL PostgreSQL FastAPI Docker