Skip to content
View sunildataengineer's full-sized avatar
🎯
Focusing
🎯
Focusing

Block or report sunildataengineer

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
sunildataengineer/README.md

👋 Hi, I'm Sunil Kumar Reddy


🚀 About Me

I'm Sunil Kumar Reddy, a Data Engineer passionate about designing and building production-grade, cloud-native data platforms capable of processing large-scale streaming workloads.

My primary interests include:

  • ⚡ Real-Time Streaming Pipelines
  • 🏗️ Distributed Data Systems
  • ☁️ Cloud-Native Data Engineering
  • 📊 ETL & ELT Platform Development
  • 📈 Observability & Monitoring
  • 🌍 Open Source Development

I enjoy building reliable, scalable, and maintainable data platforms using modern distributed technologies while continuously improving my engineering skills through open-source contributions.


🌐 Connect With Me


Profile Views

⚙️ Tech Stack

💻 Programming Languages


⚡ Real-Time Streaming & Big Data


🗄️ Databases & Storage


☁️ Cloud Platforms


🐳 DevOps & Infrastructure


📊 Monitoring & Observability


🧪 Data Quality & Testing


🛠️ Tools & IDEs


📈 Engineering Focus

🚀 Core Expertise

  • Real-Time Data Streaming
  • Event-Driven Architectures
  • Distributed Data Processing
  • ETL & ELT Pipelines
  • Data Platform Engineering
  • Cloud-Native Applications
  • Data Modeling
  • Workflow Orchestration

🎯 Currently Focusing On

  • Apache Flink
  • Apache Airflow
  • Apache Kafka
  • Spark Structured Streaming
  • System Design
  • Distributed Systems
  • Kubernetes
  • Terraform
  • Production Observability
  • Open Source Contributions

🚀 Featured Production Projects

⚡ 1. Real-Time Fraud & Anomaly Detection Streaming Platform

Production-grade streaming platform for real-time fraud detection using stateful stream processing, exactly-once semantics, and cloud-native architecture.

🎯 Highlights

  • ⚡ Stateful stream processing with Apache Flink
  • 🚀 Low-latency fraud detection
  • 🔄 Exactly-once event processing
  • 📈 Horizontal scalability
  • 📊 Live operational dashboards
  • 📡 Event-driven architecture
  • ☁️ Cloud-native deployment
  • 🔍 End-to-end observability

🛠️ Tech Stack

📊 Scale

Metric Value
Average Volume 100M+ Transactions/Day
Daily Data 1–2 TB
Processing Real-Time Streaming
Processing Guarantee Exactly Once

📂 Repository: https://github.com/sunildataengineer/Real-Time-Fraud-Detection-Risk-Intelligence-Platform

🖥️ Architecture: ChatGPT Image Jul 2, 2026, 05_07_18 AM

🖥️ Data Modelling: ChatGPT Image Jul 2, 2026, 05_29_00 AM

🌐 Live Demo: Coming Soon


📡 2. Real-Time Data Quality & Streaming Governance Platform

Production-grade data quality platform that validates, governs, monitors, and scores streaming data before downstream consumption.

🎯 Highlights

  • ✅ Schema validation
  • 📋 Data quality scoring
  • 🔍 Data governance
  • 📈 Live monitoring
  • ⚠️ Automatic anomaly detection
  • 🔄 Dead Letter Queue handling
  • 📊 Data quality dashboards
  • ☁️ Cloud-native deployment

🛠️ Tech Stack

📊 Scale

Metric Value
Average Volume 50M+ Records/Day
Daily Data 500 GB–1 TB
Processing Real-Time Streaming
Focus Data Quality & Governance

📂 Repository: Add GitHub Repository Link

🖥️ Architecture: Add Architecture Diagram

🌐 Live Demo: Optional


🌍 3. Global Real-Time Event Processing Stateful Streaming Platform

Scalable event processing platform built for high-throughput, fault-tolerant, low-latency analytics with stateful stream processing.

🎯 Highlights

  • 🌍 Event-driven architecture
  • ⚡ Stateful processing
  • 🪟 Event-time windowing
  • 🔁 Checkpoint & recovery
  • 📈 Horizontal scaling
  • 📊 Live dashboards
  • ☁️ Cloud-native deployment
  • 📡 High availability

🛠️ Tech Stack

📊 Scale

Metric Value
Average Volume 100M+ Events/Day
Processing Stateful Streaming
Availability High Availability
Processing Guarantee Exactly Once

📂 Repository: Add GitHub Repository Link

🖥️ Architecture: Add Architecture Diagram

🌐 Live Demo: Optional


💡 Engineering Philosophy

Build reliable, observable, scalable, and production-ready data platforms that transform high-volume event streams into trusted, actionable data through modern cloud-native engineering practices.

🌍 Apache Airflow Open Source Contribution

Contributing to one of the world's most widely adopted workflow orchestration platforms used by thousands of organizations for production data engineering.


🚀 Contribution Overview

Project

Apache Airflow

Area

SFTP Provider

Feature

Deferrable SFTPOperator

Pull Request

PR #68298 (continuation of the original implementation)

Status

Active review


🎯 Problem Statement

The existing SFTPOperator occupied an Airflow worker for the entire duration of long-running file transfers.

This reduced worker availability and limited scalability for workflows involving large or slow SFTP operations.

The goal of this contribution is to introduce a deferrable execution mode, allowing the operator to release the worker after initiating the transfer and resume execution only when the asynchronous operation completes.


🏗️ Technical Contributions

✅ Deferrable Operator

  • Implemented asynchronous execution using self.defer()
  • Integrated with Airflow's Triggerer architecture
  • Enabled non-blocking execution for long-running SFTP transfers

✅ Async Trigger

Designed and implemented

  • SFTPOperationTrigger

Responsibilities include:

  • Waiting asynchronously for transfer completion
  • Returning execution events
  • Reducing worker utilization
  • Supporting scalable task execution

✅ Refactoring

Refactored transfer logic into shared methods across synchronous and asynchronous hooks to eliminate duplicated code and improve maintainability.

Applied the DRY (Don't Repeat Yourself) principle based on maintainer feedback.


✅ Native Async I/O

Replaced wrapper-based asynchronous execution with native async operations for improved efficiency.

Implemented asynchronous file operations including:

  • retrieve
  • store
  • delete

✅ Concurrent Transfers

Implemented bounded concurrent transfers using:

  • asyncio.Semaphore
  • asyncio.gather

Benefits:

  • Controlled concurrency
  • Reduced connection overhead
  • Improved throughput
  • Better resource utilization

✅ Code Quality

Resolved multiple review iterations including:

  • Ruff linting
  • Import ordering
  • Documentation updates
  • News fragments
  • Exception handling
  • CI failures
  • Formatting improvements

🤝 Collaboration

Worked with Apache Airflow maintainers and contributors through multiple review cycles, incorporating feedback on:

  • Architecture
  • Naming conventions
  • API design
  • Performance
  • Maintainability
  • Code quality

This iterative review process strengthened both the implementation and my understanding of large-scale open-source collaboration.


📚 Skills Demonstrated

  • Asynchronous Programming
  • Apache Airflow Internals
  • Python
  • Open Source Collaboration
  • Distributed Systems
  • Code Review
  • Performance Optimization
  • Git Workflow
  • CI/CD Debugging
  • Software Design Principles

💡 Key Learnings

Contributing to Apache Airflow provided hands-on experience with:

  • Designing production-quality features
  • Working within a large, mature codebase
  • Responding to maintainer feedback
  • Iterating through multiple review cycles
  • Maintaining backward compatibility
  • Writing clean, maintainable, and testable code

This experience reinforced the importance of thoughtful design, collaboration, and continuous improvement in production software engineering.


🔗 Resources

🌍 Open Source Engineering Case Study

Contributing to Apache Airflow — one of the world's most widely adopted workflow orchestration platforms.


📌 Overview

Apache Airflow is one of the most popular workflow orchestration platforms used by organizations worldwide for scheduling, orchestrating, and monitoring complex data pipelines.

As an open-source contributor, I worked on improving the SFTP Provider by implementing Deferrable Execution, enabling long-running SFTP transfers to execute asynchronously without occupying Airflow worker resources.


🎯 Problem

Traditional SFTPOperator execution keeps an Airflow worker occupied during the entire file transfer.


Worker

↓

Transfer Running

↓

Worker Busy

↓

Transfer Finished

Problems

• Worker Slot Blocked

• Poor Resource Utilization

• Limited Parallelism

• Higher Infrastructure Cost


💡 Solution

Introduce

Deferrable SFTPOperator

using

Airflow Triggerer


Worker

↓

Start Transfer

↓

self.defer()

↓

Triggerer

↓

Async Trigger

↓

Transfer Complete

↓

Resume Worker

↓

Task Success

This enables

• Non-blocking execution

• Better scalability

• Lower worker utilization

• Higher concurrency


🏗️ Architecture


User DAG

↓

SFTPOperator

↓

self.defer()

↓

Triggerer

↓

SFTPOperationTrigger

↓

Async SFTP Hook

↓

Remote SFTP Server

↓

Trigger Event

↓

Worker Resumes

↓

Task Complete


⚙️ Technical Contributions

✅ Async Trigger

Implemented

SFTPOperationTrigger

Responsibilities

  • Wait asynchronously
  • Monitor file transfer
  • Return completion event

✅ Deferrable Operator

Integrated

self.defer()

inside

SFTPOperator.execute()

to support asynchronous execution.


✅ Shared Transfer Logic

Refactored transfer implementation into

SFTPHook.transfer()

SFTPHookAsync.transfer()

Benefits

  • DRY

  • Maintainability

  • Easier testing


✅ Native Async IO

Removed

sync_to_async

Replaced with

retrieve_file()

store_file()

unlink()

using

native asyncio.


✅ Concurrent Transfers

Implemented

asyncio.Semaphore

asyncio.gather

Benefits

  • Controlled concurrency

  • Better throughput

  • Reduced connection overhead


🔄 Engineering Review Process

During the review process I addressed feedback related to

  • API Design

  • Naming

  • Performance

  • Code Reuse

  • Documentation

  • CI

  • Linting

  • Provider Standards


🧪 Testing

Worked through

  • Ruff

  • Pytest

  • Provider Tests

  • Documentation Validation

  • News Fragment Validation

  • Import Ordering

  • Formatting

  • Exception Handling


📈 Skills Demonstrated

  • Distributed Systems

  • Async Programming

  • Python

  • Apache Airflow Internals

  • Git

  • GitHub

  • Code Review

  • Open Source Collaboration

  • CI/CD

  • Software Architecture

  • Performance Optimization


📚 Engineering Lessons

This contribution strengthened my understanding of

  • Production software development

  • Large-scale codebases

  • API design

  • Reviewer collaboration

  • Backward compatibility

  • Maintainable architecture

  • Async system design

  • Performance engineering


📊 Timeline

April 2025

Initial Proposal

Feature Development

Code Review

Refactoring

Native Async Migration

Performance Improvements

Multiple Review Iterations

Active Review


🚀 Technologies

Python

Apache Airflow

AsyncIO

Git

GitHub

Ruff

Pytest

Open Source

Distributed Systems

CI/CD


🔗 Resources

Apache Airflow

https://github.com/apache/airflow

Pull Request

apache/airflow#68298

Apache Airflow Documentation

https://airflow.apache.org/

Popular repositories Loading

  1. airflow airflow Public

    Forked from apache/airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

    Python

  2. sunildataengineer sunildataengineer Public

  3. portfolio portfolio Public

    my portfolio

    TypeScript

  4. Real-Time-Fraud-Detection-Risk-Intelligence-Platform Real-Time-Fraud-Detection-Risk-Intelligence-Platform Public

    Production-grade real-time fraud detection & risk intelligence platform using Apache Kafka, Apache Flink, PySpark, Airflow, AWS, PostgreSQL, Cassandra, Docker, Kubernetes, Prometheus & Grafana.