Detail-oriented and performance-driven Data Engineer
with over 5 years of IT experience, including 2+ years of specialized experience in data engineering and cloud-based analytics. Proven expertise in building and optimizing scalable ETL pipelines using PySpark, Apache Airflow, and AWS EMR. Hands-on experience working with large datasets, data lake houses (Databricks), and integrating hybrid data sources using Azure Data Factory. Collaborative team player with a strong foundation in Python, SQL, and distributed data systems. Currently based in Dubai and open to immediate opportunities in Data Engineering roles.
Roles and Responsibilities:
- Designed and implemented end-to-end ETL pipelines pulling data from S3 and loading into data lakes.
- Developed processes for writing structured and transformed data into Snowflake for downstream analytics and reporting.
- Worked on AWS EMR clusters for distributed data processing and performance tuning.
- Built automated data quality checks and logging for real-time monitoring.
- Collaborated with data scientists to provide processed datasets for model training.
- Automated pipeline orchestration using Apache Airflow, with retry and alert mechanisms.
- Integrated Spark jobs with Airflow for task orchestration and automated retries.
- Optimized performance of Spark jobs by tuning memory allocation and join strategies.
- Collaborated with cross-functional teams and DevOps to monitor and maintain EMR clusters.
- Implemented role-based access control for data stored in S3 and EMR.
- Maintained technical documentation of data workflows and conducted peer code reviews.