Cost reduction with Amazon EMR on EKS
Use Cases and Deployment Scope
Amazon EMR (Elastic MapReduce) is heavily used at my organization for most if not all data pipeline computations: we started by using EC2 instances, we then moved to EMR Serverless and we are actually completing the transition to EMR on EKS. In general we use it for long-running analysis (SQLs with a lot of JOINs) and overall for batch processing. From what I've seen, we use it with Spark under the hood.
Pros
- EMR on EKS is really flexible and cost-saving
- Flexibility on how to run the jobs (and different implementations to choose from)
- Support online and it's a regularly updated product
Cons
- EMR on EKS could be better documented, especially since for the "magic" it does under the hood when using Spark
- UI can be improved (especially for EMR on EKS)
Return on Investment
- Switching to EMR on EKS most of our EMR on EC2 jobs has produced a reduction of 4% in the overall costs (while maintaining the same level of data freshness)
Usability
Other Software Used
Apache Spark, Apache Airflow, Amazon S3 (Simple Storage Service), dbt




