Cost reduction with Amazon EMR on EKS
Use Cases and Deployment Scope
Amazon EMR (Elastic MapReduce) is heavily used at my organization for most if not all data pipeline computations: we started by using EC2 instances, we then moved to EMR Serverless and we are actually completing the transition to EMR on EKS. In general we use it for long-running analysis (SQLs with a lot of JOINs) and overall for batch processing. From what I've seen, we use it with Spark under the hood.
Pros
- EMR on EKS is really flexible and cost-saving
- Flexibility on how to run the jobs (and different implementations to choose from)
- Support online and it's a regularly updated product
Cons
- EMR on EKS could be better documented, especially since for the "magic" it does under the hood when using Spark
- UI can be improved (especially for EMR on EKS)
Likelihood to Recommend
Based on my experience, Amazon EMR is well suited for companies with a good level of support on the Platform and Data Platform level, since it needs to be properly set up to avoid incurring in extra costs: it's quite easy to give more and more resources, so a job will eventually run but it's important to avoid extra costs. In general EMR on EC2 has been the most expensive of the EMR subproducts, while EMR on EKS has a good balance of giving enough resources to the jobs to run while maintaining costs low. The other recommendation is to use the latest versions of the EMR images, as otherwise the support from Amazon might not be very helpful.