Apache Spark is still a valid DE tool
Use Cases and Deployment Scope
We use Apache Spark on a daily basis as the main computation engine for updating most critical and non-critical data pipelines. We mostly work with batch processing but there are instances for using Spark Streaming as well. The scope is for all analysis pipelines, machine learning datasets and several operational use cases.
Pros
- Parallel processing
- Configurability
- Usage with other tools
Cons
- More ready-to-use solutions for tweaking the Apache Spark configs
- Reduce the creation of UDFs for Pyspark by implementing transformations directly
Return on Investment
- Increased data literacy and adherence to best data engineering practices across the organization
- Increased ability for the data analysts to quickly and reliably have access to their data, better supporting data driven decisions
- Decreased costs due to better parallelization of resources
Usability
Other Software Used
dbt, Amazon S3 (Simple Storage Service), Amazon EMR (Elastic MapReduce)




