We use Apache Spark on a daily basis as the main computation engine for updating most critical and non-critical data pipelines. We mostly work with batch processing but there are instances for using Spark Streaming as well. The scope is for all analysis pipelines, machine learning datasets and several operational use cases.

Pros

Parallel processing
Configurability
Usage with other tools

Cons

More ready-to-use solutions for tweaking the Apache Spark configs
Reduce the creation of UDFs for Pyspark by implementing transformations directly

Return on Investment

Increased data literacy and adherence to best data engineering practices across the organization
Increased ability for the data analysts to quickly and reliably have access to their data, better supporting data driven decisions
Decreased costs due to better parallelization of resources

Usability

Other Software Used

dbt, Amazon S3 (Simple Storage Service), Amazon EMR (Elastic MapReduce)

Ananth Gouri View profile

Assistant Professor in Engineering at The National Institute of Engineering, Mysuru (501-1000 employees employees)

Use Cases and Deployment Scope

If you are working on large and big scale data with analytics - don't go further without the use of Apache Spark! One of the projects that I was involved in using Apache Spark was a Recommendation Systems based project. My area or domain of research expertise is also Recommendation Systems. The deployment of a RecSys along with the use of Apache Spark - functionalities like scalability, flexibility of using various data sources along with fault-tolerant systems - are very easy. The built-in machine learning library MLlib is a boon to work. We don't require any other libraries.

Pros

Fault-tolerant systems: in most cases, no node fails. If it fails - the processing still continues.
Scalable to any extent.
Has built-in machine learning library called - MLlib
Very flexible - data from various data sources can be used. Usage with HDFS is very easy

Cons

Its fully not backward compatible.
It is memory-consuming for heavy and large workloads and datasets
Support for advanced analytics is not available - MLlib has minimalistic analytics.
Deployment is a complex task for beginners.

Most Important Features

Scalability
We had data across multiple sources. Integration with those data source types was not a problem
Generation of recommendations was achievable easily

Return on Investment

We used Apache Spark for one of the research projects. The ROI though cannot be measured here - but the research paper got accepted to a good conference. What else would a project require??!!

Other Software Used

ChatGPT, Python IDLE, IntelliJ IDEA

Usability

Riyaz Khan View profile

Staff Engineer in Information Technology at Nagarro (10,001+ employees employees)

Use Cases and Deployment Scope

Earlier we were using RDBMS like Oracle for retail and eCommerce data. We faced challenges such as cost, performance, and a huge amount of transactions coming in. After a lot of critical issues we migrated to delta lake. Now, we are using Apache Spark Streaming to deal with all real-time transactions. For batch data as well, we are pretty much handling TBs of data using Apache Spark.

Pros

Realtime data processing
Interactive Analysis of data
Trigger Event Detection

Cons

Machine Learning
GraphX Lib
True Realtime Streaming

Most Important Features

Fast Processing
In-Memory Computing
Provides better insights

Return on Investment

No investment as it is open source
Cheap commodity hardwares can save lot of money

Alternatives Considered

Apache Hadoop, SAP HANA Cloud and Apache Ignite

Other Software Used

SAP HANA Cloud, Apache Hive, Apache Airflow, Apache Kafka, Tableau Server, Tableau Desktop

Steven Li View profile

Senior Software Developer (Consultant) in Information Technology at Morgan Stanley (10,001+ employees employees)

Use Cases and Deployment Scope

We need to calculate risk-weighted assets (RWA) daily and monthly for different positions the bank holds on a T+1 basis. The volume of calculations is large: more than millions of records per day with very complicated formulas and algorithms. In our applications/projects, we used Scala and Apache Spark clusters to load all data we needed for calculation and implemented complicated formulas and algorithms via its DataFrame or DataSet from the Apache Spark platform.

Without adopting the Apache Spark cluster, it would be pretty hard for us to implement such a big system to handle a large volume of data calculations daily. After this system was successfully deployed into PROD, we've been able to provide capital risk control reports to regulation/compliance controllers in different regions in this global financial world.

Pros

DataFrame as a distributed collection of data: easy for developers to implement algorithms and formulas.
Calculation in-memory.
Cluster to distribute large data of calculation.

Cons

It would be great if Apache Spark could provide a native database to manage all file info of saved parquet.

Most Important Features

The speed of processing a large volume of data.
Dataframe with SQL-like operations reduces the learning curve for new developers if they do have very good knowledge of databases and SQL.
Cluster to scale up/down easily.

Return on Investment

With the daily risk reports being calculated via Apache Spark, the bank is able to comply with the FHC rule in the US and other regions and control capitals much better with counterparties.

Alternatives Considered

Apache Hadoop

Other Software Used

IntelliJ IDEA, Oracle Database, Scala

Surendranatha Reddy Chappidi View profile

Senior Data Engineer in Information Technology at A.P. Moller - Maersk (10,001+ employees employees)

Pros

Spark is very fast compered to other frameworks because it works in cluster mode and use distributed processing and computation frameworks internally
Robust and fault tolerant
Open source
Can source data from multiple data sources

Cons

No Dataset API support in python version of spark
Apache Spark job run UI can have more meaningful information
Spark errors can provide more meaningful information when a job is failed

Most Important Features

Distributed processing and computing
Processing different data source formats
Fault tolerant and robust

Return on Investment

Business leaders are able to take data driven decisions
Business users are able access to data in near real time now . Before using spark, they had to wait for at least 24 hours for data to be available
Business is able come up with new product ideas

Alternatives Considered

IBM InfoSphere DataStage and Informatica PowerCenter

Other Software Used

Azure Data Factory, Databricks Lakehouse Platform (Unified Analytics Platform), Cloudera Distribution Hadoop (CDH)

Apache Spark

What is Apache Spark?

Categories & Use Cases

Most Frequent Users

Professional, Scientific, and Technical Services

Information

Finance and Insurance

Reviews

Use Cases and Deployment Scope

Pros

Cons

Return on Investment

Usability

Other Software Used

Use Cases and Deployment Scope

Pros

Cons

Most Important Features

Return on Investment

Other Software Used

Usability

Use Cases and Deployment Scope

Pros

Cons

Most Important Features

Return on Investment

Alternatives Considered

Other Software Used

Use Cases and Deployment Scope

Pros

Cons

Most Important Features

Return on Investment

Alternatives Considered

Other Software Used

Pros

Cons

Most Important Features

Return on Investment

Alternatives Considered

Other Software Used