TrustRadius: an HG Insights company

Apache Spark

Score9.2 out of 10

161 Reviews and Ratings

What is Apache Spark?

Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Categories & Use Cases

Apache Spark is still a valid DE tool

Use Cases and Deployment Scope

We use Apache Spark on a daily basis as the main computation engine for updating most critical and non-critical data pipelines. We mostly work with batch processing but there are instances for using Spark Streaming as well. The scope is for all analysis pipelines, machine learning datasets and several operational use cases.

Pros

  • Parallel processing
  • Configurability
  • Usage with other tools

Cons

  • More ready-to-use solutions for tweaking the Apache Spark configs
  • Reduce the creation of UDFs for Pyspark by implementing transformations directly

Return on Investment

  • Increased data literacy and adherence to best data engineering practices across the organization
  • Increased ability for the data analysts to quickly and reliably have access to their data, better supporting data driven decisions
  • Decreased costs due to better parallelization of resources

Usability

Other Software Used

dbt, Amazon S3 (Simple Storage Service), Amazon EMR (Elastic MapReduce)

Apache Spark: Lightning-Fast Distributed Computing with a Learning Curve

Use Cases and Deployment Scope

If you are working on large and big scale data with analytics - don't go further without the use of Apache Spark! One of the projects that I was involved in using Apache Spark was a Recommendation Systems based project. My area or domain of research expertise is also Recommendation Systems. The deployment of a RecSys along with the use of Apache Spark - functionalities like scalability, flexibility of using various data sources along with fault-tolerant systems - are very easy. The built-in machine learning library MLlib is a boon to work. We don't require any other libraries.

Pros

  • Fault-tolerant systems: in most cases, no node fails. If it fails - the processing still continues.
  • Scalable to any extent.
  • Has built-in machine learning library called - MLlib
  • Very flexible - data from various data sources can be used. Usage with HDFS is very easy

Cons

  • Its fully not backward compatible.
  • It is memory-consuming for heavy and large workloads and datasets
  • Support for advanced analytics is not available - MLlib has minimalistic analytics.
  • Deployment is a complex task for beginners.

Most Important Features

  • Scalability
  • We had data across multiple sources. Integration with those data source types was not a problem
  • Generation of recommendations was achievable easily

Return on Investment

  • We used Apache Spark for one of the research projects. The ROI though cannot be measured here - but the research paper got accepted to a good conference. What else would a project require??!!

Other Software Used

ChatGPT, Python IDLE, IntelliJ IDEA

Usability

Lightning Fast In-Memory Cluster Computing Framework

Use Cases and Deployment Scope

Earlier we were using RDBMS like Oracle for retail and eCommerce data. We faced challenges such as cost, performance, and a huge amount of transactions coming in. After a lot of critical issues we migrated to delta lake. Now, we are using Apache Spark Streaming to deal with all real-time transactions. For batch data as well, we are pretty much handling TBs of data using Apache Spark.

Pros

  • Realtime data processing
  • Interactive Analysis of data
  • Trigger Event Detection

Cons

  • Machine Learning
  • GraphX Lib
  • True Realtime Streaming

Most Important Features

  • Fast Processing
  • In-Memory Computing
  • Provides better insights

Return on Investment

  • No investment as it is open source
  • Cheap commodity hardwares can save lot of money

Alternatives Considered

Apache Hadoop, SAP HANA Cloud and Apache Ignite

Other Software Used

SAP HANA Cloud, Apache Hive, Apache Airflow, Apache Kafka, Tableau Server, Tableau Desktop

Apache Spark is the next generation of big data computing.

Use Cases and Deployment Scope

We need to calculate risk-weighted assets (RWA) daily and monthly for different positions the bank holds on a T+1 basis. The volume of calculations is large: more than millions of records per day with very complicated formulas and algorithms. In our applications/projects, we used Scala and Apache Spark clusters to load all data we needed for calculation and implemented complicated formulas and algorithms via its DataFrame or DataSet from the Apache Spark platform.

Without adopting the Apache Spark cluster, it would be pretty hard for us to implement such a big system to handle a large volume of data calculations daily. After this system was successfully deployed into PROD, we've been able to provide capital risk control reports to regulation/compliance controllers in different regions in this global financial world.

Pros

  • DataFrame as a distributed collection of data: easy for developers to implement algorithms and formulas.
  • Calculation in-memory.
  • Cluster to distribute large data of calculation.

Cons

  • It would be great if Apache Spark could provide a native database to manage all file info of saved parquet.

Most Important Features

  • The speed of processing a large volume of data.
  • Dataframe with SQL-like operations reduces the learning curve for new developers if they do have very good knowledge of databases and SQL.
  • Cluster to scale up/down easily.

Return on Investment

  • With the daily risk reports being calculated via Apache Spark, the bank is able to comply with the FHC rule in the US and other regions and control capitals much better with counterparties.

Alternatives Considered

Apache Hadoop

Other Software Used

IntelliJ IDEA, Oracle Database, Scala

Apache Spark - your go to technology for distributed data processing

Pros

  • Spark is very fast compered to other frameworks because it works in cluster mode and use distributed processing and computation frameworks internally
  • Robust and fault tolerant
  • Open source
  • Can source data from multiple data sources

Cons

  • No Dataset API support in python version of spark
  • Apache Spark job run UI can have more meaningful information
  • Spark errors can provide more meaningful information when a job is failed

Most Important Features

  • Distributed processing and computing
  • Processing different data source formats
  • Fault tolerant and robust

Return on Investment

  • Business leaders are able to take data driven decisions
  • Business users are able access to data in near real time now . Before using spark, they had to wait for at least 24 hours for data to be available
  • Business is able come up with new product ideas

Alternatives Considered

IBM InfoSphere DataStage and Informatica PowerCenter

Other Software Used

Azure Data Factory, Databricks Lakehouse Platform (Unified Analytics Platform), Cloudera Distribution Hadoop (CDH)