TrustRadius Insights for Apache Spark are summaries of user sentiment data from TrustRadius reviews and, when necessary, third party data sources.
Pros
Great Computing Engine: Apache Spark is praised by many users for its capabilities in handling complex transformative logic and sophisticated data processing tasks. Several reviewers have mentioned that it is a great computing engine, indicating its effectiveness in solving intricate problems.
Valuable Insights and Analysis: Many reviewers find Apache Spark to be useful for understanding data and performing data analytical work. They appreciate the valuable insights and analysis capabilities provided by the software, suggesting that it helps them gain deeper understanding of their data.
Extensive Set of Libraries and APIs: The extensive set of libraries and APIs offered by Apache Spark has been highly appreciated by users. It provides a wide range of tools and functionalities to solve various day-to-day problems, making it a versatile choice for different data processing needs.
We do use Apache Spark for cluster computing for our ETL environment, data and analytics as well as machine learning. It is mainly used by our data engineering team to support the entire Data Lake foundation. As we have huge amounts of information coming from multiple sources, we needed an effective cluster management system to handle capacity and deliver the performance and throughput we needed.
Pros
Cluster management for ETL.
Data processing engine for our data lake.
Cons
You still need Hive or other HDFS to store information.
Security is behind compared to MapReduce.
Likelihood to Recommend
Spark is a one-size-fits-all data processing platform. You can run batch and in-motion streams, you can use for ETL, machine learning or even graphs. You do not have multiple tools, so it makes your TCO and management tasks way easier. As every new platform, has room to grow: storage and security are the main opportunities we found.
VU
Verified User
Executive in Information Technology (Consumer Goods company, 10,001+ employees)
My company uses Apache Spark in various ways including machine learning, analytics and batch processing. [We] Grab the data from other sources and put it into a Hadoop environment. [We] Build data lakes. SparkSQL is also used for analysis of data and to develop reports. We have deployed the clusters in Cloudera. Because of Apache Spark, it has become very easy to apply data science in a big data field.
Pros
Easy ELT Process
Easy clustering on cloud
Amazing speed
Batch & real time processing
Cons
Debugging is difficult as it is new for most people
There are fewer learning resources
Likelihood to Recommend
When the data is very big, and you cannot afford a lot of computational timing such as in a real-time environment, it is advisable to use Apache Spark. There are alternatives to Apache Spark, but it is the most common and robust tool to work with. It is great at batch processing.
It's being replaced as the traditional ETL tool and we are using Apache Spark for data science solutions.
Pros
It makes the ETL process very simple when compared to SQL SERVER and MYSQL ETL tools.
It's very fast and has many machine learning algorithms which can be used for data science problems.
It is easily implemented on a cloud cluster.
Cons
The initialization and spark context procedures.
Running applications on a cluster is not well documented anywhere, some applications are hard to debug.
Debugging and Testing are sometimes time-consuming.
Likelihood to Recommend
It's well suited for ETL, data Integration, and data science problems of large data sets. It's not at all suitable for small data sets which can be done on desktops and laptops using the Python tool.