TrustRadius Insights for Apache Hadoop are summaries of user sentiment data from TrustRadius reviews and, when necessary, third party data sources.
Business Problems Solved
Hadoop has been widely adopted by organizations for various use cases. One of its key use cases is in storing and analyzing log data, financial data from systems like JD Edwards, and retail catalog and session data for an omnichannel experience. Users have found that Hadoop's distributed processing capabilities allow for efficient and cost-effective storage and analysis of large amounts of data. It has been particularly helpful in reducing storage costs and improving performance when dealing with massive data sets. Furthermore, Hadoop enables the creation of a consistent data store that can be integrated across platforms, making it easier for different departments within organizations to collect, store, and analyze data. Users have also leveraged Hadoop to gain insights into business data, analyze patterns, and solve big data modeling problems. The user-friendly nature of Hadoop has made it accessible to users who are not necessarily experts in big data technologies. Additionally, Hadoop is utilized for ETL processing, data streaming, transformation, and querying data using Hive. Its ability to serve as a large volume ETL platform and crunching engine for analytical and statistical models has attracted users who were previously reliant on MySQL data warehouses. They have observed faster query performance with Hadoop compared to traditional solutions. Another significant use case for Hadoop is secure storage without high costs. Hadoop efficiently stores and processes large amounts of data, addressing the problem of secure storage without breaking the bank. Moreover, Hadoop enables parallel processing on large datasets, making it a popular choice for data storage, backup, and machine learning analytics. Organizations have found that it helps maintain and process huge amounts of data efficiently while providing high availability, scalability, and cost efficiency. Hadoop's versatility extends beyond commercial applications—it is also used in research computing clusters to complete tasks faster using the MapReduce framework. Finally, the Systems and IT department relies on Hadoop to create data pipelines and consult on potential projects involving Hadoop. Overall, the use cases of Hadoop span across industries and departments, providing valuable solutions for data collection, storage, and analysis.
We use Apache Hadoop to store and process large amounts of data (petabytes per day) across thousands of data pipelines. Hadoop works reliably for this purpose. Data scientists at the company also use it for interactive querying for analytics and modeling purposes.
Pros
Storing large amounts of data
Processing large amounts of data via a familiar SQL interface
Cons
Slower than other interactive querying engines. Queries take minutes at least and up to hours sometimes
Tuning the settings to be able to run certain queries can require a lot of domain knowledge
Likelihood to Recommend
If you have petabytes of data that you need to store and process on a regular basis and don't mind having to wait minutes for your queries to run, Apache Hadoop is great for that use case. I would supplement it with another faster interactive database for interactive querying.
VU
Verified User
Analyst in Marketing (Internet company, 501-1000 employees)
[Apache Hadoop] is being handled as it is (mostly) intended. For large, unstructured data management from our data flows to include logging and reports extract, transform and load. We are using it at a medium scale in an on-prem server delivery with Cloudera as the management platform. While I firmly believe cloudera makes it a bit easier to manage, it obfuscates issues at times.
Pros
Handles large amounts of unstructured data well, for business level purposes
Is a good catchall because of this design, i.e. what does not fit into our vertical tables fits here.
Decent for large ETL pipelines and logging free-for-alls because of this, also.
Cons
Many, many modules and because of Apache open source, takes time to learn
Integration is not always seamless between the disparate pieces nor are all the pieces required.
Optimization can be challenging (see PSTL design)
Likelihood to Recommend
Apache Hadoop (and its subsequent add-ons) are well-suited to larger, unstructured data flows, such as aggregation of web traffic or advertising. Geospatial algorithms and their outputs are well-suited for this kind of aggregation as structuring that data is challenging, but leaving it unstructured and performing queries as-needed is a better fit for most business models. With the advent of data science, I would expect Hadoop fits a LOT of their initial outputs quite well.
It is being used at our Fortune 500 clients. It is great for storage, but it is not well understood by the business. The challenge is that it requires very sophisticated data scientists to use properly and in parallel, but the data scientists turn the data on its head, causing IT execution issues. This has forced IT to restructure data in a denormalized form so the business users can actually be productive. This is a big trend in organizations.
Pros
Great for inexpensive storage, when originally introduced.
Distributed processing
Industry standard
Cons
Network fabric needs to be more sophisticated.
Need centralized storage.
The three copy of data should have been in the original design, not years later.
Consider deploying Spectrum Scale in these environments.
Likelihood to Recommend
Massive processing in a distributed environment with data that can be distributed. Research environments. Lab environments would also be a good use for Hadoop. Hadoop can also be used in support of Spark environments and used by Frameworks if deployed properly. The best scenario is with a Data Scientist that understands how to program appropriately.
VU
Verified User
Executive in Professional Services (Information Services company, 51-200 employees)
Currently, there are two directorates using Hadoop for processing a vast amount of data from various data sources in my organization. Hadoop helps us tackle our problem of maintaining and processing a huge amount of data efficiently. High availability, scalability and cost efficiency are the main considerations for implementing Hadoop as one of the core solutions in our big-data infrastructure.
Pros
Scalability is one of the main reasons we decided to use Hadoop. Storage and processing power can be seamlessly increased by simply adding more nodes.
Replication on Hadoop's distributed file system (HDFS) ensures robustness of data being stored which ensures high-availability of data.
Using commodity hardware as a node in a Hadoop cluster can reduce cost and eliminates dependency on particular proprietary technology.
Cons
User and access management are still challenging to implement in Hadoop, deploying a kerberized secured cluster is quite a challenge itself.
Multiple application versioning on a single cluster would be a nice to have feature.
Processing a large number of small files also becomes a problem on a very large cluster with hundreds of nodes.
Likelihood to Recommend
Hadoop is well suited for internal projects in a secure environment without any external exposure. It also excels well in storing and processing large amounts of data. It is also suitable to be implemented as a data repository for data-intensive applications which require high data availability, a significant amount of memory and huge processing power. However, it is not appropriate to implement as a near real-time solution which needs a high response time with a high number of high transactions per seconds.
Hadoop has been an amazing development in the world of Big Data. Where relational databases fall short with regard to tuning and performance, Hadoop rises to the occasion and allows for massive customization leveraging the different tools and modules. We use Hadoop to input raw data and add layers of consolidation or analysis to make business decisions about disparate datapoints.
Pros
Hadoop can take loads of data quickly and performs well under load.
Hadoop is customizable so that nearly any business objective can be justified with the right combination of data and reports.
Hadoop has a lot of great resources, both informal like the community and formal like the supported modules and training.
Cons
Hadoop is not a relational database, but it has the ability to add modules to run sql-like queries like Impala and Hive.
Hadoop is open source and has many modules. It can be difficult without context to know which modules to leverage.
Likelihood to Recommend
Hadoop is well suited for organizations with a lot of data, trying to justify business decisions with data-driven KPIs and milestones. This tool is best utilized by engineers with data modeling experience and a high-level understanding of how the different data points can be used and correlated. It will be challenging for people with limited knowledge of the business and how data points are created.
VU
Verified User
Engineer in Engineering (Telecommunications company, 51-200 employees)
We needed a robust/redundant system to run multiple simultaneous jobs for our ETL pipeline, this needed distributed storage space, integration with Windows AD user accounts and the ability to expand when needed with little to no downtime. We are using Cloudera 5.6 to orchestrate the install (along with puppet) and manage the hadoop cluster.
Pros
The distributed replicated HDFS filesystem allows for fault tolerance and the ability to use low cost JBOD arrays for data storage.
Yarn with MapReduce2 gives us a job slot scheduler to fully utilize available compute resources while providing HA and resource management.
The hadoop ecosystem allows for the use of many different technologies all using the same compute resources so that your spark, samza, camus, pig and oozie jobs can happily co-exist on the same infrastructure.
Cons
Without Cloudera as a management interface the hadoop components are much harder to manage to ensure consistency across a cluster.
The calculations of hardware resources to job slots/resource management can be quite an exercise in finding that "sweet spot" with your applications, a more transparent way of figuring this out would be welcome.
A lot of the roles and management pieces are written in java, which from an administration perspective can have there own issues with garbage collection and memory management.
Likelihood to Recommend
Hadoop is not for the faint of heart and is not a technology per se but an ecosystem of disparate technologies sitting on top of HDFS. It is certainly powerful but if, like me, you were handed this with no prior knowledge or experience using or administering this ecosystem the learning curve can be significant and ongoing having said that I don't think currently there are many other opensource technologies that can provide the flexibility in the "big data" arena especially for ETL or machine learning.
My organization uses Apache Hadoop for log analysis/data mining of data fetched from different practices in the US, Canada and India. It uses this data for showing analytical graphs and the progress of our software in those regions. Data from the practices is optimized and consumed by the customer applications. It provides faster performance and ease for data usage.
Pros
Hadoop is a very cost effective storage solution for businesses’ exploding data sets.
Hadoop can store and distribute very large data sets across hundreds of servers that operate, therefore it is a highly scalable storage platform.
Hadoop can process terabytes of data in minutes and faster as compared to other data processors.
Hadoop File System can store all types of data, structured and unstructured, in nodes across many servers
Cons
For now, Hadoop is doing great and is very productive.
Likelihood to Recommend
Hadoop is well suited for healthcare organizations that deal with huge amounts of data and optimizing data.
VU
Verified User
Engineer in Engineering (Computer Software company, 51-200 employees)
I have used Hadoop for building business feeds for a telecom client. The major purpose for using Hadoop was to tackle the problem of gaining insights into the ever growing number of business data. We leveraged the map reduce programming model to churn more than 30 gigabytes of data per day into actionable and aggregated data which was further leveraged by campaign teams to design and shape marketing and by product teams to envision new customer experiences.
Pros
Hadoop is an excellent framework for building distributed, fault tolerant data processing systems which leverage HDFS which is optimized for low latency storage and high throughput performance.
Hadoop Map reduce is a powerful programming model and can be leveraged directly either via use of Java programming language or by data flow languages like Apache Pig.
Hadoop has a reach eco system of companion tools which enable easy integration for ingesting large amounts of data efficiently from various sources. For example Apache Flume can act as data bus which can use HDFS as a sink and integrates effectively with disparate data sources.
Hadoop can also be leveraged to build complex data processing and machine learning workflows, due to availability of Apache Mahout, which uses the map reduce model of Hadoop to run complex algorithms.
Cons
Hadoop is a batch oriented processing framework, it lacks real time or stream processing.
Hadoop's HDFS file system is not a POSIX compliant file system and does not work well with small files, especially smaller than the default block size.
Hadoop cannot be used for running interactive jobs or analytics.
Likelihood to Recommend
1. How large are your data sets? If your answer is few gigabytes, Hadoop may be overkill for your needs. 2. Do you require real-time analytical processing? If yes, Hadoop's map reduce may not be a great asset in that scenario. 3. Do you want to want to process data in a batch processing fashion and scale for TeraBytes size clusters? Hadoop is definitely a great fit for your use case.
Hadoop is used by data center management team. Hadoop processes the metric data pushed by virtual machines. Hadoop's output is served to the analytics engine and respective actions are taken to maintain even load on machines.
Pros
Processing huge data sets.
Concurrent processing.
Performance increases with distribution of data across multiple machines.
Better handling of unstructured data.
Data nodes and processing nodes
Cons
Make Haadop lighweight.
Installation is very difficult. Make it more user friendly.
Introduce a feature that works with continuous integration.
Likelihood to Recommend
Ask about how Hadoop fits in your environment and how fast it processes streaming data.
I have been using Hadoop for 2 years and I really find it very useful, especially working with bigger datasets. I have used Hadoop and Mahout for my project to analyze and learn different patterns from Yelp Dataset. It was really very easy and user friendly to use.
Pros
Scalability. Hadoop is really useful when you are dealing with a bigger system and you want to make your system scalable.
Reliable. Very reliable.
Fast, Fast Fast!!! Hadoop really works very fast, even with bigger datasets.
Cons
Development tools are not that easy to use.
Learning curve can be reduced. As of now, some skill is a must to use Hadoop.
Security. In today's world, security is of prime importance. Hadoop could be made more secure to use.
Likelihood to Recommend
Hadoop is really useful for larger datasets. It is not very useful when you are dealing with a smaller dataset.