TrustRadius Insights for Apache Hadoop are summaries of user sentiment data from TrustRadius reviews and, when necessary, third party data sources.
Business Problems Solved
Hadoop has been widely adopted by organizations for various use cases. One of its key use cases is in storing and analyzing log data, financial data from systems like JD Edwards, and retail catalog and session data for an omnichannel experience. Users have found that Hadoop's distributed processing capabilities allow for efficient and cost-effective storage and analysis of large amounts of data. It has been particularly helpful in reducing storage costs and improving performance when dealing with massive data sets. Furthermore, Hadoop enables the creation of a consistent data store that can be integrated across platforms, making it easier for different departments within organizations to collect, store, and analyze data. Users have also leveraged Hadoop to gain insights into business data, analyze patterns, and solve big data modeling problems. The user-friendly nature of Hadoop has made it accessible to users who are not necessarily experts in big data technologies. Additionally, Hadoop is utilized for ETL processing, data streaming, transformation, and querying data using Hive. Its ability to serve as a large volume ETL platform and crunching engine for analytical and statistical models has attracted users who were previously reliant on MySQL data warehouses. They have observed faster query performance with Hadoop compared to traditional solutions. Another significant use case for Hadoop is secure storage without high costs. Hadoop efficiently stores and processes large amounts of data, addressing the problem of secure storage without breaking the bank. Moreover, Hadoop enables parallel processing on large datasets, making it a popular choice for data storage, backup, and machine learning analytics. Organizations have found that it helps maintain and process huge amounts of data efficiently while providing high availability, scalability, and cost efficiency. Hadoop's versatility extends beyond commercial applications—it is also used in research computing clusters to complete tasks faster using the MapReduce framework. Finally, the Systems and IT department relies on Hadoop to create data pipelines and consult on potential projects involving Hadoop. Overall, the use cases of Hadoop span across industries and departments, providing valuable solutions for data collection, storage, and analysis.
Loading Reviews List....
Hadoop Reviews
14 Reviews
Mid-sized Companies (51-1,000 employees)
Search is temporarily unavailable. Filters are still applied.
We use Apache Hadoop to store and process large amounts of data (petabytes per day) across thousands of data pipelines. Hadoop works reliably for this purpose. Data scientists at the company also use it for interactive querying for analytics and modeling purposes.
Pros
Storing large amounts of data
Processing large amounts of data via a familiar SQL interface
Cons
Slower than other interactive querying engines. Queries take minutes at least and up to hours sometimes
Tuning the settings to be able to run certain queries can require a lot of domain knowledge
Likelihood to Recommend
If you have petabytes of data that you need to store and process on a regular basis and don't mind having to wait minutes for your queries to run, Apache Hadoop is great for that use case. I would supplement it with another faster interactive database for interactive querying.
[Apache Hadoop] is being handled as it is (mostly) intended. For large, unstructured data management from our data flows to include logging and reports extract, transform and load. We are using it at a medium scale in an on-prem server delivery with Cloudera as the management platform. While I firmly believe cloudera makes it a bit easier to manage, it obfuscates issues at times.
Pros
Handles large amounts of unstructured data well, for business level purposes
Is a good catchall because of this design, i.e. what does not fit into our vertical tables fits here.
Decent for large ETL pipelines and logging free-for-alls because of this, also.
Cons
Many, many modules and because of Apache open source, takes time to learn
Integration is not always seamless between the disparate pieces nor are all the pieces required.
Optimization can be challenging (see PSTL design)
Likelihood to Recommend
Apache Hadoop (and its subsequent add-ons) are well-suited to larger, unstructured data flows, such as aggregation of web traffic or advertising. Geospatial algorithms and their outputs are well-suited for this kind of aggregation as structuring that data is challenging, but leaving it unstructured and performing queries as-needed is a better fit for most business models. With the advent of data science, I would expect Hadoop fits a LOT of their initial outputs quite well.
It is being used at our Fortune 500 clients. It is great for storage, but it is not well understood by the business. The challenge is that it requires very sophisticated data scientists to use properly and in parallel, but the data scientists turn the data on its head, causing IT execution issues. This has forced IT to restructure data in a denormalized form so the business users can actually be productive. This is a big trend in organizations.
Pros
Great for inexpensive storage, when originally introduced.
Distributed processing
Industry standard
Cons
Network fabric needs to be more sophisticated.
Need centralized storage.
The three copy of data should have been in the original design, not years later.
Consider deploying Spectrum Scale in these environments.
Likelihood to Recommend
Massive processing in a distributed environment with data that can be distributed. Research environments. Lab environments would also be a good use for Hadoop. Hadoop can also be used in support of Spark environments and used by Frameworks if deployed properly. The best scenario is with a Data Scientist that understands how to program appropriately.
VU
Verified User
Executive in Professional Services (51-200 employees)
Apache Hadoop is a cost effective solution for storing and managing vast amounts of data efficiently. It is dependable and works even when various clusters fail. The Hadoop Distributed File System (HDFS) also goes a long way in helping in storing data. MapReduce and Tez, with the help of Hive of course, processes large amounts of data in a lesser time frame than expected. This helps our data warehouse to be updated with lesser resources rather than reading, processing and updating data in a relational data base.
Pros
It is cost effective.
It is highly scalable.
Failure tolerant.
Cons
Hadoop does not fit all needs.
Converting data into a single format takes time.
Need to take additional security measures to secure data.
Likelihood to Recommend
When we have data coming in from various sources, using hadoop is a good call. Its a good central station to take a good look at your data and see what needs to be done. Hadoop should not be used directly for Real time Analytics. HDFS should be used to store data and we could use Hive to query the files. Hadoop needs to be understood thoroughly even before attempting to use it for data warehousing needs. So you may need to take stock of what Hadoop provides, and read up on its accompanying tools to see what fits your needs.
Hadoop has been an amazing development in the world of Big Data. Where relational databases fall short with regard to tuning and performance, Hadoop rises to the occasion and allows for massive customization leveraging the different tools and modules. We use Hadoop to input raw data and add layers of consolidation or analysis to make business decisions about disparate datapoints.
Pros
Hadoop can take loads of data quickly and performs well under load.
Hadoop is customizable so that nearly any business objective can be justified with the right combination of data and reports.
Hadoop has a lot of great resources, both informal like the community and formal like the supported modules and training.
Cons
Hadoop is not a relational database, but it has the ability to add modules to run sql-like queries like Impala and Hive.
Hadoop is open source and has many modules. It can be difficult without context to know which modules to leverage.
Likelihood to Recommend
Hadoop is well suited for organizations with a lot of data, trying to justify business decisions with data-driven KPIs and milestones. This tool is best utilized by engineers with data modeling experience and a high-level understanding of how the different data points can be used and correlated. It will be challenging for people with limited knowledge of the business and how data points are created.
[It was used] As a proof of concept to analyze a huge amount of data. We were building a product to analyze huge data and eventually sell that product to a utility.
Pros
Highly Scalable Architecture
Low cost
Can be used in a Cloud Environment
Can be run on commodity Hardware
Open Source
Cons
Its open source but there are companies like hortonworks, Cloudera etc., which give enterprise support
Lots of scripting still needed
Some tools in the hadoop eco system overlap
Likelihood to Recommend
To analyze a huge quantity of data at a low cost. It is definitely the future.
Machine learning with Spark is also a good use case.
You can also use AWS - EMR with S3 to store a lot of data with low cost.
We needed a robust/redundant system to run multiple simultaneous jobs for our ETL pipeline, this needed distributed storage space, integration with Windows AD user accounts and the ability to expand when needed with little to no downtime. We are using Cloudera 5.6 to orchestrate the install (along with puppet) and manage the hadoop cluster.
Pros
The distributed replicated HDFS filesystem allows for fault tolerance and the ability to use low cost JBOD arrays for data storage.
Yarn with MapReduce2 gives us a job slot scheduler to fully utilize available compute resources while providing HA and resource management.
The hadoop ecosystem allows for the use of many different technologies all using the same compute resources so that your spark, samza, camus, pig and oozie jobs can happily co-exist on the same infrastructure.
Cons
Without Cloudera as a management interface the hadoop components are much harder to manage to ensure consistency across a cluster.
The calculations of hardware resources to job slots/resource management can be quite an exercise in finding that "sweet spot" with your applications, a more transparent way of figuring this out would be welcome.
A lot of the roles and management pieces are written in java, which from an administration perspective can have there own issues with garbage collection and memory management.
Likelihood to Recommend
Hadoop is not for the faint of heart and is not a technology per se but an ecosystem of disparate technologies sitting on top of HDFS. It is certainly powerful but if, like me, you were handed this with no prior knowledge or experience using or administering this ecosystem the learning curve can be significant and ongoing having said that I don't think currently there are many other opensource technologies that can provide the flexibility in the "big data" arena especially for ETL or machine learning.
I have been working with Hadoop since last year. It is very user friendly. Hadoop was used by the data center management team. It allows distributed processing of huge amount of data sets across clusters of computers using simple programming models.
Pros
It is robust in the sense that any big data applications will continue to run even when individual servers fail.
Enormous data can be easily sorted.
Cons
It can be improved in terms of security.
Since it is open source, stability issues must be improved.
Likelihood to Recommend
Hadoop is really very useful when dealing with big data.
My organization uses Apache Hadoop for log analysis/data mining of data fetched from different practices in the US, Canada and India. It uses this data for showing analytical graphs and the progress of our software in those regions. Data from the practices is optimized and consumed by the customer applications. It provides faster performance and ease for data usage.
Pros
Hadoop is a very cost effective storage solution for businesses’ exploding data sets.
Hadoop can store and distribute very large data sets across hundreds of servers that operate, therefore it is a highly scalable storage platform.
Hadoop can process terabytes of data in minutes and faster as compared to other data processors.
Hadoop File System can store all types of data, structured and unstructured, in nodes across many servers
Cons
For now, Hadoop is doing great and is very productive.
Likelihood to Recommend
Hadoop is well suited for healthcare organizations that deal with huge amounts of data and optimizing data.
We utilize Hadoop primarily as a large data staging area for disparate corporate data. Select data is aggregated and moved downstream to a more formal data warehouse. Some data analytics is also performed directly against the Hadoop stored data. The direct analytics is done primarily with Apache Spark utilizing Scala and Python.
Pros
No requirement for schema on write.
Ability to scale to massive amounts of data.
Open platform provides multiple options and customizations to fit your exact needs.
Cons
The platform is still maturing and can be confusing to research and use. Basic tasks can still be manual and are not always user friendly.
Likelihood to Recommend
A big data problem doesn't always mean huge volumes of data. The other V's of big data (velocity and variety) are also important factors that may lead to selecting Hadoop as a platform.