TrustRadius: an HG Insights company

Apache Airflow

Score8.6 out of 10

46 Reviews and Ratings

What is Apache Airflow?

Apache Airflow is an open source tool that can be used to programmatically author, schedule and monitor data pipelines using Python and SQL. Created at Airbnb as an open-source project in 2014, Airflow was brought into the Apache Software Foundation’s Incubator Program 2016 and announced as Top-Level Apache Project in 2019. It is used as a data orchestration solution, with over 140 integrations and community support.

Categories & Use Cases

Top Performing Features

  • Multi-platform scheduling

    Multi-platform scheduling is the ability to centrally manage a business process from end-to-end

    Category average: 9.2

  • Central monitoring

    A central monitoring dashboard provides data on trends and forecasts

    Category average: 9

  • Logging

    Logging and audit trails to ensure regulatory compliance

    Category average: 8.6

Areas for Improvement

  • Alerts and notifications

    Alerts and notifications enabling management by exception

    Category average: 8.6

  • Analysis and visualization

    Analysis and visualization tools provide clear understanding of critical errors and helps prioritize errors

    Category average: 8.3

  • Application integration

    Integration with a broad range of enterprise applications

    Category average: 8.4

One Stop solution for all the Orchestration needs.

Use Cases and Deployment Scope

I am part of the data platform team, where we are responsible for building the platform for data ingestion, an aggregation system, and the compute engines. Apache Airflow is one of the core systems responsible for orchestrating pipelines and scheduled workflows. We have multiple deployments of Apache Airflow running for different use cases, each with a workflow of 5,000 to 9,000 DAGs and executing even more DAGs. The Apache Airflow now also offers HA with scheduler replicas, which is a lifesaver and is well-maintained by the community.

Pros

  • Apache Airflow is one of the best Orchestration platforms and a go-to scheduler for teams building a data platform or pipelines.
  • Apache Airflow supports multiple operators, such as the Databricks, Spark, and Python operators. All of these provide us with functionality to implement any business logic.
  • Apache Airflow is highly scalable, and we can run a large number of DAGs with ease. It provided HA and replication for workers. Maintaining airflow deployments is very easy, even for smaller teams, and we also get lots of metrics for observability.

Cons

  • To achieve a production-ready deployment of Apache Airflow, you require some level of expertise. A repository of officially maintained sample configurations of Helm charts will be handy for a new team.
  • As airflow is used to build many data pipelines, a feature for building lineage using queries for different compute engines will help develop the data catalog. Typically, multiple tools are required for this use case.
  • For building a data pipeline from upstream to downstream tables, using Airflow with lineage to trigger the downstream DAGs after recovery will be helpful. Additionally, creating a dependency between the DAGs would be beneficial.

Return on Investment

  • By using Apache Airflow, we were able to build the data platform and migrate our workloads out of Hevo Data.
  • Airflow currently powers the datasets for the entire company, supporting analytics backends, data science, and data engineering use cases.
  • We can scale the DAGS from < 1000 to currently> 8000 dag runs per day using HA and worker scaling.

Usability

Alternatives Considered

Databricks Data Intelligence Platform

Other Software Used

Apache Spark, Databricks Data Intelligence Platform, Trino

Scalable Scheduling Framework and Orchestration tool

Use Cases and Deployment Scope

We are using Apache Airflow as an orchestration tool in data engineering workflows in gaming product.

We are scheduling multiple jobs i.e hourly / daily / weekly / monthly.

We have a lot of requirement for dependent jobs i.e job1 should mandatory run before job2, and Apache Airflow does this work very swiftly, we are utilising multiple Apache Airflow integration with webhook and APIs. Additionally, we are doing a lot of jobs monitoring and SLA misses via Apache Airflow features

Pros

  • Job scheduling
  • Dependent job workflows
  • Failure handling and rerun of workflows

Cons

  • Better User Interface

Return on Investment

  • Good in job scheduling and dependency management between jobs
  • Robust framework to monitor jobs and alert in case of failure and SLA misses
  • Great integration with multiple open source tools

Usability

Alternatives Considered

Prefect

Other Software Used

DataHub, Grafana, Bitbucket

Apache Airflow master of Schedulers and Orchestrator

Use Cases and Deployment Scope

Apache Airflow is a best orchestrator in market. It gives us to flexibility to orchestrate our data engineering workflows with various levels of modifications possible through python programming. It allows us to connect with various cloud providers like Google, AWS and Azure which enables the teams to work in cross cloud environment.

Pros

  • Provides Connection to different Cloud Providers
  • Good Access Management
  • Good User Interface for Users to interact with. If we need to pause , trigger manually , mark any task as successful etc

Cons

  • A local "dry run" or IDE plugin that can validate and simulate DAG execution without needing a full environment.
  • Better feedback on DAG parse errors in the UI or CLI.
  • Navigating large DAGs with hundreds of tasks can be slow and hard to understand visually.

Return on Investment

  • Apache Airflow various options to interact with different databases around controlled by the business since we get the flexibility to write in python.
  • Since Apache Airflow requires python programming hence onboarding people it takes time to onboard the data pipelines because it requires some development effort
  • Apache Airflow makes monitoring easy for all the stake holders as business can see their pipelines running in UI

Usability

Alternatives Considered

AWS Step Functions

Other Software Used

Docker, Kubernetes, DBeaver

Apache Airflow for Startups

Use Cases and Deployment Scope

Used Airflow for Analytics & Reporting

Pros

  • Reports
  • Sending Bulk Email/Notification
  • Processing from different data sources

Cons

  • Improve the GUI Control Panel
  • Provide more example and documentation
  • Improvement in debugging

Return on Investment

  • Impact Depends on number of workflows. If there are lot of workflows then it has a better usecase as the implementation is justified as it needs resources , dedicated VMs, Database that has a cost
  • Donot use it if you have very less usecases

Other Software Used

Apache Kafka, Redis™*, PostgreSQL

Apache AirFlow - Love the Features, Love the Reliability.. Love if the UI get modenized!

Use Cases and Deployment Scope

We use apache airflow as part of our DAG scheduler and health monitoring tool. It serves as a core component in ensuring our scheduled jobs are run, the ability to allow us to inspect jobs successes and failures, and as a troubleshooting tool in an event of job errors/failures. It has been a core tool and we are happy with what it does.

Pros

  • Job scheduling - Pretty straightforward in terms of UI.
  • Job monitoring - Dashboard is as straightforward as it gets.
  • Troubleshooting jobs - ability to dive into detailed errors and navigate the job workflow.

Cons

  • UI/Dashboard can be updated to be customisable, and jobs summary in groups of errors/failures/success, instead of each job, so that a summary of errors can be used as a starting point for reviewing them.
  • Navigation - It's a bit dated. Could do with more modern web navigation UX. i.e. sidebars navigation instead of browser back/forward.
  • Again core functional reorg in terms of UX. Navigation can be improved for core functions as well, instead of discovery.

Return on Investment

  • It is a good workflow job scheduler.
  • It meets all, if not most of our organization product requirements.
  • AirFlow stability in terms of the product reliability is unmatched.

Alternatives Considered

Jenkins and Apache Kafka

Other Software Used

Jenkins, Apache Kafka, Redis™*