Computer scienceData scienceInstrumentsAirflow

Introduction to Airflow

10 minutes read

Let's consider a scenario. You have a warehouse, and the maintenance costs are running high. The solution is to automate various processes by installing sensors to control the temperature, lighting, humidity, etc.

The sensors on their own do not run any logic and simply report the measurements at a certain frequency. Now, you have to collect the data from all the sensors, transform it into a specific format, and store it somewhere.

The data itself is useless if it just sits around and does nothing, so you do some data crunching and build a couple of models that minimize the losses by predicting the settings that should be applied to different controllers.

Now, you have to deploy the model somewhere, make it process the new sensor data, monitor multiple aspects (such as the incoming data or the ML model itself, among other things) of this increasingly complex system, and also have a notification system on top of that in case something goes terribly wrong.

Running the whole thing manually is too much of a burden. So, how do you tie so many diverse tasks into something more manageable, like a unified pipeline that runs almost on a single command? This is where a workflow management tool such as Airflow is introduced.

The big picture

A workflow is a sequence of tasks, and a task is a unit of work to be performed. An example of a task can be parsing a bunch of incoming JSONs, sending data to remote storage, executing an SQL query, etc. The workflow is represented as a directed acyclic graph (mostly referred to as the DAG), where each task is a graph node. Below is a basic example of a DAG that sends a GET request to an API, validates the response, preprocesses the response, and updates the database:

A sample DAG

Data validation ensures that the response code, headers, and the response body are as expected, e.g., checking for the presence of a specific key in the response body.

A few things to note:

The DAG specifies the dependencies between the tasks and their execution order. ‘Validate data’ will run after ‘Send a GET request’, and ‘Process the data’ will run after the two previous tasks. Moreover, the DAG should have a start and an end.
Acyclic means that there can't be a loop anywhere in the graph, so the graph below is not a valid DAG:

The invalid DAG structure

Now, one might say, well, I have to update the database and send another request, and this is where the scheduler comes into the picture. A scheduler triggers the workflow, schedules, and executes the tasks. The last part concerning task execution is a bit tricky because, in the simplest case, the tasks are actually executed inside the scheduler, so the scheduler is coupled with the executor:

A scheme of scheduler-executor coupling

The scheduler oversees all tasks and DAGs, activating the task instances when all their dependencies have been fulfilled, and also handles the task failures. The executor is employed by the scheduler to operate the undertaken tasks. The executor's function is to manage resource usage and determine the optimal distribution of work.

The executor's logic is always operating within the scheduler's process. The variation comes in the method of task execution, which can either be local (within the scheduler process) or remote (tasks are executed in a distributed manner, either in different processes or on different machines). Local (everything runs on a single machine) execution is typically used for testing purposes or in a resource-light environment, while remote execution is set up when one has to handle heavier workloads and make the workflow more fault-resistant.

The architectural overview

Airflow primarily deals with the following components:

DAG. The DAG defines the entire process by specifying the order and the dependencies between the tasks (this is done in a DAG file, and the entire workflow is defined in a Python file). Once the DAG starts running (triggered by a scheduler), we have a DagRun Python object, which is an instance of the defined DAG that runs at a specified time with certain parameters (thus, DAG → template, DagRun → specific instance).
Task. As we have seen previously, the goal of the task is to achieve a certain thing, and the method the task uses to achieve this thing is the Operator. So, a task is a call of an operator in a DAG (thus, Operator → template, Task → specific instance of the Operator).
Operator. Operators determine what actually gets done by a task, e.g., sending an HTTP request. An operator is a template for a task and is akin to a class in the OOP context, and the task is like a class instance with passed parameters. Some of the basic operator examples include calling a Python function (via PythonOperator class), sending an HTTP request (via SimpleHttpOperator class), or executing a PostgreSQL query (via PostgresOperator class).

Airflow relies on a database (such as PostgreSQL) as a central point of task coordination. Let's briefly overview what kind of data is being stored:

Metadata about the DAGs and tasks: the relationship between tasks and scheduling intervals (the scheduler uses the database to determine which tasks to run and when);
The state of all tasks, e.g., which tasks have run, when they were run, their status (success, failure, running, etc.), and any errors that might have occurred during the execution;
System logs.

Thus, in a very simple Airflow setup, the architecture looks like this:

The basic Airflow architecture

There are DAG files that define the pipeline (the order and the dependencies) and the tasks (what should be done), a scheduler process (when to run) coupled with an executor (running the tasks), a web server which is a user interface to monitor various aspects of the workflow, and the database that stores the states.

Usage scenarios

After running through the main components, let's look at some examples of using Airflow in real-life scenarios.

For instance, a standard ML pipeline from data loading to deployment might be represented like this as a DAG:

A sample DAG of the MLOps pipeline

Another example could be a smart home system with various sensors (such as humidity or light sensors) that saves the sensor measurements into InfluxDB (a time-series database with dashboards that is often used in smart home systems) and notifies the owner if some device is broken. A sample DAG, in this case, could be as follows:

A sample DAG of a smart home monitoring pipeline

Some of the most common Airflow usages include:

ETL pipelines: the first stage, "Extract," involves pulling or extracting data from various sources, e.g., databases, API endpoints, or local data stores. The "Transform" stage involves structuring or converting data into a more suitable format. The final "Load" stage involves loading or writing the transformed data into a final destination, which can be a database or a data warehouse.
MLOps: data preprocessing tasks are scheduled and automatic dependencies are handled via Airflow. Models are then trained based on the datasets produced from preprocessing and tested against target metrics. If the model performance is satisfactory, it's deployed into production. Various stages such as preprocessing, training, testing, and deploying are all set as separate tasks in the pipeline.

Conclusion

As a result, you are now familiar with:

The main concepts of Airflow (DAGs, Tasks, Operators, Scheduler, Executor);
The basic Airflow architecture;
Some of the most popular use cases of Airflow.

7 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo

Introduction to Airflow

The big picture

The architectural overview

Usage scenarios

Conclusion

Related topics