Computer scienceData scienceInstrumentsAirflow

Operators, hooks, and sensors in Airflow

12 minutes read

In this topic, we will look at three components of Airflow — namely, Operators, Sensors, and Hooks.

Operator

Operators in Airflow represent single tasks in a workflow. They define what actually gets done by the task, such as executing a Python function, running a bash script, or executing a command in a database. There are different types of operators for executing different types of tasks. For example, PythonOperator is used for executing arbitrary Python code, BashOperator is used for running bash commands, etc. Each operator takes different sets of parameters to define specific behaviors. Airflow also allows creating custom operators, so users can define their own tasks, although it's rather uncommon to write your own operators since Airflow provides quite a few of them from the start. You should always check whether an operator already exists before writing your own.

Astronomer provides a registry for the available Airflow components, including operators, so it's always a good idea to check the registry to see whether there is a component already available for your task at hand.

The table below provides a description for some of the commonly seen operators:

Operator

Brief description

Notes

PythonOperator (with variants)

Executes arbitrary Python code

Introduces dependency management difficulty, runs directly on the Airflow worker, should only be used for simple operations

BashOperator

Responsible for the execution of bash commands. Creates tasks that require running shell commands within the workflow

Similar to PythonOperator's notes

DockerOperator

Executes Docker images

Moves the execution outside of the Airflow worker, makes versioning possible, easy dependency management, tasks become more independent

KubernetesPodOperator

Executes tasks in a standalone K8S pod with Docker images

A very common and versatile operator in practice, the notes are similar to DockerOperator

BaseOperator

The base class for defining the operators

Should be used for custom operators

The usage of operators requires you to keep a few things in mind. PythonOperator, VirtualEnvOperator or ExternalPythonOperator should rarely be used in practice, unless performing very simple I/O operations and not writing complex or memory-intensive logic inside. The reason is that putting the logic inside those operators leads to a heavier load on the Airflow workers directly. Another thing is that when the logic is complex, you might start losing track of what is actually happening in your code, which will complicate debugging or maintenance of the DAGs. Also, PythonOperators are especially prone to introducing high coupling (the tasks become heavily dependent on other tasks and their outputs), so it will be hard to test individual components to see what went wrong. BashOperator tends to suffer from similar issues described above.

DockerOperator is a nicer alternative in the case that arbitrary Python (or, for that matter, any language) code has to be executed and the registry lacks a specific component. The images can be developed and tested outside of the DAG they are running in. Also, just by virtue of managing images, DockerOperator tasks allow versioning. Another good thing is that each task (aka, the image) can have its own set of dependencies, which helps to avoid dependency conflicts and eliminates the need to pollute the Airflow worker environment. DockerOperator also won't run on the Airflow workers directly. Lastly, DockerOperator provides a more universal interface for task management (when compared to using many different operators with their own specific set of requirements and internal logic). Using Kubernetes (with the KubernetesPodOperator) is a widespread practice, since you are also managing Docker images with Kubernetes.

Operators can be non-deferrable and deferrable. The non-deferrable operator is a type of operator that executes tasks immediately when triggered. The deferrable operator is a type of operator that suspends itself and frees up worker resources while waiting for an external trigger to resume execution. The operator defers execution to an external system until a trigger event has occurred, which is particularly suitable for larger jobs. A trigger is an instruction to start executing a certain task or workflow. When certain conditions are met, such as a predetermined schedule being reached or the successful completion of a different task, a trigger is activated to start the task. The main types of triggers are Time (triggered at certain times), External (a task is triggered by an external event or process), and Status (a task is triggered based on the status of another task).

Sensor

Sensor is a type of operator that waits for a specific condition to be met. These conditions could be a certain time of day, a certain file in a system, HTTP requests, or data in a database or a specific state of a database. Sensors are useful in scenarios where a task's execution depends on an external system or service. They keep running until the specified condition is satisfied, consuming a worker slot in the process.

The main difference between a sensor and a deferable operator is how they handle waiting conditions. A sensor will actively wait for a condition to be true by continuously using resources to check the condition, while a deferable operator allows tasks to free up resources when they are waiting for an external condition to be met. The latter is more optimized for idle tasks, hence more resource-efficient.

There are two modes for a sensor: poke and reschedule. poke is the default mode for a sensor. In this mode, the sensor operator keeps running and occupies a worker slot until the specified condition is met. This may lead to resource wastage because it blocks a worker from running other tasks. In the reschedule mode, the sensor operator frees the worker slot when the condition is not met. The operator will try to run at the next scheduled interval. This mode is more efficient as it allows the worker to execute other tasks between checks.

Some of the common sensors include HttpSensor (checks for the availability of a particular HTTP resource), SqlSensor (checks whether a particular SQL query produces any output), FileSensor (checks whether a file exists at a given location), TimeSensor (waits until a specified time), and the ExternalTaskSensor (waits for a task in another DAG to complete). The full communal list of Sensors is, again, available on the Astronomer registry.

Hook

A hook is a connection interface for many types of external platforms and databases. Hooks function to abstract the methods that must be used to interact with external systems. With hooks, users do not need to worry about dealing with authentication and specific details of how to connect with each system when retrieving and sending data. Thus, hooks simplify the process of connecting to, querying, or loading data from various systems, databases and APIs.

Hooks are especially relevant in the following cases:

  • Hooks should be always used for connection to the external systems over manual API interactions. In case there is a custom operator that does some API-related logic, it should use hooks (instead of, let's say, using requests inside of the PythonOperator).

  • If your task already has the available operator and comes with a built-in hook, just use the available operator instead of writing a hook from scratch.

You can look up the available hooks in the Astronomer registry.

Conclusion

As a result, you are familiar with the operators, which actually contain the logic for the tasks, sensors, which trigger specific actions based on some condition (e.g., time, or a presence of a specific file, etc), and hooks, which should be used for the interactions with the external systems.

4 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo