🌬️ Master Data Pipelines: Deploying Apache Airflow with Docker

In modern data engineering, data pipelines rarely consist of a single script. A typical workflow involves extracting data from an API, loading it into a cloud data lake, transforming it via a data warehouse, and finally triggering a machine learning inference model.

If any stage fails, you need retries, dependency management, and clear error monitoring.

Apache Airflow is the open-source industry standard for workflow orchestration. It allows you to author, schedule, and monitor complex workflows programmatically as Python code using Directed Acyclic Graphs (DAGs).

In this guide, we will break down the structural architecture of Airflow, deploy a full-scale instance locally using Docker Compose, and execute your first automated pipeline.

🏗️ The Core Airflow Architecture

Airflow is not a single database or monolithic application. It is a distributed ecosystem made of several decoupled microservices that collaborate to run your code:

Apache Airflow architecture webserver scheduler worker flower metadatabase. Source: airflow.apache.org

  • The Webserver: The user interface. A clean, interactive flask app that lets you trigger DAGs, debug execution logs, manage connections, and audit pipeline histories.
  • The Scheduler: The engine room. A continuous background service that parses your Python DAG files, checks schedules, and orchestrates tasks by handing them off to the queue.
  • The Executor & Workers: The muscle. The Executor defines how tasks get run (e.g., sequentially or distributed). In a standard production layout using the CeleryExecutor, the executor passes tasks to a Redis queue, and independent Worker daemons pull those tasks to run them.
  • The Metadata Database: The brain. A central database (typically PostgreSQL) that stores the state of every task, user configuration, and historical DAG run log.

🛠️ Step 1: Initialize the Airflow Docker Architecture

The easiest and most reliable way to run a multi-component stack like Airflow is using Docker Compose. Airflow provides an official, pre-configured docker-compose.yaml file that sets up PostgreSQL, Redis, the Webserver, the Scheduler, and a Worker out of the box.

1. Fetch the Official Compose Manifest

Create an empty project folder and download the production compose layout template:

Bash

mkdir airflow-docker && cd airflow-docker

# Download the official docker-compose file
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml'

2. Create the Directory Tree

Airflow expects specific folders to match local storage mount points inside the container volumes. Create them in your root directory:

Bash

mkdir -p ./dags ./logs ./plugins ./config

3. Initialize the Environment & Database

Airflow needs to run structural SQL schemas inside the PostgreSQL database before accepting jobs. Configure the current user ID permissions and run the database init command:

Bash

# Set up default environment user configuration mapping
echo -e "AIRFLOW_UID=$(id -u)" > .env

# Run the database initialization container
docker compose up airflow-init

Once the screen reads User 'airflow' created with role 'Admin', your database layer is ready.

🛠️ Step 2: Spin Up the Infrastructure

Launch the full distributed architecture in background detached mode:

Bash

docker compose up -d

Verify that all systems are operational by checking the process statuses:

Bash

docker compose ps

Accessing the UI

Open your browser and navigate to http://localhost:8080. Log in using the default secure development credentials:

  • Username: airflow
  • Password: airflow

🛠️ Step 3: Authoring a Sample DAG Pipeline

An Airflow pipeline is written as pure Python, where tasks are defined using Operators. Let’s build a workflow that simulates an ETL task pipeline.

Create a file named sample_etl.py inside your local ./dags directory:

Python

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator

# 1. Define core Python functions for business logic
def extract_data():
    print("📥 Extracting data from third-party API gateway...")
    return {"status": "success", "records": 150}

def transform_data(**context):
    # Retrieve structural metadata passed from the extraction step if needed
    print("⚙️ Cleaning records and mapping columns...")

# 2. Define default arguments for the DAG lifecycle
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

# 3. Instantiate the DAG object context manager
with DAG(
    dag_id='company_sample_etl',
    default_args=default_args,
    description='A clean sample ETL pipeline demonstrating Docker orchestration.',
    schedule_interval=timedelta(days=1), # Runs once every day automatically
    start_date=datetime(2026, 1, 1),      # Historical baseline start gate
    catchup=False,                       # Don't backfill historical missing days
    tags=['production', 'etl'],
) as dag:

    # Task 1: Use a BashOperator to run terminal system commands
    start_pipeline = BashOperator(
        task_id='print_start_timestamp',
        bash_command='echo "Pipeline triggered at $(date)"'
    )

    # Task 2: Use a PythonOperator to invoke modular python functions
    extract_task = PythonOperator(
        task_id='extract_api_payload',
        python_callable=extract_data
    )

    # Task 3: Transform step running parallel
    transform_task = PythonOperator(
        task_id='transform_payload_metrics',
        python_callable=transform_data
    )

    # Task 4: Final confirmation check
    end_pipeline = BashOperator(
        task_id='pipeline_complete_signal',
        bash_command='echo "Pipeline completed successfully!"'
    )

    # 4. Set Task Dependencies (The structural execution graph mapping)
    start_pipeline >> extract_task >> transform_task >> end_pipeline

🚀 Step 4: Run and Monitor Your DAG

Because you placed the file in the mounted ./dags directory, Airflow’s background scheduler will automatically parse the file within a few seconds.

  1. Refresh your Airflow Web UI home dashboard. Look for the company_sample_etl ID inside the DAG list.
  2. Click the toggle switch on the left side of the row to change the status from Unpaused to Active.
  3. On the right side of the row, click the Play icon button and select Trigger DAG.
  4. Click into the DAG name and navigate to the Graph View or Grid View. You will watch the visual pipeline blocks turn light green (running) and then dark green (success) sequentially.

Click on any completed block task, select Log, and you will see your direct Python print messages rendered right inside the Airflow system layout!