Apache Airflow for Workflow Automation

Eleftheria DrosopoulouMarch 12th, 2025Last Updated: March 8th, 2025

0 108 6 minutes read

Apache Airflow is a powerful open-source platform for orchestrating complex workflows, often used for creating and managing data pipelines. While Airflow is primarily Python-based, it can integrate seamlessly with Java-based systems and tools, making it versatile for teams using Java in their tech stack. This article explores how to use Apache Airflow Java Integration for workflow automation, including creating and managing data pipelines, integrating with cloud services like AWS and GCP, and providing Java-based examples where applicable.

1. Introduction to Apache Airflow

Apache Airflow allows users to define workflows as Directed Acyclic Graphs (DAGs), where each node represents a task, and edges define dependencies between tasks. Key features include:

Dynamic Pipeline Generation: Pipelines are defined in Python code, making them dynamic and flexible.
Extensibility: Airflow supports custom operators, hooks, and sensors to extend its functionality.
Scalability: It can scale to handle thousands of tasks across multiple workers.
Monitoring: Airflow provides a rich web interface for monitoring and managing workflows.

Official Documentation: https://airflow.apache.org/docs/

2. Creating and Managing Data Pipelines with Airflow

2.1. Defining a DAG

A DAG is a collection of tasks with dependencies. Below is an example of a simple DAG that runs two tasks sequentially:

from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
 
def print_hello():
    print("Hello, Airflow!")
 
default_args = {
    'owner': 'airflow',
    'start_date': datetime(2023, 10, 1),
    'retries': 1,
}
 
dag = DAG(
    'example_dag',
    default_args=default_args,
    description='A simple tutorial DAG',
    schedule_interval='@daily',
)
 
start_task = DummyOperator(task_id='start', dag=dag)
hello_task = PythonOperator(task_id='print_hello', python_callable=print_hello, dag=dag)
end_task = DummyOperator(task_id='end', dag=dag)
 
start_task >> hello_task >> end_task

2.2. Key Concepts

Operators: Define individual tasks (e.g., PythonOperator, BashOperator, EmailOperator).
Tasks: Instances of operators.
Dependencies: Define the order of task execution using >> or set_downstream.

2.3. Scheduling and Backfilling

Airflow allows you to schedule DAGs at specific intervals (e.g., daily, hourly). You can also backfill past data by running DAGs for historical dates.

Example:

airflow dags backfill example_dag --start-date 2023-10-01 --end-date 2023-10-07

3. Integrating Airflow with Cloud Services

3.1. AWS Integration

Airflow provides hooks and operators to interact with AWS services like S3, Redshift, and EMR.

Example: Uploading a File to S3

from airflow.providers.amazon.aws.operators.s3 import S3CreateObjectOperator
 
upload_task = S3CreateObjectOperator(
    task_id='upload_to_s3',
    aws_conn_id='aws_default',
    bucket_name='my-bucket',
    key='data/file.txt',
    data='Hello, S3!',
    dag=dag,
)

Example: Running an EMR Job

from airflow.providers.amazon.aws.operators.emr import EmrCreateJobFlowOperator
 
emr_task = EmrCreateJobFlowOperator(
    task_id='create_emr_cluster',
    aws_conn_id='aws_default',
    job_flow_overrides={
        'Name': 'My EMR Cluster',
        'ReleaseLabel': 'emr-6.5.0',
        'Instances': {
            'InstanceGroups': [
                {
                    'Name': 'Master nodes',
                    'Market': 'ON_DEMAND',
                    'InstanceRole': 'MASTER',
                    'InstanceType': 'm5.xlarge',
                    'InstanceCount': 1,
                },
            ],
        },
    },
    dag=dag,
)

AWS Provider Documentation: https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/index.html

3.2. GCP Integration

Airflow supports integration with Google Cloud Platform (GCP) services like BigQuery, Cloud Storage, and Dataflow.

Example: Querying BigQuery

from airflow.providers.google.cloud.operators.bigquery import BigQueryExecuteQueryOperator
 
bq_task = BigQueryExecuteQueryOperator(
    task_id='run_bigquery_query',
    sql='SELECT * FROM `my_dataset.my_table`',
    use_legacy_sql=False,
    dag=dag,
)

Example: Uploading to Google Cloud Storage

from airflow.providers.google.cloud.transfers.local_to_gcs import LocalFilesystemToGCSOperator
 
upload_to_gcs = LocalFilesystemToGCSOperator(
    task_id='upload_to_gcs',
    src='/path/to/local/file.txt',
    dst='gs://my-bucket/file.txt',
    bucket='my-bucket',
    dag=dag,
)

GCP Provider Documentation: https://airflow.apache.org/docs/apache-airflow-providers-google/stable/index.html

4. Advanced Features

4.1. Custom Operators and Hooks

You can create custom operators and hooks to extend Airflow’s functionality.

Example: Custom Operator

from airflow.models import BaseOperator
from airflow.utils.decorators import apply_defaults
 
class MyCustomOperator(BaseOperator):
    @apply_defaults
    def __init__(self, my_param, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.my_param = my_param
 
    def execute(self, context):
        print(f"Executing with parameter: {self.my_param}")
 
custom_task = MyCustomOperator(
    task_id='custom_task',
    my_param='Hello, Custom Operator!',
    dag=dag,
)

4.2. XComs for Cross-Task Communication

XComs allow tasks to exchange small amounts of data.

Example: Using XComs

def push_function(**kwargs):
    kwargs['ti'].xcom_push(key='my_key', value='my_value')
 
def pull_function(**kwargs):
    value = kwargs['ti'].xcom_pull(key='my_key')
    print(f"Pulled value: {value}")
 
push_task = PythonOperator(task_id='push_task', python_callable=push_function, provide_context=True, dag=dag)
pull_task = PythonOperator(task_id='pull_task', python_callable=pull_function, provide_context=True, dag=dag)
 
push_task >> pull_task

5. Real-World Examples

5.1. Data Pipeline for ETL

A common use case is an ETL (Extract, Transform, Load) pipeline. For example:

Extract data from a database.
Transform the data using Python.
Load the data into a data warehouse like Redshift or BigQuery.

Example: ETL Pipeline

from airflow.providers.postgres.operators.postgres import PostgresOperator
from airflow.providers.amazon.aws.transfers.redshift import S3ToRedshiftOperator
 
extract_task = PostgresOperator(
    task_id='extract_data',
    sql='SELECT * FROM my_table',
    postgres_conn_id='postgres_default',
    dag=dag,
)
 
transform_task = PythonOperator(
    task_id='transform_data',
    python_callable=transform_function,
    dag=dag,
)
 
load_task = S3ToRedshiftOperator(
    task_id='load_to_redshift',
    schema='my_schema',
    table='my_table',
    s3_bucket='my-bucket',
    s3_key='data/file.csv',
    redshift_conn_id='redshift_default',
    dag=dag,
)
 
extract_task >> transform_task >> load_task

5.2. Machine Learning Pipelines

Airflow can orchestrate machine learning workflows, such as:

Data preprocessing.
Model training.
Model evaluation and deployment.

Example: ML Pipeline

from airflow.providers.amazon.aws.operators.sagemaker import SageMakerTrainingOperator
 
train_task = SageMakerTrainingOperator(
    task_id='train_model',
    config={
        'TrainingJobName': 'my-training-job',
        'AlgorithmSpecification': {
            'TrainingImage': '123456789012.dkr.ecr.us-west-2.amazonaws.com/my-algorithm',
            'TrainingInputMode': 'File',
        },
        'RoleArn': 'arn:aws:iam::123456789012:role/service-role/AmazonSageMaker-ExecutionRole',
        'InputDataConfig': [
            {
                'ChannelName': 'train',
                'DataSource': {
                    'S3DataSource': {
                        'S3DataType': 'S3Prefix',
                        'S3Uri': 's3://my-bucket/train/',
                        'S3DataDistributionType': 'FullyReplicated',
                    },
                },
            },
        ],
        'OutputDataConfig': {
            'S3OutputPath': 's3://my-bucket/output/',
        },
        'ResourceConfig': {
            'InstanceType': 'ml.m5.large',
            'InstanceCount': 1,
            'VolumeSizeInGB': 10,
        },
        'StoppingCondition': {
            'MaxRuntimeInSeconds': 3600,
        },
    },
    dag=dag,
)

6. Best Practices

Apache Airflow is an open-source platform designed to programmatically author, schedule, and monitor workflows. While Airflow is primarily Python-based, it can integrate with Java-based systems through custom operators, hooks, and Bash operators. This integration allows teams to leverage existing Java applications, microservices, and libraries within their Airflow workflows.

When working with Airflow and Java, it is crucial to follow best practices to ensure:

Reliability: Workflows should be robust and handle failures gracefully.
Maintainability: Code should be modular, well-documented, and easy to update.
Scalability: Workflows should scale efficiently as the volume of tasks increases.
Security: Sensitive data and credentials should be managed securely.

6.1 Clarifications

Before diving into the best practices, let’s clarify some key points:

Java Integration: Airflow itself is Python-based, but Java applications can be invoked using the BashOperator or custom operators.
Idempotency: Tasks should produce the same result even if executed multiple times.
Modularity: Break down workflows into smaller, reusable components.
Monitoring: Use Airflow’s UI and logging features to monitor workflows.
Testing: Test workflows locally before deploying to production.

Category	Best Practice	Explanation	Example
Idempotency	Ensure tasks are idempotent.	Tasks should produce the same result even if executed multiple times.	A Java application that processes data should overwrite existing output files without errors.
Modularity	Break workflows into smaller, reusable tasks.	Smaller tasks are easier to debug, reuse, and maintain.	Use separate tasks for data extraction, transformation, and loading.
Error Handling	Implement robust error handling in Java applications.	Java applications should handle exceptions and log errors for debugging.	Use `try-catch` blocks in Java code to handle exceptions.
Logging	Log task execution details in Java applications.	Logs help in debugging and monitoring workflows.	Use a logging framework like Log4j or SLF4J in Java applications.
Security	Use Airflow’s connections and variables to manage credentials securely.	Avoid hardcoding sensitive information in DAGs or Java code.	Store AWS credentials in Airflow’s connection manager.
Testing	Test Java applications and Airflow DAGs locally before deployment.	Local testing ensures that workflows work as expected in production.	Use unit tests for Java applications and test DAGs in a local Airflow environment.
Version Control	Use version control for both Airflow DAGs and Java code.	Version control ensures traceability and collaboration.	Use Git to manage DAGs and Java code.
Resource Management	Set appropriate resource limits for tasks.	Prevent resource contention and ensure efficient task execution.	Use `executor_config` to allocate resources like CPU and memory.
Monitoring	Use Airflow’s UI and logging to monitor workflows.	Monitoring helps in identifying and resolving issues quickly.	Check task logs and DAG run history in the Airflow UI.
Documentation	Document DAGs and Java applications thoroughly.	Documentation helps in onboarding new team members and maintaining workflows.	Add comments in DAGs and Java code, and maintain a README file for workflows.
Dependency Management	Use dependency management tools for Java applications.	Manage dependencies efficiently to avoid conflicts.	Use Maven or Gradle for Java dependency management.
Retries	Configure retries for tasks in Airflow.	Retries help in handling transient failures.	Set `retries=3` in the DAG’s default arguments.
Backfilling	Use Airflow’s backfilling feature for historical data processing.	Backfilling ensures that historical data is processed correctly.	Use `airflow dags backfill` to process past data.
Environment Isolation	Use separate environments for development, testing, and production.	Isolating environments prevents conflicts and ensures stability.	Use Docker or Kubernetes to create isolated environments.
Scalability	Use distributed executors like Celery or Kubernetes for scaling.	Distributed executors handle large volumes of tasks efficiently.	Configure Airflow with CeleryExecutor for distributed task execution.

6.2 Examples of Best Practices in Action

1. Idempotency

A Java application that processes data should overwrite existing output files without errors. For example:

public class DataProcessor {
    public static void main(String[] args) {
        String outputPath = "output/data.txt";
        try {
            Files.write(Paths.get(outputPath), "Processed Data".getBytes(), StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING);
        } catch (IOException e) {
            System.err.println("Error writing to file: " + e.getMessage());
        }
    }
}

2. Modularity

Break down a workflow into smaller tasks:

extract_task = BashOperator(
    task_id='extract_data',
    bash_command='java -jar /path/to/ExtractApp.jar',
    dag=dag,
)
 
transform_task = BashOperator(
    task_id='transform_data',
    bash_command='java -jar /path/to/TransformApp.jar',
    dag=dag,
)
 
load_task = BashOperator(
    task_id='load_to_warehouse',
    bash_command='java -jar /path/to/LoadApp.jar',
    dag=dag,
)
 
extract_task >> transform_task >> load_task

3. Error Handling

Use try-catch blocks in Java applications:

public class DataProcessor {
    public static void main(String[] args) {
        try {
            // Process data
        } catch (Exception e) {
            System.err.println("Error processing data: " + e.getMessage());
            System.exit(1); // Exit with error code
        }
    }
}

4. Logging

Use Log4j for logging in Java applications:

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
 
public class DataProcessor {
    private static final Logger logger = LogManager.getLogger(DataProcessor.class);
 
    public static void main(String[] args) {
        logger.info("Starting data processing...");
        try {
            // Process data
        } catch (Exception e) {
            logger.error("Error processing data: ", e);
        }
    }
}

By following these best practices, you can ensure that your Apache Airflow workflows are reliable, maintainable, and scalable, even when integrating with Java-based systems. Whether you’re building ETL pipelines, orchestrating machine learning workflows, or managing cloud resources, these practices will help you get the most out of Airflow.

7. Resources

Official Airflow Documentation: https://airflow.apache.org/docs/
Airflow GitHub Repository: https://github.com/apache/airflow
AWS Provider Documentation: https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/index.html
GCP Provider Documentation: https://airflow.apache.org/docs/apache-airflow-providers-google/stable/index.html

By leveraging Apache Airflow, you can automate and manage complex workflows efficiently, integrating seamlessly with cloud services like AWS and GCP.

Apache Airflow for Workflow Automation

1. Introduction to Apache Airflow

2. Creating and Managing Data Pipelines with Airflow

2.1. Defining a DAG

2.2. Key Concepts

2.3. Scheduling and Backfilling

3. Integrating Airflow with Cloud Services

3.1. AWS Integration

Example: Uploading a File to S3

3.2. GCP Integration

Example: Querying BigQuery

4. Advanced Features

4.1. Custom Operators and Hooks

Example: Custom Operator

4.2. XComs for Cross-Task Communication

Example: Using XComs

5. Real-World Examples

5.1. Data Pipeline for ETL

Example: ETL Pipeline

5.2. Machine Learning Pipelines

Example: ML Pipeline

6. Best Practices

6.1 Clarifications

6.2 Examples of Best Practices in Action

1. Idempotency

2. Modularity

3. Error Handling

4. Logging

7. Resources

Thank you!

Eleftheria Drosopoulou

Thank you!

1. Introduction to Apache Airflow

2. Creating and Managing Data Pipelines with Airflow

2.1. Defining a DAG

2.2. Key Concepts

2.3. Scheduling and Backfilling

3. Integrating Airflow with Cloud Services

3.1. AWS Integration

Example: Uploading a File to S3

3.2. GCP Integration

Example: Querying BigQuery

4. Advanced Features

4.1. Custom Operators and Hooks

Example: Custom Operator

4.2. XComs for Cross-Task Communication

Example: Using XComs

5. Real-World Examples

5.1. Data Pipeline for ETL

Example: ETL Pipeline

5.2. Machine Learning Pipelines

Example: ML Pipeline

6. Best Practices

6.1 Clarifications

6.2 Examples of Best Practices in Action

1. Idempotency

2. Modularity

3. Error Handling

4. Logging

7. Resources

Thank you!

Related Articles

Thank you!