Airflow + Neo4j DAG
Running Apache Airflow with Docker: Deploy A Neo4j Workflow
Apache Airflow is a powerful tool for orchestrating complex workflows, and running it with Docker simplifies the setup and maintenance of the environment. In this guide, we will walk through setting up Apache Airflow using Docker, making sure that any changes to your DAGs (Directed Acyclic Graphs) are immediately reflected in the running environment.
Though, I really don't think I'll use it. I feel like for my use cases it is better to just use GCP cloud run jobs.
Prerequisites
Before we begin, make sure you have the following installed:
Docker Docker Compose
Step 1: Setting Up Your Project
First, create a directory for your Airflow project. Inside this directory, create the necessary subdirectories and files.
Step 3: Creating a Sample DAG
Next, create a simple DAG in the dags directory to test your setup. This example will create a DAG that simply runs two dummy tasks.
# dags/example_dag.py
=
=
=
>>
Step 4: Running Docker Compose
With everything set up, you can now build and run your Airflow environment using Docker Compose.
Step 5: Accessing the Airflow Web Interface
Once the services are up and running, you can access the Airflow web interface by navigating to http://localhost:8089 in your web browser. Log in with the credentials you specified in the airflow-init service (admin/admin).
Step 6: Modifying DAGs
You can freely modify your DAGs in the dags directory on your local machine. These changes will be automatically reflected in the running Airflow environment due to the volume mount configuration.
Step 7: Orchestrate a Neo4j Workflow
I created a quick neo4j workflow for some reference.
# Load environment variables from .env file
# Path within the Docker container
# Define Neo4j connection details from environment variables
=
=
=
=
return
# Function for creating relationships between contributors and subjects
=
= # Path within the Docker container
=
=
=
# Placeholder functions for future use
# def ingest_data():
# sql_query = read_query_file('/opt/airflow/sql/ingest_data.sql') # Path within the Docker container
# # Execute your SQL query here using your database connection
# print(sql_query) # Replace this with actual query execution logic
# def transform_data():
# sql_query = read_query_file('/opt/airflow/sql/transform_data.sql') # Path within the Docker container
# # Execute your SQL query here using your database connection
# print(sql_query) # Replace this with actual query execution logic
# Define the task for making contributor-to-subject relationships
=
# Define the DAG structure
Reference Repository
For a complete example with all the necessary files, you can refer to the airflow-docker-example repository. Clone this repository to get started quickly.
Conclusion Running Apache Airflow with Docker simplifies the setup and allows you to develop and test your workflows efficiently.That said, I prefer gcp cloud run. I think provisioning a managed version in google could be worth it for a large organization.. but for my personal projects gcp is the way to go.