How to Deploy Apache Airflow on Ubuntu 20.04

Updated on July 25, 2024
How to Deploy Apache Airflow on Ubuntu 20.04 header image

Introduction

Apache Airflow is an open-source data platform to author, schedule, and monitor workflows. Commonly referred to as Airflow, it's a flexible, scalable, and robust platform that helps manage and schedule ETL (Extract, Transform, and Load) data pipelines.

Airflow uses standard Python code to manage workflow tasks, and workflows are Directed Acyclic Graphs (DAGs) scheduled to run automatically at different intervals. DAG intervals are defined using the crontab syntax.

DAGS, are created using Python code and scheduled to run automatically at specified intervals using the Airflow scheduler. This article explains how to deploy Apache Airflow on a Ubuntu 20.04 server.

Prerequisites

Before you begin:

  1. Deploy a Ubuntu 20.04 server on Vultr.
  2. Set up a new domain A record pointing to the Server IP Address.
  3. Using SSH, access the server.
  4. Create a non-root sudo user, and switch to the account.
  5. Update the server.
  6. Install Nginx.

Install Apache Airflow

  1. Install the Python package manager, and virtual environment.

     $ sudo apt-get install -y python3-pip python3-venv
  2. Create a new project directory.

     $ mkdir airflow-project
  3. Change to the directory.

     $ cd airflow-project
  4. Create a new virtual environment.

     $ python3 -m venv airflow-env
  5. Activate the virtual environment.

     $ source airflow-env/bin/activate

    Your terminal prompt should change as below:

     (airflow-env) user@example:~/airflow-project$ 
  6. Using pip, install Airflow.

     $ pip install apache-airflow
  7. Initialize a new SQLite database to create the Airflow meta-store that Airflow needs to run.

     $ airflow db init

    Output:

     ...
     DB: sqlite:////root/airflow/airflow.db
     [2023-02-05 17:08:48,821] {migration.py:205} INFO - Context impl SQLiteImpl.
     [2023-02-05 17:08:48,822] {migration.py:208} INFO - Will assume non-transactional DDL.
     INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
     INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
     INFO  [alembic.runtime.migration] Running stamp_revision  -> ***
     WARNI [Airflow.models.crypto] empty cryptography key - values will not be stored encrypted.
     Initialization done
  8. Create the administrative user and password used to access Airflow.

     $ airflow users create --role Admin --username admin --email admin --firstname admin --lastname admin --password my-password
  9. Using nohup, start the Airflow scheduler to run in the background. Airflow appends the output of running the scheduler to the scheduler.log file.

     $ nohup airflow scheduler > scheduler.log 2>&1 &

    The Scheduler command starts the Airflow scheduler, queues, and runs the workflows defined in the DAG code.

  10. Start the Airflow web server on port 8080.

    $ nohup airflow webserver -p 8080 > webserver.log 2>&1 &

Configure Nginx as a Reverse Proxy to serve Apache Airflow

  1. Create a new Airflow Nginx configuration file.

     $ sudo touch /etc/nginx/airflow.conf
  2. Using a text editor such as Nano, edit the file.

     $ sudo nano /etc/nginx/airflow.conf
  3. Add the following configurations to the file.

     server {
         listen 80;
         server_name app-online.example.com;
    
         location / {
             proxy_pass http://localhost:8080;
             proxy_set_header Host $host;
             proxy_set_header X-Real-IP $remote_addr;
             proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
             proxy_set_header X-Forwarded-Proto $scheme;
             proxy_set_header X-Frame-Options SAMEORIGIN;
             proxy_buffers 16 4k;
             proxy_buffer_size 2k;
             proxy_busy_buffers_size 4k;
         }
     }

    Replace app-online.example.com with your actual domain name.

    Save and close the file.

  4. Test the Nginx configuration for configuration errors.

     $ sudo nginx -t
  5. Restart Nginx to load changes.

     $ sudo systemctl restart nginx

Security

To allow access to Apache Airflow through the Nginx reverse proxy, open the necessary HTTP and HTTPS firewall ports as described below.

  1. By default, Uncomplicated Firewall (UFW) is active on Vultr Ubuntu servers. Verify that the firewall is running.

     $ sudo ufw status
  2. Allow HTTP access on port 80.

     $ sudo ufw allow 80/tcp
  3. Allow HTTPS on port 443.

     $ sudo ufw allow 443/tcp
  4. Restart the firewall to load changes.

     $ sudo ufw reload

For more firewall configuration options, learn how to configure UFW on Ubuntu.

Generate Let's Encrypt SSL Certificates

To secure your server, serve Apache Airflow requests over HTTPS by installing an SSL certificate to encrypt traffic between the application and the users over the Internet as described below.

  1. Install the Certbot Let's Encrypt Client.

     $ sudo snap install --classic certbot
  2. Activate the Certbot command.

     $ sudo ln -s /snap/bin/certbot /usr/bin/certbot
  3. Generate an SSL Certificate for your domain as set in the Nginx configuration file.

     $ sudo certbot --nginx --redirect -d app-online.example.com -m hello@example.com --agree-tos

    Replace app-online.example.com with your domain name, and hello@example.com with your actual email.

  4. When successful, verify that Certbot auto renews your certificate on expiry.

     $ sudo certbot renew --dry-run
  5. Restart Nginx to load changes.

     $ sudo systemctl restart nginx

For more Certbot configuration options, visit the Install Let's Encrypt SSL on Ubuntu page.

Access Apache Airflow

  1. In a web browser such as Chrome. Visit your configured domain to access the Airflow web interface.

     https://app-online.example.com

    Log in using the administrative username and password you created earlier.

    Airflow web interface login page

How to run a DAG on the airflow setup

Airflow provides sample DAGs that offer a great way to learn Airflow. To run the first DAG on your Airflow instance, follow the steps below.

  1. In your web browser, access the Airflow UI dashboard. in your web browser.

     https://app-online.example.com
  2. When logged in, find the list of default/starter DAGs on the dashboard.

  3. Click any DAG to open the detail page. For example: dataset_consumes_1.

    Default Airflow DAGs

  4. In the upper left corner, toggle the switch button to ON to activate the DAG.

  5. Find and click the play button, then select trigger DAG from the drop down to run the DAG.

    Run Airflow DAGs

You have activated and run your first DAG. Using the DAG, you can start customizing and building workflows to utilize Airflow's powerful features and components.

Conclusion

Airflow is a widely used tool in the data engineering ecosystem, and many companies use it to manage their data pipelines. It suits ETL (Extract, Transform, Load) and other related Data Engineering tasks. It's a great tool to have in your data engineering toolkit and a must-have for any data engineer or data scientist. In this article, you deployed Airflow on a Vultr Ubuntu server, for more information and configuration options, visit the official Airflow documentation.