Install python dependencies for apache airflow - airflow-scheduler

I am using apache airflow for running my dags.
I want to install the python dependencies: requests==2.22.0
My docker compose file for webserver, scheduler and postgres is:
version: "2.1"
services:
postgres_airflow:
image: postgres:12
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
ports:
- "5432:5432"
postgres_Service:
image: postgres:12
environment:
- POSTGRES_USER=developer
- POSTGRES_PASSWORD=secret
- POSTGRES_DB=service_db
ports:
- "5433:5432"
scheduler:
image: apache/airflow
restart: always
depends_on:
- postgres_airflow
- postgres_Service
- webserver
env_file:
- .env
volumes:
- ./dags:/opt/airflow/dags
command: scheduler
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
webserver:
image: apache/airflow
restart: always
depends_on:
- pg_airflow
- pg_metadata
- tenants-registry-api
- metadata-api
env_file:
- .env
volumes:
- ./dags:/opt/airflow/dags
- ./scripts:/opt/airflow/scripts
ports:
- "8080:8080"
entrypoint: ./scripts/airflow-entrypoint.sh
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
My dag file is:
import requests
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
default_args = {'owner': 'airflow',
'start_date': datetime(2018, 1, 1)
}
dag = DAG('download2',
schedule_interval='0 * * * *',
default_args=default_args,
catchup=False)
def hello_world_py():
requests.post(url)
print('Hello World')
with dag:
t1 = PythonOperator(
task_id='download2',
python_callable=hello_world_py,
requirements=['requests==2.22.0'],
provide_context=True,
dag=dag
)
The problems I am facing are:
I can not use PythonVirtualenvOperator to install the requirements as I am facing an issue
Airflow log file exception
I can not use something like:
build:
args:
PYTHON_DEPS: "requests==2.22.0"
as I do not have Dockerfile in the context. I have image with apavhe/airflow.
I can not use volume mount ./requirements.txt:requirements.txt in initdb as I am not using the initdb container. I am using just the command in script airflow initdb.
Any solution to the above three problems would work.

Related

Project with django,docker,celery,redis giving error/mainprocess] cannot connect to amqp://guest:**#127.0.0.1:5672//: [errno 111] connection refused

I'm trying to create a Django project with celery and redis for the messaging service using docker-compose. I'm getting Cannot connect to amqp://guest:**#127.0.0.1:5672. I'm not using guest as a user anywhere or 127.0.0.1:5672 and amqp is for RabbitMQ but I'm not using RabbitMQ. So, I don't know if my docker-compose volumes are not set correctly for celery to get the settings, where is it getting amqp from, or is the broker miss configured.
docker-compose.yml:
version: '3'
# network
networks:
data:
management:
volumes:
postgres-data:
redis-data:
services:
nginx:
image: nginx
ports:
- "7001:80"
volumes:
- ./nginx.conf:/etc/nginx/conf.d/default.conf:ro
- ../static:/static
command: [nginx-debug, '-g', 'daemon off;']
networks:
- management
depends_on:
- web
db:
image: postgres:14
restart: always
volumes:
- postgres-data:/var/lib/postgresql/data/
- ../data:/docker-entrypoint-initdb.d # import SQL dump
environment:
- POSTGRES_DB=link_checker_db
- POSTGRES_USER=link_checker
- POSTGRES_PASSWORD=passw0rd
networks:
- data
ports:
- "5432:5432"
web:
image: link_checker_backend
build:
context: .
dockerfile: Dockerfile
environment:
- DJANGO_LOG_LEVEL=ERROR
- INITIAL_YAML=/code/initial.yaml
volumes:
- ../:/code
- ../link_checker:/code/link_checker
- ../link_checker_django/:/code/link_checker_django
- ./settings.py:/code/link_checker_django/settings.py
working_dir: /code
command: >
sh -c "
python manage.py migrate --noinput &&
python manage.py collectstatic --no-input &&
python manage.py runserver 0.0.0.0:7000
"
networks:
- data
- management
depends_on:
- db
redis:
image: redis
volumes:
- redis-data:/data
networks:
- data
celery-default:
image: link_checker_backend
volumes:
- ../:/code
- ../link_checker:/code/link_checker
- ../link_checker_django/:/code/link_checker_django
- ./settings.py:/code/link_checker_django/settings.py
working_dir: /code/link_checker
command: celery -A celery worker --pool=prefork --concurrency=30 -l DEBUG
networks:
- data
depends_on:
- db
- redis
celery.py
from celery import Celery
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "link_checker_django.settings")
app = Celery("link_checker")
app.config_from_object("django.conf:settings")
app.conf.task_create_missing_queues = True
app.autodiscover_tasks()
settings.py
BROKER_URL = "redis://redis:6379/0"
CELERY_ACCEPT_CONTENT = ["json"]
CELERY_TASK_SERIALIZER = "json"
File structure:
link_checker_django
deploy
docker-compose.yml
link_checker
celery.py
link_checker_django
settings.py
manage.py
Thanks, for any help.

Request to Django views that start Celery tasks time out

I'm deploying a Django app with Docker.
version: '3.1'
services:
b2_nginx:
build: ./nginx
container_name: b2_nginx
ports:
- 1904:80
volumes:
- ./app/cv_baza/static:/static:ro
restart: always
b2_app:
build: ./app
container_name: b2_app
volumes:
- ./app/cv_baza:/app
restart: always
b2_db:
container_name: b2_db
image: mysql
command: --default-authentication-plugin=mysql_native_password
restart: always
environment:
MYSQL_ROOT_PASSWORD: -
MYSQL_DATABASE: cvbaza2
volumes:
- ./db:/var/lib/mysql
- ./init:/docker-entrypoint-initdb.d
rabbitmq:
container_name: b2_rabbit
hostname: rabbitmq
image: rabbitmq:latest
ports:
- "5672:5672"
restart: on-failure
celery_worker:
build: ./app
command: sh -c "celery -A cv_baza worker -l info"
container_name: celery_worker
volumes:
- ./app/cv_baza:/app
depends_on:
- b2_app
- b2_db
- rabbitmq
hostname: celery_worker
restart: on-failure
celery_beat:
build: ./app
command: sh -c "celery -A cv_baza beat -l info"
container_name: celery_beat
volumes:
- ./app/cv_baza:/app
depends_on:
- b2_app
- b2_db
- rabbitmq
hostname: celery_beat
image: cvbaza_v2_b2_app
restart: on-failure
memcached:
container_name: b2_memcached
ports:
- "11211:11211"
image: memcached:latest
networks:
default:
In this configuration, hitting any route that is supposed to start a task just hands the request until it eventually times out. Example of route
class ParseCSV(views.APIView):
parser_classes = [MultiPartParser, FormParser]
def post(self, request, format=None):
path = default_storage.save("./internal/documents/csv/resumes.csv", File(request.data["csv"]))
parse_csv.delay(path)
return Response("Task has started")
Task at hand
#shared_task
def parse_csv(file_path):
with open(file_path) as resume_file:
file_read = csv.reader(resume_file, delimiter=",")
for row in file_read:
new_resume = Resumes(first_name=row[0], last_name=row[1], email=row[2],
tag=row[3], university=row[4], course=row[5], year=row[6], cv_link=row[7])
new_resume.save()
None of the docker containers produce an error. Nothing crashes, it just times out and fails silently. Does anyone have a clue where the issue might lie?
Are you checking the result of the Celery task?
result = parse_csv.delay(path)
task_id = result.id
Then somewhere else (perhaps another view):
task = AsyncResult(task_id)
if task.ready():
status_message = task.get()

Django + ElasticSearch + Docker - Connection Timeout no matter what hostname I use

I'm having issues connecting with my Elasticsearch container since day 1.
First I was using elasticsearch as the hostname, then I've tried the container name web_elasticsearch_1, and finally I'd set a Static IP address to the container and passed it in my configuration file.
PYPI packages:
django==3.2.9
elasticsearch==7.15.1
elasticsearch-dsl==7.4.0
docker-compose.yml
version: "3.3"
services:
web:
build:
context: .
dockerfile: local/Dockerfile
image: project32439/python
command: python manage.py runserver 0.0.0.0:8000
volumes:
- .:/code
ports:
- "8000:8000"
env_file:
- local/python.env
depends_on:
- elasticsearch
elasticsearch:
image: elasticsearch:7.10.1
environment:
- xpack.security.enabled=false
- discovery.type=single-node
networks:
default:
ipv4_address: 172.18.0.10
settings.py
# Elasticsearch
ELASTICSEARCH_HOST = "172.18.0.10"
ELASTICSEARCH_PORT = 9200
service.py
from django.conf import settings
from elasticsearch import Elasticsearch, RequestsHttpConnection
es = Elasticsearch(
hosts=[{"host": settings.ELASTICSEARCH_HOST, "port": settings.ELASTICSEARCH_PORT}],
use_ssl=False,
verify_certs=False,
connection_class=RequestsHttpConnection,
)
traceback
HTTPConnectionPool(host='172.18.0.10', port=9200): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f1973ebd6d0>, 'Connection to 172.18.0.10 timed out. (connect timeout=5)'))
By default Docker Compose uses a bridge network to provision inter-container communication. You can read more about this network at the Debian Wiki.
What matters for you, is that by default Docker Compose creates a hostname that equals the service name in the docker-compose.yml file. So update your file:
version: "3.3"
services:
web:
build:
context: .
dockerfile: local/Dockerfile
image: project32439/python
command: python manage.py runserver 0.0.0.0:8000
volumes:
- .:/code
ports:
- "8000:8000"
env_file:
- local/python.env
depends_on:
- elasticsearch
elasticsearch:
image: elasticsearch:7.10.1
environment:
- xpack.security.enabled=false
- discovery.type=single-node
And now you can connect with elasticsearch:9200 instead of 172.18.0.10 from your web container. For more info see this article.

Airflow log file exception

I am using apache airflow for running my dags.
I am getting an exception as:
*** Log file does not exist: /opt/airflow/logs/download2/download2/2020-07-26T15:00:00+00:00/1.log
*** Fetching from: http://fb3393f5f01e:8793/log/download2/download2/2020-07-26T15:00:00+00:00/1.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='fb3393f5f01e', port=8793): Max retries exceeded with url: /log/download2/download2/2020-07-26T15:00:00+00:00/1.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8ba66d7b70>: Failed to establish a new connection: [Errno 111] Connection refused',))
My docker compose file for webserver, scheduler and postgres is:
version: "2.1"
services:
postgres_airflow:
image: postgres:12
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
ports:
- "5432:5432"
postgres_Service:
image: postgres:12
environment:
- POSTGRES_USER=developer
- POSTGRES_PASSWORD=secret
- POSTGRES_DB=service_db
ports:
- "5433:5432"
scheduler:
image: apache/airflow
restart: always
depends_on:
- postgres_airflow
- postgres_Service
- webserver
env_file:
- .env
volumes:
- ./dags:/opt/airflow/dags
command: scheduler
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
webserver:
image: apache/airflow
restart: always
depends_on:
- pg_airflow
- pg_metadata
- tenants-registry-api
- metadata-api
env_file:
- .env
volumes:
- ./dags:/opt/airflow/dags
- ./scripts:/opt/airflow/scripts
ports:
- "8080:8080"
entrypoint: ./scripts/airflow-entrypoint.sh
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
I am getting this exception while using the PythonVirtualenvOperator.
My dag file is:
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
default_args = {'owner': 'airflow',
'start_date': datetime(2018, 1, 1)
}
dag = DAG('download2',
schedule_interval='0 * * * *',
default_args=default_args,
catchup=False)
def hello_world_py():
return "data"
with dag:
t1 = PythonOperator(
task_id='download2',
python_callable=hello_world_py,
op_kwargs=None,
provide_context=True,
dag=dag
)
env file:
AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql://airflow:airflow#postgres_airflow:5432/airflow
AIRFLOW__CORE__FERNET_KEY=XXXX
AIRFLOW_CONN_METADATA_DB=postgres://developer:secret#postgres_Service:5432/service_db
AIRFLOW__VAR__METADATA_DB_SCHEMA=service_db
AIRFLOW__WEBSERVER__BASE_URL=http://0.0.0.0:8080/
I have also explicitly set AIRFLOW__CORE__REMOTE_LOGGING=False to disable the remote logs, still getting an exception.
Also tried placing everything inside the bridge network. Nothing worked for me, though the DAG passes.
Also tried adding:
image: apache/airflow
restart: always
depends_on:
- scheduler
volumes:
- ./dags:/opt/airflow/dags
env_file:
- .env
ports:
- 8793:8793
command: worker
Did not work for me
You need to expose worker log-server port (worker_log_server_port setting in airflow.cfg, 8793 by default) in docker-compose, like:
worker:
image: apache/airflow
...
ports:
- 8793:8793
Here is a slightly different approach I've seen a few folks use when running the scheduler and webserver in their own containers and using LocalExecutor (which I'm guessing is the case here):
Mount a host log directory as a volume into both the scheduler and webserver containers:
volumes:
- /location/on/host/airflow/logs:/opt/airflow/logs
Make sure the user within the airflow containers (usually airflow) has permissions to read and write that directory. If the permissions are wrong you will see an error like the one in your post.
This probably won't scale beyond LocalExecutor usage though.

How to observe scheduled tasks output with Django celery and Docker

I have setup my celery app for django and trying to test a simple periodic tasks in my django app.
I have used RabbitMQ as message broker and setup broker_url by passing environment variables from docker-compose.yml file as well.
I have following structure in docker-compose.yml file.
version: '3'
services:
nginx:
restart: always
image: nginx:latest
container_name: NGINX
ports:
- "8000:8000"
volumes:
- ./src:/src
- ./config/nginx:/etc/nginx/conf.d
- /static:/static
depends_on:
- web
web:
restart: always
build: .
container_name: DJANGO
command: bash -c "python manage.py makemigrations && python manage.py migrate && gunicorn loop.wsgi -b 0.0.0.0:8000 --reload"
depends_on:
- db
volumes:
- ./src:/src
- /static:/static
expose:
- "8000"
links:
- db
- rabbit
db:
restart: always
image: postgres:latest
container_name: PSQL
rabbit:
hostname: rabbit
image: rabbitmq:latest
ports:
# We forward this port for debugging purposes.
- "5672:5672"
# Here, we can access the rabbitmq management plugin.
- "15672:15672"
environment:
- RABBITMQ_DEFAULT_USER=admin
- RABBITMQ_DEFAULT_PASS=mypass
# Celery worker
worker:
build: .
command: bash -c "python manage.py celery worker -B --concurrency=1"
volumes:
- .:/src
links:
- db
- rabbit
depends_on:
- rabbit
For testing purpose, I have created a schedule tasks in settings.py as follows:(scheduling for 10 seconds)
CELERYBEAT_SCHEDULE = {
'schedule-task': {
'task': 'myapp.tasks.test_print',
'schedule': 10, # in seconds
},
}
and the tasts.py file inside myapp has following code:
# -*- coding: utf-8 -*-
from celery.task import task
#task(ignore_result=True, max_retries=1, default_retry_delay=10)
def test_print():
print ("Print from celery task")
When I run command : docker-compose run web python manage.py celery worker -B --concurrency=1
The test_print is executed and the output is printed.
But when I run docker-compose up --build, the following output is printed. But, the tasks test_print is not executed?
rabbit_1 | =INFO REPORT==== 15-Jun-2017::13:05:30 ===
rabbit_1 | accepting AMQP connection <0.354.0> (172.18.0.7:51400 -> 172.18.0.3:5672)
rabbit_1 | =INFO REPORT==== 15-Jun-2017::12:54:25 ===
rabbit_1 | connection <0.421.0> (172.18.0.6:52052 -> 172.18.0.3:5672): user 'admin' authenticated and granted access to vhost '/'
I am new to docker. Any guidance will be very helpful.