Shutting Off CELERY.BACKEND_CLEANUP on Amazon SQS - django

I'm using Django with Celery. I need to turn off the celery.backend_cleanup that runs every day at 4 UTC. I've been looking at the documentation and can't find how to disable it. Below is my last try:
celery.py
from __future__ import absolute_import, unicode_literals
from django.conf import settings
from celery import Celery
import os
os.environ.setdefault("DJANGO_SETTINGS_MODULE",
"settings")
app = Celery('app')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()
app.conf.beat_schedule = {
'backend_cleanup': {
'task': 'celery.backend_cleanup',
'schedule': None,
'result_expires': None
},
}
I don't want this to run. How can I stop it?
UPDATE:
I also tried adding this to settings.py
CELERYBEAT_SCHEDULE = {
'backend_cleanup': {
'task': 'celery.backend_cleanup',
'schedule': 0,
'result_expires': 0
},
}
I know deleting task in db is an option, but if later on beat has to be restarted it creates the backend_cleanup again and it starts running it. I may not be the person maintaining this in the future, so I need this configured in the code not manually deleted from database.

Here are a few approaches you can try:
You could use a crontab definition that runs once far in the future, e.g.:
app.conf.beat_schedule = {
# Disable cleanup task by scheduling to run every ~1000 years
'backend_cleanup': {
'task': 'celery.backend_cleanup',
'schedule': timedelta(days=365*1000),
'relative': True,
},
}
You can try setting app.backend.supports_autoexpire = True, because this attribute is checked before the default backend_cleanup task is added (src).
You can create a subclass your backend class and set supports_autoexpire = True

Simply set the result_expires setting equal to None, but define it as a global setting in your settings.py, and not a setting bound to a particular task (eg. backend_cleanup in your example). You could also pass it into your app declaration directly.
The docs:
Default: Expire after 1 day.
Time (in seconds, or a timedelta object) for when after stored task tombstones will be deleted.
A built-in periodic task will delete the results after this time (celery.backend_cleanup), assuming that celery beat is enabled. The task runs daily at 4am.
A value of None or 0 means results will never expire (depending on backend specifications).
Note that this only works with the AMQP, database, cache, Couchbase, and Redis backends, and not on SQS!
The docs on SQS:
Warning
Don’t use the amqp result backend with SQS.
It will create one queue for every task, and the queues will not be collected. This could cost you money that would be better spent contributing an AWS result store backend back to Celery :)

You may be using django-celery-beat. if so, you'll have to delete the scheduled task from your database. You can do this through the django admin panel.

Related

MWAA can retrieve variable by ID but not connection from AWS Secrets Manager

we've set up AWS SecretsManager as a secrets backend to Airflow (AWS MWAA) as described in their documentation. Unfortunately, nowhere is explained where the secrets are to be found and how they are to be used then. When I supply conn_id to a task in a DAG, we can see two errors in the task logs, ValueError: Invalid IPv6 URL and airflow.exceptions.AirflowNotFoundException: The conn_id redshift_conn isn't defined. What's even more surprising is that when retrieving variables stored the same way with Variable.get('my_variable_id'), it works just fine.
The question is: Am I wrongly expecting that the conn_id can be directly passed to operators as SomeOperator(conn_id='conn-id-in-secretsmanager')? Must I retrieve the connection manually each time I want to use it? I don't want to run something like read_from_aws_sm_fn in the code below every time beforehand...
Btw, neither the connection nor the variable show up in the Airflow UI.
Having stored a secret named airflow/connections/redshift_conn (and on the side one airflow/variables/my_variable_id), I expect the connection to be found and used when constructing RedshiftSQLOperator(task_id='mytask', redshift_conn_id='redshift_conn', sql='SELECT 1'). But this results in the above error.
I am able to retrieve the redshift connection manually in a DAG with a separate task, but I think that is not how SecretsManager is supposed to be used in this case.
The example DAG is below:
from airflow import DAG, settings, secrets
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
from airflow.models.baseoperator import chain
from airflow.models import Connection, Variable
from airflow.providers.amazon.aws.operators.redshift import RedshiftSQLOperator
from datetime import timedelta
sm_secret_id_name = f'airflow/connections/redshift_conn'
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(1),
'retries': 1,
}
def read_from_aws_sm_fn(**kwargs): # from AWS example code
### set up Secrets Manager
hook = AwsBaseHook(client_type='secretsmanager')
client = hook.get_client_type('secretsmanager')
response = client.get_secret_value(SecretId=sm_secret_id_name)
myConnSecretString = response["SecretString"]
print(myConnSecretString[:15])
return myConnSecretString
def get_variable(**kwargs):
my_var_value = Variable.get('my_test_variable')
print('variable:')
print(my_var_value)
return my_var_value
with DAG(
dag_id=f'redshift_test_dag',
default_args=default_args,
dagrun_timeout=timedelta(minutes=10),
start_date=days_ago(1),
schedule_interval=None,
tags=['example']
) as dag:
read_from_aws_sm_task = PythonOperator(
task_id="read_from_aws_sm",
python_callable=read_from_aws_sm_fn,
provide_context=True
) # works fine
query_redshift = RedshiftSQLOperator(
task_id='query_redshift',
redshift_conn_id='redshift_conn',
sql='SELECT 1;'
) # results in above errors :-(
try_to_get_variable_value = PythonOperator(
task_id='get_variable',
python_callable=get_variable,
provide_context=True
) # works fine!
The question is: Am I wrongly expecting that the conn_id can be directly passed to operators as SomeOperator(conn_id='conn-id-in-secretsmanager')? Must I retrieve the connection manually each time I want to use it? I don't want to run something like read_from_aws_sm_fn in the code below every time beforehand...
Using secret manager as a backend, you don't need to change the way you use the connections or variables. They work the same way, when looking up a connection/variable, airflow follow a search path.
Btw, neither the connection nor the variable show up in the Airflow UI.
The connection/variable will not up in the UI.
ValueError: Invalid IPv6 URL and airflow.exceptions.AirflowNotFoundException: The conn_id redshift_conn isn't defined
The 1st error is related to the secret and the 2nd error is due to the connection not existing in the airflow UI.
There is 2 formats to store connections in secret manager (depending on the aws provider version installed) the IPv6 URL error could be that its not parsing the connection correctly. Here is a link to the provider docs.
First step is defining the prefixes for connections and variables, if they are not defined, your secret backend will not check for the secret:
secrets.backend_kwargs : {"connections_prefix" : "airflow/connections", "variables_prefix" : "airflow/variables"}
Then for the secrets/connections, you should store them in those prefixes, respecting the required fields for the connection.
For example, for the connection my_postgress_conn:
{
"conn_type": "postgresql",
"login": "user",
"password": "pass",
"host": "host",
"extra": '{"key": "val"}',
}
You should store it in the path airflow/connections/my_postgress_conn, with the json dict as string.
And for the variables, you just need to store them in airflow/variables/<var_name>.

How to get Airflow user who manually trigger a DAG?

In the Airflow UI, one of the log events available under "Browser > Logs" is the event "Trigger" along with the DAG ID and Owner/User who's responsible for triggering this event. Is this information easily obtainable programmatically?
The use case is, I have a DAG that allows a subset of users to manually trigger the execution. Depending on the user who triggers the execution of this DAG, the behavior of code execution from this DAG will be different.
Thank you in advance.
You can directly fetch it from the Log table in the Airflow Metadata Database as follows:
from airflow.models.log import Log
from airflow.utils.db import create_session
with create_session() as session:
results = session.query(Log.dttm, Log.dag_id, Log.execution_date, Log.owner, Log.extra).filter(Log.dag_id == 'example_trigger_target_dag', Log.event == 'trigger').all()
# Get top 2 records
results[2]
Output:
(datetime.datetime(2020, 3, 30, 23, 16, 52, 487095, tzinfo=<TimezoneInfo [UTC, GMT, +00:00:00, STD]>),
'example_trigger_target_dag',
None,
'admin',
'[(\'dag_id\', \'example_trigger_target_dag\'), (\'origin\', \'/tree?dag_id=example_trigger_target_dag\'), (\'csrf_token\', \'IjhmYzQ4MGU2NGFjMzg2ZWI3ZjgyMTA1MWM3N2RhYmZiOThkOTFhMTYi.XoJ92A.5q35ClFnQjKRiWwata8dNlVs-98\'), (\'conf\', \'{"message": "kaxil"}\')]')
I will correct the previous answer a little:
with create_session() as session:
results = session.query(Log.dttm, Log.dag_id, Log.execution_date,
Log.owner, Log.extra)\
.filter(Log.dag_id == 'dag_id', Log.event ==
'trigger').order_by(Log.dttm.desc()).all()

Can not cache queryset in Django

In a views I have this cache which is supposed to save some costly queries:
from django.core.cache import cache
LIST_CACHE_TIMEOUT = 120
....
topics = cache.get('forum_topics_%s' % forum_id)
if not topics:
topics = Topic.objects.select_related('creator') \
.filter(forum=forum_id).order_by("-created")
print 'forum topics not in cache', forum_id #Always printed out
cache.set('forum_topics_%s' % forum_id, topics, LIST_CACHE_TIMEOUT)
I don't have problem using this method to cache other queryset results and can not think of the reson of this strange behavior, so I appreciate your hints about this.
I figured out what caused this: memcache hash value can not be larger than 1mb.
So I switched to redis, and the problem was gone:
CACHES = {
"default": {
"BACKEND": "django_redis.cache.RedisCache",
"LOCATION": "redis://127.0.0.1:6379/1",
"OPTIONS": {
"CLIENT_CLASS": "django_redis.client.DefaultClient",
}
}
}
IMPORTANT: make sure that redis version is 2.6 or higher.
redis-server --version
In older versions of redis, apparently redis does not recognize key timeout parameter and throughs error. This tripped me a while because the default redis on Debian 7 was 2.4.

APScheduler Not picking jobs using mongodbjobstore and django

Trying to schedule some jobs using APScheduler
Following is the apscheduler setup from settings.py
jobstores = {
'default': MongoDBJobStore(client=dup_client),
}
executors = {
'default': ThreadPoolExecutor(20),
'processpool': ProcessPoolExecutor(5)
}
job_defaults = {
'coalesce': False,
'max_instances': 3
}
scheduler = BackgroundScheduler(jobstores=jobstores, executors=executors, job_defaults=job_defaults, timezone=TIME_ZONE)
scheduler.start()
Adding Job:
job_id = scheduler.add_job(job_1,'interval', seconds=10)
As mentioned in the documentation, apscheduler db and jobs collections are created. However jobs are not executed
I'm able to see in memory jobs getting executed. Problem is with persistent job stores.

How do I create celery queues on runtime so that tasks sent to that queue gets picked up by workers?

I'm using django 1.4, celery 3.0, rabbitmq
To describe the problem, I have many content networks in a system and I want a queue for processing tasks related to each of these network.
However content is created on the fly when the system is live and therefore I need to create queues on the fly and have existing workers start picking up on them.
I've tried scheduling tasks in the following way (where content is a django model instance):
queue_name = 'content.{}'.format(content.pk)
# E.g. queue_name = content.0c3a92a4-3472-47b8-8258-2d6c8a71e3ba
add_content.apply_async(args=[content], queue=queue_name)
This create a queue with name content.0c3a92a4-3472-47b8-8258-2d6c8a71e3ba, creates a new exchange with name content.0c3a92a4-3472-47b8-8258-2d6c8a71e3ba and routing key content.0c3a92a4-3472-47b8-8258-2d6c8a71e3ba and sends a task to that queue.
However I never see the workers picking up on these tasks. Workers that I have currently set up are not listening to any specific queues (not initialized with queue names) and pick up tasks sent to the default queue just fine. My Celery settings are:
BROKER_URL = "amqp://test:password#localhost:5672/vhost"
CELERY_TIMEZONE = 'UTC'
CELERY_ALWAYS_EAGER = False
from kombu import Exchange, Queue
CELERY_DEFAULT_QUEUE = 'default'
CELERY_DEFAULT_EXCHANGE = 'default'
CELERY_DEFAULT_EXCHANGE_TYPE = 'direct'
CELERY_DEFAULT_ROUTING_KEY = 'default'
CELERY_QUEUES = (
Queue(CELERY_DEFAULT_QUEUE, Exchange(CELERY_DEFAULT_EXCHANGE),
routing_key=CELERY_DEFAULT_ROUTING_KEY),
)
CELERY_CREATE_MISSING_QUEUES = True
CELERYD_PREFETCH_MULTIPLIER = 1
Any idea how I can get the workers to pick up on tasks sent to this newly created queue?
You need to tell the workers to start consuming the new queues. Relevant docs are here.
From the command line:
$ celery control add_consumer content.0c3a92a4-3472-47b8-8258-2d6c8a71e3ba
Or from within python:
>>> app.control.add_consumer('content.0c3a92a4-3472-47b8-8258-2d6c8a71e3ba', reply=True)
Both forms accept a destination argument, so you can tell individual workers only about the new queues if required.
We can dynamically add queues and attach workers to them.
from celery import current_app as app
from task import celeryconfig #your celeryconfig module
To dynamically define and route task into a queue
from task import process_data
process_data.apply_async(args,kwargs={}, queue='queue-name')
reply = app.control.add_consumer('queue_name', destination = ('your-worker-name',), reply = True)
You have to keep the queue names in persistent data store like redis so you can remember it when it restarts.
redis.sadd('CELERY_QUEUES','queue_name')
celeryconfig.py also uses the same to keep remember queue names
CELERY_QUEUES = {
'celery-1': {
'binding_key': 'celery-1'
},
'gateway-1': {
'binding_key': 'gateway-1'
},
'gateway-2': {
'binding_key': 'gateway-2'
}
}
for queue in redis.smembers('CELERY_QUEUES'):
CELERY_QUEUES[queue] = dict(binding_key=queue)