APScheduler Not picking jobs using mongodbjobstore and django - django

Trying to schedule some jobs using APScheduler
Following is the apscheduler setup from settings.py
jobstores = {
'default': MongoDBJobStore(client=dup_client),
}
executors = {
'default': ThreadPoolExecutor(20),
'processpool': ProcessPoolExecutor(5)
}
job_defaults = {
'coalesce': False,
'max_instances': 3
}
scheduler = BackgroundScheduler(jobstores=jobstores, executors=executors, job_defaults=job_defaults, timezone=TIME_ZONE)
scheduler.start()
Adding Job:
job_id = scheduler.add_job(job_1,'interval', seconds=10)
As mentioned in the documentation, apscheduler db and jobs collections are created. However jobs are not executed
I'm able to see in memory jobs getting executed. Problem is with persistent job stores.

Related

aws_glue_trigger in terraform creates invalid expression schedule in aws

I am trying to create a AWS Glue job scheduler in terraform based on condition where Crawler triggered by Cron succeeded:
resource "aws_glue_trigger" "trigger" {
name = "trigger"
type = "CONDITIONAL"
actions {
job_name = aws_glue_job.job.name
}
predicate {
conditions {
crawler_name = aws_glue_crawler.crawler.name
crawl_state = "SUCCEEDED"
}
}
}
It applies cleanly but in the job schedules property I am getting job with
Invalid expression in Cron column while the status is Activated. Of course it won't trigger because of that. What I am missing here?
Not sure if I understood the question correctly, but this is my glue trigger configuration, which is to run at scheduled time. And this is triggered at the scheduled time.
resource "aws_glue_trigger" "tr_one" {
name = "tr_one"
schedule = var.wf_schedule_time
type = "SCHEDULED"
workflow_name = aws_glue_workflow.my_workflow.name
actions {
job_name = var.my_glue_job_1
}
}
// Specify schedule time in UTC format to run glue workfows
wf_schedule_time = "cron(56 09 * * ? *)"
Please note that the schedule should be in utc time.
I had the same problem. Unfortunately I did not find an easy way to solve the 'invalid expression' by just using the aws_glue_triggers. Although I figured out a nice workaround using glue workflows to achieve the same goal (to trigger a glue job after a crawler succeeded) I am not quite sure if this is the best way to do it.
First i created a glue workflow
resource "aws_glue_workflow" "my_workflow" {
name = "my-workflow"
}
Then I created a scheduled trigger for my crawler (and I removed the scheduler of the glue crawler I referenced)
resource "aws_glue_trigger" "crawler_scheduler" {
name = "crawler-trigger"
workflow_name = "my-workflow"
type = "SCHEDULED"
schedule = "cron(15 12 * * ? *)"
actions {
crawler_name = "my-crawler"
}
}
Lastly I created the final trigger for my glue job which shall run after the crawler succeeded. The important aspect here is that both triggers are linked to the same workflow; virtually linking crawler & job.
resource "aws_glue_trigger" "job_trigger" {
name = "${each.value.s3_bucket_id}-ndjson_to_parquet-trigger"
type = "CONDITIONAL"
workflow_name = "my-workflow"
predicate {
conditions {
crawler_name = "my-crawler"
crawl_state = "SUCCEEDED"
}
}
actions {
job_name = "my-job"
}
}
The glue job still shows the error message 'invalid expression' under the schedule label but this time you can successfully trigger the glue job by just running the scheduler. In addition to this you will even get a visualization in glue-workflows.

GCP Cloud Tasks: shorten period for creating a previously created named task

We are developing a GCP Cloud Task based queue process that sends a status email whenever a particular Firestore doc write-trigger fires. The reason we use Cloud Tasks is so a delay can be created (using scheduledTime property 2-min in the future) before the email is sent, and to control dedup (by using a task-name formatted as: [firestore-collection-name]-[doc-id]) since the 'write' trigger on the Firestore doc can be fired several times as the document is being created and then quickly updated by backend cloud functions.
Once the task's delay period has been reached, the cloud-task runs, and the email is sent with updated Firestore document info included. After which the task is deleted from the queue and all is good.
Except:
If the user updates the Firestore doc (say 20 or 30 min later) we want to resend the status email but are unable to create the task using the same task-name. We get the following error:
409 The task cannot be created because a task with this name existed too recently. For more information about task de-duplication see https://cloud.google.com/tasks/docs/reference/rest/v2/projects.locations.queues.tasks/create#body.request_body.FIELDS.task.
This was unexpected as the queue is empty at this point as the last task completed succesfully. The documentation referenced in the error message says:
If the task's queue was created using Cloud Tasks, then another task
with the same name can't be created for ~1hour after the original task
was deleted or executed.
Question: is there some way in which this restriction can be by-passed by lowering the amount of time, or even removing the restriction all together?
The short answer is No. As you've already pointed, the docs are very clear regarding this behavior and you should wait 1 hour to create a task with same name as one that was previously created. The API or Client Libraries does not allow to decrease this time.
Having said that, I would suggest that instead of using the same Task ID, use different ones for the task and add an identifier in the body of the request. For example, using Python:
from google.cloud import tasks_v2
from google.protobuf import timestamp_pb2
import datetime
def create_task(project, queue, location, payload=None, in_seconds=None):
client = tasks_v2.CloudTasksClient()
parent = client.queue_path(project, location, queue)
task = {
'app_engine_http_request': {
'http_method': 'POST',
'relative_uri': '/task/'+queue
}
}
if payload is not None:
converted_payload = payload.encode()
task['app_engine_http_request']['body'] = converted_payload
if in_seconds is not None:
d = datetime.datetime.utcnow() + datetime.timedelta(seconds=in_seconds)
timestamp = timestamp_pb2.Timestamp()
timestamp.FromDatetime(d)
task['schedule_time'] = timestamp
response = client.create_task(parent, task)
print('Created task {}'.format(response.name))
print(response)
#You can change DOCUMENT_ID with USER_ID or something to identify the task
create_task(PROJECT_ID, QUEUE, REGION, DOCUMENT_ID)
Facing a similar problem of requiring to debounce multiple instances of Firestore write-trigger functions, we worked around the default Cloud Tasks task-name based dedup mechanism (still a constraint in Nov 2022) by building a small debounce "helper" using Firestore transactions.
We're using a helper collection _syncHelper_ to implement a delayed throttle for side effects of write-trigger fires - in the OP's case, send 1 email for all writes within 2 minutes.
In our case we are using Firebease Functions task queue utils and not directly interacting with Cloud Tasks but thats immaterial to the solution. The key is to determine the task's execution time in advance and use that as the "dedup key":
async function enqueueTask(shopId) {
const queueName = 'doSomething';
const now = new Date();
const next = new Date(now.getTime() + 2 * 60 * 1000);
try {
const shouldEnqueue = await getFirestore().runTransaction(async t=>{
const syncRef = getFirestore().collection('_syncHelper_').doc(<collection_id-doc_id>);
const doc = await t.get(syncRef);
let data = doc.data();
if (data?.timestamp.toDate()> now) {
return false;
}
await t.set(syncRef, { timestamp: Timestamp.fromDate(next) });
return true;
});
if (shouldEnqueue) {
let queue = getFunctions().taskQueue(queueName);
await queue.enqueue({
timestamp: next.toISOString(),
},
{ scheduleTime: next }); }
} catch {
...
}
}
This will ensure a new task is enqueued only if the "next execution" time has passed.
The execution operation (also a cloud function in our case) will remove the sync data entry if it hasn't been changed since it was executed:
exports.doSomething = functions.tasks.taskQueue({
retryConfig: {
maxAttempts: 2,
minBackoffSeconds: 60,
},
rateLimits: {
maxConcurrentDispatches: 2,
}
}).onDispatch(async data => {
let { timestamp } = data;
await sendYourEmailHere();
await getFirestore().runTransaction(async t => {
const syncRef = getFirestore().collection('_syncHelper_').doc(<collection_id-doc_id>);
const doc = await t.get(syncRef);
let data = doc.data();
if (data?.timestamp.toDate() <= new Date(timestamp)) {
await t.delete(syncRef);
}
});
});
This isn't a bullet proof solution (if the doSomething() execution function has high latency for example) but good enough for 99% of our use cases.

Shutting Off CELERY.BACKEND_CLEANUP on Amazon SQS

I'm using Django with Celery. I need to turn off the celery.backend_cleanup that runs every day at 4 UTC. I've been looking at the documentation and can't find how to disable it. Below is my last try:
celery.py
from __future__ import absolute_import, unicode_literals
from django.conf import settings
from celery import Celery
import os
os.environ.setdefault("DJANGO_SETTINGS_MODULE",
"settings")
app = Celery('app')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()
app.conf.beat_schedule = {
'backend_cleanup': {
'task': 'celery.backend_cleanup',
'schedule': None,
'result_expires': None
},
}
I don't want this to run. How can I stop it?
UPDATE:
I also tried adding this to settings.py
CELERYBEAT_SCHEDULE = {
'backend_cleanup': {
'task': 'celery.backend_cleanup',
'schedule': 0,
'result_expires': 0
},
}
I know deleting task in db is an option, but if later on beat has to be restarted it creates the backend_cleanup again and it starts running it. I may not be the person maintaining this in the future, so I need this configured in the code not manually deleted from database.
Here are a few approaches you can try:
You could use a crontab definition that runs once far in the future, e.g.:
app.conf.beat_schedule = {
# Disable cleanup task by scheduling to run every ~1000 years
'backend_cleanup': {
'task': 'celery.backend_cleanup',
'schedule': timedelta(days=365*1000),
'relative': True,
},
}
You can try setting app.backend.supports_autoexpire = True, because this attribute is checked before the default backend_cleanup task is added (src).
You can create a subclass your backend class and set supports_autoexpire = True
Simply set the result_expires setting equal to None, but define it as a global setting in your settings.py, and not a setting bound to a particular task (eg. backend_cleanup in your example). You could also pass it into your app declaration directly.
The docs:
Default: Expire after 1 day.
Time (in seconds, or a timedelta object) for when after stored task tombstones will be deleted.
A built-in periodic task will delete the results after this time (celery.backend_cleanup), assuming that celery beat is enabled. The task runs daily at 4am.
A value of None or 0 means results will never expire (depending on backend specifications).
Note that this only works with the AMQP, database, cache, Couchbase, and Redis backends, and not on SQS!
The docs on SQS:
Warning
Don’t use the amqp result backend with SQS.
It will create one queue for every task, and the queues will not be collected. This could cost you money that would be better spent contributing an AWS result store backend back to Celery :)
You may be using django-celery-beat. if so, you'll have to delete the scheduled task from your database. You can do this through the django admin panel.

Disabling a database for testing when multiple databases configured

I have two databases configured in my app:
DATABASES = {
'default': { ... }
'legacy': { ... }
}
The legacy database is only used in a particular part of the app (I've added it as a second database for convenience).
This works fine, except that when I try to run tests, Django tried to create a test database for the legacy database, causing an error:
Got an error creating the test database: (1044, "Access denied for user ... to to database 'test_...'")
How can I tell Django not to create a test database for the second legacy database?
I thought setting the following would work:
DATABASES['legacy']['TEST'] = {
'NAME': None,
'CREATE_DB': False
}
but that doesn't seem to help
Seems looks like a common issue with multiples databases and testing in Django. Here is the way I generally deal with it.
DATABASES = {
'default': { ... }
'legacy': { ... }
}
# You can add here any other type of control (not prod, debug == True, etc.)
if "test" in sys.argv:
del DATABASES["legacy"]
# Or
DATABASES = { "default": DATABASES["default"] }
This works great in the case you have only one setting file, you can easily adapt for other cases.
If you are handling many databases another option could be to start from the ground up inside your tests settings:
DATABASES = {
'default': { ... }
'legacy': { ... }
}
# You can add here any other type of control (not prod, debug == True, etc.)
if "test" in sys.argv:
DATABASES = {"default": {}}
DATABASES["default"]["ENGINE"] = "django.db.backends.sqlite3"
# Etc... Add want you need here.

How do I create celery queues on runtime so that tasks sent to that queue gets picked up by workers?

I'm using django 1.4, celery 3.0, rabbitmq
To describe the problem, I have many content networks in a system and I want a queue for processing tasks related to each of these network.
However content is created on the fly when the system is live and therefore I need to create queues on the fly and have existing workers start picking up on them.
I've tried scheduling tasks in the following way (where content is a django model instance):
queue_name = 'content.{}'.format(content.pk)
# E.g. queue_name = content.0c3a92a4-3472-47b8-8258-2d6c8a71e3ba
add_content.apply_async(args=[content], queue=queue_name)
This create a queue with name content.0c3a92a4-3472-47b8-8258-2d6c8a71e3ba, creates a new exchange with name content.0c3a92a4-3472-47b8-8258-2d6c8a71e3ba and routing key content.0c3a92a4-3472-47b8-8258-2d6c8a71e3ba and sends a task to that queue.
However I never see the workers picking up on these tasks. Workers that I have currently set up are not listening to any specific queues (not initialized with queue names) and pick up tasks sent to the default queue just fine. My Celery settings are:
BROKER_URL = "amqp://test:password#localhost:5672/vhost"
CELERY_TIMEZONE = 'UTC'
CELERY_ALWAYS_EAGER = False
from kombu import Exchange, Queue
CELERY_DEFAULT_QUEUE = 'default'
CELERY_DEFAULT_EXCHANGE = 'default'
CELERY_DEFAULT_EXCHANGE_TYPE = 'direct'
CELERY_DEFAULT_ROUTING_KEY = 'default'
CELERY_QUEUES = (
Queue(CELERY_DEFAULT_QUEUE, Exchange(CELERY_DEFAULT_EXCHANGE),
routing_key=CELERY_DEFAULT_ROUTING_KEY),
)
CELERY_CREATE_MISSING_QUEUES = True
CELERYD_PREFETCH_MULTIPLIER = 1
Any idea how I can get the workers to pick up on tasks sent to this newly created queue?
You need to tell the workers to start consuming the new queues. Relevant docs are here.
From the command line:
$ celery control add_consumer content.0c3a92a4-3472-47b8-8258-2d6c8a71e3ba
Or from within python:
>>> app.control.add_consumer('content.0c3a92a4-3472-47b8-8258-2d6c8a71e3ba', reply=True)
Both forms accept a destination argument, so you can tell individual workers only about the new queues if required.
We can dynamically add queues and attach workers to them.
from celery import current_app as app
from task import celeryconfig #your celeryconfig module
To dynamically define and route task into a queue
from task import process_data
process_data.apply_async(args,kwargs={}, queue='queue-name')
reply = app.control.add_consumer('queue_name', destination = ('your-worker-name',), reply = True)
You have to keep the queue names in persistent data store like redis so you can remember it when it restarts.
redis.sadd('CELERY_QUEUES','queue_name')
celeryconfig.py also uses the same to keep remember queue names
CELERY_QUEUES = {
'celery-1': {
'binding_key': 'celery-1'
},
'gateway-1': {
'binding_key': 'gateway-1'
},
'gateway-2': {
'binding_key': 'gateway-2'
}
}
for queue in redis.smembers('CELERY_QUEUES'):
CELERY_QUEUES[queue] = dict(binding_key=queue)