Timedeltasensor delaying from schedule interval - airflow-scheduler

I have a job which runs at 13:30. Of which first task takes almost 1 hour to complete after that we need to wait 15 mins. So, I am using Timedeltasensor like below.
waitfor15min = TimeDeltaSensor(
task_id='waitfor15min',
delta=timedelta(minutes=15),
dag=dag)
However in logs, It is showing schedule_interval + 15 min like below
[2020-11-05 20:36:27,013] {time_delta_sensor.py:45} INFO - Checking if the time (2020-11-05T13:45:00+00:00) has come
[2020-11-05 20:36:27,013] {base_sensor_operator.py:79} INFO - Success criteria met. Exiting.
[2020-11-05 20:36:30,655] {logging_mixin.py:95} INFO - [2020-11-05 20:36:30,655] {jobs.py:2612} INFO - Task exited with return code 0
How can I create delay between job??

You could use PythonOperator and write a function that simply waits 15 minutes. There is an example on how a wait task could look like:
def my_sleeping_function(random_base, **kwargs)):
"""This is a function that will run within the DAG execution"""
time.sleep(random_base)
# Generate 5 sleeping tasks, sleeping from 0.0 to 0.4 seconds respectively
for i in range(5):
task = PythonOperator(
task_id='sleep_for_' + str(i),
python_callable=my_sleeping_function,
op_kwargs={'random_base': float(i) / 10},
provide context=true,
dag=dag,
)
run_this >> task

Related

Airflow Not triggering sla_miss_callback for my case

I have followed the documentation and previous stackoverflow links eg : Airflow sla_miss_callback function not triggering
Still I am not able to trigger sla_miss_callback for my case.
I have a top level DAG(DAG X1) that has scheduler interval of : 0 18 * * * -> run everyday at 6 pm.
This calls sub DAG(DAG Y1) using TriggerDagRunOperator.
Sub DAG Y1 do not have schedule interval attached so I cannot apply SLA to it because of this piece of code - Airflow Code
So I have attached SLA to my top level DAG task exactly the same way as this - https://airflow.apache.org/docs/apache-airflow/2.3.1/concepts/tasks.html#slas
#dag(
schedule_interval="0 18 * * *",
start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
sla_miss_callback=sla_callback,
default_args={'sla': timedelta(seconds=60)
}
def sla_callback(dag, task_list, blocking_task_list, slas, blocking_tis):
print(
"The callback arguments are: ",
{
"dag": dag,
"task_list": task_list,
"blocking_task_list": blocking_task_list,
"slas": slas,
"blocking_tis": blocking_tis,
},
)
In my DAG Processor log I do see this log
{{processor.py:377}} INFO - Running SLA Checks for DAG X1, Coming from this part of the Airflow Code
but after that I dont see any log related to SLA and SLA Callback function is not getting kicked off. My DAG run time is more than hour long.

Restarting celery and celery beat schedule relationship in django

Will restarting celery cause all the periodic tasks(celery beat schedules) to get reset and start from the time celery is restarted or does it retain the schedule?
For example assume I have a periodic task that gets executed at 12 pm everyday. Now I restart celery at 3 pm. Will the periodic task be reset to run at 3 pm everyday?
How do you set your task?
Here is many ways to set task schedule →
Example: Run the tasks.add task every 30 seconds.
app.conf.beat_schedule = {
'add-every-30-seconds': {
'task': 'tasks.add',
'schedule': 30.0,
'args': (16, 16)
},
}
app.conf.timezone = 'UTC'
This task is running every 30 seconds after start.
Another example:
from celery.schedules import crontab
app.conf.beat_schedule = {
# Executes every Monday morning at 7:30 a.m.
'add-every-monday-morning': {
'task': 'tasks.add',
'schedule': crontab(hour=7, minute=30),
'args': (16, 16),
},
}
This task is running at 7:30 every day.
You may check schedule examples
So answer is depending on your code.

Celery beat runs every minute instead of every 15 minutes

I'm setting my celery to have the schedule
CELERYBEAT_SCHEDULE = {
'update_some_info': {
'task': 'myapp.somepath.update_some_info',
'schedule': crontab(minute='*/15'),
},
}
when checking what's actually written in crontab, it's indeed <crontab: */15 * * * * (m/h/d/dM/MY)>
but my celery log indicates that the task is running every minute
INFO 2020-01-06 13:21:00,004 beat 29534 139876219189056 Scheduler: Sending due task update_some_info (myapp.somepath.update_some_info)
INFO 2020-01-06 13:22:00,003 beat 29534 139876219189056 Scheduler: Sending due task update_some_info (myapp.somepath.update_some_info)
INFO 2020-01-06 13:23:00,004 beat 29534 139876219189056 Scheduler: Sending due task update_some_info (myapp.somepath.update_some_info)
INFO 2020-01-06 13:24:28,255 beat 29534 139876219189056 Scheduler: Sending due task update_some_info (myapp.somepath.update_some_info)
Why isn't celery beat picking up my schedule?

Celery rate-limiting: Is it possible to rate-limit a celery task differently based on a run-time parameter?

I would like to rate-limit a Celery task based on certain parameters that are decided at runtime. Eg: If the parameter is 1, the rate limit might be 100. If the parameter is 2, the rate-limit might be 25. Moreover, I would like to be able to modify these rate-limits at run-time.
Does celery provide a way of doing this? I could use a routing_key to send tasks to different queues based on a parameter, but celery doesn't appear to support queue-level rate-limiting.
One possible solution would be to use eta while queueing up the task, but I was wondering if there was a better way of achieving this.
Celery provides a built-in rate limit system, but it doesn't work the way most people expect a rate limit system to work and it has several limitations. I implemented a distributed rate limiting system based on the ETA like you mentioned and some Lua scripts on Redis, it worked quite well so I would recommend that approach.
This article details an approach similar to that one:
https://callhub.io/distributed-rate-limiting-with-redis-and-celery/
I used a way simpler version, my lua script was just this:
local current_time = tonumber(ARGV[1])
local eta = tonumber(redis.call('get', KEYS[1]))
local interval = tonumber(ARGV[2])
if not eta or eta < current_time then
redis.call('set', KEYS[1], current_time + interval, 'EX', 10800)
return nil
else
redis.call('set', KEYS[1], eta + interval, 'EX', 10800)
return tostring(eta)
end
And I had to simple override the task apply_async method and call that lua script with the delay that I wanted:
def apply_async(self, *args, **kwargs):
now = int(time.time())
# From django-redis
conn = get_redis_connection('default')
cache_key = 'something'
eta = conn.eval(self.rate_limit_script, 1, cache_key, now, rate_limiter.get_delay())
if eta:
eta = datetime.fromtimestamp(float(eta), tz=timezone.get_current_timezone())
kwargs['eta'] = eta
return super().apply_async(*args, **kwargs)
You can update the rate_limit at runtime within the part of your application that has access to the Celery app instance via celery_app.control.rate_limit().
./task.py
from celery import Celery
app = Celery("sample")
app.conf.update(
broker_url='amqp://guest:guest#localhost:5672',
task_annotations={
'task.func1': {
'rate_limit': '10/s' # Default is 10 per second
}
},
)
#app.task
def func1(ctr):
print(f"I have now processed task {ctr}")
./runner.py
import task
print(f"Current rate_limit is 10/s")
for ctr in range(7):
print(f"Enqueue task {ctr}")
task.func1.delay(ctr)
if ctr == 3:
choice = input("Let's update the rate limit setting [1/2]: ")
if choice == "1":
new_rate_limit = '1/m'
print(f"Changing rate_limit to {new_rate_limit}")
task.app.control.rate_limit('task.func1', new_rate_limit)
elif choice == "2":
new_rate_limit = '1/h'
print(f"Changing rate_limit to {new_rate_limit}")
task.app.control.rate_limit('task.func1', new_rate_limit)
else:
print("Retaining default rate_limit")
For simplicity of example, here we have a raw python runnable script that acts as the caller to our celery task. In real life applications, this could be a Django view integrated with celery, or whatever.
Execute the task listener (the consumer):
$ celery --app=task worker --loglevel=INFO
Execute the task caller (the producer):
$ python3 runner.py
Current rate_limit is 10/s
Enqueue task 0
Enqueue task 1
Enqueue task 2
Enqueue task 3
Let's update the rate limit setting [1/2]: 1
Changing rate_limit to 1/m
Enqueue task 4
Enqueue task 5
Enqueue task 6
Here, we can see that the first 4 runs have a rate of 10 per second. Then with a runtime input, we updated it to 1 per minute for the remaining 3 runs.
Logs of the task listener (the consumer):
[2021-04-30 10:35:44,006: INFO/MainProcess] Received task: task.func1[60600074-16ad-41b1-afbf-7a89da5af2f0]
[2021-04-30 10:35:44,007: INFO/MainProcess] Received task: task.func1[e93f9936-4d56-49a7-bb8b-757817235aa2]
[2021-04-30 10:35:44,007: WARNING/ForkPoolWorker-2] I have now processed task 0
[2021-04-30 10:35:44,008: INFO/ForkPoolWorker-2] Task task.func1[60600074-16ad-41b1-afbf-7a89da5af2f0] succeeded in 0.000337354000293999s: None
[2021-04-30 10:35:44,010: INFO/MainProcess] Received task: task.func1[c0c369c4-dbcf-43db-b79c-49d5866b136f]
[2021-04-30 10:35:44,010: INFO/MainProcess] Received task: task.func1[38b32102-7313-4e64-be77-f9565ce04683]
[2021-04-30 10:35:44,217: WARNING/ForkPoolWorker-3] I have now processed task 2
[2021-04-30 10:35:44,218: INFO/ForkPoolWorker-3] Task task.func1[c0c369c4-dbcf-43db-b79c-49d5866b136f] succeeded in 0.0006413599985535257s: None
[2021-04-30 10:35:44,217: WARNING/ForkPoolWorker-2] I have now processed task 1
[2021-04-30 10:35:44,219: INFO/ForkPoolWorker-2] Task task.func1[e93f9936-4d56-49a7-bb8b-757817235aa2] succeeded in 0.0021943179999652784s: None
[2021-04-30 10:35:44,726: WARNING/ForkPoolWorker-2] I have now processed task 3
[2021-04-30 10:35:44,727: INFO/ForkPoolWorker-2] Task task.func1[38b32102-7313-4e64-be77-f9565ce04683] succeeded in 0.00125738899987482s: None
[2021-04-30 10:35:44,809: INFO/MainProcess] New rate limit for tasks of type task.func1: 1/m.
[2021-04-30 10:35:44,810: INFO/MainProcess] Received task: task.func1[1acb9b7e-755e-4773-a3db-0a284c7024bb]
[2021-04-30 10:35:44,811: INFO/MainProcess] Received task: task.func1[b861a33a-0856-4044-a498-250c0da48d53]
[2021-04-30 10:35:44,811: WARNING/ForkPoolWorker-2] I have now processed task 4
[2021-04-30 10:35:44,812: INFO/ForkPoolWorker-2] Task task.func1[1acb9b7e-755e-4773-a3db-0a284c7024bb] succeeded in 0.0006612189990846673s: None
[2021-04-30 10:35:44,812: INFO/MainProcess] Received task: task.func1[e2e79f75-7628-4449-b880-e3a03020da7e]
[2021-04-30 10:36:44,892: WARNING/ForkPoolWorker-2] I have now processed task 5
[2021-04-30 10:36:44,892: INFO/ForkPoolWorker-2] Task task.func1[b861a33a-0856-4044-a498-250c0da48d53] succeeded in 0.00017851099983090535s: None
[2021-04-30 10:37:44,830: WARNING/ForkPoolWorker-2] I have now processed task 6
[2021-04-30 10:37:44,831: INFO/ForkPoolWorker-2] Task task.func1[e2e79f75-7628-4449-b880-e3a03020da7e] succeeded in 0.0007846450007491512s: None
Here, you could see that the first 4 tasks (with a rate of 10 per second) are all processed at 10:35:44 while the other 3 tasks (with the updated rate of 1 per minute) are processed at 10:35:44, 10:36:44, and 10:37:44 respectively.
Reference: https://docs.celeryproject.org/en/latest/userguide/workers.html#changing-rate-limits-at-run-time

Is possible to avoid the 60 seconds limit in urllib2.urlopen with GAE?

I'm requesting a file with a size around 14MB from a slow server with urllib2.urlopen, and it spend more than 60 seconds to get the data, and I'm getting the error:
Deadline exceeded while waiting for HTTP response from URL:
http://bigfile.zip?type=CSV
Here my code:
class CronChargeBT(webapp2.RequestHandler):
def get(self):
taskqueue.add(queue_name = 'optimized-queue', url='/cronChargeBTB')
class CronChargeBTB(webapp2.RequestHandler):
def post(self):
url = "http://bigfile.zip?type=CSV"
url_request = urllib2.Request(url)
url_request.add_header('Accept-encoding', 'gzip')
urlfetch.set_default_fetch_deadline(300)
response = urllib2.urlopen(url_request, timeout=300)
buf = StringIO(response.read())
f = gzip.GzipFile(fileobj=buf)
...work with the data insiste the file...
I create a cron task who calls CronChargeBT. Here the cron.yaml:
- description: cargar BlueTomato
url: /cronChargeBT
target: charge
schedule: every wed,sun 01:00
and it create a new task and insert into a queue, here the queue configuration:
- name: optimized-queue
rate: 40/s
bucket_size: 60
max_concurrent_requests: 10
retry_parameters:
task_retry_limit: 1
min_backoff_seconds: 10
max_backoff_seconds: 200
Of coursethat the timeout=300 isn't working because the 60seconds limit in GAE but I think yhat I can avoid it using a task... anyone knows how I can get the data in the file avoiding this timeout.
Thanks a lot!!!
Cron jobs are limited to 10 minutes deadline, not 60 seconds. If your download fails, perhaps just retry? Does the download work if you download it from your computer? There's nothing you can do on GAE if the server you are downloading from is too slow or unstable.
Edit: According to https://cloud.google.com/appengine/docs/java/outbound-requests#request_timeouts, there is a maximum deadline of 60 seconds for cron job requests. Therefore, you can't get around it.