Calculate next_run at every run for a Schedule Object - django

I have got a question about django-q, where I could not find any answers in its documentation.
Question: Is it possible to calculate the next_run at every run at the end?
The reason behind it: The q cluster does not cover local times with dst (daylight saving time).
As example:
A schedule that should run 6am. german time.
For summer time: The schedule should be executed at 4am (UTC).
For winter time: The schedule should be executed at 5am (UTC).
To fix that I wrote custom logic for the next run. This logic is taking place in the custom function.
I tried to retrieve the schedule object in the custom function and set the next_run there.
The probleme here is: If I put the next_run logic before the second section "other calculations" it does work, But if I place it after the second section "other calculations" it does not work. Other calculations are not related to the schedule object.
def custom_function(**kwargs)
# 1. some calculations not related to schedule object
# this place does work
related_schedule= Schedule.objects.get(id=kwargs["schedule_id"])
related_schedule.next_run = ...
related_schedule.save()
# 2. some other calculations not related to schedule object
# this place doesn't
That is very random behaviour which I could not explain to me.

Related

WARNING: Failed to add policy job since the add condition is not satisfied

I'm trying to schedule automatic recommendation and population by following this doc.
I'm trying to run this query
SELECT google_columnar_engine_add_policy( 'RECOMMEND_AND_POPULATE_COLUMNS', 'EVERY', 10, 'HOURS');
But this query fails. I've tried many other combinations of policy_interval, duration, time_unit, and it fails with the same error every time.
Only one case works, that is when policy_interval is "IMMEDIATE" but this is not what I'm after.
The basic steps to follow for the configuation and usage are as below:
Enable the columnar engine.
Let the engine's recommendation feature observe your workload and
gather query statistics
Size the engine's column store based on the recommendation feature's
analysis.
Enable automatic population of the column store by the
recommendation feature.
Let the recommendation feature observe your workload and
automatically add columns to the column store.
The query that you are trying to run is for Schedule automatic recommendation and population
(
'RECOMMEND_AND_POPULATE_COLUMNS',
policy_interval, duration, time_unit
);
policy_interval: The time interval determining when the policy runs. You can specify these values:
'IMMEDIATE': The RECOMMEND_AND_POPULATE_COLUMNS operation runs immediately one time. When you use this value, specify 0 and 'HOURS' for the duration and time_unit parameters.
'AFTER': The RECOMMEND_AND_POPULATE_COLUMNS operation runs once when the duration time_unit amount of time passes.
'EVERY': The RECOMMEND_AND_POPULATE_COLUMNS operation runs repeatedly every duration time_unit amount of time.
duration: The number of time_units. For example, 24.
time_unit: The unit of time for duration. You can specify 'DAYS'or 'HOURS'.
Please check if this was followed from setup to configuration and try again.Also as you mentioned, the specific errors are not available with you for clearly understanding the breakpoint here.I would recommend you to check the below link for reference.
https://cloud.google.com/alloydb/docs
https://cloud.google.com/alloydb/docs/faq
Hope that helps.

Writing SSM ParameterStore update once when many are trying from Lambda?

We've inherited an application that runs on Lambda. On initialization, this App reads a configuration stored on ParameterStore and if a certain value is out of date, then that value is updated and the app can continue.
The problem is that many users (about 7 concurrent at any one time) can run this app, and they could hit the same update nearly at the same time and therefore cause ParameterStore to throw a PutParameter throttle-back error.
To avoid that and to update the ParameterStore value only once per day rather than on every run, we've devised a simple solution, and we're wanting to get some advice on whether it makes sense or not...
Here are the steps we're thinking:
Random sleep between 1 and 10 seconds ("queue" up calls to minimize clashes -- perhaps this isn't necessary??)
Does an S3 object exist for today: obj_MMDDYYYY (signals that the PStore update has been done today)
If YES then skip updating ParameterStore value
If NO then create S3 obj_MMDDYYYY (today's date), run the Update ParameterStore value
Any advice appreciated.
Thanks!

How can I schedule cloudwatch rule at second level?

I am trying to setup a cloudwatch rule to trigger a lambda based on this doc: https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/ScheduledEvents.html.
What I'd like to do is to trigger a lambda at every 3rd second per 5 minutes. For example, I want to trigger it as:
00:00:03
00:05:03
00:10:03
...
but I can't find a solution to configure second level in the cron expression. Is there any solution to that?
Cron only allows for a minimum of one minute. So configuration of second is not possible with cron expression. You can take hybrid approach by executing your lambda function at every 5 minutes and handle the 3rd second logic in your function by writing sleep function.
import time
def lambda_handler():
time.sleep(3)
# Now execute your logic
I think timing to second level is near impossible. possibly it can be adjusted to following
initiate every minute via Cron expression.
defer execution of processing logic using sleep for (1-3 seconds) if second portion of current time is under 3 second.
Skip entire processing logic if at the initiation time if second portion of the current time is above some high number of seconds like 5x if that suits the need. 59 will mean no-skip.

What is the most efficient way to perform a large and slow batch job on GAE

Say I have a retrieved a list of objects from NDB. I have a method that I can call to update the state of these objects, which I have to do every 15 minutes. These updates take ~30 seconds due to API calls that it has to make.
How would I go ahead and process a list of >1,000 objects?
Example of an approach that would be very slow:
my_objects = [...] # list of objects to process
for object in my_objects:
object.process_me() # takes around 30 seconds
object.put()
Two options:
you can run a task with a query cursor, that processes only N entities each time. When these are processed, and there are more entities to go, you fire another task with the next query cursor.Resources: query cursor, tasks
you can run a mapreduce job that will go over all entities in your query in a parallel manner (might require more resources).Simple tutorial: MapReduce on App Engine made easy
You might consider using mapreduce for your purposes. When I wanted to update all my > 15000 entities I used mapreduce.
def process(entity):
# update...
yield op.db.Put(entity)

Modifying an existing, timezone-naive scheduler to deal with daylight savings time?

We currently have a timezone-unaware scheduler in pure python.
It uses a heapq (a python binary heap) of ordered events, containing a time, callback and arguments for the callback. It gets the least-valued time from the heapq, computes the number of seconds until the event is to occur, and sleeps that number of seconds before running the job.
We don't need to worry about computers being suspended; this is to run on a dedicated server, not a laptop.
We'd like to make the scheduler cope well with timezone changes, so we don't have a problem in November like we did recently (we had an important job that had to be adjusted in the database to make it run at 8:15AM instead of 9:15AM - normally it runs at 8:15AM). I'm thinking we could:
Store all times in UTC.
Make the scheduler sleep 1 minute and test, in a loop, recomputing
“now” each time, and doing a <= comparison against job datetimes.
Jobs run more frequently than once an hour should “just run normally”.
Hourly jobs that run in between 2:00AM and 2:59AM (inclusive) on a
time change day, probably should skip an hour for PST->PDT, and run
an extra time for PDT->PST.
Jobs run less than hourly probably should avoid rerunning in either
case on days that have a time change.
Does that sound about right? Where might it be off?
Thanks!
I've written about scheduling a few times before with respect to other programming languages. The concepts are valid for python as well. You may wish to read some of these posts: 1, 2, 3, 4, 5, 6
I'll try to address the specific points again, from a Python perspective:
It's important to separate the separate the recurrence pattern from the execution time. The recurrence pattern should store the time as the user would enter it, which is usually a local time. Even if the recurrence pattern is "just one time", that should still be stored as local time. Scheduling is one of a handful of use cases where the common advice of "always work in UTC" does not hold up!
You will also need to store the time zone identifier. These should be IANA time zones, such as America/Los_Angeles or Europe/London. In Python, you can use the pytz library to work with time zones like these.
The execution time should indeed be based on UTC. The next execution time for any event should be calculated from the local time in the recurrence pattern. You may wish to calculate and store these execution times in advance, such that you can easily determine which are the next events to run.
You should be prepared to recalculate these execution times. You may wish to do it periodically, but at minimum it should be done any time you apply a time zone update to your system. You can (and should) subscribe for tz update announcements from IANA, and then look for corresponding pytz updates on pypi.
Think of it this way. When you convert a local time to UTC, you're assuming that you know what the time zone rules will be at that point in time, but nobody can predict what governments will do in the future. Time zone rules can change, and they often do. You need to take that into consideration.
You should test for invalid and ambiguous times, and have a plan for dealing with them. These are easy to hit when scheduling, especially with recurring events.
For example, you might schedule a task to run at 2:00 AM every day - but on the day of the spring-forward transition that time doesn't exist. So what should you do? In many cases, you'll want to run at 3:00 AM on that day, since it's the next time after 1:59 AM. But in some (rarer) contexts, you might run at 1:00 AM, or at 1:59 AM, or just skip that day entirely.
Likewise, you might schedule a task to run at 1:00 AM every day, but on the day of the fall-back transition, 1:00 AM occurs twice. So what do you do? In many cases, the first instance (which is the daylight instance) is the right time to fire. In other (rarer) cases, the second instance may be more appropriate, or (even rarer) it might be appropriate to actually run the job twice.
With regard to jobs that run on an every X [hours/minutes/seconds] type schedule:
These are easiest to schedule by UTC, and should not be affected by DST changes.
If these are the only types of jobs you are running, you can just base your whole system on UTC. But if you're running a mix of different types of jobs, then you might consider just setting the "local time zone" to be "UTC" in the recurrence pattern.
Alternatively, you could just schedule them by a true local time, just make sure that when the job runs it calculates the next execution time based on the current execution time, which should already be in UTC.
You shouldn't distinguish between jobs that run more than hourly, or jobs that run less than hourly. I would expect an hourly to run 25 times on the day of a fall-back transition, and 23 times on the day of a spring-forward transition.
With regard to your plan to sleep and wake up once per minute in a loop - that will probably work, as long as you don't have sub-minute tasks to deal with. It may not necessarily be the most efficient way to deal with it though. If you properly pre-calculate and store the execution times, you could just set a single task to wake up at the next time to run, run everything that needs to run, then set a new task for the next execution time. You don't necessarily have to wake up once per minute.
You should also think about the resources you will need to run the scheduled jobs. What happens if you schedule 1000 tasks that all need to run at midnight? Well they won't necessarily all be able to run simultaneously on a single computer. You might queue them up to run in batches, or spread out the load into different time slots. In a cloud environment perhaps you spin up additional workers to handle the load.