I have got a question about django-q, where I could not find any answers in its documentation.
Question: Is it possible to calculate the next_run at every run at the end?
The reason behind it: The q cluster does not cover local times with dst (daylight saving time).
As example:
A schedule that should run 6am. german time.
For summer time: The schedule should be executed at 4am (UTC).
For winter time: The schedule should be executed at 5am (UTC).
To fix that I wrote custom logic for the next run. This logic is taking place in the custom function.
I tried to retrieve the schedule object in the custom function and set the next_run there.
The probleme here is: If I put the next_run logic before the second section "other calculations" it does work, But if I place it after the second section "other calculations" it does not work. Other calculations are not related to the schedule object.
def custom_function(**kwargs)
# 1. some calculations not related to schedule object
# this place does work
related_schedule= Schedule.objects.get(id=kwargs["schedule_id"])
related_schedule.next_run = ...
related_schedule.save()
# 2. some other calculations not related to schedule object
# this place doesn't
That is very random behaviour which I could not explain to me.
We have a AWS step function that processes csv files. These CSV files records can be anything from 1 to 4000.
Now, I want to create another inner AWS step function that will process these csv records. The problem is for each record I need to hit another API and for that I want all of the record to be executed asynchronously.
For example - CSV recieved having records of 2500
The step function called another step function 2500 times (The other step function will take a CSV record as input) process it and then store the result in Dynamo or in any other place.
I have learnt about the callback pattern in aws step function but in my case I will be passing 2500 tokens and I want the outer step function to process them when all the 2500 records are done processing.
So my question is this possible using the AWS step function.
If you know any article or guide for me to reference then that would be great.
Thanks in advance
It sounds like dynamic parallelism could work:
To configure a Map state, you define an Iterator, which is a complete sub-workflow. When a Step Functions execution enters a Map state, it will iterate over a JSON array in the state input. For each item, the Map state will execute one sub-workflow, potentially in parallel. When all sub-workflow executions complete, the Map state will return an array containing the output for each item processed by the Iterator.
This keeps the flow all within a single Step Function and allows for easier traceability.
The limiting factor would be the amount of concurrency available (docs):
Concurrent iterations may be limited. When this occurs, some iterations will not begin until previous iterations have completed. The likelihood of this occurring increases when your input array has more than 40 items.
One additional thing to be aware of here is cost. You'll easily blow right through the free tier and start incurring actual cost (link).
I am exceeding the Lambda execution time out after 15 mins.
The reason why is because I have long running operations that are inserting and updating thousands of rows of data into Salesforce.
Below is the code that is executing one by one:
sfdc_ops.insert_case_records( records_to_insert_df , sf)
sfdc_ops.update_case_records( records_to_update_df , sf)
sfdc_ops.update_case_records( unprocessed_in_IKM_df , sf)
sfdc_ops.update_case_records( processed_in_IKM_df , sf)
I ultimately do not need to wait for each line. What I really want to do is launch all 4 of these update and insert processes at once.
What is the best solution for avoiding this 15 minute limit - Step Functions?
It is recommended to use AWS Step Functions for long-running jobs, such as ETL jobs. You could split the four update operations into separate Lambdas and orchestrate them in a Step Function to run in parallel.
I'd rethink your approach. If you're exceeding than 15 minutes limit, then one lambda function isn't going to work for you.
Can you break it down into smaller functionality? One lambda to do the inserts, one to do the updates and orchestrate them using Step Functions perhaps?
Have you looked at AWS Batch for batch processing instead of using Lambda Functions?
I have a Cloud Watch Rule scheduling a lambda with a cron expression 0/1 * * * ? *.
The next 10 trigger dates show that it will run on the first second of every minute which is exactly what I want.
However when I monitor the lambda Recent Invocations (or look in the CouldWatch logs) I see it is always run somewhere around the 37th second of each minute. I am able to adjust this by enabling and disabling my CouldWatch rule again and again until I randomly get a start time closer to start of each minute.
But there must be a better solution than this?
We currently have a timezone-unaware scheduler in pure python.
It uses a heapq (a python binary heap) of ordered events, containing a time, callback and arguments for the callback. It gets the least-valued time from the heapq, computes the number of seconds until the event is to occur, and sleeps that number of seconds before running the job.
We don't need to worry about computers being suspended; this is to run on a dedicated server, not a laptop.
We'd like to make the scheduler cope well with timezone changes, so we don't have a problem in November like we did recently (we had an important job that had to be adjusted in the database to make it run at 8:15AM instead of 9:15AM - normally it runs at 8:15AM). I'm thinking we could:
Store all times in UTC.
Make the scheduler sleep 1 minute and test, in a loop, recomputing
“now” each time, and doing a <= comparison against job datetimes.
Jobs run more frequently than once an hour should “just run normally”.
Hourly jobs that run in between 2:00AM and 2:59AM (inclusive) on a
time change day, probably should skip an hour for PST->PDT, and run
an extra time for PDT->PST.
Jobs run less than hourly probably should avoid rerunning in either
case on days that have a time change.
Does that sound about right? Where might it be off?
Thanks!
I've written about scheduling a few times before with respect to other programming languages. The concepts are valid for python as well. You may wish to read some of these posts: 1, 2, 3, 4, 5, 6
I'll try to address the specific points again, from a Python perspective:
It's important to separate the separate the recurrence pattern from the execution time. The recurrence pattern should store the time as the user would enter it, which is usually a local time. Even if the recurrence pattern is "just one time", that should still be stored as local time. Scheduling is one of a handful of use cases where the common advice of "always work in UTC" does not hold up!
You will also need to store the time zone identifier. These should be IANA time zones, such as America/Los_Angeles or Europe/London. In Python, you can use the pytz library to work with time zones like these.
The execution time should indeed be based on UTC. The next execution time for any event should be calculated from the local time in the recurrence pattern. You may wish to calculate and store these execution times in advance, such that you can easily determine which are the next events to run.
You should be prepared to recalculate these execution times. You may wish to do it periodically, but at minimum it should be done any time you apply a time zone update to your system. You can (and should) subscribe for tz update announcements from IANA, and then look for corresponding pytz updates on pypi.
Think of it this way. When you convert a local time to UTC, you're assuming that you know what the time zone rules will be at that point in time, but nobody can predict what governments will do in the future. Time zone rules can change, and they often do. You need to take that into consideration.
You should test for invalid and ambiguous times, and have a plan for dealing with them. These are easy to hit when scheduling, especially with recurring events.
For example, you might schedule a task to run at 2:00 AM every day - but on the day of the spring-forward transition that time doesn't exist. So what should you do? In many cases, you'll want to run at 3:00 AM on that day, since it's the next time after 1:59 AM. But in some (rarer) contexts, you might run at 1:00 AM, or at 1:59 AM, or just skip that day entirely.
Likewise, you might schedule a task to run at 1:00 AM every day, but on the day of the fall-back transition, 1:00 AM occurs twice. So what do you do? In many cases, the first instance (which is the daylight instance) is the right time to fire. In other (rarer) cases, the second instance may be more appropriate, or (even rarer) it might be appropriate to actually run the job twice.
With regard to jobs that run on an every X [hours/minutes/seconds] type schedule:
These are easiest to schedule by UTC, and should not be affected by DST changes.
If these are the only types of jobs you are running, you can just base your whole system on UTC. But if you're running a mix of different types of jobs, then you might consider just setting the "local time zone" to be "UTC" in the recurrence pattern.
Alternatively, you could just schedule them by a true local time, just make sure that when the job runs it calculates the next execution time based on the current execution time, which should already be in UTC.
You shouldn't distinguish between jobs that run more than hourly, or jobs that run less than hourly. I would expect an hourly to run 25 times on the day of a fall-back transition, and 23 times on the day of a spring-forward transition.
With regard to your plan to sleep and wake up once per minute in a loop - that will probably work, as long as you don't have sub-minute tasks to deal with. It may not necessarily be the most efficient way to deal with it though. If you properly pre-calculate and store the execution times, you could just set a single task to wake up at the next time to run, run everything that needs to run, then set a new task for the next execution time. You don't necessarily have to wake up once per minute.
You should also think about the resources you will need to run the scheduled jobs. What happens if you schedule 1000 tasks that all need to run at midnight? Well they won't necessarily all be able to run simultaneously on a single computer. You might queue them up to run in batches, or spread out the load into different time slots. In a cloud environment perhaps you spin up additional workers to handle the load.