We currently have a timezone-unaware scheduler in pure python.
It uses a heapq (a python binary heap) of ordered events, containing a time, callback and arguments for the callback. It gets the least-valued time from the heapq, computes the number of seconds until the event is to occur, and sleeps that number of seconds before running the job.
We don't need to worry about computers being suspended; this is to run on a dedicated server, not a laptop.
We'd like to make the scheduler cope well with timezone changes, so we don't have a problem in November like we did recently (we had an important job that had to be adjusted in the database to make it run at 8:15AM instead of 9:15AM - normally it runs at 8:15AM). I'm thinking we could:
Store all times in UTC.
Make the scheduler sleep 1 minute and test, in a loop, recomputing
“now” each time, and doing a <= comparison against job datetimes.
Jobs run more frequently than once an hour should “just run normally”.
Hourly jobs that run in between 2:00AM and 2:59AM (inclusive) on a
time change day, probably should skip an hour for PST->PDT, and run
an extra time for PDT->PST.
Jobs run less than hourly probably should avoid rerunning in either
case on days that have a time change.
Does that sound about right? Where might it be off?
Thanks!
I've written about scheduling a few times before with respect to other programming languages. The concepts are valid for python as well. You may wish to read some of these posts: 1, 2, 3, 4, 5, 6
I'll try to address the specific points again, from a Python perspective:
It's important to separate the separate the recurrence pattern from the execution time. The recurrence pattern should store the time as the user would enter it, which is usually a local time. Even if the recurrence pattern is "just one time", that should still be stored as local time. Scheduling is one of a handful of use cases where the common advice of "always work in UTC" does not hold up!
You will also need to store the time zone identifier. These should be IANA time zones, such as America/Los_Angeles or Europe/London. In Python, you can use the pytz library to work with time zones like these.
The execution time should indeed be based on UTC. The next execution time for any event should be calculated from the local time in the recurrence pattern. You may wish to calculate and store these execution times in advance, such that you can easily determine which are the next events to run.
You should be prepared to recalculate these execution times. You may wish to do it periodically, but at minimum it should be done any time you apply a time zone update to your system. You can (and should) subscribe for tz update announcements from IANA, and then look for corresponding pytz updates on pypi.
Think of it this way. When you convert a local time to UTC, you're assuming that you know what the time zone rules will be at that point in time, but nobody can predict what governments will do in the future. Time zone rules can change, and they often do. You need to take that into consideration.
You should test for invalid and ambiguous times, and have a plan for dealing with them. These are easy to hit when scheduling, especially with recurring events.
For example, you might schedule a task to run at 2:00 AM every day - but on the day of the spring-forward transition that time doesn't exist. So what should you do? In many cases, you'll want to run at 3:00 AM on that day, since it's the next time after 1:59 AM. But in some (rarer) contexts, you might run at 1:00 AM, or at 1:59 AM, or just skip that day entirely.
Likewise, you might schedule a task to run at 1:00 AM every day, but on the day of the fall-back transition, 1:00 AM occurs twice. So what do you do? In many cases, the first instance (which is the daylight instance) is the right time to fire. In other (rarer) cases, the second instance may be more appropriate, or (even rarer) it might be appropriate to actually run the job twice.
With regard to jobs that run on an every X [hours/minutes/seconds] type schedule:
These are easiest to schedule by UTC, and should not be affected by DST changes.
If these are the only types of jobs you are running, you can just base your whole system on UTC. But if you're running a mix of different types of jobs, then you might consider just setting the "local time zone" to be "UTC" in the recurrence pattern.
Alternatively, you could just schedule them by a true local time, just make sure that when the job runs it calculates the next execution time based on the current execution time, which should already be in UTC.
You shouldn't distinguish between jobs that run more than hourly, or jobs that run less than hourly. I would expect an hourly to run 25 times on the day of a fall-back transition, and 23 times on the day of a spring-forward transition.
With regard to your plan to sleep and wake up once per minute in a loop - that will probably work, as long as you don't have sub-minute tasks to deal with. It may not necessarily be the most efficient way to deal with it though. If you properly pre-calculate and store the execution times, you could just set a single task to wake up at the next time to run, run everything that needs to run, then set a new task for the next execution time. You don't necessarily have to wake up once per minute.
You should also think about the resources you will need to run the scheduled jobs. What happens if you schedule 1000 tasks that all need to run at midnight? Well they won't necessarily all be able to run simultaneously on a single computer. You might queue them up to run in batches, or spread out the load into different time slots. In a cloud environment perhaps you spin up additional workers to handle the load.
Related
I have got a question about django-q, where I could not find any answers in its documentation.
Question: Is it possible to calculate the next_run at every run at the end?
The reason behind it: The q cluster does not cover local times with dst (daylight saving time).
As example:
A schedule that should run 6am. german time.
For summer time: The schedule should be executed at 4am (UTC).
For winter time: The schedule should be executed at 5am (UTC).
To fix that I wrote custom logic for the next run. This logic is taking place in the custom function.
I tried to retrieve the schedule object in the custom function and set the next_run there.
The probleme here is: If I put the next_run logic before the second section "other calculations" it does work, But if I place it after the second section "other calculations" it does not work. Other calculations are not related to the schedule object.
def custom_function(**kwargs)
# 1. some calculations not related to schedule object
# this place does work
related_schedule= Schedule.objects.get(id=kwargs["schedule_id"])
related_schedule.next_run = ...
related_schedule.save()
# 2. some other calculations not related to schedule object
# this place doesn't
That is very random behaviour which I could not explain to me.
We've inherited an application that runs on Lambda. On initialization, this App reads a configuration stored on ParameterStore and if a certain value is out of date, then that value is updated and the app can continue.
The problem is that many users (about 7 concurrent at any one time) can run this app, and they could hit the same update nearly at the same time and therefore cause ParameterStore to throw a PutParameter throttle-back error.
To avoid that and to update the ParameterStore value only once per day rather than on every run, we've devised a simple solution, and we're wanting to get some advice on whether it makes sense or not...
Here are the steps we're thinking:
Random sleep between 1 and 10 seconds ("queue" up calls to minimize clashes -- perhaps this isn't necessary??)
Does an S3 object exist for today: obj_MMDDYYYY (signals that the PStore update has been done today)
If YES then skip updating ParameterStore value
If NO then create S3 obj_MMDDYYYY (today's date), run the Update ParameterStore value
Any advice appreciated.
Thanks!
I have seen that the first time query execution taking longer time to execute but second execution takes less time, seems like query compile time is taking longer time at first, can we do anything here which will increase the performance of compile time ?
Scenario:
enable_result_cache_for_session is off
We have SLA defined to execute specific query is 15 seconds but when run for the first time it is taking 33 seconds to compile and run the query that time SLA is miss but subsequent run took 10 seconds which is SLA hit.
Q: How do I tune this part ? How do I make sure this does not happens ?
Do we have any database configuration parameter for the same?
The title of the question says compile time but I understand that you are interested in improving the execution time, right?
For sure the John Rotenstein comment makes total sense, to improve the Redshift execution query time you need to understand the RS architecture and how to distribute your data in the best way you can to improve the queries time.
You will need to understand the DISTKEY and SORTKEY
Useful links
Redshift Architecture
https://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html
https://medium.com/#dpazetojr/redshift-architecture-basics-4aae5068b8e3
Redshift Distribuition Styles
https://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html
https://medium.com/#dpazetojr/redshift-distkey-and-sortkey-d247b01b01f6
UPDATE 1:
In order you tune query and know how/when use DISTKEY and SORTKEY, we can start using the EXPLAIN command in the query you run and based on that act more precisely.
https://docs.aws.amazon.com/redshift/latest/dg/r_EXPLAIN.html
https://dev.to/ronsoak/the-r-a-g-redshift-analyst-guide-understanding-the-query-plan-explain-360d
I ran a query which resulted in the below stats.
Elapsed time: 12.1 sec
Slot time consumed: 14 hr 12 min
total_slot_ms: 51147110 ( which is 14 hr 12 min)
We are on an on-demand pricing plan. So the max slots would be 2000. That being said, if I used 2000 slots for the whole 12.1 seconds span then I should end up with total_slot_ms as 24200000 ( which is 2000x12.1x1000). However, the total_slot_ms is 51147110. Average number of slots used are 51147110/121000 = 4225 ( which is way above 2000). Can some explain to me how I ended up using more than 2000 slots?
In a course of Google, there is an example where a query shows 13 "elapsed time" seconds and 50 minutes of "slot time consumed". They says:
Hey, across all of our workers, we did essentially 50 minutes of work massively in parallel, 50 minutes so that your query could be returned back in 13 seconds. Best of all for you, you don't need to worry about spinning up those workers, moving data in-between them, making sure they're sharing all their results between their aggregations. All you care about is writing the SQL, finding the insights, and then running that query in a very fast turnaround. But there is abstracted from you a lot of distributed parallel processing that's happening.
Increasing Bigquery slot capacity significantly improves overall query performance, despite the fact that slots amount is actually the subject for Quotas restriction along Bigquery on-demand pricing plan, exceeding slots limit does not charge you for additional costs:
BigQuery slots are shared among all queries in a single project.
BigQuery might burst beyond this limit to accelerate your queries.
To check how many slots you're using, see Monitoring BigQuery using
Cloud Monitoring.
BigQuery on-demand supports limited bursting. https://cloud.google.com/bigquery/docs/release-notes#December_10_2019
You might want to check execution plan for the query and understand all different slot_time_ms for wait, read, write activities at each stage. Since this is on-demand slots, you may see lots of wait time, that will add up into total time.
Besides bursting, each stage of explain pan will help you understand that total time is not necessarily actual slot consumption but equivalent slot consumption.
I was implemnting some functionaliy in which i get a set of queries on database One shouldnt loose the query for a certain time lets say some 5min unless and untill the query is executed fine (this is incase the DB is down, we dont loose the query). so, what i was thinking to do is to set a sort of timer for each query through a different thread and wait on it for that time frame, and at the end if it still exists, remove it from the queue, but, i am not happy with this solution as i have to create as many threads as the number of queries. is there a better way to design this (environment is vc++), If the question is unclear, please let me know, i will try to frame it better.
One thread is enough to check lets say every 10 seconds that you do not have queries in that queue of yours whose due time has been reached and so should be aborted / rolled back.
Queues are usually grown from one end and erased from other end so you have to check only if the query on the end where the oldest items are has not reached its due time.