I have a few tasks that should be executed with different priorities.
For example, task A is need as soon as possible, but tasks B and C may be calculated a bit later, but definitely after task A.
Moreover, priority of tasks B or C may be changed in future.
How it can be achieved using GPars?
Real business case is that I need to pre-calculate some data. For example, I have 3 tabs, calculation of data for every tab takes 30 seconds. So, I'd like to calculate data for the first tab at the beginning and start to calculate data for the second and third tabs. It is not predictable what tab will be chosen next, so I have to be able to change priority of the second and third tasks.
I'd suggest you tried pools with different thread priorities. Setting a priority for a task then becomes scheduling it on the right pool (or GPars group).
Related
I have a step function setup which calls preprocessing lambda and inference lambda for a data item. Now, I need to do this process on the entire dataset(over 10000 items). One way is to invoke step function parallelly for each input. Is there a better alternative to this approach?
Another way to do it would be to use Map state to run over an array of items. You could start with a list of item ID's and run a set of tasks for it.
https://aws.amazon.com/blogs/aws/new-step-functions-support-for-dynamic-parallelism/
This approach has some drawbacks though:
There is a 256kb limit for input/output data. The initial array of items could possibly be bigger. If you passes an array of ID's only as an input to map state though, 10k items would likely not cross that limit.
Map state doesn't guarantee that all the items will run concurrently. It could possibly be less than 40 at a time (workaround would be to have nested map states or maps of map states). From documentation:
Concurrent iterations may be limited. When this occurs, some iterations will not begin until previous iterations have completed. The likelihood of this occurring increases when your input array has more than 40 items.
https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-map-state.html
Is it possible to run through several states in one statechart at the same time?
My simulation model is agent-based.
A) At the moment I consider my process as a continuous chain for simplicity. This means that only when the product is ejected from the machine can the process restart. The individual stations of the machine are represented as states.
B) Now I would like to represent the following: The machine should be able to run through several states simultaneously in one run. Example: If the manufactured product is just ejected from the machine, there is raw material in the filling station and in the pressing station at the same time. This means that more product is produced in the best possible time than when I look at the process as in A.
I would be glad about any help. :)
Three axioms that are always true, and you must make your logic follow them:
an agent can always only be in 1 state per state chart
While in 1 state, it can be part of a larger "composite state" (see help)
An agent can have several state charts running in parallel, for example one for "machine states" and one for "failure states"
Be careful with point 3, though. If you have several state charts in 1 agent type, they should be 100% mutually exclusive, i.e. represent very different things.
We currently have a timezone-unaware scheduler in pure python.
It uses a heapq (a python binary heap) of ordered events, containing a time, callback and arguments for the callback. It gets the least-valued time from the heapq, computes the number of seconds until the event is to occur, and sleeps that number of seconds before running the job.
We don't need to worry about computers being suspended; this is to run on a dedicated server, not a laptop.
We'd like to make the scheduler cope well with timezone changes, so we don't have a problem in November like we did recently (we had an important job that had to be adjusted in the database to make it run at 8:15AM instead of 9:15AM - normally it runs at 8:15AM). I'm thinking we could:
Store all times in UTC.
Make the scheduler sleep 1 minute and test, in a loop, recomputing
“now” each time, and doing a <= comparison against job datetimes.
Jobs run more frequently than once an hour should “just run normally”.
Hourly jobs that run in between 2:00AM and 2:59AM (inclusive) on a
time change day, probably should skip an hour for PST->PDT, and run
an extra time for PDT->PST.
Jobs run less than hourly probably should avoid rerunning in either
case on days that have a time change.
Does that sound about right? Where might it be off?
Thanks!
I've written about scheduling a few times before with respect to other programming languages. The concepts are valid for python as well. You may wish to read some of these posts: 1, 2, 3, 4, 5, 6
I'll try to address the specific points again, from a Python perspective:
It's important to separate the separate the recurrence pattern from the execution time. The recurrence pattern should store the time as the user would enter it, which is usually a local time. Even if the recurrence pattern is "just one time", that should still be stored as local time. Scheduling is one of a handful of use cases where the common advice of "always work in UTC" does not hold up!
You will also need to store the time zone identifier. These should be IANA time zones, such as America/Los_Angeles or Europe/London. In Python, you can use the pytz library to work with time zones like these.
The execution time should indeed be based on UTC. The next execution time for any event should be calculated from the local time in the recurrence pattern. You may wish to calculate and store these execution times in advance, such that you can easily determine which are the next events to run.
You should be prepared to recalculate these execution times. You may wish to do it periodically, but at minimum it should be done any time you apply a time zone update to your system. You can (and should) subscribe for tz update announcements from IANA, and then look for corresponding pytz updates on pypi.
Think of it this way. When you convert a local time to UTC, you're assuming that you know what the time zone rules will be at that point in time, but nobody can predict what governments will do in the future. Time zone rules can change, and they often do. You need to take that into consideration.
You should test for invalid and ambiguous times, and have a plan for dealing with them. These are easy to hit when scheduling, especially with recurring events.
For example, you might schedule a task to run at 2:00 AM every day - but on the day of the spring-forward transition that time doesn't exist. So what should you do? In many cases, you'll want to run at 3:00 AM on that day, since it's the next time after 1:59 AM. But in some (rarer) contexts, you might run at 1:00 AM, or at 1:59 AM, or just skip that day entirely.
Likewise, you might schedule a task to run at 1:00 AM every day, but on the day of the fall-back transition, 1:00 AM occurs twice. So what do you do? In many cases, the first instance (which is the daylight instance) is the right time to fire. In other (rarer) cases, the second instance may be more appropriate, or (even rarer) it might be appropriate to actually run the job twice.
With regard to jobs that run on an every X [hours/minutes/seconds] type schedule:
These are easiest to schedule by UTC, and should not be affected by DST changes.
If these are the only types of jobs you are running, you can just base your whole system on UTC. But if you're running a mix of different types of jobs, then you might consider just setting the "local time zone" to be "UTC" in the recurrence pattern.
Alternatively, you could just schedule them by a true local time, just make sure that when the job runs it calculates the next execution time based on the current execution time, which should already be in UTC.
You shouldn't distinguish between jobs that run more than hourly, or jobs that run less than hourly. I would expect an hourly to run 25 times on the day of a fall-back transition, and 23 times on the day of a spring-forward transition.
With regard to your plan to sleep and wake up once per minute in a loop - that will probably work, as long as you don't have sub-minute tasks to deal with. It may not necessarily be the most efficient way to deal with it though. If you properly pre-calculate and store the execution times, you could just set a single task to wake up at the next time to run, run everything that needs to run, then set a new task for the next execution time. You don't necessarily have to wake up once per minute.
You should also think about the resources you will need to run the scheduled jobs. What happens if you schedule 1000 tasks that all need to run at midnight? Well they won't necessarily all be able to run simultaneously on a single computer. You might queue them up to run in batches, or spread out the load into different time slots. In a cloud environment perhaps you spin up additional workers to handle the load.
I was implemnting some functionaliy in which i get a set of queries on database One shouldnt loose the query for a certain time lets say some 5min unless and untill the query is executed fine (this is incase the DB is down, we dont loose the query). so, what i was thinking to do is to set a sort of timer for each query through a different thread and wait on it for that time frame, and at the end if it still exists, remove it from the queue, but, i am not happy with this solution as i have to create as many threads as the number of queries. is there a better way to design this (environment is vc++), If the question is unclear, please let me know, i will try to frame it better.
One thread is enough to check lets say every 10 seconds that you do not have queries in that queue of yours whose due time has been reached and so should be aborted / rolled back.
Queues are usually grown from one end and erased from other end so you have to check only if the query on the end where the oldest items are has not reached its due time.
Long story short, I'm rewriting a piece of a system and am looking for a way to store some hit counters in AWS SimpleDB.
For those of you not familiar with SimpleDB, the (main) problem with storing counters is that the cloud propagation delay is often over a second. Our application currently gets ~1,500 hits per second. Not all those hits will map to the same key, but a ballpark figure might be around 5-10 updates to a key every second. This means that if we were to use a traditional update mechanism (read, increment, store), we would end up inadvertently dropping a significant number of hits.
One potential solution is to keep the counters in memcache, and using a cron task to push the data. The big problem with this is that it isn't the "right" way to do it. Memcache shouldn't really be used for persistent storage... after all, it's a caching layer. In addition, then we'll end up with issues when we do the push, making sure we delete the correct elements, and hoping that there is no contention for them as we're deleting them (which is very likely).
Another potential solution is to keep a local SQL database and write the counters there, updating our SimpleDB out-of-band every so many requests or running a cron task to push the data. This solves the syncing problem, as we can include timestamps to easily set boundaries for the SimpleDB pushes. Of course, there are still other issues, and though this might work with a decent amount of hacking, it doesn't seem like the most elegant solution.
Has anyone encountered a similar issue in their experience, or have any novel approaches? Any advice or ideas would be appreciated, even if they're not completely flushed out. I've been thinking about this one for a while, and could use some new perspectives.
The existing SimpleDB API does not lend itself naturally to being a distributed counter. But it certainly can be done.
Working strictly within SimpleDB there are 2 ways to make it work. An easy method that requires something like a cron job to clean up. Or a much more complex technique that cleans as it goes.
The Easy Way
The easy way is to make a different item for each "hit". With a single attribute which is the key. Pump the domain(s) with counts quickly and easily. When you need to fetch the count (presumable much less often) you have to issue a query
SELECT count(*) FROM domain WHERE key='myKey'
Of course this will cause your domain(s) to grow unbounded and the queries will take longer and longer to execute over time. The solution is a summary record where you roll up all the counts collected so far for each key. It's just an item with attributes for the key {summary='myKey'} and a "Last-Updated" timestamp with granularity down to the millisecond. This also requires that you add the "timestamp" attribute to your "hit" items. The summary records don't need to be in the same domain. In fact, depending on your setup, they might best be kept in a separate domain. Either way you can use the key as the itemName and use GetAttributes instead of doing a SELECT.
Now getting the count is a two step process. You have to pull the summary record and also query for 'Timestamp' strictly greater than whatever the 'Last-Updated' time is in your summary record and add the two counts together.
SELECT count(*) FROM domain WHERE key='myKey' AND timestamp > '...'
You will also need a way to update your summary record periodically. You can do this on a schedule (every hour) or dynamically based on some other criteria (for example do it during regular processing whenever the query returns more than one page). Just make sure that when you update your summary record you base it on a time that is far enough in the past that you are past the eventual consistency window. 1 minute is more than safe.
This solution works in the face of concurrent updates because even if many summary records are written at the same time, they are all correct and whichever one wins will still be correct because the count and the 'Last-Updated' attribute will be consistent with each other.
This also works well across multiple domains even if you keep your summary records with the hit records, you can pull the summary records from all your domains simultaneously and then issue your queries to all domains in parallel. The reason to do this is if you need higher throughput for a key than what you can get from one domain.
This works well with caching. If your cache fails you have an authoritative backup.
The time will come where someone wants to go back and edit / remove / add a record that has an old 'Timestamp' value. You will have to update your summary record (for that domain) at that time or your counts will be off until you recompute that summary.
This will give you a count that is in sync with the data currently viewable within the consistency window. This won't give you a count that is accurate up to the millisecond.
The Hard Way
The other way way is to do the normal read - increment - store mechanism but also write a composite value that includes a version number along with your value. Where the version number you use is 1 greater than the version number of the value you are updating.
get(key) returns the attribute value="Ver015 Count089"
Here you retrieve a count of 89 that was stored as version 15. When you do an update you write a value like this:
put(key, value="Ver016 Count090")
The previous value is not removed and you end up with an audit trail of updates that are reminiscent of lamport clocks.
This requires you to do a few extra things.
the ability to identify and resolve conflicts whenever you do a GET
a simple version number isn't going to work you'll want to include a timestamp with resolution down to at least the millisecond and maybe a process ID as well.
in practice you'll want your value to include the current version number and the version number of the value your update is based on to more easily resolve conflicts.
you can't keep an infinite audit trail in one item so you'll need to issue delete's for older values as you go.
What you get with this technique is like a tree of divergent updates. you'll have one value and then all of a sudden multiple updates will occur and you will have a bunch of updates based off the same old value none of which know about each other.
When I say resolve conflicts at GET time I mean that if you read an item and the value looks like this:
11 --- 12
/
10 --- 11
\
11
You have to to be able to figure that the real value is 14. Which you can do if you include for each new value the version of the value(s) you are updating.
It shouldn't be rocket science
If all you want is a simple counter: this is way over-kill. It shouldn't be rocket science to make a simple counter. Which is why SimpleDB may not be the best choice for making simple counters.
That isn't the only way but most of those things will need to be done if you implement an SimpleDB solution in lieu of actually having a lock.
Don't get me wrong, I actually like this method precisely because there is no lock and the bound on the number of processes that can use this counter simultaneously is around 100. (because of the limit on the number of attributes in an item) And you can get beyond 100 with some changes.
Note
But if all these implementation details were hidden from you and you just had to call increment(key), it wouldn't be complex at all. With SimpleDB the client library is the key to making the complex things simple. But currently there are no publicly available libraries that implement this functionality (to my knowledge).
To anyone revisiting this issue, Amazon just added support for Conditional Puts, which makes implementing a counter much easier.
Now, to implement a counter - simply call GetAttributes, increment the count, and then call PutAttributes, with the Expected Value set correctly. If Amazon responds with an error ConditionalCheckFailed, then retry the whole operation.
Note that you can only have one expected value per PutAttributes call. So, if you want to have multiple counters in a single row, then use a version attribute.
pseudo-code:
begin
attributes = SimpleDB.GetAttributes
initial_version = attributes[:version]
attributes[:counter1] += 3
attributes[:counter2] += 7
attributes[:version] += 1
SimpleDB.PutAttributes(attributes, :expected => {:version => initial_version})
rescue ConditionalCheckFailed
retry
end
I see you've accepted an answer already, but this might count as a novel approach.
If you're building a web app then you can use Google's Analytics product to track page impressions (if the page to domain-item mapping fits) and then to use the Analytics API to periodically push that data up into the items themselves.
I haven't thought this through in detail so there may be holes. I'd actually be quite interested in your feedback on this approach given your experience in the area.
Thanks
Scott
For anyone interested in how I ended up dealing with this... (slightly Java-specific)
I ended up using an EhCache on each servlet instance. I used the UUID as a key, and a Java AtomicInteger as the value. Periodically a thread iterates through the cache and pushes rows to a simpledb temp stats domain, as well as writing a row with the key to an invalidation domain (which fails silently if the key already exists). The thread also decrements the counter with the previous value, ensuring that we don't miss any hits while it was updating. A separate thread pings the simpledb invalidation domain, and rolls up the stats in the temporary domains (there are multiple rows to each key, since we're using ec2 instances), pushing it to the actual stats domain.
I've done a little load testing, and it seems to scale well. Locally I was able to handle about 500 hits/second before the load tester broke (not the servlets - hah), so if anything I think running on ec2 should only improve performance.
Answer to feynmansbastard:
If you want to store huge amount of events i suggest you to use distributed commit log systems such as kafka or aws kinesis. They allow to consume stream of events cheap and simple (kinesis's pricing is 25$ per month for 1K events per seconds) – you just need to implement consumer (using any language), which bulk reads all events from previous checkpoint, aggregates counters in memory then flushes data into permanent storage (dynamodb or mysql) and commit checkpoint.
Events can be logged simply using nginx log and transfered to kafka/kinesis using fluentd. This is very cheap, performant and simple solution.
Also had similiar needs/challenges.
I looked at using google analytics and count.ly. the latter seemed too expensive to be worth it (plus they have a somewhat confusion definition of sessions). GA i would have loved to use, but I spent two days using their libraries and some 3rd party ones (gadotnet and one other from maybe codeproject). unfortunately I could only ever see counters post in GA realtime section, never in the normal dashboards even when the api reported success. we were probably doing something wrong but we exceeded our time budget for ga.
We already had an existing simpledb counter that updated using conditional updates as mentioned by previous commentor. This works well, but suffers when there is contention and conccurency where counts are missed (for example, our most updated counter lost several million counts over a period of 3 months, versus a backup system).
We implemented a newer solution which is somewhat similiar to the answer for this question, except much simpler.
We just sharded/partitioned the counters. When you create a counter you specify the # of shards which is a function of how many simulatenous updates you expect. this creates a number of sub counters, each which has the shard count started with it as an attribute :
COUNTER (w/5shards) creates :
shard0 { numshards = 5 } (informational only)
shard1 { count = 0, numshards = 5, timestamp = 0 }
shard2 { count = 0, numshards = 5, timestamp = 0 }
shard3 { count = 0, numshards = 5, timestamp = 0 }
shard4 { count = 0, numshards = 5, timestamp = 0 }
shard5 { count = 0, numshards = 5, timestamp = 0 }
Sharded Writes
Knowing the shard count, just randomly pick a shard and try to write to it conditionally. If it fails because of contention, choose another shard and retry.
If you don't know the shard count, get it from the root shard which is present regardless of how many shards exist. Because it supports multiple writes per counter, it lessens the contention issue to whatever your needs are.
Sharded Reads
if you know the shard count, read every shard and sum them.
If you don't know the shard count, get it from the root shard and then read all and sum.
Because of slow update propogation, you can still miss counts in reading but they should get picked up later. This is sufficient for our needs, although if you wanted more control over this you could ensure that- when reading- the last timestamp was as you expect and retry.