I was telling a friend that an advantage of running a load with a lambda function is that each instance, and thus each execution, gets dedicated resources - memory and CPU (and perhaps disk, network,... but that's less relevant). And then I started wondering...
For instance, if you have a function with some CPU-intensive logic that is used by multiple tenants, then one execution should never be affected by another. If some calculation takes 5 seconds to execute, it will always take 5 seconds, no matter how many requests are processed simultaneously.
This seems self-evident for memory, but less so for CPU. From a quick test I seem to get mixed results.
So, does every function instance gets its own CPU dedicated resources?
My main focus is AWS Lambda, but the same question arises for Azure (on a Consumption plan, I guess) and Google.
Lambda uses fractional CPU allocations of instance CPU, running on an instance type comparable to compute optimized EC2 instance. That CPU share is dedicated to the Lambda, and its allocation is based on the amount of memory allocated to the function.
The CPU share dedicated to a function is based off of the fraction of
its allocated memory, per each of the two cores. For example, an
instance with ~ 3 GB memory available for lambda functions where each
function can have up to 1 GB memory means at most you can utilize ~
1/3 * 2 cores = 2/3 of the CPU. The details may be revisited in the
future
The explanation is supported by Lambda Function Configuration documentation, which states:
Performance testing your Lambda function is a crucial part in ensuring
you pick the optimum memory size configuration. Any increase in memory
size triggers an equivalent increase in CPU availabile to your
function.
So yes, you get a dedicated share of an instances' total CPU, based on your memory allocation and the formula above.
It might have been indicated more clearly that I wasn't looking for documentation, but for facts. The core question was if we can assume that one execution should never be affected by another.
As I said, a first quick test gave me mixed results, so I took the time to delve in a little deeper.
I created a very simple lambda that, for a specified number of seconds, generates and sums up random numbers (code here):
while (process.hrtime(start)[0] < duration) {
var nextRandom = randomizer();
random = random + nextRandom - 0.5;
rounds++;
}
Now, if executions on different instances are really independent, then there should be no difference between executing this lambda just once or multiple times in parallel, all other factors being equal.
But the figures indicate otherwise. Here's a graph, showing the number of 'rounds' per second that was achieved.
Every datapoint is the average of 10 iterations with the same number of parallel requests - which should rule out cold start effects and other variations. The raw results can be found here.
The results look rather shocking: they indicate that avoiding parallel executions of the same lambda can almost double the performance....?!
But sticking to the original question: this looks like the CPU fraction 'dedicated' to a lambda instance is not fixed, but depends on certain other factors.
Of course I welcome any remarks on the test, and of course, explanations for the observed behavior!
Related
I have to process a lot of data in my Lambda code and this computation could be parallelized. I am currently using single-threaded Python code and want to optimize it. I thought about converting it to multi-threaded Python code, but anyway it seems that Amazon Lambda doesn't have enough resources. What is the best way to do this?
AWS Lambda now supports up to 10 GB of memory and 6 vCPU cores for Lambda Functions
If you want to do CPU bound, parallelized functions on Lambda, always remember these core behaviour
The total amount number of vCPU (which correlate to the optimal thread count) is dictated by how much memory you assigned for that Lambda function
Lambda allocates CPU power in proportion to the amount of memory configured. Memory is the amount of memory available to your Lambda function at runtime. You can increase or decrease the memory and CPU power allocated to your function using the Memory (MB) setting. To configure the memory for your function, set a value between 128 MB and 10,240 MB in 1-MB increments. At 1,769 MB, a function has the equivalent of one vCPU (one vCPU-second of credits per second).
Lambda is also severely limited by the maximum amount of time it can be run: 900 seconds (15 minutes)
Depend on how your application has been architectured, you can improve the performance with these things in mind
It does support multi-threaded / multi-core processing. How-to in python can be found here
When you hit the upper limits of a single Lambda run, think about ways to break the work to multiple Lambdas running in parallel if possible. That level of horizontal scaling is what Lambda excels at.
Is scheduling a lambda function to get called every 20 mins with CloudWatch the best way to get rid of lambda cold start times? (not completely get rid of)...
Will this get pricey or is there something I am missing because I have it set up right now and I think it is working.
Before my cold start time would be like 10 seconds and every subsequent call would complete in like 80 ms. Now every call no matter how frequent is around 80 ms. Is this a good method until say your userbase grows, then you can turn this off?
My second option is just using beanstalk and having a server running 24/7 but that sounds expensive so I don't prefer it.
As far as I know this is the only way to keep the function hot right now. It can get pricey only when you have a lot of those functions.
You'd have to calculate yourself how much do you pay for keeping your functions alive considering how many of them do you have, how long does it take to run them each time and how much memory do you need.
But once every 20 minutes is something like 2000 times per month so if you use e.g. 128MB and make them finish under 100ms then you could keep quite a lot of such functions alive at 20 minute intervals and still be under the free tier - it would be 20 seconds per month per function. You don't even need to turn it off after you get a bigger load because it will be irrelevant at this point. Besides you can never be sure to get a uniform load all the time so you might keep your heart beating code active even then.
Though my guess is that since it is so cheap to keep a function alive (especially if you have a special argument that makes them return immediately) and that the difference is so great (10 seconds vs. 80 ms) then pretty much everyone will do it - there is pretty much no excuse not to. In that case I expect Amazon to either fight that practice (by making it hard or more expensive than it currently is - which wouldn't be a smart move) or to make it not needed in the future. If the difference between hot and cold start was 100ms then no one would bother. If it is 10 seconds than everyone needs to work around it.
There would always have to be a difference between running a code that was run a second ago and a code that was run a month ago, because having all of them in RAM and ready to go would waste a lot of resources, but I see no reason why that difference couldn't be made less noticeable or even have few more steps instead of just hot and cold start.
You can improve the cold start time by allocating more memory to your Lambda function. With the default 512MB, I am seeing cold start times of 8-10 seconds for functions written in Java. This improves to 2-3 seconds with 1536MB of memory.
Amazon says that it is the CPU allocation that really matters, but there is no way to directly change it. CPU allocation increases proportionately to memory.
And if you want close to zero cold start times, keeping the function warm is the way to go, as described rsp suggested.
Starting from December 2019 AWS Lambda supports Reserved Concurrency (so you can set the number of lambda functions that will be ready and waiting for new calls) [1]
The downside of this, is that you will be charged for the reserved concurrency. If you provision a concurrency of 1, for a lambda with 128MB being active 24 hrs for the whole month, you will be charged: 1 instance x 30 days x 24 hr x 60min x 60sec x (128/1024) = 324,000 GB-sec (almost all of the capacity AWS gives for the lambda free tier) [2]
From above you will get a lambda instance that responds very fast...subsequent concurrent calls may still suffer "cold-start" though.
What is more, you can configure application autoscaling to dynamically manage the provisioned concurrency of your lambda. [3]
Refs:
https://aws.amazon.com/blogs/aws/new-provisioned-concurrency-for-lambda-functions/
https://aws.amazon.com/lambda/pricing/
https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html
Among adding more memory for lambda, there is also one more approach to reduce the cold starts: use Graal native-image tool. The jar is translated into byte code. Basically, we would do part of the work, which is done on aws. When you build your code, on loading to AWS - select "Custom runtime", not java8.
Helpful article: https://engineering.opsgenie.com/run-native-java-using-graalvm-in-aws-lambda-with-golang-ba86e27930bf
Beware:
but it also has its limitations; it does not support dynamic class loading, and reflection support is also limited
Azure has pre warming solution for serverless instances(Link). This would be a great feature in AWS lambda if and when they implement it.
Instead of user warming the instance at the application level it's handled by the cloud provider in the platform.
Hitting server would not resolve case of simultaneous requests by users, or same page sending a few api requests async.
A better solution is to dump the 'warmed-up' into docker checkpoint. It is especially useful for dynamic language when warm up is fast, yet loading of all the libraries is slow.
For details read
https://criu.org/Docker
https://www.imperial.ac.uk/media/imperial-college/faculty-of-engineering/computing/public/1819-ug-projects/StenbomO-Refunction-Eliminating-Serverless-Cold-Starts-Through-Container-Reuse.pdf
Other hints:
use more memory
use Python or JavaScript with most basic libraries, try eliminate bulky ones
create several 'microservices' to reduce chances of several users hitting same service
see more at https://www.jeremydaly.com/15-key-takeaways-from-the-serverless-talk-at-aws-startup-day/
Lambda's cold start depends on multiple factors such as your implementation, the language run-time you use, and the code size etc. If you give your Lambda function more memory, you can reduce the cold start too. You can read the best practices https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html
The serverless community also has recommendations for performance https://atlas.serverless.tech-field-community.aws.a2z.com/Performance/index.html
Lambda team also launched Provisioned Concurrency. You can now request multiple Lambda containers be kept in a "hyper ready" state, ready to re-run your function. This is the new best practice for reducing the likelihood of cold starts.
Official docs https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html?icmpid=docs_lambda_console
Hi a few questions regarding Cuda stream processing for multiple kernels.
Assume s streams and a kernels in a 3.5 capable kepler device, where s <= 32.
kernel uses a dev_input array of size n and a dev output array of size s*n.
kernel reads data from input array, stores its value in a register, manipulates it and writes its result back to dev_output at the position s*n + tid.
We aim to run the same kernel s times using one of the n streams each time. Similar to the simpleHyperQ example. Can you comment if and how any of the following affects concurrency please?
dev_input and dev_output are not pinned;
dev_input as it is vs dev_input size s*n, where each kernel reads unique data (no read conflicts)
kernels read data from constant memory
10kb of shared memory are allocated per block.
kernel uses 60 registers
Any good comments will be appreciated...!!!
cheers,
Thanasio
Robert,
thanks a lot for your detailed answer. It has been very helpful. I edited 4, it is 10kb per block. So in my situation, i launch grids of 61 blocks and 256 threads. The kernels are rather computationally bound. I launch 8 streams of the same kernel. Profile them and then i see a very good overlap between the first two and then it gets worse and worse. The kernel execution time is around 6ms. After the first two streams execute almost perfectly concurrent the rest have a 3ms distance between them. Regarding 5, i use a K20 which has a 255 register file. So i would not expect drawbacks from there. I really cannot understand why i do not achieve concurrency equivalent to what is specified for gk110s..
Please take a look at the following link. There is an image called kF.png .It shows the profiler output for the streams..!!!
https://devtalk.nvidia.com/default/topic/531740/cuda-programming-and-performance/concurrent-streams-and-hyperq-for-k20/
Concurrency amongst kernels depends upon a number of factors, but one that many people overlook is simply the size of the kernel (i.e. number of blocks in the grid.) Kernels that are of a size that can effectively utilize the GPU by themselves will not generally run concurrently to a large degree, and there would be little throughput advantage even if they did. The work distributor inside the GPU will generally begin distributing blocks as soon as a kernel is launched, so if one kernel is launched before another, and both have a large number of blocks, then the first kernel will generally occupy the GPU until it is nearly complete, at which point blocks of the second kernel will then get scheduled and executed, perhaps with a small amount of "concurrent overlap".
The main point is that kernels that have enough blocks to "fill up the GPU" will prevent other kernels from actually executing, and apart from scheduling, this isn't any different on a compute 3.5 device. In addition, rather than just specifying a few parameters for the kernel as a whole, also specifying launch parameters and statistics (such as register usage, shared mem usage, etc.) at the block level are helpful for providing crisp answers. The benefits of the compute 3.5 architecture in this area will still mainly come from "small" kernels of "few" blocks, attempting to execute together. Compute 3.5 has some advantages there.
You should also review the answer to this question.
When global memory used by the kernel is not pinned, it affects the speed of data transfer, and also the ability to overlap copy and compute but does not affect the ability of two kernels to execute concurrently. Nevertheless, the limitation on copy and compute overlap may skew the behavior of your application.
There shouldn't be "read conflicts", I'm not sure what you mean by that. Two independent threads/blocks/grids are allowed to read the same location in global memory. Generally this will get sorted out at the L2 cache level. As long as we are talking about just reads there should be no conflict, and no particular effect on concurrency.
Constant memory is a limited resource, shared amongst all kernels executing on the device (try running deviceQuery). If you have not exceeded the total device limit, then the only issue will be one of utilization of the constant cache, and things like cache thrashing. Apart from this secondary relationship, there is no direct effect on concurrency.
It would be more instructive to identify the amount of shared memory per block rather than per kernel. This will directly affect how many blocks can be scheduled on a SM. But answering this question would be much crisper also if you specified the launch configuration of each kernel, as well as the relative timing of the launch invocations. If shared memory happened to be the limiting factor in scheduling, then you can divide the total available shared memory per SM by the amount used by each kernel, to get an idea of the possible concurrency based on this. My own opinion is that number of blocks in each grid is likely to be a bigger issue, unless you have kernels that use 10k per grid but only have a few blocks in the whole grid.
My comments here would be nearly the same as my response to 4. Take a look at deviceQuery for your device, and if registers became a limiting factor in scheduling blocks on each SM, then you could divide available registers per SM by the register usage per kernel (again, it makes a lot more sense to talk about register usage per block and the number of blocks in the kernel) to discover what the limit might be.
Again, if you have reasonable sized kernels (hundreds or thousands of blocks, or more) then the scheduling of blocks by the work distributor is most likely going to be the dominant factor in the amount of concurrency between kernels.
EDIT: in response to new information posted in the question. I've looked at the kF.png
First let's analyze from a blocks per SM perspective. CC 3.5 allows 16 "open" or currently scheduled blocks per SM. If you are launching 2 kernels of 61 blocks each, that may well be enough to fill the "ready-to-go" queue on the CC 3.5 device. Stated another way, the GPU can handle 2 of these kernels at a time. As the blocks of one of those kernels "drains" then another kernel is scheduled by the work distributor. The blocks of the first kernel "drain" sufficiently in about half the total time, so that the next kernel gets scheduled about halfway through the completion of the first 2 kernels, so at any given point (draw a vertical line on the timeline) you have either 2 or 3 kernels executing simultaneously. (The 3rd kernel launched overlaps the first 2 by about 50% according to the graph, I don't agree with your statement that there is a 3ms distance between each successive kernel launch). If we say that at peak we have 3 kernels scheduled (there are plenty of vertical lines that will intersect 3 kernel timelines) and each kernel has ~60 blocks, then that is about 180 blocks. Your K20 has 13 SMs and each SM can have at most 16 blocks scheduled on it. This means at peak you have about 180 blocks scheduled (perhaps) vs. a theoretical peak of 16*13 = 208. So you're pretty close to max here, and there's not much more that you could possibly get. But maybe you think you're only getting 120/208, I don't know.
Now let's take a look from a shared memory perspective. A key question is what is the setting of your L1/shared split? I believe it defaults to 48KB of shared memory per SM, but if you've changed this setting that will be pretty important. Regardless, according to your statement each block scheduled will use 10KB of shared memory. This means we would max out around 4 blocks scheduled per SM, or 4*13 total blocks = 52 blocks max that can be scheduled at any given time. You're clearly exceeding this number, so probably I don't have enough information about the shared memory usage by your kernels. If you're really using 10kb/block, this would more or less preclude you from having more than one kernel's worth of threadblocks executing at a time. There could still be some overlap, and I believe this is likely to be the actual limiting factor in your application. The first kernel of 60 blocks gets scheduled. After a few blocks drain (or perhaps because the 2 kernels were launched close enough together) the second kernel begins to get scheduled, so nearly simultaneously. Then we have to wait a while for about a kernel's worth of blocks to drain before the 3rd kernel can get scheduled, this may well be at the 50% point as indicated in the timeline.
Anyway I think the analyses 1 and 2 above clearly suggest you're getting most of the capability out of the device, based on the limitations inherent in your kernel structure. (We could do a similar analysis based on registers to discover if that is a significant limiting factor.) Regarding this statement: "I really cannot understand why i do not achieve concurrency equivalent to what is specified for gk110s.." I hope you see that the concurrency spec (e.g. 32 kernels) is a maximum spec, and in most cases you are going to run into some other kind of machine limit before you hit the limit on the maximum number of kernels that can execute simultaneously.
EDIT: regarding documentation and resources, the answer I linked to above from Greg Smith provides some resource links. Here are a few more:
The C programming guide has a section on Asynchronous Concurrent Execution.
GPU Concurrency and Streams presentation by Dr. Steve Rennich at NVIDIA is on the NVIDIA webinar page
My experience with HyperQ so far is 2-3 (3.5) times parallellization of my kernels, as the kernels usually are larger for a little more complex calculations. With small kernels its a different story, but usually the kernels are more complicated.
This is also answered by Nvidia in their cuda 5.0 documentation that more complex kernels will take down the amount of parallellization.
But still, GK110 has a great advantage just allowing this.
I just made some benchmarks for this super question/answer Why is my program slow when looping over exactly 8192 elements?
I want to do benchmark on one core so the program is single threaded. But it doesn't reach 100% usage of one core, it uses 60% at most. So my tests are not acurate.
I'm using Qt Creator, compiling using MinGW release mode.
Are there any parameters to setup for better performance ? Is it normal that I can't leverage CPU power ? Is it Qt related ? Is there some interruptions or something preventing code to run at 100%...
Here is the main loop
// horizontal sums for first two lines
for(i=1;i<SIZE*2;i++){
hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
}
// rest of the computation
for(;i<totalSize;i++){
// compute horizontal sum for next line
hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
// final result
resPointer[i-SIZE]=(hsumPointer[i-SIZE-SIZE]+hsumPointer[i-SIZE]+hsumPointer[i])/9;
}
This is run 10 times on an array of SIZE*SIZE float with SIZE=8193, the array is on the heap.
There could be several reasons why Task Manager isn't showing 100% CPU usage on 1 core:
You have a multiprocessor system and the load is getting spread across multiple CPUs (most OSes will do this unless you specify a more restrictive CPU affinity);
The run isn't long enough to span a complete Task Manager sampling period;
You have run out of RAM and are swapping heavily, meaning lots of time is spent waiting for disk I/O when reading/writing memory.
Or it could be a combination of all three.
Also Let_Me_Be's comment on your question is right -- nothing here is QT's fault, since no QT functions are being called (assuming that the objects being read and written to are just simple numeric data types, not fancy C++ objects with overloaded operator=() or something). The only activities taking place in this region of the code are purely CPU-based (well, the CPU will spend some time waiting for data to be sent to/from RAM, but that is counted as CPU-in-use time), so you would expect to see 100% CPU utilisation except under the conditions given above.
I want to use multi-threads to accelerate my program, but not sure which way is optimal.
Say we have 10000 small tasks, it takes maybe only 0.1s to finish one of them. Now I have a CPU with 12 cores and I want to use 12 threads to make it faster.
So far as I know, there are two ways:
1.Tasks Pool
There are always 12 threads running, each of them get one new task from the tasks pool after it finished its current work.
2.Separate Tasks
By separating the 10000 tasks into 12 parts and each thread works on one part.
The problem is, if I use tasks pool it is a waste of time for lock/unlock when multiple threads try to access the tasks pool. But the 2nd way is not ideal because some of the threads finish early, the total time depends on the slowest thread.
I am wondering how you deal with this kind of work and any other best way to do it? Thank you.
EDIT: Please note that the number 10000 is just for example, in practice, it may be 1e8 or more tasks and 0.1 per task is also an average time.
EDIT2: Thanks for all your answers :] It is good to know kinds of options.
So one midway between the two approaches is to break into say 100 batches of 100 tasks each and let the a core pick a batch of 100 tasks at a time from the task pool.
Perhaps if you model the randomness in execution time in a single core for a single task, and get an estimate of mutex locking time, you might be able to find an optimal batch size.
But without too much work we at least have the following lemma :
The slowest thread can only take at max 100*.1 = 10s more than others.
Task pool is always the best solution here. It's not just optimum time, it's also comprehensibility of code. You should never force your tasks to conform to the completely unrelated criteria of having the same number of subtasks as cores - your tasks have nothing to do with that (in general), and such a separation doesn't scale when you change machines, etc. It requires overhead to collaborate on combining results in subtasks for the final task, and just generally makes an easy task hard.
But you should not be worrying about the use of locks for taskpools. There are lockfree queues available if you ever determined them necessary. But determine that first. If time is your concern, use the appropriate methods of speeding up your task, and put your effort where you will get the most benefit. Profile your code. Why do your tasks take 0.1 s? Do they use an inefficient algorithm? Can loop unrolling help? If you find the hotspots in your code through profiling, you may find that locks are the least of your worries. And if you find everything is running as fast as possible, and you want that extra second from removing locks, search the internet with your favorite search engine for "lockfree queue" and "waitfree queue". Compare and swap makes atomic lists easy.
Both ways suggested in the question will perform well and similarly to each another (in simple cases with predictable and relatively long duration of the tasks). If the target system type is known and available (and if performance is really a top concern), the approach should be chosen based on prototyping and measurements.
Do not necessarily prejudice yourself as to the optimal number of threads matching the number of the cores. If this is a regular server or desktop system, there will be various system processes kicking in here and then and you may see your 12 threads variously floating between processors which hurts memory caching.
There are also crucial non-measurement factors you should check: do those small tasks require any resources to execute? Do these resources impose additional potential delays (blocking) or competition? Are there additional apps competing for the CPU power? Will the application need to be grow to accommodate different execution environments, task types, or user interaction models?
If the answer to all is negative, here are some additional approaches that you can measure and consider.
Use only 10 or 11 threads. You will observe a small slowdown, or even
a small speedup (the additional core will serve OS processes, so that
thread affinity of the rest will become more stable compared to 12
threads). Any concurrent interactive activity on the system will see
a big boost in responsiveness.
Create exactly 12 threads but explicitly set a different processor
affinity mask to each, to impose a 1-1 mapping between threads and processors.
This is good in the simplest near-academical case
where there are no resources other than CPU and shared memory
involved; you will see no chronic migration of threads across
processes. The drawback is an
algorithm closely coupled to a particular machine; on another machine
it could behave so poorly as to finish never at all (because of an
unrelated real time task that
blocks one of your threads forever).
Create 12 threads and split the tasks evenly. Have each thread
downgrade its own priority once it is past 40% and again once it is
past 80% of its load. This will improve load balancing inside your
process, but it will behave poorly if your application is competing
with other CPU-bound processes.
100ms/task - pile 'em on as they are - pool overhead will be insignificant.
OTOH..
1E8 tasks # 0.1s/task = 10,000,000 seconds
= 2777.7r hours
= 115.7 days
That's much more than the interval between patch Tuesday reboots.
Even if you run this on Linux, you should batch up the output and flush it to disk in such a manner that the job is restartable.
Is there a database involved? If so, you should have told us!
Each working thread may have its own small task queue with the capacity of no more than one or two memory pages. When the queue size becomes low (a half of capacity) it should send a signal to some manager thread to populate it with more tasks. If queue is organized in batches then working threads do not need to enter critical sections as long as current batch is not empty. Avoiding critical sections will give you extra cycles for actual job. Two batches per queue are enough, and in this case one batch can take one memory page, and so queue takes two.
The point of memory pages is that thread does not have to jump all over the memory to fetch data. If all data are in one place (one memory page) you avoid cache misses.