Increasing cpu.shares cgroup granularity - cgroups

By default this value is set to 1024. I'm assuming that means that if I set the shares for my group to 1024, it'll be 100% of the available computing power.
Is there any way for me to increase the granularity for this? For example, rather than maxing out at 1024, I want to max out at 2048 or 8196.

IIRC, cpu.shares is already incredibly granular.
cpu.shares: The weight of each group living in the same hierarchy, that
translates into the amount of CPU it is expected to get. Upon cgroup creation,
each group gets assigned a default of 1024. The percentage of CPU assigned to
the cgroup is the value of shares divided by the sum of all shares in all
cgroups in the same level.
The value of cpu.shares is used to compare to other cgroups. If you have a single process running under cgroups, no matter the value of cpu.shares it will always get as much CPU as the host can spare. Multiple processes running under cgroups assigned the same value for cpu.shares would split the available CPU evenly.
When you have multiple processes under cgroups, their cpu.shares values are compared together, and those with a higher number get a higher percent of the available host CPU. If you have cpu.shares=1024 for two cgrouped processes, then assign cpu.shares=2048 for a third process, the first two processes would get equal amounts of the available CPU while the third would get twice as much of the available CPU (roughly 25%, 25%, 50%).
The actual numbers could be anything you want (now that's granular), only the percentage difference between them matters. The results from the above example would be replicated if you used cpu.shares=5 and cpu.shares=10 or cpu.shares=50000 and cpu.shares=100000

Related

How to enqueue as many kernels as there are threads available on version 1.2

I'm trying to do some calculations where it starts off with 10-20~ objects, but by doing calculations on these objects it creates 20-40 and so on and so forth, so slightly recursive but not forever, eventually the amount of calculations will reach zero. I have considered using a different tool but it's kind of too late for that for me. It's kind of an odd request which is probably why no results came up.
In short I'm wondering how it is possible to set global work size to as many threads as there are available. For example if the GPU can have X different processes running in parallel it will set that to global work size to X.
edit:it would also work if I can call more kernels from the GPU but that doesn't look possible on version 1.2.
There is not really a limit to global work size (only above 2^32 threads you have to use 64-bit ulong to avoid integer overflow), and the hard limit at 2^64 threads is so large that you can never possibly come even close to it.
If you need a billion threads, than set global work size to a billion threads. The GPU scheduler and hardware will handle that just fine, even if the GPU only has a few thousand physical cores. In fact, you should always launch much more threads than there are cores on the GPU; otherwise the hardware won't be fully saturated and you loose performance.
Only issue could be to run out of GPU memory.
Launching kernels from within kernels is only possible in OpenCL 2.0-2.2, on AMD or Intel GPUs.
It sounds like each iteration depends on the result of the previous one. In that case, your limiting factor is not the number of available threads. You cannot cause some work-items to "wait" for others submitted by the same kernel enqueueing API call (except to a limited extent within a work group).
If you have an OpenCL 2.0+ implementation at your disposal, you can queue subsequent iterations dynamically from within the kernel. If not, and you have established that your bottleneck is checking whether another iteration is required and the subsequent kernel submission, you could try the following:
Assuming a work-item can trivially determine how many threads are actually needed for an iteration based on the output of the previous iteration, you could speculatively enqueue multiple batches of the kernel, each of which depends on the completion event of the previous batch. Inside the kernel, you can exit early if the thread ID is greater or equal the number of threads required in that iteration.
This only works if you either have a hard upper bound or can make a reasonable guess that will yield sensible results (with acceptable perf characteristics if the guess is wrong) for:
The maximum number of iterations.
The number of work-items required on each iteration.
Submitting, say UINT32_MAX work items for each iteration will likely not make any sense in terms of performance, as the number of work-items that fail the check for whether they are needed will dominate.
You can work around incorrect guesses for the latter number by surrounding the calculation with a loop, so that work item N will calculate both item N and M+N if the number of items on an iteration exceeds M, where M is the enqueued work size for that iteration.
Incorrect guesses for the number of iterations would need to be detected on the host, and more iterations enqueued.
So it becomes a case of performing a large number of runs with different guesses and gathering statistics on how good the guesses are and what overall performance they yielded.
I can't say whether this will yield acceptable performance in general - it really depends on the calculations you are performing and whether they are a good fit for GPU-style parallelism, and whether the overhead of the early-out for a potentially large number of work items becomes a problem.

What's the right parallelism-factor for AKKA Thread Pool Executor running in docker

I have 15 apps running in containers in a single host. My apps are using the default thread pool size based on the number of CPUs it detect, which is what the host is exposing (16), however I'm allocating 1 CPU per app (using mesos) I know these are only cpu-shares and not full CPUs, but I don't think my apps should be configured to the default values for CPU related settings (I'm already defining max memory per jvm).
What is the right value for: parallelism-factor, parallelism-max, in AKKA thread pool executor?
Thanks
parallelism-factor is capped by parallelism-max (and parallelism-min, essentially the result of max(parallelism-min, min(parallelism-max, cores * parallelism-factor)) so if you want to limit downwards you only have to set parallelism-max to a low value.
Sounds like a low value should fit better given the single logical core. The default is set, with a single core more than one threads will essentially just compete against each other, but on the other hand, if there is a little bit of blocking somewhere unexpected it is good to have some extra threads. I'd go with four or possibly eight and benchmark the application a bit.

Does each Map have its own thread?

Does each Map have its own thread? So, when we do splitting, we should split the task for as many Map function as we have processors available? Or there's some other way, besides threads, where we can run map functions in parallel?
I assume you're speaking about hadoop mapreduce implementation. Also, I assume you're speaking about cores workload.
For the intro, the number of map tasks for a given job is derived from the number of input data splits. Then, those tasks are scheduled to task nodes, where mappers are started, up to mapred.tasktracker.map.tasks.maximum per node. This configuration paramether may differ for different nodes, for example in case of different computational power. I'll add an illustration from one of my other answers on SO:
The mappers by default, runs in a different JVM and there can be multiple JVMs running at any particular instance on a node, up to mapred.tasktracker.map.tasks.maximum. Those JVMs are recreated for each starting map task, or can be reused
for several consecutive runs. Won't dig in details, but this setting can also affect performance due to tradeoff between memory fragmentation and JVM instatiation overhead.
Proceeding to your question, amount of cores loaded by running JVMs is controlled by underlying OS, which does balance load and optimize computations. One can expect that different JVMs will be executed over different cores, if possible. One can expect performance degradation if number of mappers exceeds number of cores in general case. I have skewed usecases where latter is not true.
An example:
Say you have job splitted in 100 map tasks, to be run on 2 task nodes with 2 cpu unit each, with mapred.tasktracker.map.tasks.maximum equal to 2. Then, most of the time (except when waiting on mappers to start) your 100 elements task will be executed 4 at a given time, resulting (in average) in 50 tasks completed by each node.
And last, but not least. For mapper task, it is common to not to have CPU as bottleneck, but IO. In that case, it's not uncommon to get better results with many small on CPU machines vs a few servers huge on CPU.

Cuda block or thread preference

The algorithm that I'm implementing has a number of things that need to be done in parrallel. My question is, if I'm not going to use shared memory, should I prefer more blocks with less threads/block or more threads/block with less blocks for performance so that the total threads adds up to the number of parallel things I need to do?
I assume the "set number of things" is a small number or you wouldn't be asking this question. Attempting to expose more parallelism might be time well spent.
CUDA GPUs group execution activity and the resultant memory accesses into warps of 32 threads. So at a minimum, you'll want to start by creating at least one warp per threadblock.
You'll then want to create at least as many threadblocks as you have SMs in your GPU. If you have 4 SMs, then your next scaling increment above 32 would be to create 4 threadblocks of 32 threads each.
If you have more than 128 "number of things" in this hypothetical example, then you will probably want to increase both warps per threadblock as well as threadblocks. You might start with threadblocks until you get to some number, perhaps around 16 or so, that would allow your code to scale up on GPUs larger than your hypothetical 4-SM GPU. But there are limits to the number of threadblocks that can be open on a single SM, so pretty quickly after 16 or so threadblocks you'll also want to increase the number of warps per threadblock beyond 1 (i.e. beyond 32 threads).
These strategies for small problems will allow you to take advantage of all the hardware on the GPU as quickly as possible as your problem scales up, while still allowing opportunities for latency hiding if your problem is large enough (eg. more than one warp per threadblock, or more than one threadblock resident per SM).

How can I measure how my multithreaded code scales (speedup)?

What would be the best way to measure the speedup of my program assuming I only have 4 cores? Obviously I could measure it up to 4, however it would be nice to know for 8, 16, and so on.
Ideally I'd like to know the amount of speedup per number of thread, similar to this graph:
Is there any way I can do this? Perhaps a method of simulating multiple cores?
I'm sorry, but in my opinion, the only reliable measurement is to actually get an 8, 16 or more cores machine and test on that.
Memory bandwidth saturation, number of CPU functional units and other hardware bottlenecks can have a huge impact on scalability. I know from personal experience that if a program scales on 2 cores and on 4 cores, it might dramatically slow down when run on 8 cores, simply because it's not enough to have 8 cores to be able to scale 8x.
You could try to predict what will happen, but there are a lot of factors that need to be taken into account:
caches - size, number of layers, shared / non-shared
memory bandwidth
number of cores vs. number of processors i.e. is it an 8-core machine or a dual-quad-core machine
interconnection between cores - a lower number of cores (2, 4) can still work reasonably well with a bus, but for 8 or more cores a more sophisticated interconnection is needed.
memory access - again, a lower number of cores work well with the SMP (symmetrical multiprocessing) model, while a higher number of core need a NUMA (non-uniform memory access) model.
I do neither think that there is a real way to do this, but one thing which comes to my mind is that you could use a virtual machine to simulate more cores. In VirtualBox for example you can select up to 16 cores out of the standard menu, but I am very confident that there are some hacks, which can make more of that and other VirtualMachines like VMware might even support more out of the Box.
bamboon and and doron are correct that many variables are at play, but if you have a tunable input size n, you can figure out the strong scaling and weak scaling of your code.
Strong scaling refers to fixing the problem size (e.g. n = 1M) and varying the number of threads available for computation. Weak scaling refers to fixing the problem size per thread (n = 10k/thread) and varying the number of threads available for computation.
It's true there's a lot of variables at work in any program -- however if you have some basic input size n, it's possible to get some semblance of scaling. On a n-body simulator I developed a few years back, I varied the threads for fixed size and the input size per thread and was able to reasonably calculate a rough measure of how well the multithreaded code scaled.
Since you only have 4 cores, you can only feasibly compute the scaling up to 4 threads. This severely limits your ability to see how well it scales to largely threaded loads. But this may not be an issue if your application is only used on machines where there are small core counts.
You really need to ask yourself the question: Is this going to be used on 10, 20, 40+ threads? If it is, the only way to accurately determine scaling to those regimes is to actually benchmark it on a platform where you have that hardware available.
Side note: Depending on your application, it may not matter that you only have 4 cores. Some workloads scale with increasing threads regardless of the real number of cores available, if many of those threads spend time "waiting" for something to happen (e.g. web servers). If you're doing pure computation though, this won't be the case
I don't believe this is possible since there are too many variables to be able to accurately extrapolate performace. Even assuming you are 100% parallel. There are other factors like bus speed and cache misses that might limit your performance, not to mention periferal performace. How all of these factors affect your code can only be done though measuring on your specific hardware platform.
I take it you are asking about measurement, so I won't address the issue of predicting the effect on higher numbers of cores.
This question can be viewed another way: how busy can you keep each thread, and what do they total up to? So for six threads, running at say 50% utilization each, means you have 3 equivalent processors running. Dividing that by say four processors, means that your methods are achieving 75% utilization. Comparing that utilization, against the clock-time of actual speedup, tells you how much of your utilization is new overhead, and how much is real speed up. Isn't that what you are really interested in?
The processor utilization can be computed in real-time a couple different ways. Threads can independently ask the system for their thread times, compute ratios and maintain global totals. If you have total control over your blocking states, you don't even need the system calls, because you can just keep track of the ratio of blocking to nonblocking machine cycles, for computing utilization. A real-time multithreading instrumentation package I developed uses such methods and they work well. The cpu clock counter in newer cpus reads on the inside of 20 machine cycles.