Linux TC eBPF and concurency - concurrency

Is there a limit to how much instances of an eBPF programs the kernel can run simultaneously on several CPUs (similar to the python GIL problem)
In particular can eBPF tc programs work on multiple CPU simultaneously?
How is locking of kernel datastructures done by eBPF when it is running the same code on multiple CPUs?

In particular can eBPF tc programs work on multiple CPU simultaneously?
Yes (see details below).
How is locking of kernel datastructures done by eBPF when it is running the same code on multiple CPUs?
Concurrent accesses to maps in BPF are protected by the RCU mechanism. However, there is currently no way to protect concurrent code in BPF programs themselves. So, for example, a BPF program running on a first core may update a value between the lookup and update calls of the same program running on a second core.
In some cases, to improve performance, you can use per-CPU maps (e.g., per-CPU arrays and per-CPU hashmaps). In that case, the API for lookups, updates, and deletes stays the same, but each core actually has its own copy of the map's values. This means that, for example, if you are incrementing a counter in a map, each core will see its own counter and you'll have to aggregate their values in userspace to get the total counter. Of course, this might not always fit your use case.

Related

Open CL Running parallel tasks on data parallel kernel

I'm currently reading up on the OpenCL framework because of reasons regarding my thesis work. And what I've come across so far is that you can either run kernels in data parallel or in task parallel. Now I've got a question and I can't manage to find the answer.
Q: Say that you have a vector that you want to sum up. You can do that in OpenCL by writing a kernel for a data parallel process and just run it. Fairly simple.
However, now say that you have 10+ different vectors that need to be summed up also. Is it possible to run these 10+ different vectors in task parallel, while still using a kernel that processes them as "data parallel"?
So you basically parallelize tasks, which in a sense are run in parallel? Because what I've come to understand is that you can EITHER run the tasks parallel, or just run one task itself in parallel.
The whole task-parallel/data-parallel distinction in OpenCL was a mistake. We deprecated clEnqueueTask in OpenCL 2.0 because it had no meaning.
All enqueued entities in OpenCL can be viewed as tasks. Those tasks may be run concurrently, they may be run in parallel, they may be serialized. You may need multiple queues to run them concurrently, or a single out-of-order queue, this is all implementation-defined to be fully flexible.
Those tasks may be data-parallel, if they are made of multiple work-items working on different data elements within the same task. They may not be, consisting of only one work-item. This last definition is what clEnqueueTask used to provide - however, because it had no meaning whatsoever compared with clEnqueueNDRangeKernel with a global size of (1,1,1), and it was not checked against anything in the kernel code, deprecating it was the safer option.
So yes, if you enqueue multiple NDRanges, you can have multiple tasks in parallel, each one of which is data-parallel.
You can also copy all of those vectors at once inside one data-parallel kernel, if you are careful with the way you pass them in. One option would be to launch a range of work-groups, each one iterates through a single vector copying it (that might well be the fastest way on a CPU for cache prefetching reasons). You could have each work-item copy one element using some complex lookup to see which vector to copy from, but that would likely have high overhead. Or you can just launch multiple parallel kernels, each for one kernel, and have the runtime decide if it can run them together.
If your 10+ different vectors are close to the same size, it becomes a data parallel problem.
The task parallel nature of OpenCL is more suited for CPU implementations. GPUs are more suited for data parallel work. Some high-end GPUs can have a handful of kernels in-flight at once, but their real efficiency is in large data parallel jobs.

Am I disturbing other programs with OpenMP?

I'm using OpenMP for a loop like this:
#pragma omp parallel for
for (int out = 1; out <= matrix.rows; out++)
{
...
}
I'm doing a lot of computations on a machine with 64 CPUs. This works quite qell but my question is:
Am I disturbing other users on this machine? Usually they only run single thread programms. Will they still run on 100%? Obviously I will disturb other multithreads programms, but will I disturb single thread programs?
If yes, can I prevend this? I think a can set the maximum number of CPUs with omp_set_num_threads. I can set this to 60, but I don't think this is the best solution.
The ideal solution would disturb no other single thread programs but take as much CPUs as possible.
Every multitasking OS has something called a process scheduler. This is an OS component that decides where and when to run each process. Schedulers are usually quite stubborn in the decisions they make but those could often be influenced by various user-supplied policies and hints. The default configuration for almost any scheduler is to try and spread the load over all available CPUs, which often results in processes migrating from one CPU to another. Fortunately, any modern OS except "the most advanced desktop OS" (a.k.a. OS X) supports something called processor affinity. Every process has a set of processors on which it is allowed to execute - the so-called CPU affinity set of that process. By configuring disjoint affinity sets to various processes, those could be made to execute concurrently without stealing CPU time from each other. Explicit CPU affinity is supported on Linux, FreeBSD (with the ULE scheduler), Windows NT (this also includes all desktop versions since Windows XP), and possibly other OSes (but not OS X). Every OS then provides a set of kernel calls to manipulate the affinity and also an instrument to do that without writing a special program. On Linux this is done using the sched_setaffinity(2) system call and the taskset command line instrument. Affinity could also be controlled by creating a cpuset instance. On Windows one uses the SetProcessAffinityMask() and/or SetThreadAffinityMask() and affinities can be set in Task Manager from the context menu for a given process. Also one could specify the desired affinity mask as a parameter to the START shell command when starting new processes.
What this all has to do with OpenMP is that most OpenMP runtimes for the listed OSes support under one form or another ways to specify the desired CPU affinity for each OpenMP thread. The simplest control is the OMP_PROC_BIND environment variable. This is a simple switch - when set to TRUE, it instructs the OpenMP runtime to "bind" each thread, i.e. to give it an affinity set that includes a single CPU only. The actual placement of threads to CPUs is implementation dependent and each implementation provides its own way to control it. For example, the GNU OpenMP runtime (libgomp) reads the GOMP_CPU_AFFINITY environment variable, while the Intel OpenMP runtime (open-source since not long ago) reads the KMP_AFFINITY environment variable.
The rationale here is that you could limit your program's affinity in such a way as to only use a subset of all the available CPUs. The remaining processes would then get predominantly get scheduled to the rest of the CPUs, though this is only guaranteed if you manually set their affinity (which is only doable if you have root/Administrator access since otherwise you can modify the affinity only of processes that you own).
It is worth mentioning that it often (but not always) makes no sense to run with more threads than the number of CPUs in the affinity set. For example, if you limit your program to run on 60 CPUs, then using 64 threads would result in some CPUs being oversubscribed and in timesharing between the threads. This will make some threads run slower than the others. The default scheduling for most OpenMP runtimes is schedule(static) and therefore the total execution time of the parallel region is determined by the execution time of the slowest thread. If one thread timeshares with another one, then both threads will execute slower than those threads that do not timeshare and the whole parallel region would get delayed. Not only this reduces the parallel performance but it also results in wasted cycles since the faster threads would simply wait doing nothing (possibly busy looping at the implicit barrier at the end of the parallel region). The solution is to use dynamic scheduling, i.e.:
#pragma omp parallel for schedule(dynamic,chunk_size)
for (int out = 1; out <= matrix.rows; out++)
{
...
}
where chunk_size is the size of the iteration chunk that each thread gets. The whole iteration space is divided in chunks of chunk_size iterations and are given to the worker threads on a first-come-first-served basis. The chunk size is an important parameter. If it is too low (the default is 1), then there could be a huge overhead from the OpenMP runtime managing the dynamic scheduling. If it is too high, then there might not be enough work available for each thread. It makes no sense to have chunk size bigger than maxtrix.rows / #threads.
Dynamic scheduling allows your program to adapt to the available CPU resources, even if they are not uniform, e.g. if there are other processes running and timesharing with the current one. But it comes with a catch: big system like your 64-core one usually are ccNUMA (cache-coherent non-uniform memory access) systems, which means that each CPU has its own memory block and access to the memory block(s) of other CPU(s) is costly (e.g. takes more time and/or provides less bandwidth). Dynamic scheduling tends to destroy data locality since one could not be sure that a block of memory, which resides on one NUMA, won't get utilised by a thread running on another NUMA node. This is especially important when data sets are large and do not fit in the CPU caches. Therefore YMMV.
Put your process on low priority within the operating system. Use a many resources as you like. If someone else needs those resources the OS will make sure to provide them, because they are on a higher (i.e. normal) priority. If there are no other users you will get all resources.

Hyper-threading - By which test can I check if it is enabled or disabled?

Is there any simple performance test to detect is HT enabled or not?
For example I need it it case when max CPU number is limited by linux kernel(NR_CPUS) and no access to BIOS.
So could you advice any code to detect is HT enabled?
I glanced here or here but it's not the answers.
Thanx.
There is another way - the /sys/ file system, it is supposed to be more orderly than /proc.
/proc/cpuinfo varies between kernel versions;
cat /sys/devices/system/cpu/cpu0/topology/thread_siblings
gives you the list of hardware threads that run together with core cpu0.
https://www.kernel.org/doc/Documentation/cputopology.txt
4) /sys/devices/system/cpu/cpuX/topology/thread_siblings:
internal kernel map of cpuX's hardware threads within the same
core as cpuX
On Linux I think you can read /proc/cpuinfo, but after that you have to do a bit of thinking to see whether we have multicore cpu, or HT enabled cpu etc.
First, flags will give you supported features, and ht there will indicate hyperthreading support.
Then you have to check whether sibling count matches core count on each CPU, so look for cpu id, and deduct from there. (So if sibling count matches core count -> no HT)
More information can be found here: http://richweb.com/cpu_info
Checking the flags would give you a clear answer, whereas a performance test (particularly if programmatically checking the result) would have some uncertainty. For what performance characteristic is the hyper-threading (HT) signature that we will test? HT provides better performance when the threads are doing different work, where different is defined based on the microarchitecture. In contrast, separate cores have little performance correlation due to the code executing on each core (some factors still exist like memory bandwidth or shared caches).
There are a variety of combinations for which you could test; I will sketch out one possible solution here. Assuming that the system has at least two cores that may also have HT enabled. This presents 4 logical processors (LP) on which threads can be scheduled. Craft a single-threaded program that can stress one core's resources. Now, duplicate that work, so that we'll have two threads that can run independently. To then test performance, set the scheduling affinity of the threads to different pairs of LPs in the system. Then measure the performance for running on different pairs. A HT pair will give different performance than pairing separate cores.
In writing the performance test, you have the usual concerns with measuring performance. Does the mechanism for measuring have the requisite granularity? Is the variable you are testing (HT versus core) changing, but no other variables are? For example, is the cache in the same state before each test? Or, do some cores share caches, so pairing them in a test would give different performance from other pairs? Now, if you do all of this, then you should observe different performance results depending on which pair of LPs you have scheduled your work.

Message passing interface on shared memory systems performance

As I know there is two way for parallel processing Message passing interface and multi threading. Multi threading can not be used for distributed memory systems without message passing interface; but message passing interface can be used for either systems "shared memory" and "distributed memory". My question is about performance of a code that is parallelized with MPI and ran on shared memory system. Is the performance of this code in the same range of a code that is parallelized with multi threading?
Update:
My job is in the for that processes need to communicate with each other in repeatedly and the communication array can be 200*200 matrix
The answer is: it depends. MPI proceses are predominantly separate OS processes and communication between them occurs with some sort of shared memory IPC techniques when the communicating processes run on the same shared-memory node. Being separate OS processes, MPI processes in general do not share data and sometimes data has to be replicated in each process which leads to less than optimal memory usage. On the other hand threads can share lots of data and can benefit from cache reusage, especially on multicore CPUs that have large shared last-level caches (e.g. the L3 cache on current generation x86 CPUs). Cache reusage when combined with more lightweight methods for data exchange between threads (usually just synchronisation since work data is already shared) can lead to better performance than the one achievalbe by separate processes.
But once again - it depends.
Let's assume we only consider MPI and OpenMP, since they are the two major representatives of the two parallel programming families you mention. For distributed systems, MPI is the only option between different nodes. Within a single node, however, as you well say, you can still use MPI and use OpenMP too. Which one will perform better really depends on the application you are running, and specifically in its computation/communication ratio. Here you can see a comparison of MPI and OpenMP for a multicore processor, where they confirm the same observation.
You can go a step further and use a hybrid approach. Use MPI between the nodes and then use OpenMP within nodes. This is called hybrid MPI+OpenMP parallel programming. You can also apply this within a node that contains a hybrid CMP+SMT processor.
You can check some information here and here. Moreover this paper compares an MPI approach vs a hybrid MPI+OpenMP one.
In my opinion, they're simply better at different jobs. The Actor model is great at asynchronously performing many different tasks at different times, whereas the OpenMP/TBB/PPL model is great for performing one task in parallel very simply and reliably.

Executing C++ program on multiple processor machine

I developed a program in C++ for research purpose. It takes several days to complete.
Now i executing it on our lab 8core server machine to get results quickly, but i see machine assigns only one processor to my program and it remains at 13% processor usage(even i set process priority at high level and affinity for 8 cores).
(It is a simple object oriented program without any parallelism or multi threading)
How i can get true benefit from the powerful server machine?
Thanks in advance.
Partition your code into chunks you can execute in parallel.
You need to go read about data parallelism
and task parallelism.
Then you can use OpenMP or
MPI
to break up your program.
(It is a simple object oriented program without any parallelism or
multi threading)
How i can get true benefit from the powerful server machine?
By using more threads. No matter how powerful the computer is, it cannot spread a thread across more than one processor. Find independent portions of your program and run them in parallel.
C++0x threads
Boost threads
OpenMP
I personally consider OpenMP a toy. You should probably go with one of the other two.
You have to exploit multiparallelism explicitly by splitting your code into multiple tasks that can be executed independently and then either use thread primitives directly or a higher level parallelization framework, such as OpenMP.
If you don't want to make your program itself use multithreaded libraries or techniques, you might be able to try breaking your work up into several independent chunks. Then run multiple copies of your program...each being assigned to a different chunk, specified by getting different command-line parameters.
As for just generally improving a program's performance...there are profiling tools that can help you speed up or find the bottlenecks in memory usage, I/O, CPU:
https://stackoverflow.com/questions/tagged/c%2b%2b%20profiling
Won't help split your work across cores, but if you can get an 8x speedup in an algorithm that might be able to help more than multithreading would on 8 cores. Just something else to consider.