Multiple instances of program on multi-core machine - c++

I am assuming a dual-core (2 cores per processors) machine with 2 processors for the questions that follow; so a total of 4 "cores". So some natural questions arose:
Suppose I wrote a simple serial program and built it in, say, Visual Studio.. and ran the same program twice, say, with distinct input data in each run. Would they be running on the same processor? Or distinct processors? How much RAM memory would be assigned to each? Would it be the RAM memory on 1 processor (2 cores) or the total RAM? I believe the two programs would run on distinct processors and should each have RAM memory of 1 processor (2 cores); but I am not 100% certain. Would the behavior be any different on Linux?
Now suppose my program was written using a distributed memory parallel interface such as MPI and that I ran it once with 2 processors in the np argument (say). Would the program use both processors (and in effect all 4 cores)? Is this the optimal value for the argument -np? In other words, if I did the same with -np 3 or -np 4; is it correct to assume there would be no added advantage? Again, I think so, but I am not 100% certain. I assume also that I could go higher than 4 (-np 5, -np 6, etc). In such cases, how do the processes compete for memory at values of np > 4? Would the performance get worse for np > 4. I think yes, and perhaps this partly depends on problem size, but again not 100% sure.
Next, suppose I ran two instances of my MPI-built parallel program, both with -np 2, each with, say, different input data. First off, is this possible? I assume it is and that they each run on both processors? How are the two programs synchronized and how do they individually compete for memory sequentially? This should atleast in part, be based on the order of launching the programs, presumably?
Lastly, suppose my program was written using a shared memory parallel interface such as OpenMP and that I ran it once. How many "threads" can I run it on to make full use of shared memory parallelism - is it 2 or 4? (since I have 2 processors with 2 cores each). My guess is it is 4; since all 4 cores are part of the a single shared memory unit? Is that correct? If the answer is 4; does it make sense to run on greater than 4 threads? I am not sure this even works (unlike MPI, where I believe we can do -np 5, -np 6 and so on).
Finally, suppose I run 2 instances of the shared memory parallel program, each with, say, different input data. I assume this is possible and that the individual processes would somehow compete for memory, presumably in the order the programs were launched?

Which processor they run on is entirely up to the OS and depends on many factors, including whatever else is happening on the same machine. The common case, though, is that they will tend to sit on one core each, occasionally swapping to different cores ("occasionally" may mean several times a second or even more frequently).
Çores don't have their own RAM on normal PC hardware, and the processes will be given however much RAM they ask for.
For MPI processes, yes, your parallelism should match the core count (assuming a CPU-heavy workload). If two MPI processes run with -np 2, they will simply consume all four cores. Increase anything and they'll start to contend. As explained above, RAM has nothing to do with any of this, though cache will suffer in the presence of contention.
This "question" is way too long, so I'm going to stop now.

#Marcelo is absolutely right and I'd like to just expand on his answer a little bit.
The OS will determine where and when the threads the comprise the application execution depending on what else is going on in the system and the available resources. Each application will run in it's own process and that process can have hundereds or thousands of threads. The OS (Windows, Linux, Mac whatever) will switch the execution context of the processing cores to ensure that all applications and services get a slice of the pie.
As for I/O access to such things as RAM that is physically controlled by the NorthBridge Controller that sits on your motherboard. Each process (not processor!) will have an allocated amount of RAM that it can deal with that can expand or contract over the lifetime of the application... this of course is limited to the amount of resources available on the system, and also worth noting the OS will take care of swapping RAM requests beyond it's physically availability to disk (i.e. Virtual RAM).
On the other hand though you will need to coordinate access to memory within your application through the use of critical sections and other thread synchronising mechanisms.
OpenMP is a library that helps you write multithreaded parellel applications and makes the syntax of keeping threads in sync easier.... I would comment more, but it's been quite a while since I've used it and I'm sure someone could give a better explaination.

I see you are using windows, so I will summarize by saying that you can set process affinities (which core or cores a process can run on) in the task manager. There's also a winapi call but the name escapes me
a) for a single threaded program, they will not launch on the same cpu (assuming its cpu bound). You can guarantee it by changing the affinity. in linux there's a call sched_setaffinity and a userspace program taskset
b) depends on the MPI library; the machinery is library-specific.
c) it depends on the specific application and data pattern. For small data accesses but lots of messaging passing, you may actually find limiting to 1 CPU to be the most efficient pattern.

Related

openMP: Running with all threads in parallel leads to out-of-memory-exceptions

I want to shorten the runtime of an lengthy image processing algorithm, which is applied to multiple images by using parallel processing with openMP.
The algorithm works fine with single or limited number (=2) of threads.
But: The parallel processing with openMP requires lots of memory, leading to out-of-memory-exceptions, when running with the maximum number of possible threads.
To resolve the issue, I replaced the "throwing of exceptions" with a "waiting for free memory" in case of low memory, leading to many (<= all) threads just waiting for free memory...
Is there any solution/tool/approach to dynamically maintain the memory or start threads depending on available memory?
Try compiling your program 64-bit. 32-bit programs can only have up to 2^32 = about 4GB of memory. 64-bit programs can use significantly more (2^64 which is 18 exabytes). It's very easy to hit 4GB of memory these days.
Note that if you are using more RAM than you have available, your OS will have to page some memory to disk. This can hurt performance a lot. If you get to this point (where you are using a significant portion of RAM) and still have extra cores, you would have to go deeper into the algorithm to find a more granular section to parallelize.
If you for some reason can't switch to 64-bit, you can do multiprocessing (running multiple instances of a program) so each process will have up to 4GB. You will need to launch and coordinate the processes somehow. Depending on your needs, this could mean using simple command-line arguments or complicated inter-process communication (IPC). OpenMP doesn't do IPC, but Open MPI does. Open MPI is generally used for communication between many nodes on a network, but it can be set up to run concurrent instances on one machine.

Performance of threads and processes in linux

I have two scenarios in Linux that I've been working for some time in the same machine. The machine has two xeon processors each with 8 cores and 16 threads.
I have one code in c++ that is parallelized with openmp. In this scenario, if I use all threads (32 in total according to the Linux kernel) do I have any penalties in terms of concurrence between the threads ? I mean, setting 32 threads is the optimal configuration for this scenario ?
I run a given number of processes (all single threaded) using the same binary. Basically I have a script that spawn the same binary with different input files. In this scenario, what is the best way to launch these processes and not exhaust the machine ? I think that if I run 32 processes at the same time I will harm the performance of the machine.
The optimal one will generally be something between 16 and 32 for CPU-bound tasks (hyperthreaded cores compete for the same resources); for memory-bound or even IO-bound tasks it can be even lower.
Still, in most cases using as many threads as cores can be a good starting point.
Why should it be harmful? In Linux, threads are just processes that happen to share the virtual address space (and most other OS resources). If you have enough RAM to keep them running without paging¹ and each process is single thread, 32 is as ok as per the thread case.
notice that the situation would be pretty much the same for an equivalent multithreaded program, as the program code is shared between the various instances of the application.

Multithreading efficiency in C++

I am trying to learn threading in C++, and just had a few questions about it (more specifically <thread>.
Let's say the machine this code will run on has 4 cores, should I split up an operation into 4 threads? If I were to create 8 threads instead of 4, would this run slower on a 4 core machine? What if the processor has hyperthreading, should I try and make the threads match the number of physical cores or logical cores?
Should I just not worry about the number of cores a machine has, and try to create as many threads as possible?
I apologize if these questions have been already answered; I've been looking for information about threading with <thread>, which was introduced in c11 so I haven't been able to find too much about it.
The program in question is going to run many independent simulations.
If anybody has any insight about <thread> or just multithreading in general, I would be glad to hear it.
If you are performing pure calculations with no I/O - and those calculations are freestanding and not relying on results from other calculations happening in another thread, the maximum number of such threads should be the number of cores (possibly one or two less if the system is also loaded with other tasks).
If you are doing network I/O or similar, more threads are certainly a possibility.
If you are doing disk-I/O, a single thread reading from the disk is often best, because disk reads from multiple threads leads to moving the read/write head around on the disk, which just makes things slower.
If you're using threads for to make the code simpler, then the number of threads will probably depend on what you are doing.
It also depends on how "freestanding" each thread is. If they need to share data in complex ways, the sharing/waiting for other thread/etc, may well make it slower with more threads.
And as others have said, try to make your framework for this flexible and test different options. Preferably on multiple machines (unless you only have one kind of machine that you will ever run your code on).
There is no such thing as <threads.h>, you mean <thread>, the thread support library introduced in C++11.
The only answer to your question is "test and see". You can make your code flexible enough, so that it can be run by passing an N parameter (where N is the desired number of threads).
If you are CPU-bound, the answer will be very different from the case when you are IO bound.
So, test and see! For your reference, this link can be helpful. And if you are serious, then go ahead and get this book. Multithreading, concurrency, and the like are hairy topics.
Let's say the machine this code will run on has 4 cores, should I split up an operation into 4 threads?
If some portions of your code can be run in parallel, then yes it can be made to go faster, but this is very tricky to do since loading threads and switching data between them takes a ton of time.
If I were to create 8 threads instead of 4, would this run slower on a 4 core machine?
It depends on the context switching it has to do. Sometimes the execution will switch between threads very often and sometimes it will not but this is very difficult to control. It will not in any case run faster than 4 threads doing the same work.
What if the processor has hyperthreading, should I try and make the threads match the number of physical cores or logical cores?
Hyperthreading works nearly the same as having more cores. When you will notice the differences between a real core and an execution core, you will have enough knowledge to work around the caveats.
Should I just not worry about the number of cores a machine has, and try to create as many threads as possible?
NO, threads are hard to manage, avoid them as much as you can.
The program in question is going to run many independent simulations.
You should look into openmp. It is a library in C made to parallelize computation when your program can be split up. Do not confuse parallel with concurrent. Concurrent is simply multiple threads working together while parallel is made specifically to speed up your application. Maybe openmp is overkill for your thing, but it is a good thing to know when you are approaching parallel computing
Don't think of the number of threads you need as in comparison to the machine you're running on. Threading is valuablue any time you have a process that:
A: There is some very slow operation, that the rest of the process need not wait for.
B: Certain functions can run faster than one another and don't need to be executed inline.
C: There is a lot of non-order dependant I/O going on(web servers).
These are just a few of the obvious examples when launching a thread makes sense. The number of threads you launch is therefore more dependant on the number of these scenarios that pop up in your code, than the architecture you expect to run on. In fact unless you're running a process that really really needs to be optimized, it is likely that you can only eek out a few percentage points of additional performance by benchmarking for your architecture in comparison to the number of threads that you launch, and in modern computers this number shouldn't vary much at all.
Let's take the I/O example, as it is the scenario that will see the most benefit. Let's assume that some program needs to interract with 200 users over the network. Network I/O is very very slow. Thousands of times slower than the CPU. If we were to handle each user in turn we would waste thousands of processor cycles just waiting for data to come from the first user. Could we not have been processing information from more than one user at a time? In this case since we have roughly 200 users, and the data that we're waiting for we know to be 1000s of times slower than what we can handle(assuming we have a minimal amount of processing to do on this data), we should launch as many threads as the operating system allows. A web server that takes advantage of threading can serve hundreds of more people per second than one that does not.
Now, let's consider a less I/O intensive example, where say we have several functions that execute in turn, but are independant of one another and some of them might run faster, say because there is disk I/O in one, and no disk I/O in another. In this case, our I/O is still fairly fast, but we will certainly waste processing time waiting for the disk to catch up. As such we can launch a few threads, just to take advantage of our processing power, and minimize wasted cycles. However, if we launch as many threads as the operating system allows we are likely to cuase memory management issues for branch predictors, etc... and launching too many threads in this case is actually sub optimal and can slow the program down. Note that in this, I never mentioned how many cores the machine has! NOt that optimizing for different architectures isn't valuable, but if you optimize for one architecture you are likely very close to optimal for most. Assuming, again, that you're dealing with all reasonably modern processors.
I think most people would say that large scale threading projects are better supported by languages other than c++ (go, scala,cuda). Task parallelism as opposed to data parallelism works better in c++. I would say that you should create as many threads as you have tasks to dole out but if data parallelism is more related to your problem consider maybe using cuda and linking to the rest of your project at a later time
NOTE: if you look at some sort of system monitor you will notice that there are likely far more than 8 threads running, I looked at my computer and it had hundreds of threads running at once so don't worry too much about the overhead. The main reason I choose to mention the other languages is that managing many threads in c++ or c tends to be very difficult and error prone, I did not mention it because the c++ program will run slower(which unless you use cuda it probably won't)
In regards to hyper-threading let me comment on what I have found from experience.
In large dense matrix multiplication hyper-threading actually gives worse performance. For example Eigen and MKL both use OpenMP (at least the way I have used them) and get better results on my system which has four cores and hyper-threading using only four threads instead of eight. Also, in my own GEMM code which gets better performance than Eigen I also get better results using four threads instead of eight.
However, in my Mandelbrot drawing code I get a big performance increase using hyper-threading with OpenMP (eight threads instead of four). The general trend (so far) seems to be that if the code works well using schedule(static) in OpenMP then hyper-threading does not help and may even be worse. If the code works better using schedule(dynamic) then hyper-threading may help.
In other words, my observation so far is that if the run time of each thread can vary a lot hyper-threading can help. If the run time of each thread is constant then it may even make performance worse. But YOU have to test and see for each case.

What does taskset in linux exactly do?

I have a program running on a 32 core system using Intel TBB.
The problem I have is when I set the program to use 32 threads, the performance doesn't gain enough compared to 16 threads (only 50% boost). However, when I use:
taskset 0xFFFFFFFF ./foo
which would lock the process to 32 cores, the performance is much better.
I have the two following questions:
Why? By Default, the OS would use all 32 cores for the 32 thread program anyway.
I'm assuming that even with taskset, the OS is allowed (would) to exchange the virtual threads and the physical threads, i.e. threads are not pinned. am I right?
Thanks.
The operating system may choose to use less cores for cache purposes. Imagine if the application uses the same set of memory then each write causes a cache invalidate. Forcing the lock is essentially you telling the OS the cache overhead for concurrency is not worth it, go ahead and use all the cores.
You must also remember there are other processes to run (like kthreads from the kernel, and background processes.) and migrating threads between cores is costly and may cause imbalances if your threads are not doing an even amount of work.
Also remember that the OS tries to evenly distribute work on the cores across ALL processes not just yours. This means that the load balancer may choose to not place your process on all 32 cores as there are other processes currently running and migration costs could be high or spreading your process evenly could cause load imbalance among the cpu cores. The OS strives for best system performance not necessarily best per application performance.

Utilizing All 4 cores in c++/c

I have a main processes that will create 4 threads. If i simply run all 4 threads would the kernel utilize all 4 cores or will the program be multithreaded on a single core? if not then how would synchronization be handled on a multicore. I have a 4core intel cpu and my program is in c++
Im running this on a linux in a Virtual machine.
You don't really know.
For one thing, the C++03 Standard doesn't know anything about threads, cores or any of that kind of stuff. So this is all platform-dependant anyway.
But even from a platform point-of-view, you often still don't really know. The operating system schedules threads and jobs. The OS might -- or might not -- give you the means to specify a "processor affinity" for a particular thread, but this typically takes some hoop-jumping-through to utilize.
One of the things you also should keep in mind is that if your goal is to keep each core 100% utilized, you'll often need more than n threads (where n is number of cores). Threads spend a lot of time sleeping, waiting on disk, and generally not doing anything on the core. The exact number of threads you'll need depends on your actual application and platform, but experimentation can help guide you towards fine tuning this.