Executing C++ program on multiple processor machine - c++

I developed a program in C++ for research purpose. It takes several days to complete.
Now i executing it on our lab 8core server machine to get results quickly, but i see machine assigns only one processor to my program and it remains at 13% processor usage(even i set process priority at high level and affinity for 8 cores).
(It is a simple object oriented program without any parallelism or multi threading)
How i can get true benefit from the powerful server machine?
Thanks in advance.

Partition your code into chunks you can execute in parallel.
You need to go read about data parallelism
and task parallelism.
Then you can use OpenMP or
MPI
to break up your program.

(It is a simple object oriented program without any parallelism or
multi threading)
How i can get true benefit from the powerful server machine?
By using more threads. No matter how powerful the computer is, it cannot spread a thread across more than one processor. Find independent portions of your program and run them in parallel.
C++0x threads
Boost threads
OpenMP
I personally consider OpenMP a toy. You should probably go with one of the other two.

You have to exploit multiparallelism explicitly by splitting your code into multiple tasks that can be executed independently and then either use thread primitives directly or a higher level parallelization framework, such as OpenMP.

If you don't want to make your program itself use multithreaded libraries or techniques, you might be able to try breaking your work up into several independent chunks. Then run multiple copies of your program...each being assigned to a different chunk, specified by getting different command-line parameters.
As for just generally improving a program's performance...there are profiling tools that can help you speed up or find the bottlenecks in memory usage, I/O, CPU:
https://stackoverflow.com/questions/tagged/c%2b%2b%20profiling
Won't help split your work across cores, but if you can get an 8x speedup in an algorithm that might be able to help more than multithreading would on 8 cores. Just something else to consider.

Related

How to use multi-core instead of multi-tread for programming?

I'm working on a project (Hardware: RaspberryPI 3B+), which has lots of computation and parallel processing. At present, I'm noticing some sort of lag in the code performance. Therefore, I'm constantly looking for efficient ways to improve my code and its performance.
Currently, I'm using C-language (because I can access and manipulate lower-level drivers easily) and developing my own set of functions, libraries and the drivers, which runs faster than any other pre-defined or readymade libraries or plugins.
Now, instead of the software-based muti-treading (Pthread), I wanted to use the separate cores for performing the corresponding task. So, any suggestion or guideline how I can use the different cores of the RaspberryPI?
Moreover, how can I check the CPU utilization to choose the best core to perform a certain task?
Thanking with regards,
Aatif Shaikh
At the C/C++ level you do not have access to which CPU core will run which thread. Just use the C++ 11 standard threads and let the OS scheduler to decide which thread runs where.
That said, Linux has the taskset tool to check thread affinity and there 's also sched_setaffinity() function.

Ways to process data which is coming at double speed than my processing speed

I was asked this question my someone and bit confused on same.
Q: how will you process the data which is coming at double speed than my processing speed?
I think of following:
using queue to handle this. But if I use simply queue then size of
queue required will be indefinetly large and i will still lag
behind. As every t time i will have half more data that I can
process. and I will keep laging exponentially.
I use one thread for reading data and two more for processing. But
suppose my data has to be processed serially then what happens.
Am still confused and any help on similar problems will be welcomed. I know there might be a standard solution for this but am unaware of same.
I would like to implement in c/c++
Short answer: you'll need some kind of parallel processing. It's not easy.
Long answer: Depending on your workload requirements, and whether the bottleneck is in IO or in CPU, it might simply be multithreading on a single core, or on a multicore processor, or on a shared memory multiprocessor or even distributed between multiple nodes. It can be just a matter of distributing and balancing your work between the worker units, if the problem is simple enough (embarrasingly parallel) or you'll need to explicitly do some parallel programming. There are fundamentally two parallel programming models: OpenMP, for multithreading in multicore systems with shared memory (either symmetric or non-uniform access); and MPI, for distributed processing in a low-latency high-bandwidth network. To complicate even further, OpenMP and MPI might perfectly run together, in a hybrid parallel programming runtime environment: OpenMP distributes and coordinates the parallel compute load between the cores on each node, and MPI does it between the nodes. Be aware, it is very tough work.

Multithreading efficiency in C++

I am trying to learn threading in C++, and just had a few questions about it (more specifically <thread>.
Let's say the machine this code will run on has 4 cores, should I split up an operation into 4 threads? If I were to create 8 threads instead of 4, would this run slower on a 4 core machine? What if the processor has hyperthreading, should I try and make the threads match the number of physical cores or logical cores?
Should I just not worry about the number of cores a machine has, and try to create as many threads as possible?
I apologize if these questions have been already answered; I've been looking for information about threading with <thread>, which was introduced in c11 so I haven't been able to find too much about it.
The program in question is going to run many independent simulations.
If anybody has any insight about <thread> or just multithreading in general, I would be glad to hear it.
If you are performing pure calculations with no I/O - and those calculations are freestanding and not relying on results from other calculations happening in another thread, the maximum number of such threads should be the number of cores (possibly one or two less if the system is also loaded with other tasks).
If you are doing network I/O or similar, more threads are certainly a possibility.
If you are doing disk-I/O, a single thread reading from the disk is often best, because disk reads from multiple threads leads to moving the read/write head around on the disk, which just makes things slower.
If you're using threads for to make the code simpler, then the number of threads will probably depend on what you are doing.
It also depends on how "freestanding" each thread is. If they need to share data in complex ways, the sharing/waiting for other thread/etc, may well make it slower with more threads.
And as others have said, try to make your framework for this flexible and test different options. Preferably on multiple machines (unless you only have one kind of machine that you will ever run your code on).
There is no such thing as <threads.h>, you mean <thread>, the thread support library introduced in C++11.
The only answer to your question is "test and see". You can make your code flexible enough, so that it can be run by passing an N parameter (where N is the desired number of threads).
If you are CPU-bound, the answer will be very different from the case when you are IO bound.
So, test and see! For your reference, this link can be helpful. And if you are serious, then go ahead and get this book. Multithreading, concurrency, and the like are hairy topics.
Let's say the machine this code will run on has 4 cores, should I split up an operation into 4 threads?
If some portions of your code can be run in parallel, then yes it can be made to go faster, but this is very tricky to do since loading threads and switching data between them takes a ton of time.
If I were to create 8 threads instead of 4, would this run slower on a 4 core machine?
It depends on the context switching it has to do. Sometimes the execution will switch between threads very often and sometimes it will not but this is very difficult to control. It will not in any case run faster than 4 threads doing the same work.
What if the processor has hyperthreading, should I try and make the threads match the number of physical cores or logical cores?
Hyperthreading works nearly the same as having more cores. When you will notice the differences between a real core and an execution core, you will have enough knowledge to work around the caveats.
Should I just not worry about the number of cores a machine has, and try to create as many threads as possible?
NO, threads are hard to manage, avoid them as much as you can.
The program in question is going to run many independent simulations.
You should look into openmp. It is a library in C made to parallelize computation when your program can be split up. Do not confuse parallel with concurrent. Concurrent is simply multiple threads working together while parallel is made specifically to speed up your application. Maybe openmp is overkill for your thing, but it is a good thing to know when you are approaching parallel computing
Don't think of the number of threads you need as in comparison to the machine you're running on. Threading is valuablue any time you have a process that:
A: There is some very slow operation, that the rest of the process need not wait for.
B: Certain functions can run faster than one another and don't need to be executed inline.
C: There is a lot of non-order dependant I/O going on(web servers).
These are just a few of the obvious examples when launching a thread makes sense. The number of threads you launch is therefore more dependant on the number of these scenarios that pop up in your code, than the architecture you expect to run on. In fact unless you're running a process that really really needs to be optimized, it is likely that you can only eek out a few percentage points of additional performance by benchmarking for your architecture in comparison to the number of threads that you launch, and in modern computers this number shouldn't vary much at all.
Let's take the I/O example, as it is the scenario that will see the most benefit. Let's assume that some program needs to interract with 200 users over the network. Network I/O is very very slow. Thousands of times slower than the CPU. If we were to handle each user in turn we would waste thousands of processor cycles just waiting for data to come from the first user. Could we not have been processing information from more than one user at a time? In this case since we have roughly 200 users, and the data that we're waiting for we know to be 1000s of times slower than what we can handle(assuming we have a minimal amount of processing to do on this data), we should launch as many threads as the operating system allows. A web server that takes advantage of threading can serve hundreds of more people per second than one that does not.
Now, let's consider a less I/O intensive example, where say we have several functions that execute in turn, but are independant of one another and some of them might run faster, say because there is disk I/O in one, and no disk I/O in another. In this case, our I/O is still fairly fast, but we will certainly waste processing time waiting for the disk to catch up. As such we can launch a few threads, just to take advantage of our processing power, and minimize wasted cycles. However, if we launch as many threads as the operating system allows we are likely to cuase memory management issues for branch predictors, etc... and launching too many threads in this case is actually sub optimal and can slow the program down. Note that in this, I never mentioned how many cores the machine has! NOt that optimizing for different architectures isn't valuable, but if you optimize for one architecture you are likely very close to optimal for most. Assuming, again, that you're dealing with all reasonably modern processors.
I think most people would say that large scale threading projects are better supported by languages other than c++ (go, scala,cuda). Task parallelism as opposed to data parallelism works better in c++. I would say that you should create as many threads as you have tasks to dole out but if data parallelism is more related to your problem consider maybe using cuda and linking to the rest of your project at a later time
NOTE: if you look at some sort of system monitor you will notice that there are likely far more than 8 threads running, I looked at my computer and it had hundreds of threads running at once so don't worry too much about the overhead. The main reason I choose to mention the other languages is that managing many threads in c++ or c tends to be very difficult and error prone, I did not mention it because the c++ program will run slower(which unless you use cuda it probably won't)
In regards to hyper-threading let me comment on what I have found from experience.
In large dense matrix multiplication hyper-threading actually gives worse performance. For example Eigen and MKL both use OpenMP (at least the way I have used them) and get better results on my system which has four cores and hyper-threading using only four threads instead of eight. Also, in my own GEMM code which gets better performance than Eigen I also get better results using four threads instead of eight.
However, in my Mandelbrot drawing code I get a big performance increase using hyper-threading with OpenMP (eight threads instead of four). The general trend (so far) seems to be that if the code works well using schedule(static) in OpenMP then hyper-threading does not help and may even be worse. If the code works better using schedule(dynamic) then hyper-threading may help.
In other words, my observation so far is that if the run time of each thread can vary a lot hyper-threading can help. If the run time of each thread is constant then it may even make performance worse. But YOU have to test and see for each case.

how to use quad core CPU in application

For using all the cores of a quad core processor what do I need to change in my code is it about adding support of multi threading or is it which is taken care by OS itself. I am having FreeBSD and language I am using is C++. I want to give complete CPU cycles to my application at least 90%.
You need some form of parallelism. Multi-threading or multi-processing would be fine.
Usually, multiple threads are easier to handle (since they can access shared data) than multiple processes. However, usually, multiple threads are harder to handle (since they access shared data) than multiple processes.
And, yes, I wrote this deliberately.
If you have a SIMD scenario, Ninefingers' suggestion to look at OpenMP is also very good. (If you don't know what SIMD means, see Ninefingers' helpful comment below.)
For multi-threaded applications in C++ may I suggest Boost.Thread which should help you access the full potential of your quad-core machine.
As for changing your code, you might want to consider making things as immutable as possible. State transitions between threads are much more difficult to debug. There a plethora of things that could potentially happen in unexpected ways. See this SO thread.
Another option not mentioned here, threading aside, is the use of OpenMP available via the -fopenmp and the libgomp library, both of which I have installed on my FreeBSD 8 system.
These give you #pragma directives to parallelise certain loops, while statements etc i.e. the bits you can parallelise. It takes care of threading and cpu association for you. Note it is a general solution and therefore might not be the optimum way to parallelise, but it will allow you to parallelise certain routines.
Take a look at this: https://computing.llnl.gov/tutorials/openMP/
As for using threads/processes themselves, certain routines and ways of working lend themselves to it. Can you break tasks out into such a way? Does it make sense to fork() your process or create a thread? If so, do so, but if not, don't try to force your application to be multi-threaded just because. An example I usually give is the greatest common divisor algorithm - it relies on the step before all the time in the traditional implementation therefore is difficult to make parallel.
Also note it is well known that for certain algorithms, parallelisation is actually slower for small values of whatever you are doing in parallel, because although the jobs complete more quickly, the associated time cost of forking and joining (be that threads or processes) actually pushes the time above that of a serial implementation.
I think your only option is to run several threads. If your application is single-threaded, then it will only run on one of the cores (at a time), but if you have more threads, they can run simultaneously.
You need to add support to your application for parallelism through the use of Threading.
Once you have support for parallelism, it's up to the OS to assign your threads to CPU cores.
The first thing I think you should look at is whether your application and its algorithms are suited to be executed in parellel (or possibly as a set of serial tasks that can be processed independently). If this is not the case, it will be difficult to multithread it or break it up into parallel processes, and you may need to look into modifying the way it works.
Once you have established that you will be able to benefit from parallel processing you have the option to either use several processes or threads. The choice depends a lot on the nature of your application and how independent the parallel processes can be. It is easier to coordinate and share data between threads since they are in the same process, but also quite a bit more challenging to develop and debug.
Boost.Thread is a good library if you decide to go down the multi-threaded route.
I want to give complete CPU cycles to my application at least 90%.
Why? Your chip's not hot enough?
Seriously, it takes world experts dozens if not hundreds of hours to parallelize and load-balance an application so that it uses 90% of all four cores. Your CPU is already paid for and it costs the same whether you use it or not. (Actually, it costs slightly less to run, electrically speaking, if you don't use it.) How much is your time worth? How many hours are you willing to invest in order to make more effective use of a resource that may have cost you $300 and is probably sitting idle most of the time anyway?
It's possible to get speedups through parallelism, but it's expensive in human time. You need a good reason to justify it. (Learning how is a good enough reason.)
All the good books I know on parallel programming are for languages other than C++, and for good reason. If you want interesting stuff on parallelism check out Implicit Parallel Programmin in pH or Concurrent Programming in ML or the Fortress Project.

How to structure a C++ application to use a multicore processor

I am building an application that will do some object tracking from a video camera feed and use information from that to run a particle system in OpenGL. The code to process the video feed is somewhat slow, 200 - 300 milliseconds per frame right now. The system that this will be running on has a dual core processor. To maximize performance I want to offload the camera processing stuff to one processor and just communicate relevant data back to the main application as it is available, while leaving the main application kicking on the other processor.
What do I need to do to offload the camera work to the other processor and how do I handle communication with the main application?
Edit:
I am running Windows 7 64-bit.
Basically, you need to multithread your application. Each thread of execution can only saturate one core. Separate threads tend to be run on separate cores. If you are insistent that each thread ALWAYS execute on a specific core, then each operating system has its own way of specifying this (affinity masks & such)... but I wouldn't recommend it.
OpenMP is great, but it's a tad fat in the ass, especially when joining back up from a parallelization. YMMV. It's easy to use, but not at all the best performing option. It also requires compiler support.
If you're on Mac OS X 10.6 (Snow Leopard), you can use Grand Central Dispatch. It's interesting to read about, even if you don't use it, as its design implements some best practices. It also isn't optimal, but it's better than OpenMP, even though it also requires compiler support.
If you can wrap your head around breaking up your application into "tasks" or "jobs," you can shove these jobs down as many pipes as you have cores. Think of batching your processing as atomic units of work. If you can segment it properly, you can run your camera processing on both cores, and your main thread at the same time.
If communication is minimized for each unit of work, then your need for mutexes and other locking primitives will be minimized. Course grained threading is much easier than fine grained. And, you can always use a library or framework to ease the burden. Consider Boost's Thread library if you take the manual approach. It provides portable wrappers and a nice abstraction.
It depends on how many cores you have. If you have only 2 cores (cpu, processors, hyperthreads, you know what i mean), then OpenMP cannot give such a tremendous increase in performance, but will help. The maximum gain you can have is divide your time by the number of processors so it will still take 100 - 150 ms per frame.
The equation is
parallel time = (([total time to perform a task] - [code that cannot be parallelized]) / [number of cpus]) + [code that cannot be parallelized]
Basically, OpenMP rocks at parallel loops processing. Its rather easy to use
#pragma omp parallel for
for (i = 0; i < N; i++)
a[i] = 2 * i;
and bang, your for is parallelized. It does not work for every case, not every algorithm can be parallelized this way but many can be rewritten (hacked) to be compatible. The key principle is Single Instruction, Multiple Data (SIMD), applying the same convolution code to multiple pixels for example.
But simply applying this cookbook receipe goes against the rules of optimization.
1-Benchmark your code
2-Find the REAL bottlenecks with "scientific" evidence (numbers) instead of simply guessing where you think there is a bottleneck
3-If it is really processing loops, then OpenMP is for you
Maybe simple optimizations on your existing code can give better results, who knows?
Another road would be to run opengl in a thread and data processing on another thread. This will help a lot if opengl or your particle rendering system takes a lot of power, but remember that threading can lead to other kind of synchronization bottlenecks.
I would recommend against OpenMP, OpenMP is more for numerical codes rather than consumer/producer model that you seem to have.
I think you can do something simple using boost threads to spawn worker thread, common segment of memory (for communication of acquired data), and some notification mechanism to tell on your data is available (look into boost thread interrupts).
I do not know what kind of processing you do, but you may want to take a look at the Intel thread building blocks and Intel integrated primitives, they have several functions for video processing which may be faster (assuming they have your functionality)
You need some kind of framework for handling multicores. OpenMP seems a fairly simple choice.
Like what Pestilence said, you just need your app to be multithreaded. Lots of frameworks like OpenMP have been mentioned, so here's another one:
Intel Thread Building Blocks
I've never used it before, but I hear great things about it.
Hope this helps!