I writed a c++ native dll with heavy maths computation, and then put it into Unity engine to run.
The problem is that:
When I used OMP in c++, the OMP did improve the c++'s performance, which I measured by logging out the time. But OMP would slow down the Unity. Unity would run faster if I removed the OMP.
So, how could OMP boosted the dll and slowed down the Unity at the meantime?
Here is what the omp does:
DLLEXPORT void UpdateTreeQuick(DbvtWrapper* wrapper, Vector3* prePositions, Vector3* positions, Triangle* triangles,
int triangleCount, float margin)
{
bool needPropagate = false;
double d1 = omp_get_wtime();
#pragma omp parallel for schedule(static)
for (int i = 0; i < triangleCount; i++)
{
Vector3 sixPos[6];
sixPos[0] = prePositions[triangles[i].A];
sixPos[1] = prePositions[triangles[i].B];
sixPos[2] = prePositions[triangles[i].C];
sixPos[3] = positions[triangles[i].A];
sixPos[4] = positions[triangles[i].B];
sixPos[5] = positions[triangles[i].C];
DbvtVolume vol = DbvtVolume::FromPoints(sixPos, 6);
if (wrapper->m_dbvt->refit(wrapper->m_leaves[i], vol, margin))
needPropagate = true;
}
double d2 = omp_get_wtime();
if (triangleCount == 10222)
Debug::Log(d2 - d1);
}
Here is how I call this native code in Unity:
private void Update()
{
NativeAPI.UpdateTreeQuick(nativeDvbtWrapper, (Vector4*)nativePrePositionsWorld.GetUnsafePtr<Vector4>(),
(Vector4*)nativePositionsWorld.GetUnsafePtr<Vector4>(), (Triangle*)nativeTriangles.GetUnsafePtr<Triangle>(),
m_mesh.triangles.Length / 3, m_aabbMargin);
}
Wit OMP, 2 threads: the c++ code run with a time cost of 7-05 second, the Unity 125-130FPS;
Without OMP: c++ cost 0.0002008 seconds, BUT the Unity run at 138 FPS!
So,Again, how could OMP boosted the dll while slowed down the Unity at the meantime?
So, how could OMP boosted the dll and slowed down the Unity at the meantime?
More details would be great here, but:
In doubt, this can depend on many aspects. Besides the ones mentioned by AlexGeorg:
What are these OMP routines doing exactly? Which kinds of OMP-patterns are used? Which OpenMP version is used? Which kind of data is relevant for OpenMP in your runtime context? How "local" are the data sets you operate on?
Common OMP usage doesn't ensure a nice "main thread" discharge. Even if you strictly separate master from OMP slave work, that doesn't ensure a fluid main core behavior of your CPU a priori. It further depends on aspects like thread/CPU affinity in doubt.
Typical performance droppers for OpenMP-usage cases are cache(!) and sometimes pipeline bottlenecks. Especially if there is a lot of interference with parts of Unity, this might arise.
Maybe this has nothing to do with the FPS problem but it might be questionable that you forward pointers (the vectors) to your OMP loop. In doubt this can lead to hidden bottlenecks or even harder problems if not analyzed well enough since you hide the shared state of actual values a bit.
What's this refit method doing and is it a static/ const method? I'm not that familiar with Unity. Is there a chance for blocking GPU calls (CUDA)?
What you could try further is to measure
general OpenMP thread pool creation time in the Unity working context (you could use a quite easier task for that). How often is your routine called?
You could further look for main thread issues in removing the master thread (id 0) from the work.
If nothing helps, try to compare to another parallelization approach via std::thread simply or intel threading building blocks.
Related
I have been working on a quantum simulation. Each time step a potential function is calculated, one step of the solver is iterated, and then a series of measurements are conducted. These three processes are easily parallelizable, and I've already made sure they don't interfere with each other. Additionally there is some stuff that is fairly simple, but should not be done in parallel. An outline of the setup is shown below.
omp_set_num_threads(3);
#pragma omp parallel
{
while (notDone) {
#pragma omp sections
{
#pragma omp section
{
createPotential();
}
#pragma omp section
{
iterateWaveFunction();
}
#pragma omp section
{
takeMeasurements();
}
}
#pragma omp single
{
doSimpleThings();
}
}
}
The code works just fine! I see a speed increase, mostly associated with the measurements running alongside the TDSE solver (about 30% speed increase). However, the program goes from using about 10% CPU (about one thread) to 35% (about three threads). This would make sense if the potential function, TDSE iterator, and measurements took equally as long, but they do not. Based on the speed increase, I would expect something on the order of 15% CPU usage.
I have a feeling this has to do with the overhead of running these three threads within the while loop. Replacing
#pragma omp sections
with
#pragma omp parallel sections
(and omitting the two lines just before the loop) changes nothing. Is there a more efficient way to run this setup? I'm not sure if the threads are constantly being recreated, or if the thread is holding up an entire core while it waits for the others to be done. If I increase the number of threads from 3 to any other number, the program uses as much resources as it wants (which could be all of the CPU) and gets no performance gain.
I've tried many options, including using tasks instead of sections (with the same results), switching compilers, etc. As suggested by Qubit, I also tried to use std::async. This was the solution! The CPU usage dropped from about 50% to 30% (this is on a different computer from the original post, so the numbers are different -- it's a 1.5x performance gain for 1.6x CPU usage basically). This is much closer to what I expected for this computer.
For reference, here is the new code outline:
void SimulationManager::runParallel(){
auto rV = &SimulationManager::createPotential();
auto rS = &SimulationManager::iterateWaveFunction();
auto rM = &SimulationManager::takeMeasurements();
std::future<int> f1, f2, f3;
while(notDone){
f1 = std::async(rV, this);
f2 = std::async(rS, this);
f3 = std::async(rM, this);
f1.get(); f2.get(); f3.get();
doSimpleThings();
}
}
The three original functions are called using std::async, and then I use the future variables f1, f2, and f3 to collect everything back to a single thread and avoid access issues.
I am novice is parallel programming. I running a my own Gibbs sampler written in C++. The overview of program look some thing like this.
for(int iter=0; iter <=itermax; iter++){ //loop1
#pragma omp parallel for schedule(dynamic)
for(int jobs= 0; jobs<=1000; jobs++){ // loop2
small_job();
#pragma omp critical(dataupdate){
data_updates()
}
}
jobs_that_cannot_be_parallelized();
}
I am running in a machine with 64 cores. Since small_job are of variable length and small I was assigning omp_get_max_threads = 128. The number of cores used seems to be correct (see fig load last hour).. Each of peaks belongs to loop2.
However when I look to the actual cpu usage (see fig it seems lot of of cpu is used by system and only 20% is used by user. Is it because I am spawning lots of threads at loop2. What are best practices to decide on omp_get_max_threads? I know I have not given enough information but I will really appreciate any other recommendation to make the program faster.
I wrote classic game "Life" with 4-sided neighbors. When I run it in debug, it says:
Consecutive version: 4.2s
Parallel version: 1.5s
Okey, it's good. But if I run it in release, it says:
Consecutive version: 0.46s
Parallel version: 1.23s
Why? I run it on the computer with 4 kernels. I run 4 threads in parallel section. Answer is correct. But somethere is leak and I don't know that place. Can anybody help me?
I try to run it in Visual Studio 2008 and 2012. The results are same. OMP is enabled in the project settings.
To repeat my problem, you can find defined constant PARALLEL and set it to 1 or 0 to enable and disable OMP correspondingly. Answer will be in the out.txt (out.txt - right answer example). The input must be in in.txt (my input - in.txt). There are some russian symbols, you don't need to understand them, but the first number in in.txt means number of threads to run in parallel section (it's 4 in the example).
The main part is placed in the StartSimulation function. If you run the program, you will see some russian text with running time in the console.
The program code is big enough, so I add it with file hosting - main.cpp (l2 means "lab 2" for me)
Some comments about StartSimulation function. I cuts 2D surface with cells into small rectangles. It is done by AdjustKernelsParameters function.
I do not find the ratio so strange. Having multiple threads co-operate is a complex business and has overheads.
Access to shared memory needs to be serialized which normally involves some form of locking mechanism and contention between threads where they have to wait for the lock to be released.
Such shared variables need to be synchronized between the processor cores which can give significant slowdowns. Also the compiler needs to treat these critical areas differently as a "sequence point".
All this reduces the scope for per thread optimization both in the processor hardware and the compiler for each thread when it is working with the shared variable.
It seems that in this case the overheads of parallelization outweigh the optimization possibilities for the single threaded case.
If there were more work for each thread to do independently before needed to access a shared variable then these overheads would be less significant.
You are using guided loop schedule. This is a very bad choice given that you are dealing with a regular problem where each task can easily do exactly the same amount of work as any other if the domain is simply divided into chunks of equal size.
Replace schedule(guided) with schedule(static). Also employ sum reduction over livingCount instead of using locked increments:
#if PARALLEL == 1
#pragma omp parallel for schedule(static) num_threads(kernelsCount) \
reduction(+:livingCount)
#endif
for (int offsetI = 0; offsetI < n; offsetI += kernelPartSizeN)
{
for (int offsetJ = 0; offsetJ < m; offsetJ += kernelPartSizeM)
{
int boundsN = min(kernelPartSizeN, n - offsetI),
boundsM = min(kernelPartSizeM, m - offsetJ);
for (int kernelOffsetI = 0; kernelOffsetI < boundsN; ++kernelOffsetI)
{
for (int kernelOffsetJ = 0; kernelOffsetJ < boundsM; ++kernelOffsetJ)
{
if(BirthCell(offsetI + kernelOffsetI, offsetJ + kernelOffsetJ))
{
++livingCount;
}
}
}
}
}
I have a c++ program with multiple For loops; each one runs about 5 million iterations. Is there any command I can use with g++ to make the resulting .exe will use multiple cores; i.e. make the first For loop run on the first core and the second For loop run on the second core at the same time? I've tried -O3 and -O3 -ftree-vectorize, but in both cases, my cpu usage still only hovers at around 25%.
EDIT:
Here is my code, in case in helps. I'm basically just making a program to test the speed capabilities of my computer.
#include <iostream>
using namespace std;
#include <math.h>
int main()
{
float *bob = new float[50102133];
float *jim = new float[50102133];
float *joe = new float[50102133];
int i,j,k,l;
//cout << "Starting test...";
for (i=0;i<50102133;i++)
bob[i] = sin(i);
for (j=0;j<50102133;j++)
bob[j] = sin(j*j);
for (k=0;k<50102133;k++)
bob[k] = sin(sqrt(k));
for (l=0;l<50102133;l++)
bob[l] = cos(l*l);
cout << "finished test.";
cout << "the 100120 element is," << bob[1001200];
return 0;
}
The most obviously choice would be to use OpenMP. Assuming your loop is one that's really easy to execute multiple iterations in parallel, you might be able to just add:
#pragma openmp parallel for
...immediately before the loop, and get it to execute in parallel. You'll also have to add -fopenmp when you compile.
Depending on the content of the loop, that may give anywhere from a nearly-linear speedup to slowing the code down somewhat. In the latter cases (slowdown or minimal speedup) there may be other things you can do with OpenMP to help speed it up, but without knowing at least a little about the code itself, it's hard to guess what to do or what improvement you may be able to expect at maximum.
The other advice you're getting ("Use threads") may be suitable. OpenMP is basically an automated way of putting threads to use for specific types of parallel code. For a situation such as you describe (executing multiple iterations of a loop in parallel) OpenMP is generally preferred--it's quite a bit simpler to implement, and may well give better performance unless you know multithreading quite well and/or expend a great deal of effort on parallelizing the code.
Edit:
The code you gave in the question probably won't benefit from multiple threads. The problem is that it does very little computation on each data item before writing the result out to memory. Even a single core can probably do the computation fast enough that the overall speed will be limited by the bandwidth to memory.
To stand a decent chance of getting some real benefit from multiple threads, you probably want to write some code that does more computation and less just reading and writing memory. For example, if we collapse your computations together, and do all of them on a single item, then sum the results:
double total = 0;
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
By adding a pragma:
#pragma omp parallel for reduction(+:total)
...just before the for loop, we stand a good chance of seeing a substantial improvement in execution speed. Without OpenMP, I get a time like this:
Real 16.0399
User 15.9589
Sys 0.0156001
...but with the #pragma and OpenMP enabled when I compile, I get a time like this:
Real 8.96051
User 17.5033
Sys 0.0468003
So, on my (dual core) processor, time has dropped from 16 to 9 seconds--not quite twice as fast, but pretty close. Of course, a lot of the improvement you get will depend on exactly how many cores you have available. For example, on my other computer (with an Intel i7 CPU), I get a rather larger improvement because it has more cores.
Without OpenMP:
Real 15.339
User 15.3281
Sys 0.015625
...and with OpenMP:
Real 3.09105
User 23.7813
Sys 0.171875
For completeness, here's the final code I used:
#include <math.h>
#include <iostream>
static const int size = 1024 * 1024 * 128;
int main(){
double total = 0;
#pragma omp parallel for reduction(+:total)
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
std::cout << total << "\n";
}
The compiler has no way to tell if your code inside the loop can be safely executed on multiple cores. If you want to use all your cores, use threads.
Use Threads or Processes, you may want to look to OpenMp
C++11 got support for threading but c++ compilers won't/can't do any threading on their own.
As others have pointed out, you can manually use threads to achieve this. You might look at libraries such as libdispatch (aka. GCD) or Intel's TBB to help you do this with the least pain.
The -ftree-vectorize option you mention is for targeting SIMD vector processor units on CPUs such as ARM's NEON or Intel's SSE. The code produced is not thread-parallel, but rather operation parallel using a single thread.
The code example posted above is highly amenable to parallelism on SIMD systems as the body of each loop very obviously has no dependancies on the previous iteration, and the operations in the loop are linear.
On some ARM Cortex A series systems at least, you may need to accept slightly reduced accuracy to get the full benefits.
What is the easiest way to profile a C++ program parallelized with OpenMP, on a machine on which one has no sudo rights?
I would recommend using Intel VTune Amplifier XE profiler.
The Basic Hotspots analysis doesn't require the root privileges and you can even install it without being in sudoers.
For OpenMP analysis it's best to compile with Intel OpenMP implementation and set environment variable KMP_FORKJOIN_FRAMES to 1 before running the profile session. This will enable the tool to visualize time regions from fork point to join point for each parallel region. This gives a good idea about where you had sufficient parallelism and where you did not. By using grid grouping like Frame Domain / Frame Type / Function you can also correlate the parallel regions with what was happening on CPUs which allows finding functions that didn't scale.
For example, imagine a simple code like below that runs some balanced work, then some serial work and then some imbalanced work calling delay() function for all of these making sure delay() doesn't inline. This imitates a real workload where all kinds of unfamiliar functions may be invoked from parallel regions making it harder to analyze whether the parallism was good or bad by looking into just hot-functions profile:
void __attribute__ ((noinline)) balanced_work() {
printf("Starting ideal parallel\n");
#pragma omp parallel
delay(3000000);
}
void __attribute__ ((noinline)) serial_work() {
printf("Starting serial work\n");
delay(3000000);
}
void __attribute__ ((noinline)) imbalanced_work() {
printf("Starting parallel with imbalance\n");
#pragma omp parallel
{
int mythread = omp_get_thread_num();
int nthreads = omp_get_num_threads();
delay(1000000);
printf("First barrier %d\n", mythread);
#pragma omp barrier
delay(mythread * 25000 + 200000);
printf("Second barrier %d\n", mythread);
#pragma omp barrier
delay((nthreads - 1 - mythread) * 25000 + 200000);
printf("Join barrier %d\n", mythread);
}
}
int
main(int argc, char **argv)
{
setvbuf(stdout, NULL, _IONBF, 0);
calibrate();
balanced_work();
serial_work();
imbalanced_work();
printf("Bye bye\n");
}
For this code a typical function profile will show most of the time spent in the delay() function. On the other hand, viewing the data with frame grouping and CPU usage information in VTune will give an idea about what is serial, what is imbalanced and what is balanced. Here is what you might see with VTune:
Here one can see that:
There were 13.671 of elapsed time when we were executing an imbalanced region. One can see the imbalance from CPU Usage breakdown.
There were 3.652 of elapsed time that were pretty well balanced. There is some red time here, that’s likely some system effects - worth investigating in a real-world case.
And then I also have about 4 seconds of serial time. Figuring out that it’s 4 seconds is currently a bit tricky - you have to take elapsed time from summary (21.276 in my case) and subtract 13.671 and 3.652 from it yielding four. But easy enough.
Hope this helps.