How to tell if OpenMP is working? - c++

I am trying to run LIBSVM in parallel mode, however my question is in OpenMP in general. According to LIBSVM FAQ, I have modified the code with #pragma calls to use OpenMP. I also modified the Makefile (for un*x) by adding a -fopenmp argument so it becomes:
CFLAGS = -Wall -Wconversion -O3 -fPIC -fopenmp
The code compiles well. I check (since it's not my PC) whether OpenMP is installed by :
/sbin/ldconfig -p | grep gomp
and see that it is -probably- installed:
libgomp.so.1 (libc6,x86-64) => /usr/lib64/libgomp.so.1
libgomp.so.1 (libc6) => /usr/lib/libgomp.so.1
Now; when I run the program, I don't see any speed improvements. Also when I check with "top" the process is using at most %100 CPU (there are 8 cores), also there is not a CPU bottleneck (only one more user with %100 CPU usage), I was expecting to see more than %100 (or a different indicator) that process is using multiple cores.
Is there a way to check that it is working multiple core?

You can use the function omp_get_num_threads(). It will return you the number of threads that are used by your program.

With omp_get_max_threads() you get the maximum number of threads available to your program. It is also the maximum of all possible return values of omp_get_num_threads(). You can explicitly set the number of threads to be used by your program with the environment variable OMP_NUM_THREADS, e.g. in bash via
$export OMP_NUM_THREADS=8; your_program

Related

Why does a simple C++ program generate so many branch commands? Using perf on Linux [duplicate]

This question already has answers here:
Number of executed Instructions different for Hello World program Nasm Assembly and C
(2 answers)
How do I determine the number of x86 machine instructions executed in a C program?
(4 answers)
Closed 2 years ago.
I must be doing something stupid or using perf incorrectly ?
#include <iostream>
int main()
{
return 0;
}
Compile command (Using g++-9.2.1)
g++ -std=c++17 -Wall -Wextra -pedantic -O3 Source.cpp -o prog
Following the tutorial
stat Run a command and gather performance counter
statistics
I attempted
perf stat ./prog
And in the output
560,957 branches # 303.607 M/sec
16,181 branch-misses # 2.88% of all branches
The question is why? should I "clean" the registers before running this command? is this normal?
About 80% of the branching comes from dynamic linking. Files need to be opened and then the dynamic libraries need to be parsed. This requires a lot of decision making as the contents of the file have to be tested to see what their format is, what sections they have, and so on.
Most of the remaining 20% is precisely that same kind of logic operating on the executable. It has a complex format and code has to parse that format to figure out what sections it has, find the endings of each section, and decide how to lay them out in memory before the program begins executing.

Determine the number of cores at compile time in C/C++

Is there a way to determine how many physical cores a target machine has at compile time in C/C++ in Linux under GCC?
I am aware of other methods like td::thread::hardware_concurrency() in C++11 or sysconf(_SC_NPROCESSORS_ONLN) but I am curious to know if there is actually a way to obtain this information at compile time.
You can query information during the build processes and pass it into the program as a pre-processor definition.
Example
g++ main.cpp -D PROC_COUNT=$(grep -c ^processor /proc/cpuinfo)
where main.cpp is
#include <iostream>
int main() {
std::cout << PROC_COUNT << std::endl;
return 0;
}
Edit
As pointed out in the comments. If the target machine differs from the build machine then you'll need to replace the method grep -c ^processor /proc/cpuinfo with something that queries the number of processors on the target machine. The details would depend on what form of access you have to the target machine during build.

Does gcc automatically use -j4? Is there anything I can do to optimize my compilation?

Hello I am a beginner with Linux platform therefore I am not familiar with the terminal commands.
I am writing an application on C++ and I expect it to consume a lot of processing power. So I want to make sure I am using all available cores on my device (it has 4 cores).
I am using the following to create an executable file:
gcc -o blink -l rt blink.c -l bcm2835
where bcm2835 is the library I use for I/O. So my question is, is this command is using all available cores or is there anything I can do to optimize it? I am willing to use everything available, to throw the kitchen sink if it will make this code run faster.
The -j jobs option is for make not gcc
When used with make it will cause multiple "recipes" to be executed in parallel. In this context, your gcc line is one recipe.
AFTER QUESTION REVISION
If you want your code to use multiple cores, you will need to use threads or processes. Look into pthreads.
Since you're using C++, you have this nice-enough crossplatform-enough thread library integrated for you (>=C++11).
Just make sure to add -std=c++11 so that
gcc -o blink -l rt blink.c -l bcm2835
becomes
gcc -std=c++11 -o blink -l rt blink.c -l bcm2835
Docs and basic examples at http://www.cplusplus.com/reference/thread/thread/
Nicer looking docs at http://en.cppreference.com/w/cpp/thread/thread
You still have to program what to thread on your own though.

Threaded DGEMM on ifort

I've implemented a working program (FORTRAN, compiler: Intel ifort 11.x) that makes calls to DGEMM. I've read that there's a quick way to parallelize this by compiling with:
ifort -mkl=parallel -O3 myprog.f -o myprog
I have a quad-core processor, so I run the program with (via bash):
export OMP_NUM_THREADS=4
./myprog
My assumption was that DGEMM would automatically summon 4 threads, resulting in faster matrix multiplication. That doesn't seem to be happening. Am I missing something? Any help would be appreciated.
I think -mkl=parallel is the default choice of the intel compiler. Therefore you don't have to set this flag especially. Try -mkl=sequential instead and see if the calculations are slowing down.

analysis of core file

I'm using Linux redhat 3, can someone explain how is that possible that i am able to analyze
with gdb , a core dump generated in Linux redhat 5 ?
not that i complaint :) but i need to be sure this will always work... ?
EDIT: the shared libraries are the same version, so no worries about that, they are placed in a shaerd storage so it can be accessed from both linux 5 and linux 3.
thanks.
You can try following commands of GDB to open a core file
gdb
(gdb) exec-file <executable address>
(gdb) set solib-absolute-prefix <path to shared library>
(gdb) core-file <path to core file>
The reason why you can't rely on it is because every process used libc or system shared library,which will definitely has changes from Red hat 3 to red hat 5.So all the instruction address and number of instruction in native function will be diff,and there where debugger gets goofed up,and possibly can show you wrong data to analyze. So its always good to analyze the core on the same platform or if you can copy all the required shared library to other machine and set the path through set solib-absolute-prefix.
In my experience analysing core file, generated on other system, do not work, because standard library (and other libraries your program probably use) typically will be different, so addresses of the functions are different, so you cannot even get a sensible backtrace.
Don't do it, because even if it works sometimes, you cannot rely on it.
You can always run gdb -c /path/to/corefile /path/to/program_that_crashed. However, if program_that_crashed has no debug infos (i.e. was not compiled and linked with the -g gcc/ld flag) the coredump is not that useful unless you're a hard-core debugging expert ;-)
Note that the generation of corefiles can be disabled (and it's very likely that it is disabled by default on most distros). See man ulimit. Call ulimit -c to see the limit of core files, "0" means disabled. Try ulimit -c unlimited in this case. If a size limit is imposed the coredump will not exceed the limit size, thus maybe cutting off valuable information.
Also, the path where a coredump is generated depends on /proc/sys/kernel/core_pattern. Use cat /proc/sys/kernel/core_pattern to query the current pattern. It's actually a path, and if it doesn't start with / then the file will be generated in the current working directory of the process. And if cat /proc/sys/kernel/core_uses_pid returns "1" then the coredump will have the file PID of the crashed process as file extension. You can also set both value, e.g. echo -n /tmp/core > /proc/sys/kernel/core_pattern will force all coredumps to be generated in /tmp.
I understand the question as:
how is it possible that I am able to
analyse a core that was produced under
one version of an OS under another
version of that OS?
Just because you are lucky (even that is questionable). There are a lot of things that can go wrong by trying to do so:
the tool chains gcc, gdb etc will
be of different versions
the shared libraries will be of
different versions
so no, you shouldn't rely on that.
You have asked similar question and accepted an answer, ofcourse by yourself here : Analyzing core file of shared object
Once you load the core file you can get the stack trace and get the last function call and check the code for the reason of crash.
There is a small tutorial here to get started with.
EDIT:
Assuming you want to know how to analyse core file using gdb on linux as your question is little unclear.