Threaded DGEMM on ifort - fortran

I've implemented a working program (FORTRAN, compiler: Intel ifort 11.x) that makes calls to DGEMM. I've read that there's a quick way to parallelize this by compiling with:
ifort -mkl=parallel -O3 myprog.f -o myprog
I have a quad-core processor, so I run the program with (via bash):
export OMP_NUM_THREADS=4
./myprog
My assumption was that DGEMM would automatically summon 4 threads, resulting in faster matrix multiplication. That doesn't seem to be happening. Am I missing something? Any help would be appreciated.

I think -mkl=parallel is the default choice of the intel compiler. Therefore you don't have to set this flag especially. Try -mkl=sequential instead and see if the calculations are slowing down.

Related

How can I validate a newly created instruction using LLVM?

How can I validate a newly created instruction using LLVM?
I am new to LLVM and computer architecture.
Created a new instruction of bfloat16 type arithmetic targeting the RISCV-32 architecture.
I was wondering if the output of this arithmetic instruction was correct. And I wanted to verify that the value stored in the float register is in IEEE754 bfloat16 format.
clang-14 -c -g -v --target=riscv32-unknown-elf -march=rv32izfh0p1 -menable-experimental-extensions -I/usr/include -o main.o main.c
riscv32-unknown-elf-gcc -g -o main main.o
I compiled it through the above command, and it was confirmed that the assembly code came out well as shown below.
enter image description here
Then, I tried to run compiled executable file with qemu-riscv32 and debug it using gdb. But an error illegal instruction occurred.
enter image description here
Question
I think an illegal instruction error occurred because QEMU and gdb don't have information about the new instruction I created. Other than modifying QEMU and GDB, is there any way to validate the newly created instruction?
If you extend a CPU architecture by adding new instructions to it, you need to add support for those in not just the toolchain (which will then output code using that new instruction) but also in the CPU implementation itself (so that it can execute the code your compiler now generates). You can do that either on a real hardware CPU, which in practice will start with changes to its RTL which can be tested in simulation; or you can do it on an emulated version of the CPU such as that which QEMU provides. But if you don't have an implementation of a CPU at all that has your new instructions in it, there's no point in having the compiler emit them, because you have nothing that can run the code that compiler produces.

Does gcc automatically use -j4? Is there anything I can do to optimize my compilation?

Hello I am a beginner with Linux platform therefore I am not familiar with the terminal commands.
I am writing an application on C++ and I expect it to consume a lot of processing power. So I want to make sure I am using all available cores on my device (it has 4 cores).
I am using the following to create an executable file:
gcc -o blink -l rt blink.c -l bcm2835
where bcm2835 is the library I use for I/O. So my question is, is this command is using all available cores or is there anything I can do to optimize it? I am willing to use everything available, to throw the kitchen sink if it will make this code run faster.
The -j jobs option is for make not gcc
When used with make it will cause multiple "recipes" to be executed in parallel. In this context, your gcc line is one recipe.
AFTER QUESTION REVISION
If you want your code to use multiple cores, you will need to use threads or processes. Look into pthreads.
Since you're using C++, you have this nice-enough crossplatform-enough thread library integrated for you (>=C++11).
Just make sure to add -std=c++11 so that
gcc -o blink -l rt blink.c -l bcm2835
becomes
gcc -std=c++11 -o blink -l rt blink.c -l bcm2835
Docs and basic examples at http://www.cplusplus.com/reference/thread/thread/
Nicer looking docs at http://en.cppreference.com/w/cpp/thread/thread
You still have to program what to thread on your own though.

TI DM3730 (Design reference: beagleboard) computes wrong floating point operation results

The Situation
We have a board with a TI DM3730 processor (also known from the Beagleboard) with a Cortex A8 core (r3p2) in use with the following parameters:
Beagleboard Reference Design: Beagleboard-xM Rev-C
Kernel version: 3.2.8
Open CV library: 2.4.6
U-Boot: uboot-2013.04
Toolchain: Sourcery CodeBench ARM 2011.03
Buildroot: 2012.02
The setup is derived from this blog
Now we have written a program (written in C++ and compiled with GCC Version 4.5.2.) which uses the OpenCV library (to calculate some scores using support vector machines) and which behaves in some strange way:
The program runs on the board in its own process using defined test data: It produces repeatedly correct results.
The program runs in two or more processes (with the same defined test data): The results start to become wrong for each process, processes die with segfaults. The last remaining process runs correctly again.
The program runs in its own process (with the same defined test data again). Additionally, another process changes some exposure settings of an attached camera: The program starts to produce wrong results.
So we assume this is a very low level floating point problem.
What we tried
The complete system (all libraries, kernel, boot loader, etc.) have been compiled with compiler flags as suggested on the pandorawiki.org regarding Floating_Point_Optimization
-O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp
-ffast-math -fsingle-precision-constant
We tried to enable L1NEON in Cortex-A8 aux ctrl register according to the Beagle board FAQ and tried the other options mentioned there as well, but unfortunately to no avail.
All three different behaviors are reproducible, but not in the form of a minimal working example.
The same program source and the first and second scenario run correctly on Windows (using Visual Studio) and on a desktop running Linux (GCC), so it's probably not something our code does.
So the questions are now:
Are there any other known bugs with this setup and floating point operations which we are not aware of?
Are there any known compiler options which should be set or omitted which can lead to the observed results?
If a MWE would be helpful, we will look into providing one.
Any clues are welcome.
Ok, we now use an up-to-date buildroot (2014.08) with the included toolchain (arm-buildroot-linux-uclibcgueabi-), Linux-kernel 3.9.11, boost 1.55, Qt 4.8.6, and still OpenCV 2.4.6.
When compiling, we optimize for size (–Os) and for target-optimization we only use –pipe.
The following compiler-flags are currently not used anymore:
-mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp -ffast-math -fsingle-precision-constant
Unfortunately, we still don't know the exact reason for the original problem, but we are quite happy that the problem went away with this setup.
So maybe this answer helps some poor soul in the future... ;)

How to compile a C++ program as 64-bit on 64-bit machine?

Perhaps a very trivial question:
I need to compile a program as 64-bit (earlier makefile written to compile it as 32-bit).
I saw the option -m32 appearing in command line parameters with each file compilation. So, I modified the makefile to get rid of -m32 in OPTFLAG, but again when the program compiles, I still see -m32 showing up and binaries are still 32-bit. Does this m32 come from somewhere else as well?
-m32 can only be coming from somewhere in your makefiles, you'll have to track it down (use a recursive grep) and remove it.
When I am able to force -m64, I get "CPU you selected does not support x86-64 instruction set".Any clues?. uname -a gives x86_64
That error means there is an option like -march=i686 in the makefiles, which is not valid for 64-bit compilation, try removing that too.
If you can't remove it (try harder!) then adding -march=x86-64 after it on the command line will specify a generic 64-bit CPU type.
If the software you are trying to build is autotools-based, this should do the trick:
./configure "CFLAGS=-m64" "CXXFLAGS=-m64" "LDFLAGS=-m64" && make
Or, for just a plain Makefile:
env CFLAGS=-m64 CXXFLAGS=-m64 LDFLAGS=-m64 make
If you are using CMake, you can add m64 compile options by this:
add_compile_options(-m64)

How to tell if OpenMP is working?

I am trying to run LIBSVM in parallel mode, however my question is in OpenMP in general. According to LIBSVM FAQ, I have modified the code with #pragma calls to use OpenMP. I also modified the Makefile (for un*x) by adding a -fopenmp argument so it becomes:
CFLAGS = -Wall -Wconversion -O3 -fPIC -fopenmp
The code compiles well. I check (since it's not my PC) whether OpenMP is installed by :
/sbin/ldconfig -p | grep gomp
and see that it is -probably- installed:
libgomp.so.1 (libc6,x86-64) => /usr/lib64/libgomp.so.1
libgomp.so.1 (libc6) => /usr/lib/libgomp.so.1
Now; when I run the program, I don't see any speed improvements. Also when I check with "top" the process is using at most %100 CPU (there are 8 cores), also there is not a CPU bottleneck (only one more user with %100 CPU usage), I was expecting to see more than %100 (or a different indicator) that process is using multiple cores.
Is there a way to check that it is working multiple core?
You can use the function omp_get_num_threads(). It will return you the number of threads that are used by your program.
With omp_get_max_threads() you get the maximum number of threads available to your program. It is also the maximum of all possible return values of omp_get_num_threads(). You can explicitly set the number of threads to be used by your program with the environment variable OMP_NUM_THREADS, e.g. in bash via
$export OMP_NUM_THREADS=8; your_program