Core latency testing ARMv8.1

Core latency testing ARMv8.1 - c++

There is an interesting article about ARM8.1 Graviton 2 offering of AWS.
This article has tests for CPU coherency where I am trying to repeat.
There is C++ code repo in GitHub named core-latency using Nonius Micro-benchmarking.
I managed to replicate the first test without atomic instructions using the command below to compile:
$ g++ -std=c++11 -Wall -pthread -O3 -Iinclude -o core-latency main.cpp -march=armv8-a
The article claims that ARMv8.1 uses atomic CAS operations and has much better performance. It also provides test results that are much better.
I tried to repeat it compiling with ARMv8.1, ARMv8.2, and ARMv8.3. Sample commands for compilation are below:
$ g++ -std=c++11 -Wall -pthread -O3 -Iinclude -o core-latency main.cpp -march=armv8.1-a+lse
$ g++ -std=c++11 -Wall -pthread -O3 -Iinclude -o core-latency main.cpp -march=armv8.2-a+lse
$ g++ -std=c++11 -Wall -pthread -O3 -Iinclude -o core-latency main.cpp -march=armv8.3-a+lse
None of these improved the performance. Because of that I got the assembly code for it using these commands:
g++ -std=c++11 -Wall -pthread -O3 -Iinclude -S main.cpp -march=armv8.1-a+lse
g++ -std=c++11 -Wall -pthread -O3 -Iinclude -S main.cpp -march=armv8.2-a+lse
g++ -std=c++11 -Wall -pthread -O3 -Iinclude -S main.cpp -march=armv8.3-a+lse
I searched the code and cannot find any CAS operations used.
I also tried the different variations of compilation with or without "lse" and "-moutline-atomics".
I am not a C++ expert and I have a very basic understanding of it.
My guess is that the code needs some changes to use atomic instructions.
Tests are executed on m6g.16xlarge EC2 instance in AWS. OS Ubuntu 20.04.
So if someone can check the core-latency code and give some insights to make sure that it compiles with CAS instructions, that will be a great help.

After doing some more experiments, I found the problem.
In the code snippet below are the steps:
making a comparison first (if state equals Ping)
calling the class method set to do an atomic store operation.
Code snippet from core-latency:
if (state == Ping)
sync.set(Pong);
...
void set(State new_state)
{
state.store(new_state);
}
All of the code never compiles to a CAS instruction. If you want to have an atomic compare and swap operation, you need to use the relevant method from atomic.
I have written below a sample code for experimenting:
#include <atomic>
#include <cstdio>
int main() {
int expected = 0;
int desired = 1;
std::atomic<int> current;
current.store(expected);
printf("Before %d\n", current.load());
while(!current.compare_exchange_weak(expected,desired));
printf("After %d\n", current.load());
}
I compiled it for ARMv8.1 and can see that it is using CAS instruction.
I compiled it for ARMv8.0 and can see that it is not using CAS instruction (which is OK as it is not supported in this version).
So if I want to get CAS instruction sets used, I need to use atomic::compare_exchange_weak or atomic::compare_exchange_strong; otherwise, the compiler will not use CAS but compile your comparison and store operations separately.
In summary, I can rewrite the benchmark with atomic::compare_exchange_weak and see what results I am getting.
New update April 30
I have created the new version of the code with atomic compare and swap support.
It is available here https://github.com/fuatu/core-latency-atomic
Here are the test results for instance m6g.16xlarge (ARM):
Without CAS: Average latency 245ns
With CAS: Average latency 39ns

Related

How do retain debug symbols when profiling C++ code on Windows?

I'm trying to profile a C++ shared library on Windows 10, in order to find which lines the program is spending most time on. (The code happens to form part of an R package.)
I've previously used
AMD µprof and
Very Sleepy. However, I'm now having trouble compiling the code: all these profilers show is which DLL is being used, rather than which function / line.
I suspect that the problem relates to debugging symbol tables being missing. Per
Enabling debug symbols in shared library using GCC, I've ensured that a -g flag is applied when compiling each file, and that there is no -s flag at the linker stage. What else do I need to do to allow µprof / Very Sleepy to tell me which lines of the code are proving a bottleneck?
Detailed compilation notes
I'm using RBuildTools MinGW-w64 v3 g++ 8.3.0 to compile the code on 64-bit Windows 10.
Here are some sample compile commands, which are being generated by R, using Makevars / Makeconf templates.
g++ -std=gnu++14 -I"<<include paths>>" -DNDEBUG -g -O2 -Wall
-mfpmath=sse -msse2 -mstackrealign
-c source_file.cpp -o source_file.o
g++ -shared -static-libgcc -g -Og
-o PackageName.dll tmp.def source_file.o <<Other files>>
-L<<Library paths>>
I've also tried replacing -g with -gdwarf-2 -g3, and adding -fno-omit-frame-pointer, per Very Sleepy doesn't see function names when capturing MinGW compiled file.
Running without shared library
ssbssa suggested running against a simple executable.
I tried:
#include <chrono>
#include <thread>
#include <iostream>
long sumto(long n) {
if(n > 0) {
std::this_thread::sleep_for(std::chrono::milliseconds(1));
return n + sumto(n - 1);
}
return 1;
}
int main() {
std::cout << sumto(1000) << std::endl;
return 0;
}
>"C:/RBuildTools/4.0/mingw64/bin/"g++ -std=gnu++14 -gdwarf-2 -g3 -Og -c test.cpp
>"C:/RBuildTools/4.0/mingw64/bin/"g++ -std=gnu++14 -gdwarf-2 -g3 -Og -o test test.o
test.exe runs as expected. When I profile test.exe, AMD µprof states "The raw file has no data!", whereas VerySleepy does detect activity in sumto and displays the associated source code.

Code produced by GCC is 10 times slower than G++

I am trying to compile a C++ project and have been asked to use GCC. This works fine and compiles without any problems, but the code generated by GCC is around ten times slower than that produced by G++ even with full optimizations.
Is this because GCC is meant for c-programs or can I do something to make the compiler optimize more?
This is the content in my makefile:
CFLAGS = -std=c++11 -lstdc++ -O3
SRC_PATH=Loki
FILES=bench.cpp bitboard.cpp evaluation.cpp magics.cpp main.cpp misc.cpp move.cpp movegen.cpp
perft.cpp position.cpp psqt.cpp search.cpp thread.cpp transposition.cpp uci.cpp
SOURCES=$(FILES:%.cpp=$(SRC_PATH)/%.cpp)
OUTFILE=Loki.exe
all:
gcc ${SOURCES} -o $(OUTFILE) ${CFLAGS}

Effect of hardware on simulation results

I have a simulation code, written in C++ using standard libraries together with the Eigen library for linear algebra and matrices, and observed the following effect:
If I run it on my personal laptop, which is a Macbook with OS X, I get different results than what I get if I run it on my office PC with Ubutuntu OS.
The difference in the average result is approximately 20%.
I am sure that it is the same code, because both are running the updated version of the same repository.
The code is too complex to provide here, and has a lot of complicated mechanisms inside.
Remark: Running on the same device gives always a similar average, so my results are reproducible as long as they run on the same HW.
How should I interpret this? That my code is erroneous? Or such effects are normal?
Thanks in advance!
Edit1: I am using qmake but when I run the make command, this is an example line to compile my main.cpp on my Ubuntu PC:
g++ -c -pipe -O2 -std=gnu++11 -Wall -W -D_REENTRANT -fPIC -DQT_DEPRECATED_WARNINGS -DQT_NO_DEBUG -DQT_CORE_LIB -I. -ICommon -INetwork -IEigen -I../../../Qt/5.10.0/gcc_64/include -I../../../Qt/5.10.0/gcc_64/include/QtCore -Imocs -I../../../Qt/5.10.0/gcc_64/mkspecs/linux-g++ -o objs/main.o main.cpp
The same line looks like the following on my OS X laptop:
/Library/Developer/CommandLineTools/usr/bin/clang++ -c -pipe -stdlib=libc++ -g -std=gnu++11 -arch x86_64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -mmacosx-version-min=10.13 -Wall -Wextra -fPIC -DQT_DEPRECATED_WARNINGS -DQT_CORE_LIB -I. -ICommon -INetwork -IEigen -I../../../opt/Qt5.14/lib/QtCore.framework/Headers -Imocs -I../../../opt/Qt5.14/mkspecs/macx-clang -F/Users/XXX/opt/Qt5.14/lib -o objs/main.o main.cpp
Edit2: I have an array of doubles which is calculated offline using Eigen libraries. Offline means here, the elements of the array have deterministic values, i.e., a lookup table. So, I have compared the content on both platforms, and the (printed) content is completely identical.

c++ program behaving different when profiling

I would like to profile some c++ code using gprof. I compile the program exactly like normal but with -pg added at the end; i.e. something like
g++ prog.cpp $(OBJECTS) -lgmp -lgmpxx -lmpfr -lmpc -msse2 -std=c++11 -O2 -o prog_P -pg
However when I run the resulting executable I get a bunch of errors that are not normally there. Specifically they are from the zkcm multiprecision library:
Warning: in zkcm_gauss::gauss, partial pivoting failed.
This is bad news for my LU decomposition. Any ideas?
EDIT: I use cygwin

Armadillo issue in ubuntu

I have been writing a c++ program in Ubuntu and window8 using armadillo. Under Windows8 the program compiles without problems.
The program is just using the linear systems solver.
Under Ubuntu the compiler says
"reference to `wrapper_dgels_' not defined"
The compiler line I use is:
mpic++ -O2 -std=c++11 -Wall -fexceptions -O2 -larmadillo -llapack -lblas program.o
However, right before the error I see:
g++ module_of_the_error.o
Which is something I haven't set.
I am using code blocks in Ubuntu, and I compiled armadillo with all the libraries that cmake asked. (BLAS< LAPACK, OpenBLAS, HDF5, ARPACK, etc)
I have no clue what might be causing the problem, since the exact same code compiles in visual studio.I have tried the compiler line modifications suggested but it does not seem to work.
Any help is appreciated.

This is one trap I fell into myself one time. You will not like the likely cause of your error.
The order of the arguments to the linker matters.
Instead of
mpic++ -O2 -std=c++11 -Wall -fexceptions -O2 -larmadillo -llapack -lblas program.o
try:
mpic++ -O2 -std=c++11 -Wall -fexceptions -O2 program.o -larmadillo -llapack -lblas
I.e., put the object files to be linked into the executable before the libraries.
By the way, at this stage you are only linking files that have already been compiled. It is not necessary to repeat command line options that are only relevant for compiling. So this will be equivalent:
mpic++ program.o -larmadillo -llapack -lblas
Moreover, depending on how you installed Armadillo, you are adding either one or two superfluous libraries in that line. One of the following should be enough:
mpic++ program.o -larmadillo
or
mpic++ program.o -llapack -lblas

EDIT: as the answer by rerx states, the problem is probably just a simple ordering of the switches/arguments supplied to g++. All the -l switches need to be after the -o switch. Or in other words, put the -o switch before any -l switches. For example:
g++ prog.cpp -o prog -O3 -larmadillo
original answer:
Looks like your compiler can't find the Armadillo run-time library. The proper solution is to specify the path for armadillo run-time library using the -L switch. For example, g++ -O2 blah.cpp -o blah -L /usr/local/lib/ -larmadillo
Another possible solution is to define ARMA_DONT_USE_WRAPPER before including the armadillo header, and then directly link with LAPACK and BLAS. For example:
#define ARMA_DONT_USE_WRAPPER
#include <armadillo>
More details are available at the Armadillo frequently asked questions page.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Core latency testing ARMv8.1 - c++

Related

How do retain debug symbols when profiling C++ code on Windows?

Code produced by GCC is 10 times slower than G++

Effect of hardware on simulation results

c++ program behaving different when profiling

Armadillo issue in ubuntu

Categories

Resources