Why my Xeon E5-2640 v3 2.60GHz runs twice faster than my Xeon E5-4627 v3 2.6GHz - xeon-phi

I am struggling with the performance of the newly bought machine with the Xeon E5-4627 CPU. My experiment codes ran almost twice as slow on Xeon E5-4627 v3 CPU than the same code ran on Xeon E5-2640 v3 processor.
I did some simple benchmark on these two machines. Here is the code
long N = 1 << 24;
double s = 0;
for (long i = 0; i < N; i++)
s += std::cos(i % 360);
I compiled the code on the machine with E5-4627 as g++ -std=c++11 -O3 -o test test.cpp and ran the same binary execution file in these two machines.
The E5-4627 cost about 0.95s while the E5-2640 cost about 0.36s.
Both machines are with Ubuntu 4.8.5 and the g++-4.8 was used.

Related

GCC compilation times slower on Debian Stretch/Buster than Wheezy/Jessie

in my compagny we are building programs for different versions of Debian. We are using a Jenkins build chains with Virtual Machine on ESXI.
The programs compils with GCC. Based on some test we found that the compilation time on Stretch/Buster is 50% slower than on Wheezy/Jessie.
For example, a simple Hello World program :
jessie
------
real 0m0.099s
user 0m0.076s
sys 0m0.012s
buster
------
real 0m0,201s
user 0m0,168s
sys 0m0,032s
For small programs, it's not really important but for bigger projects, time difference is really visible (even with -O3 falgs) :
jessie
------
real 0m29.996s
user 0m26.636s
sys 0m1.688s
buster
------
real 0m59,051s
user 0m53,226s
sys 0m5,164s
Our biggest project takes 25 min on Jessie to compile against 45 min on Stretch.
Note this is done on two different virtual machine but on the same physical machine. The CPU models is : Intel(R) Core(TM) i7-4770 CPU # 3.40GHz.
I think that one reason might be the meldown and spectre patch that is applied to the kernel. But i don't know if this patch is enabled on stretch.
Do you have any idea about the possible reasons of this performance difference? How i can check it? And how to fix it if possible.
Regards.

Speed up extremely slow MinGW-w64 compilation/linking?

How can I speed up MinGW-w64's extremely slow C++ compilation/linking?
Compiling a trivial "Hello World" program:
#include <iostream>
int main()
{
std::cout << "hello world" << std::endl;
}
...takes 3 minutes(!) on this otherwise-unloaded Windows 10 box (i7-6700, 32GB of RAM, decent SATA SSD):
> ptime.exe g++ main.cpp
ptime 1.0 for Win32, Freeware - http://www.pc-tools.net/
Copyright(C) 2002, Jem Berkes <jberkes#pc-tools.net>
=== g++ main.cpp ===
Execution time: 180.488 s
Process Explorer shows the g++ process tree bottoming out in ld.exe which doesn't use any appreciable CPU or I/O for the duration.
Running the g++ process tree through API Monitor shows there are three unusually long syscalls in ld.exe: two NtCreateFile()s and a NtOpenFile(), each operating on a.exe and taking 60 seconds apiece.
The slowness only happens when using the default a.exe output; g++ -o foo.exe main.cpp takes 2 seconds, tops.
"Well don't use a.exe as an output name then!" isn't really a solution since this behavior causes CMake to take ages doing compiler feature detection.
GCC toolchain versions:
>g++ --version
g++ (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 8.1.0
>ld --version
GNU ld (GNU Binutils) 2.30
Given that I couldn't repro the problem in a clean Windows 10 VM and the dependence on the output filename led me down the path of anti-virus/anti-malware interference.
fltmc instances listed several possible filesystem filter drivers; guess-n-check narrowed it down to two of Carbon Black's: carbonblackk & ParityDriver.
Using Regedit to disable them via setting Start to 0x4 ("Disabled", 0x2 == Automatic, 0x3 == Manual) in these two registry keys followed by a reboot fixed the slowness:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\carbonblackk
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ParityDriver

Are C++17 Parallel Algorithms implemented already?

I was trying to play around with the new parallel library features proposed in the C++17 standard, but I couldn't get it to work. I tried compiling with the up-to-date versions of g++ 8.1.1 and clang++-6.0 and -std=c++17, but neither seemed to support #include <execution>, std::execution::par or anything similar.
When looking at the cppreference for parallel algorithms there is a long list of algorithms, claiming
Technical specification provides parallelized versions of the following 69 algorithms from algorithm, numeric and memory: ( ... long list ...)
which sounds like the algorithms are ready 'on paper', but not ready to use yet?
In this SO question from over a year ago the answers claim these features hadn't been implemented yet. But by now I would have expected to see some kind of implementation. Is there anything we can use already?
GCC 9 has them but you have to install TBB separately
In Ubuntu 19.10, all components have finally aligned:
GCC 9 is the default one, and the minimum required version for TBB
TBB (Intel Thread Building Blocks) is at 2019~U8-1, so it meets the minimum 2018 requirement
so you can simply do:
sudo apt install gcc libtbb-dev
g++ -ggdb3 -O3 -std=c++17 -Wall -Wextra -pedantic -o main.out main.cpp -ltbb
./main.out
and use as:
#include <execution>
#include <algorithm>
std::sort(std::execution::par_unseq, input.begin(), input.end());
see also the full runnable benchmark below.
GCC 9 and TBB 2018 are the first ones to work as mentioned in the release notes: https://gcc.gnu.org/gcc-9/changes.html
Parallel algorithms and <execution> (requires Thread Building Blocks 2018 or newer).
Related threads:
How to install TBB from source on Linux and make it work
trouble linking INTEL tbb library
Ubuntu 18.04 installation
Ubuntu 18.04 is a bit more involved:
GCC 9 can be obtained from a trustworthy PPA, so it is not so bad
TBB is at version 2017, which does not work, and I could not find a trustworthy PPA for it. Compiling from source is easy, but there is no install target which is annoying...
Here are fully automated tested commands for Ubuntu 18.04:
# Install GCC 9
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install gcc-9 g++-9
# Compile libtbb from source.
sudo apt-get build-dep libtbb-dev
git clone https://github.com/intel/tbb
cd tbb
git checkout 2019_U9
make -j `nproc`
TBB="$(pwd)"
TBB_RELEASE="${TBB}/build/linux_intel64_gcc_cc7.4.0_libc2.27_kernel4.15.0_release"
# Use them to compile our test program.
g++-9 -ggdb3 -O3 -std=c++17 -Wall -Wextra -pedantic -I "${TBB}/include" -L
"${TBB_RELEASE}" -Wl,-rpath,"${TBB_RELEASE}" -o main.out main.cpp -ltbb
./main.out
Test program analysis
I have tested with this program that compares the parallel and serial sorting speed.
main.cpp
#include <algorithm>
#include <cassert>
#include <chrono>
#include <execution>
#include <random>
#include <iostream>
#include <vector>
int main(int argc, char **argv) {
using clk = std::chrono::high_resolution_clock;
decltype(clk::now()) start, end;
std::vector<unsigned long long> input_parallel, input_serial;
unsigned int seed;
unsigned long long n;
// CLI arguments;
std::uniform_int_distribution<uint64_t> zero_ull_max(0);
if (argc > 1) {
n = std::strtoll(argv[1], NULL, 0);
} else {
n = 10;
}
if (argc > 2) {
seed = std::stoi(argv[2]);
} else {
seed = std::random_device()();
}
std::mt19937 prng(seed);
for (unsigned long long i = 0; i < n; ++i) {
input_parallel.push_back(zero_ull_max(prng));
}
input_serial = input_parallel;
// Sort and time parallel.
start = clk::now();
std::sort(std::execution::par_unseq, input_parallel.begin(), input_parallel.end());
end = clk::now();
std::cout << "parallel " << std::chrono::duration<float>(end - start).count() << " s" << std::endl;
// Sort and time serial.
start = clk::now();
std::sort(std::execution::seq, input_serial.begin(), input_serial.end());
end = clk::now();
std::cout << "serial " << std::chrono::duration<float>(end - start).count() << " s" << std::endl;
assert(input_parallel == input_serial);
}
On Ubuntu 19.10, Lenovo ThinkPad P51 laptop with CPU: Intel Core i7-7820HQ CPU (4 cores / 8 threads, 2.90 GHz base, 8 MB cache), RAM: 2x Samsung M471A2K43BB1-CRC (2x 16GiB, 2400 Mbps) a typical output for an input with 100 million numbers to be sorted:
./main.out 100000000
was:
parallel 2.00886 s
serial 9.37583 s
so the parallel version was about 4.5 times faster! See also: What do the terms "CPU bound" and "I/O bound" mean?
We can confirm that the process is spawning threads with strace:
strace -f -s999 -v ./main.out 100000000 |& grep -E 'clone'
which shows several lines of type:
[pid 25774] clone(strace: Process 25788 attached
[pid 25774] <... clone resumed> child_stack=0x7fd8c57f4fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fd8c57f59d0, tls=0x7fd8c57f5700, child_tidptr=0x7fd8c57f59d0) = 25788
Also, if I comment out the serial version and run with:
time ./main.out 100000000
I get:
real 0m5.135s
user 0m17.824s
sys 0m0.902s
which confirms again that the algorithm was parallelized since real < user, and gives an idea of how effectively it can be parallelized in my system (about 3.5x for 8 cores).
Error messages
Hey, Google, index this please.
If you don't have tbb installed, the error is:
In file included from /usr/include/c++/9/pstl/parallel_backend.h:14,
from /usr/include/c++/9/pstl/algorithm_impl.h:25,
from /usr/include/c++/9/pstl/glue_execution_defs.h:52,
from /usr/include/c++/9/execution:32,
from parallel_sort.cpp:4:
/usr/include/c++/9/pstl/parallel_backend_tbb.h:19:10: fatal error: tbb/blocked_range.h: No such file or directory
19 | #include <tbb/blocked_range.h>
| ^~~~~~~~~~~~~~~~~~~~~
compilation terminated.
so we see that <execution> depends on an uninstalled TBB component.
If TBB is too old, e.g. the default Ubuntu 18.04 one, it fails with:
#error Intel(R) Threading Building Blocks 2018 is required; older versions are not supported.
You can refer https://en.cppreference.com/w/cpp/compiler_support to check all C++ feature implementation status. For your case, just search "Standardization of Parallelism TS", and you will find only MSVC and Intel C++ compilers support this feature now.
Intel has released a Parallel STL library which follows the C++17 standard:
https://github.com/intel/parallelstl
It is being merged into GCC.
Gcc does not yet implement the Parallelism TS (see https://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.2017)
However libstdc++ (with gcc) has an experimental mode for some equivalent parallel algorithms. See https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html
Getting it to work:
Any use of parallel functionality requires additional compiler and
runtime support, in particular support for OpenMP. Adding this support
is not difficult: just compile your application with the compiler flag
-fopenmp. This will link in libgomp, the GNU Offloading and Multi Processing Runtime Library, whose presence is mandatory.
Code example
#include <vector>
#include <parallel/algorithm>
int main()
{
std::vector<int> v(100);
// ...
// Explicitly force a call to parallel sort.
__gnu_parallel::sort(v.begin(), v.end());
return 0;
}
Gcc now support execution header, but not standard clang build from https://apt.llvm.org

working with VexCL "compiling binaries"

I want to make a program "which will be distributed to customers" , so I wanna protect my kernels code from hackers "some one told me that AMD driver some-how puts the kernel source inside the binary , so hacker can log the kernel with AMD device"
as I'm not experienced yet with VexCL , what is the proper compile line to just distribute binaries
for example with CUDA I can type : nvcc -gencode arch=compute_10,code=sm_10 myfile.cu -o myexec
what is the equivilent in VexCL?
also does VexCL work on Mac OS?which IDE? (this is a future task as I didn't have experience on Mac OS before)
my previous experience with OpenCL was by using STDCL library "but it is buggy on windows, no Mac support"
I am the developer of VexCL, and I have also replied to your question here.
VexCL generates OpenCL/CUDA kernels for the expressions you use in your code at runtime. Moreover, it allows the user to dump the generated kernel sources to the standard output stream. For example, if you save the following to a hello.cpp file:
#include <vexcl/vexcl.hpp>
int main() {
vex::Context ctx(vex::Filter::Env);
vex::vector<double> x(ctx, 1024);
vex::vector<double> y(ctx, 1024);
y = 2 * sin(M_PI * x) + 1;
}
then compile it with
g++ -o hello hello.cpp -std=c++11 -I/path/to/vexcl -lOpenCL -lboost_system
then set VEXCL_SHOW_KERNELS=1 and run the compiled binary:
$ export VEXCL_SHOW_KERNELS=1
$ ./hello
you will see the kernel that was generated for the expression y = 2 * sin(M_PI * x) + 1:
#if defined(cl_khr_fp64)
# pragma OPENCL EXTENSION cl_khr_fp64: enable
#elif defined(cl_amd_fp64)
# pragma OPENCL EXTENSION cl_amd_fp64: enable
#endif
kernel void vexcl_vector_kernel
(
ulong n,
global double * prm_1,
int prm_2,
double prm_3,
global double * prm_4,
int prm_5
)
{
for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0))
{
prm_1[idx] = ( ( prm_2 * sin( ( prm_3 * prm_4[idx] ) ) ) + prm_5 );
}
}
VexCL also allows to cache the compiled binary sources (in the $HOME/.vexcl folder by default), and it saves the source code with the cache.
From the one hand, the sources that you see are, being automatically generated, not very human-friendly. On the other hand, those are still more convenient to read than, e.g., disassembled binary. I am afraid there is nothing you can do to keep the sources away from 'hackers' except may be modify VexCL source code to suite your needs. The MIT license allows you to do that, and, if you are ready to do this, I could provide you with some guidance.
Mind you, NVIDIA OpenCL driver does it's own caching, and it also stores the kernel sources together with the cached binaries (in the $HOME/.nv/ComputeCache folder). I don't know if it is possible to alter this behavior, so 'hackers' could still get the kernel sources from there. I don't know if AMD does similar thing, but may be that is what your source meant by "log the kernel with AMD device".
Regarding the MacOS compatibility, I don't have a MacOS machine to do my own testing, but I had reports that VexCL does work there. I am not sure what IDE was used.

Threaded DGEMM on ifort

I've implemented a working program (FORTRAN, compiler: Intel ifort 11.x) that makes calls to DGEMM. I've read that there's a quick way to parallelize this by compiling with:
ifort -mkl=parallel -O3 myprog.f -o myprog
I have a quad-core processor, so I run the program with (via bash):
export OMP_NUM_THREADS=4
./myprog
My assumption was that DGEMM would automatically summon 4 threads, resulting in faster matrix multiplication. That doesn't seem to be happening. Am I missing something? Any help would be appreciated.
I think -mkl=parallel is the default choice of the intel compiler. Therefore you don't have to set this flag especially. Try -mkl=sequential instead and see if the calculations are slowing down.