Strange behaviour in multithread analysis - c++

for a university project we are implementing an algorithm capable of bruteforcing on an AES key that we assume is partially known.
We have implemented several versions including one that exploits the multithreading mechanism in C++.
The implementation is done by allocating a variable number of threads, to be passed as input at launch, and dividing the key space equally for each thread that will cycle through the respective range attempting each key. De facto the implementation works, as it succeeds in finding the key for any combination #bitsToHack/#threads but returns strange timing results.
//Structs for threads and respective data
pthread_t threads[num_of_threads];
struct bf_data td [num_of_threads];
int rc;
//Space division
uintmax_t index = pow (BASE_NUMBER, num_bits_to_hack);
uintmax_t step = index/num_of_threads;
if(sem_init(&s, 1, 0)!=0){
printf("Error during semaphore initialization\n");
return -1;
}
for(int i = 0; i < num_of_threads; i++){
//Structure initialization
td[i].ciphertext = ciphertext;
td[i].hacked_key = hacked_key;
td[i].iv_aes = iv_aes;
td[i].key = key_aes;
td[i].num_bits_to_hack = num_bits_to_hack;
td[i].plaintext = plaintext;
td[i].starting_point = step*i;
td[i].step = step;
td[i].num_of_threads = num_of_threads;
if(DEBUG)
printf("Starting point for thread %d is: %lu, using step: %lu\n", i , td[i].starting_point, td[i].step);
rc = pthread_create(&threads[i], NULL, decryption_brute_force, (void*)&td[i]);
if (rc){
cout << "Error:unable to create thread," << rc << endl;
exit(-1);
}
}
sem_wait(&s);
for(int i = 0; i < num_of_threads; i++){
pthread_join(threads[i], NULL);
}
For the decryption_brute_force function (The body of each thread):
void* decryption_brute_force(void* data){
** Copy data on local thread memory
** Build the key to begin the search from starting point
** for each key from starting_point to starting_point + step
** Try decryption
** if obtained plaintext corresponds to the expected one
** Print results, wake up main thread and terminate
** else
** increment the key and continue
}
To conclude the project we intended to conduct a study of the optimal number of threads expecting an increase in performance as the number of threads increased up to a threshold, after which the system would no longer benefit from the increase in threads assigned to it.
At the end of the analysis (a simulation lasting about 9 hours), the results obtained were as follows in figure.
Click here to see the plot.
We cannot understand why 8 threads performs better than 16. Could it be due to the CPU architecture? Could it be able to schedule 32 and 8 threads better than 16?

From comments, I think it could be the linear-search pattern in each thread yields to different results for different number of threads. Because when you double the threads, the actual linear point to find in a thread may shift to a further point. But once you double again, it can not go much further due to too many threads. Because you said you are using only same encrypted data always. Did you try different inputs?
this variable is integer (so it may not be exact distribution)
^
8 threads & step=7 (56 work total)
index-16 (0-based)
v
01234567 89abcdef 01234567 89abcdef
| | |. | ...
500 seconds as its the first loop iteration
16 threads & step=3 (56 work total)
index-16 again, but at second-iteration now
v
012 345 678 9ab cde f01 234 567 8
| | | | | | . | | | ...
1000 seconds as it finds only after second iteration in the thread
Another example with 2 threads and 3 threads:
x to found at 51-th element of 100-element-work:
2 threads
| |x(1st iteration) |
3 threads
| |........x | |
5x slower than 2 threads

Related

Why is adding two std::vectors slower than raw arrays from new[]?

I'm looking around OpenMP, partially because my program need to make additions of very large vectors (millions of elements). However i see a quite large difference if i use std::vector or raw array. Which i cannot explain. I insist that the difference is only on the loop, not the initialisation of course.
The difference in time I refer to, is only timing the addition, especially not to take into account any initialization difference between vectors, arrays, etc. I'm really talking only about the sum part. The size of the vectors is not known at compile time.
I use g++ 5.x on Ubuntu 16.04.
edit: I tested what #Shadow said, it got me thinking, is there something going on with optimization? If i compile with -O2, then, using raw arrays initialized, I get back for loop scaling with number of threads. But with -O3 or -funroll-loops, it is as if the compiler kicks in early and optimize before the pragma is seen.
I came up with the following, simple test:
#define SIZE 10000000
#define TRIES 200
int main(){
std::vector<double> a,b,c;
a.resize(SIZE);
b.resize(SIZE);
c.resize(SIZE);
double start = omp_get_wtime();
unsigned long int i,t;
#pragma omp parallel shared(a,b,c) private(i,t)
{
for( t = 0; t< TRIES; t++){
#pragma omp for
for( i = 0; i< SIZE; i++){
c[i] = a[i] + b[i];
}
}
}
std::cout << "finished in " << omp_get_wtime() - start << std::endl;
return 0;
}
I compile with
g++ -O3 -fopenmp -std=c++11 main.cpp
And get for one threads
>time ./a.out
finished in 2.5638
./a.out 2.58s user 0.04s system 99% cpu 2.619 total.
For two threads, loop takes 1.2s, for 1.23 total.
Now if I use raw arrays:
int main(){
double *a, *b, *c;
a = new double[SIZE];
b = new double[SIZE];
c = new double[SIZE];
double start = omp_get_wtime();
unsigned long int i,t;
#pragma omp parallel shared(a,b,c) private(i,t)
{
for( t = 0; t< TRIES; t++)
{
#pragma omp for
for( i = 0; i< SIZE; i++)
{
c[i] = a[i] + b[i];
}
}
}
std::cout << "finished in " << omp_get_wtime() - start << std::endl;
delete[] a;
delete[] b;
delete[] c;
return 0;
}
And i get (1 thread):
>time ./a.out
finished in 1.92901
./a.out 1.92s user 0.01s system 99% cpu 1.939 total
std::vector is 33% slower!
For two threads:
>time ./a.out
finished in 1.20061
./a.out 2.39s user 0.02s system 198% cpu 1.208 total
As a comparison, with Eigen or Armadillo for exactly the same operation (using c = a+b overload with vector object), I get for total real time ~2.8s. They are not multi-threaded for vector additions.
Now, i thought the std::vector has almost no overhead? What is happening here? I'd like to use nice standard library objects.
I cannot find any reference anywhere on a simple example like this.
Meaningful benchmarking is hard
The answer from Xirema has already outlined in detail the difference in the code. std::vector::reserve initializes the data to zero, whereas new double[size] does not. Note that you can use new double[size]() to force initalization.
However your measurement doesn't include initialization, and the number of repetitions is so high that the loop costs should outweigh the small initialization even in Xirema's example. So why do the very same instructions in the loop take more time because the data is initialized?
Minimal example
Let's dig to the core of this with a code that dynamically determines whether memory is initialized or not (Based on Xirema's, but only timing the loop itself).
#include <vector>
#include <chrono>
#include <iostream>
#include <memory>
#include <iomanip>
#include <cstring>
#include <string>
#include <sys/types.h>
#include <unistd.h>
constexpr size_t size = 10'000'000;
auto time_pointer(size_t reps, bool initialize, double init_value) {
double * a = new double[size];
double * b = new double[size];
double * c = new double[size];
if (initialize) {
for (size_t i = 0; i < size; i++) {
a[i] = b[i] = c[i] = init_value;
}
}
auto start = std::chrono::steady_clock::now();
for (size_t t = 0; t < reps; t++) {
for (size_t i = 0; i < size; i++) {
c[i] = a[i] + b[i];
}
}
auto end = std::chrono::steady_clock::now();
delete[] a;
delete[] b;
delete[] c;
return end - start;
}
int main(int argc, char* argv[]) {
bool initialize = (argc == 3);
double init_value = 0;
if (initialize) {
init_value = std::stod(argv[2]);
}
auto reps = std::stoll(argv[1]);
std::cout << "pid: " << getpid() << "\n";
auto t = time_pointer(reps, initialize, init_value);
std::cout << std::setw(12) << std::chrono::duration_cast<std::chrono::milliseconds>(t).count() << "ms" << std::endl;
return 0;
}
Results are consistent:
./a.out 50 # no initialization
657ms
./a.out 50 0. # with initialization
1005ms
First glance at performance counters
Using the excellent Linux perf tool:
$ perf stat -e LLC-loads -e dTLB-misses ./a.out 50
pid: 12481
626ms
Performance counter stats for './a.out 50':
101.589.231 LLC-loads
105.415 dTLB-misses
0,629369979 seconds time elapsed
$ perf stat -e LLC-loads -e dTLB-misses ./a.out 50 0.
pid: 12499
1008ms
Performance counter stats for './a.out 50 0.':
145.218.903 LLC-loads
1.889.286 dTLB-misses
1,096923077 seconds time elapsed
Linear scaling with increasing number of repetitions also tells us, that the difference comes from within the loop. But why would initializing the memory cause more last level cache-loads and data TLB misses?
Memory is complex
To understand that, we need to understand how memory is allocated. Just because a malloc / new returns some pointer to virtual memory, doesn't mean that there is physical memory behind it. The virtual memory can be in a page that is not backed by physical memory - and the physical memory is only assigned on the first page fault. Now here is where page-types (from linux/tools/vm - and the pid we show as output comes in handy. Looking at the page statistics during a long execution of our little benchmark:
With initialization
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000804 1 0 __R________M______________________________ referenced,mmap
0x000000000004082c 392 1 __RU_l_____M______u_______________________ referenced,uptodate,lru,mmap,unevictable
0x000000000000086c 335 1 __RU_lA____M______________________________ referenced,uptodate,lru,active,mmap
0x0000000000401800 56721 221 ___________Ma_________t___________________ mmap,anonymous,thp
0x0000000000005868 1807 7 ___U_lA____Ma_b___________________________ uptodate,lru,active,mmap,anonymous,swapbacked
0x0000000000405868 111 0 ___U_lA____Ma_b_______t___________________ uptodate,lru,active,mmap,anonymous,swapbacked,thp
0x000000000000586c 1 0 __RU_lA____Ma_b___________________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
total 59368 231
Most of the virtual memory is in a normal mmap,anonymous region - something that is mapped to a physical address.
Without initialization
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000001000000 1174 4 ________________________z_________________ zero_page
0x0000000001400000 37888 148 ______________________t_z_________________ thp,zero_page
0x0000000000000800 1 0 ___________M______________________________ mmap
0x000000000004082c 388 1 __RU_l_____M______u_______________________ referenced,uptodate,lru,mmap,unevictable
0x000000000000086c 347 1 __RU_lA____M______________________________ referenced,uptodate,lru,active,mmap
0x0000000000401800 18907 73 ___________Ma_________t___________________ mmap,anonymous,thp
0x0000000000005868 633 2 ___U_lA____Ma_b___________________________ uptodate,lru,active,mmap,anonymous,swapbacked
0x0000000000405868 37 0 ___U_lA____Ma_b_______t___________________ uptodate,lru,active,mmap,anonymous,swapbacked,thp
0x000000000000586c 1 0 __RU_lA____Ma_b___________________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
total 59376 231
Now here, only 1/3 of the memory is backed by dedicated physical memory, and 2/3 are mapped to a zero page. The data behind a and b is all backed by a single read-only 4kiB page filled with zeros. c (and a, b in the other test) have already been written to, so it has to have it's own memory.
0 != 0
Now it may look weird: everything here is zero1 - why does it matter how it became zero? Whether you memset(0), a[i] = 0., or std::vector::reserve - everything causes explicit writes to memory, hence a page fault if you do it on a zero page. I don't think you can/should prevent physical page allocation at that point. The only thing you could do for the memset / reserve is to use calloc to explicitly request zero'd memory, which is probably backed by a zero_page, but I doubt it is done (or makes a lot of sense). Remember that for new double[size]; or malloc there is no guarantee what kind of memory you get, but that includes the possibility of zero-memory.
1: Remember that the double 0.0 has all bits set to zero.
In the end the performance difference really comes only from the loop, but is caused by initialization. std::vector carries no overhead for the loop. In the benchmark code, raw arrays just benefit from optimization of an abnormal case of uninitialized data.
The observed behaviour is not OpenMP-specific and has to do with the way modern operating systems manage memory. Memory is virtual, meaning that each process has its own virtual address (VA) space and a special translation mechanism is used to map pages of that VA space to frames of physical memory. Consequently, memory allocation is performed in two stages:
reservation of a region within the VA space - this is what operator new[] does when the allocation is big enough (smaller allocations are handled differently for reasons of efficiency)
actually backing the region with physical memory upon access to some part of the region
The process is split in two parts since in many cases applications do not really use at once all the memory they reserve and backing the entire reservation with physical memory might lead to waste (and unlike virtual memory, physical one is a very limited resource). Therefore, backing reservations with physical memory is performed on-demand the very first time the process writes to a region of the allocated memory space. The process is known as faulting the memory region since on most architectures it involves a soft page-fault that triggers the mapping within the OS kernel. Every time your code writes for the first time to a region of memory that is still not backed by physical memory, a soft page-fault is triggered and the OS tries to map a physical page. The process is slow as it involves finding a free page and modification on the process page table. The typical granularity of that process is 4 KiB unless some kind of large pages mechanism is in place, e.g., the Transparent Huge Pages mechanism on Linux.
What happens if you read for the first time from a page that has never been written to? Again, a soft page fault occurs, but instead of mapping a frame of physical memory, the Linux kernel maps a special "zero page". The page is mapped in CoW (copy-on-write) mode, which means that when you try to write it, the mapping to the zero page will be replaced by a mapping to a fresh frame of physical memory.
Now, take a look at the size of the arrays. Each of a, b, and c occupies 80 MB, which exceeds the cache size of most modern CPUs. One execution of the parallel loop thus has to bring 160 MB of data from the main memory and write back 80 MB. Because of how system cache works, writing to c actually reads it once, unless non-temporal (cache-bypassing) stores are used, therefore 240 MB of data is read and 80 MB of data gets written. Multiplied by 200 outer iterations, this gives 48 GB of data read and 16 GB of data written in total.
The above is not the case when a and b are not initialised, i.e. the case when a and b are simply allocated using operator new[]. Since reads in those case result in access to the zero page, and there is physically only one zero page that easily fits in the CPU cache, no real data has to be brought in from the main memory. Therefore, only 16 GB of data has to be read in and then written back. If non-temporal stores are used, no memory is read at all.
This could be easily proven using LIKWID (or any other tool able to read the CPU hardware counters):
std::vector<double> version:
$ likwid-perfctr -C 0 -g HA a.out
...
+-----------------------------------+------------+
| Metric | Core 0 |
+-----------------------------------+------------+
| Runtime (RDTSC) [s] | 4.4796 |
| Runtime unhalted [s] | 5.5242 |
| Clock [MHz] | 2850.7207 |
| CPI | 1.7292 |
| Memory read bandwidth [MBytes/s] | 10753.4669 |
| Memory read data volume [GBytes] | 48.1715 | <---
| Memory write bandwidth [MBytes/s] | 3633.8159 |
| Memory write data volume [GBytes] | 16.2781 |
| Memory bandwidth [MBytes/s] | 14387.2828 |
| Memory data volume [GBytes] | 64.4496 | <---
+-----------------------------------+------------+
Version with uninitialised arrays:
+-----------------------------------+------------+
| Metric | Core 0 |
+-----------------------------------+------------+
| Runtime (RDTSC) [s] | 2.8081 |
| Runtime unhalted [s] | 3.4226 |
| Clock [MHz] | 2797.2306 |
| CPI | 1.0753 |
| Memory read bandwidth [MBytes/s] | 5696.4294 |
| Memory read data volume [GBytes] | 15.9961 | <---
| Memory write bandwidth [MBytes/s] | 5703.4571 |
| Memory write data volume [GBytes] | 16.0158 |
| Memory bandwidth [MBytes/s] | 11399.8865 |
| Memory data volume [GBytes] | 32.0119 | <---
+-----------------------------------+------------+
Version with uninitialised array and non-temporal stores (using Intel's #pragma vector nontemporal):
+-----------------------------------+------------+
| Metric | Core 0 |
+-----------------------------------+------------+
| Runtime (RDTSC) [s] | 1.5889 |
| Runtime unhalted [s] | 1.7397 |
| Clock [MHz] | 2530.1640 |
| CPI | 0.5465 |
| Memory read bandwidth [MBytes/s] | 123.4196 |
| Memory read data volume [GBytes] | 0.1961 | <---
| Memory write bandwidth [MBytes/s] | 10331.2416 |
| Memory write data volume [GBytes] | 16.4152 |
| Memory bandwidth [MBytes/s] | 10454.6612 |
| Memory data volume [GBytes] | 16.6113 | <---
+-----------------------------------+------------+
The disassembly of the two versions provided in your question when using GCC 5.3 shows that the two loops are translated to exactly the same sequence of assembly instructions sans the different code address. The sole reason for the difference in the execution time is the memory access as explained above. Resizing the vectors initialises them with zeros, which results in a and b being backed up by their own physical memory pages. Not initialising a and b when operator new[] is used results in their backing by the zero page.
Edit: It took me so long to write this that in the mean time Zulan has written a way more technical explanation.
I have a good hypothesis.
I've written three versions of the code: one using raw double *, one using std::unique_ptr<double[]> objects, and one using std::vector<double>, and compared the runtimes of each of these versions of the code. For my purposes, I've used a single-threaded version of the code to try to simplify the case.
Total Code::
#include<vector>
#include<chrono>
#include<iostream>
#include<memory>
#include<iomanip>
constexpr size_t size = 10'000'000;
constexpr size_t reps = 50;
auto time_vector() {
auto start = std::chrono::steady_clock::now();
{
std::vector<double> a(size);
std::vector<double> b(size);
std::vector<double> c(size);
for (size_t t = 0; t < reps; t++) {
for (size_t i = 0; i < size; i++) {
c[i] = a[i] + b[i];
}
}
}
auto end = std::chrono::steady_clock::now();
return end - start;
}
auto time_pointer() {
auto start = std::chrono::steady_clock::now();
{
double * a = new double[size];
double * b = new double[size];
double * c = new double[size];
for (size_t t = 0; t < reps; t++) {
for (size_t i = 0; i < size; i++) {
c[i] = a[i] + b[i];
}
}
delete[] a;
delete[] b;
delete[] c;
}
auto end = std::chrono::steady_clock::now();
return end - start;
}
auto time_unique_ptr() {
auto start = std::chrono::steady_clock::now();
{
std::unique_ptr<double[]> a = std::make_unique<double[]>(size);
std::unique_ptr<double[]> b = std::make_unique<double[]>(size);
std::unique_ptr<double[]> c = std::make_unique<double[]>(size);
for (size_t t = 0; t < reps; t++) {
for (size_t i = 0; i < size; i++) {
c[i] = a[i] + b[i];
}
}
}
auto end = std::chrono::steady_clock::now();
return end - start;
}
int main() {
std::cout << "Vector took " << std::setw(12) << time_vector().count() << "ns" << std::endl;
std::cout << "Pointer took " << std::setw(12) << time_pointer().count() << "ns" << std::endl;
std::cout << "Unique Pointer took " << std::setw(12) << time_unique_ptr().count() << "ns" << std::endl;
return 0;
}
Test Results:
Vector took 1442575273ns //Note: the first one executed, regardless of
//which function it is, is always slower than expected. I'll talk about that later.
Pointer took 542265103ns
Unique Pointer took 1280087558ns
So all of STL objects are demonstrably slower than the raw version. Why might this be?
Let's go to the Assembly! (compiled using Godbolt.com, using the snapshot version of GCC 8.x)
There's a few things we can observe to start with. For starters, the std::unique_ptr and std::vector code are generating virtually identical assembly code. std::unique_ptr<double[]> swaps out new and delete for new[] and delete[]. Since their runtimes are within margin of error, we'll focus on the std::unique_ptr<double[]> version and compare that to double *.
Starting with .L5 and .L22, the code seems to be identical. The only major differences are an extra pointer arithmetic before the delete[] calls are made in the double * version, and some extra stack cleanup code at the end in .L34 (std::unique_ptr<double[]> version), which doesn't exist for the double * version. Neither of these seem likely to have strong impact on the code speed, so we're going to ignore them for now.
The code that's identical appears to be the code directly responsible for the loop. You'll notice that the code which is different (which I'll get to momentarily) doesn't contain any jump statements, which are integral to loops.
So all of the major differences appear to be specific to the initial allocation of the objects in question. This is between time_unique_ptr(): and .L32 for the std::unique_ptr<double[]> version, and between time_pointer(): and .L22 for the double * version.
So what's the difference? Well, they're almost doing the same thing. Except for a few lines of code that show up in the std::unique_ptr<double[]> version that don't show up in the double * version:
std::unique_ptr<double[]>:
mov edi, 80000000
mov r12, rax
call operator new[](unsigned long)
mov edx, 80000000
mov rdi, rax
xor esi, esi //Sets register to 0, which is probably used in...
mov rbx, rax
call memset //!!!
mov edi, 80000000
call operator new[](unsigned long)
mov rdi, rax
mov edx, 80000000
xor esi, esi //Sets register to 0, which is probably used in...
mov rbp, rax
call memset //!!!
mov edi, 80000000
call operator new[](unsigned long)
mov r14, rbx
xor esi, esi //Sets register to 0, which is probably used in...
mov rdi, rax
shr r14, 3
mov edx, 80000000
mov r13d, 10000000
and r14d, 1
call memset //!!!
double *:
mov edi, 80000000
mov rbp, rax
call operator new[](unsigned long)
mov rbx, rax
mov edi, 80000000
mov r14, rbx
shr r14, 3
call operator new[](unsigned long)
and r14d, 1
mov edi, 80000000
mov r12, rax
sub r13, r14
call operator new[](unsigned long)
Well would you look at that! Some unexpected calls to memset that aren't part of the double * code! It's quite clear that std::vector<T> and std::unique_ptr<T[]> are contracted to "initialize" the memory they allocate, whereas double * has no such contract.
So this is basically a very, very round-about way of verifying what Shadow observed: When you make no attempt to "zero-fill" the arrays, the compiler will
Do nothing for double * (saving precious CPU cycles), and
Do the initialization without prompting for std::vector<double> and std::unique_ptr<double[]> (costing time initializing everything).
But when you do add zero-fill, the compiler recognizes that it's about to "repeat itself", optimizes out the second zero-fill for std::vector<double> and std::unique_ptr<double[]> (which results in the code not changing) and adds it to the double * version, making it the same as the other two versions. You can confirm this by comparing the new version of the assembly where I've made the following change to the double * version:
double * a = new double[size];
for(size_t i = 0; i < size; i++) a[i] = 0;
double * b = new double[size];
for(size_t i = 0; i < size; i++) b[i] = 0;
double * c = new double[size];
for(size_t i = 0; i < size; i++) c[i] = 0;
And sure enough, the assembly now has those loops optimized into memset calls, the same as the std::unique_ptr<double[]> version! And the runtime is now comparable.
(Note: the runtime of the pointer is now slower than the other two! I observed that the first function called, regardless of which one, is always about 200ms-400ms slower. I'm blaming branch prediction. Either way, the speed should be identical in all three code paths now).
So that's the lesson: std::vector and std::unique_ptr are making your code slightly safer by preventing that Undefined Behavior you were invoking in your code that used raw pointers. The consequence is that it's also making your code slower.
I tested it and found out the following: The vector case had a runtime about 1.8 times longer than the raw array case. But this was only the case when I did not initialize the raw array. After adding a simple loop before the time measurement to initialize all entries with 0.0 the raw array case took as long as the vector case.
It took a closer look and did the following:
I did not initialize the raw arrays like
for (size_t i{0}; i < SIZE; ++i)
a[i] = 0.0;
but did it this way:
for (size_t i{0}; i < SIZE; ++i)
if (a[i] != 0.0)
{
std::cout << "a was set at position " << i << std::endl;
a[i] = 0.0;
}
(the other arrays accordingly).
The result was that I got no console output from initializing the arrays and it was again as fast as without initializing at all, that is about 1.8 faster than with the vectors.
When I initialized for example only a "normal" and the other two vector with the if clause I measured a time between the vector runtime and the runtime with all arrays "fake initialized" with the if clause.
Well... that's strange...
Now, i thougt the std::vector has almost no overhead ? What is happening here ? I'd like to use nice STL objects...
Although I cannot explain you this behavior, I can tell you that there is not really an overhead for std::vector if you use it "normal". This is just a very artificial case.
EDIT:
As qPCR4vir and the OP Napseis pointed out this might have to do with optimization. As soon as I turned on optimization the "real init" case was about the already mentioned factor of 1.8 slower. But without it was still about 1.1 times slower.
So I looked at the assembler code but I did not saw any difference in the 'for' loops...
The major thing to notice here is the fact that
The array version has undefined behavior
dcl.init #12 states:
If an indeterminate value is produced by an evaluation, the behavior is undefined
And this is exactly what happens in that line:
c[i] = a[i] + b[i];
Both a[i] and b[i] are indeterminate values since the arrays are default-initialized.
The UB perfectly explains the measuring results (whatever they are).
UPD: In the light of #HristoIliev and #Zulan answers I'd like to emphasize language POV once more.
The UB of reading uninitialized memory for the compiler essentialy means that it can always assume that memory is initialized, so whatever the OS does is fine with C++, even if the OS has some specific behavior for that case.
Well it turns out that it does - your code is not reading the physical memory and your measurements correspond to that.
One could say that the resulting program does not compute the sum of two arrays - it computes the sum of two more easily accessible mocks, and it is fine with C++ exactly because of the UB. If it did something else, it would still be perfectly fine.
So in the end you have two programs: one adds up two vectors and the other just does something undefined (from C++ point of view) or something unrelated (from OS point of view). What is the point of measuring their timings and comparing the results?
Fixing the UB solves the whole problem, but more importantly it validates your measurements and allows you to meaningfully compare the results.
In this case, i think the culprit is -funroll-loops, from what i just tests in O2 with and without this option.
https://gcc.gnu.org/onlinedocs/gcc-5.4.0/gcc/Optimize-Options.html#Optimize-Options
funroll-loops: Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop. -funroll-loops implies -frerun-cse-after-loop. It also turns on complete loop peeling (i.e. complete removal of loops with small constant number of iterations). This option makes code larger, and may or may not make it run faster.

Why can sin(Vector) on all cores be as fast as sin(V) on one core?

I have a simple C++ code that runs a default sin function across a vector of values.
static void BM_sin() {
int data_size = 100000000;
double lower_bound = 0;
double upper_bound = 1;
random_device device;
mt19937 engine(device());
uniform_real_distribution<double> distribution(lower_bound, upper_bound);
auto generator = bind(distribution, engine);
vector<double> data(data_size);
generate(begin(data), end(data), generator);
#pragma omp parallel for
for(int i = 0; i < data_size; ++i) {
data[i] = sin(data[i]);
}
cout << accumulate(data.begin(), data.end(), 0) << endl;
}
I get same time when I run this function with export OMP_NUM_THREADS set to 1 and 8 having 8 cores. Also commenting line #pragma omp parallel for out does not help. So I wonder why sinus applied to a vector from all threads is as fast as applied from one thread?
(I compile with -Ofast -fopenmp on gcc-4.8)
Simple answer is simple:
Not all things scale well. I don't know fast_sin, but it's possible it's mainly memory-bandwidth limited. In that case, you'll win nothing by distributing the workload across cores.
Also, I doubt your measuring methods. If your generator is the mt19337, it's a lot more complex than your sine, so parallelizing your sine doesn't do much, because most of the time is spent generating random numbers.
You are measuring something wrongly. The generator loop is slow, but not that slow that it completely overshadows the sine loop. Here are the results of measuring the execution speed of several code parts on two different Intel architectures:
Code part | WM (x64) | WM (x86) | SB (x64) | SB (x86)
-----------------------+----------+----------+----------+----------
generate() | 1,45 s | 2,44 s | 1,28 s | 2,18 s
sine loop (serial) | 2,17 s | 2,88 s | 1,80 s | 2,91 s
sine loop (6 threads) | 0,37 s | 0,51 s | 0,31 s | 0,52 s
accumulate() | 0,31 s | 0,70 s | 0,33 s | 0,67 s
-----------------------+----------+----------+----------+----------
speed-up: overall | 1,85x | 1,65x | 1,78x | 1,71x
speed-up: sine loop | 5,86x | 5,65x | 5,81x | 5,60x
speed-up: Amdahl | 2,23x | 1,92x | 2,12x | 2,02x
In the above table, WM stands for Intel X5675, a Westmere CPU, while SB stands for Intel E5-2650, a Sandy Bridge CPU. x64 stands for 64-bit mode and x86 - for 32-bit mode. GCC 4.8.5 was used with -Ofast -fopenmp -mtune=native (-m32 for 32-bit mode). Both systems are running CentOS 7.2. The execution times are only approximate, as I haven't done proper timing by taking the average of multiple executions. Timing was done using the portable omp_get_wtime() timer routine.
As you can see, the overall speed-up with 6 threads ranges from 1,65x to 1,85x, while the speed-up for the sine loop alone ranges from 5,60x to 5,86x. Both the generator loop and the accumulator loop are performed in serial, which caps the parallel speed-up (see Amdahl's law).
Two things to note here. First one, the generator loop could be a tad faster if the memory for the vector is pre-faulted. It basically means sweeping over the vector and touching every memory page that backs it. Running the generator loop twice and only timing the second invocation will also do the trick. On my systems that brings no noticeable advantage (the savings are on the same order as the measurement error), most likely since CentOS's kernel has transparent huge pages turned on by default.
The second thing is the last parameter to accumulate() is an integer 0, therefore the algorithm is forced to perform an integer conversion every time, which slows it down considerably and gives the wrong result at the end (0). accumulate(data.begin(), data.end(), 0.0) executes ten times faster and also produces the correct result.

Filter strange C++ multimap values

I have this multimap in my code:
multimap<long, Note> noteList;
// notes are added with this method. measureNumber is minimum `1` and doesn't go very high
void Track::addNote(Note &note) {
long key = note.measureNumber * 1000000 + note.startTime;
this->noteList.insert(make_pair(key, note));
}
I'm encountering problems when I try to read the notes from the last measure. In this case the song has only 8 measures and it's measure number 8 that causes problems. If I go up to 16 measures it's measure 16 that causes the problem and so on.
// (when adding notes I use as key the measureNumber * 1000000. This searches for notes within the same measure)
for(noteIT = trackIT->noteList.lower_bound(this->curMsr * 1000000); noteIT->first < (this->curMsr + 1) * 1000000; noteIT++){
if(this->curMsr == 8){
cout << "_______________________________________________________" << endl;
cout << "ID:" << noteIT->first << endl;
noteIT->second.toString();
int blah = 0;
}
// code left out here that processes the notes
}
I have only added one note to the 8th measure and yet this is the result I'm getting in console:
_______________________________________________________
ID:8000001
note toString()
Duration: 8
Start Time: 1
Frequency: 880
_______________________________________________________
ID:1
note toString()
Duration: 112103488
Start Time: 44
Frequency: 0
_______________________________________________________
ID:8000001
note toString()
Duration: 8
Start Time: 1
Frequency: 880
_______________________________________________________
ID:1
note toString()
Duration: 112103488
Start Time: 44
Frequency: 0
This keeps repeating. The first result is a correct note which I've added myself but I have no idea where the note with ID: 1 is coming from.
Any ideas how to avoid this? This loop gets stuck repeating the same two results and I can't get out of it. Even if there are several notes within measure 8 (so that means several values within the multimap that start with 8xxxxxx it only repeats the first note and the non-existand one.
You aren't checking for the end of your loop correctly. Specifically there is no guarantee that noteIT does not equal trackIT->noteList.end(). Try this instead
for (noteIT = trackIT->noteList.lower_bound(this->curMsr * 1000000);
noteIT != trackIT->noteList.end() &&
noteIT->first < (this->curMsr + 1) * 1000000;
++noteIT)
{
For the look of it, it might be better to use some call to upper_bound as the limit of your loop. That would handle the end case automatically.

C++ and MPI how to write part of code as parallel?

I've been writing some code using PETSc library and now I'm going to change a part of it to be run as parallel. Most of the things what I want to parallelize is matrix initializings and the parts where I generate and calculate a large amount of values. Anyway my problem is following if I run the code with more than 1 core for some reason all parts of the code will be run as many times as how many cores I use.
This is just simple sample code where I tested PETSc and MPI
int main(int argc, char** argv)
{
time_t rawtime;
time ( &rawtime );
string sta = ctime (&rawtime);
cout << "Solving began..." << endl;
PetscInitialize(&argc, &argv, 0, 0);
Mat A; /* linear system matrix */
PetscInt i,j,Ii,J,Istart,Iend,m = 120000,n = 3,its;
PetscErrorCode ierr;
PetscBool flg = PETSC_FALSE;
PetscScalar v;
#if defined(PETSC_USE_LOG)
PetscLogStage stage;
#endif
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Compute the matrix and right-hand-side vector that define
the linear system, Ax = b.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Create parallel matrix, specifying only its global dimensions.
When using MatCreate(), the matrix format can be specified at
runtime. Also, the parallel partitioning of the matrix is
determined by PETSc at runtime.
Performance tuning note: For problems of substantial size,
preallocation of matrix memory is crucial for attaining good
performance. See the matrix chapter of the users manual for details.
*/
ierr = MatCreate(PETSC_COMM_WORLD,&A);CHKERRQ(ierr);
ierr = MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,m,n);CHKERRQ(ierr);
ierr = MatSetFromOptions(A);CHKERRQ(ierr);
ierr = MatMPIAIJSetPreallocation(A,5,PETSC_NULL,5,PETSC_NULL);CHKERRQ(ierr);
ierr = MatSeqAIJSetPreallocation(A,5,PETSC_NULL);CHKERRQ(ierr);
ierr = MatSetUp(A);CHKERRQ(ierr);
/*
Currently, all PETSc parallel matrix formats are partitioned by
contiguous chunks of rows across the processors. Determine which
rows of the matrix are locally owned.
*/
ierr = MatGetOwnershipRange(A,&Istart,&Iend);CHKERRQ(ierr);
/*
Set matrix elements for the 2-D, five-point stencil in parallel.
- Each processor needs to insert only elements that it owns
locally (but any non-local elements will be sent to the
appropriate processor during matrix assembly).
- Always specify global rows and columns of matrix entries.
Note: this uses the less common natural ordering that orders first
all the unknowns for x = h then for x = 2h etc; Hence you see J = Ii +- n
instead of J = I +- m as you might expect. The more standard ordering
would first do all variables for y = h, then y = 2h etc.
*/
PetscMPIInt rank; // processor rank
PetscMPIInt size; // size of communicator
MPI_Comm_rank(PETSC_COMM_WORLD,&rank);
MPI_Comm_size(PETSC_COMM_WORLD,&size);
cout << "Rank = " << rank << endl;
cout << "Size = " << size << endl;
cout << "Generating 2D-Array" << endl;
double temp2D[120000][3];
for (Ii=Istart; Ii<Iend; Ii++) {
for(J=0; J<n;J++){
temp2D[Ii][J] = 1;
}
}
cout << "Processor " << rank << " set values : " << Istart << " - " << Iend << " into 2D-Array" << endl;
v = -1.0;
for (Ii=Istart; Ii<Iend; Ii++) {
for(J=0; J<n;J++){
MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);
}
}
cout << "Ii = " << Ii << " processor " << rank << " and it owns: " << Istart << " - " << Iend << endl;
/*
Assemble matrix, using the 2-step process:
MatAssemblyBegin(), MatAssemblyEnd()
Computations can be done while messages are in transition
by placing code between these two statements.
*/
ierr = MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
ierr = MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
MPI_Finalize();
cout << "No more MPI" << endl;
return 0;
}
And my real program has a couple different .cpp files. I initialize MPI in the main program what calls a function in another .cpp file where I did implement same kind of matrix filling but all the cout's what the program does before filling the matrices will be printed as many times as the number of my cores.
I can run my test program as mpiexec -n 4 test and it runs successfully but for some reason I have to run my real program as mpiexec -n 4 ./myprog
Output of my test program is following
Solving began...
Solving began...
Solving began...
Solving began...
Rank = 0
Size = 4
Generating 2D-Array
Processor 0 set values : 0 - 30000 into 2D-Array
Rank = 2
Size = 4
Generating 2D-Array
Processor 2 set values : 60000 - 90000 into 2D-Array
Rank = 3
Size = 4
Generating 2D-Array
Processor 3 set values : 90000 - 120000 into 2D-Array
Rank = 1
Size = 4
Generating 2D-Array
Processor 1 set values : 30000 - 60000 into 2D-Array
Ii = 30000 processor 0 and it owns: 0 - 30000
Ii = 90000 processor 2 and it owns: 60000 - 90000
Ii = 120000 processor 3 and it owns: 90000 - 120000
Ii = 60000 processor 1 and it owns: 30000 - 60000
no more MPI
no more MPI
no more MPI
no more MPI
Edit after two comments:
So my goal is to run this on small cluster which has 20 nodes and each node has 2 cores. Later on this should be running on super computer so mpi is definitely the way I need to go. I'm currently testing this on two different machines one of them has 1 processor / 4 cores and second has 4 processor / 16 cores.
MPI is an implementation of the SPMD/MPMD model (single program multiple data / multiple programs multiple data). An MPI job consists of concurrently running processes that exchange messages between each other in order to cooperate on solving a problem. You cannot run only part of the code in parallel. You can only have parts of the code that do not communicate with each other but still execute concurrently. And you ought use mpirun or mpiexec to start your application in parallel mode.
If you'd like to make only parts of your code parallel and could live with the limitation that you can only run the code on a single machine, then what you need is OpenMP and not MPI. Or you can also use low-level POSIX threads programming as according to the PETSc web site, it supports pthreads. And OpenMP is built on top of pthreads so using PETSc with OpenMP might be possible.
To add to Hristo's answer, MPI is built to run in a distributed fashion, i.e. completely separate processes. They have to be separate, because they are supposed to be on different physical machines. You can run multiple MPI processes on one machine, for example one per core. That's perfectly OK, but MPI does not have any tools to take advantage of that shared memory context. In other words, you cannot have some MPI ranks (processes) do work on a matrix that is owned by another MPI process because you have no way to share the matrix.
When you start x MPI processes you get x copies of the same exact program running. You need code like
if (rank == 0)
do something
else
do something else
to have the different processes do different things. The processes can communicate with each other by send messages, but they all run the same exact binary.
If you don't have the code diverge, then you'll just get x copies of the same program give the same result x times.

What's the difference between "static" and "dynamic" schedule in OpenMP?

I started working with OpenMP using C++.
I have two questions:
What is #pragma omp for schedule?
What is the difference between dynamic and static?
Please, explain with examples.
Others have since answered most of the question but I would like to point to some specific cases where a particular scheduling type is more suited than the others. Schedule controls how loop iterations are divided among threads. Choosing the right schedule can have great impact on the speed of the application.
static schedule means that iterations blocks are mapped statically to the execution threads in a round-robin fashion. The nice thing with static scheduling is that OpenMP run-time guarantees that if you have two separate loops with the same number of iterations and execute them with the same number of threads using static scheduling, then each thread will receive exactly the same iteration range(s) in both parallel regions. This is quite important on NUMA systems: if you touch some memory in the first loop, it will reside on the NUMA node where the executing thread was. Then in the second loop the same thread could access the same memory location faster since it will reside on the same NUMA node.
Imagine there are two NUMA nodes: node 0 and node 1, e.g. a two-socket Intel Nehalem board with 4-core CPUs in both sockets. Then threads 0, 1, 2, and 3 will reside on node 0 and threads 4, 5, 6, and 7 will reside on node 1:
| | core 0 | thread 0 |
| socket 0 | core 1 | thread 1 |
| NUMA node 0 | core 2 | thread 2 |
| | core 3 | thread 3 |
| | core 4 | thread 4 |
| socket 1 | core 5 | thread 5 |
| NUMA node 1 | core 6 | thread 6 |
| | core 7 | thread 7 |
Each core can access memory from each NUMA node, but remote access is slower (1.5x - 1.9x slower on Intel) than local node access. You run something like this:
char *a = (char *)malloc(8*4096);
#pragma omp parallel for schedule(static,1) num_threads(8)
for (int i = 0; i < 8; i++)
memset(&a[i*4096], 0, 4096);
4096 bytes in this case is the standard size of one memory page on Linux on x86 if huge pages are not used. This code will zero the whole 32 KiB array a. The malloc() call just reserves virtual address space but does not actually "touch" the physical memory (this is the default behaviour unless some other version of malloc is used, e.g. one that zeroes the memory like calloc() does). Now this array is contiguous but only in virtual memory. In physical memory half of it would lie in the memory attached to socket 0 and half in the memory attached to socket 1. This is so because different parts are zeroed by different threads and those threads reside on different cores and there is something called first touch NUMA policy which means that memory pages are allocated on the NUMA node on which the thread that first "touched" the memory page resides.
| | core 0 | thread 0 | a[0] ... a[4095]
| socket 0 | core 1 | thread 1 | a[4096] ... a[8191]
| NUMA node 0 | core 2 | thread 2 | a[8192] ... a[12287]
| | core 3 | thread 3 | a[12288] ... a[16383]
| | core 4 | thread 4 | a[16384] ... a[20479]
| socket 1 | core 5 | thread 5 | a[20480] ... a[24575]
| NUMA node 1 | core 6 | thread 6 | a[24576] ... a[28671]
| | core 7 | thread 7 | a[28672] ... a[32768]
Now lets run another loop like this:
#pragma omp parallel for schedule(static,1) num_threads(8)
for (i = 0; i < 8; i++)
memset(&a[i*4096], 1, 4096);
Each thread will access the already mapped physical memory and it will have the same mapping of thread to memory region as the one during the first loop. It means that threads will only access memory located in their local memory blocks which will be fast.
Now imagine that another scheduling scheme is used for the second loop: schedule(static,2). This will "chop" iteration space into blocks of two iterations and there will be 4 such blocks in total. What will happen is that we will have the following thread to memory location mapping (through the iteration number):
| | core 0 | thread 0 | a[0] ... a[8191] <- OK, same memory node
| socket 0 | core 1 | thread 1 | a[8192] ... a[16383] <- OK, same memory node
| NUMA node 0 | core 2 | thread 2 | a[16384] ... a[24575] <- Not OK, remote memory
| | core 3 | thread 3 | a[24576] ... a[32768] <- Not OK, remote memory
| | core 4 | thread 4 | <idle>
| socket 1 | core 5 | thread 5 | <idle>
| NUMA node 1 | core 6 | thread 6 | <idle>
| | core 7 | thread 7 | <idle>
Two bad things happen here:
threads 4 to 7 remain idle and half of the compute capability is lost;
threads 2 and 3 access non-local memory and it will take them about twice as much time to finish during which time threads 0 and 1 will remain idle.
So one of the advantages for using static scheduling is that it improves locality in memory access. The disadvantage is that bad choice of scheduling parameters can ruin the performance.
dynamic scheduling works on a "first come, first served" basis. Two runs with the same number of threads might (and most likely would) produce completely different "iteration space" -> "threads" mappings as one can easily verify:
$ cat dyn.c
#include <stdio.h>
#include <omp.h>
int main (void)
{
int i;
#pragma omp parallel num_threads(8)
{
#pragma omp for schedule(dynamic,1)
for (i = 0; i < 8; i++)
printf("[1] iter %0d, tid %0d\n", i, omp_get_thread_num());
#pragma omp for schedule(dynamic,1)
for (i = 0; i < 8; i++)
printf("[2] iter %0d, tid %0d\n", i, omp_get_thread_num());
}
return 0;
}
$ icc -openmp -o dyn.x dyn.c
$ OMP_NUM_THREADS=8 ./dyn.x | sort
[1] iter 0, tid 2
[1] iter 1, tid 0
[1] iter 2, tid 7
[1] iter 3, tid 3
[1] iter 4, tid 4
[1] iter 5, tid 1
[1] iter 6, tid 6
[1] iter 7, tid 5
[2] iter 0, tid 0
[2] iter 1, tid 2
[2] iter 2, tid 7
[2] iter 3, tid 3
[2] iter 4, tid 6
[2] iter 5, tid 1
[2] iter 6, tid 5
[2] iter 7, tid 4
(same behaviour is observed when gcc is used instead)
If the sample code from the static section was run with dynamic scheduling instead there will be only 1/70 (1.4%) chance that the original locality would be preserved and 69/70 (98.6%) chance that remote access would occur. This fact is often overlooked and hence suboptimal performance is achieved.
There is another reason to choose between static and dynamic scheduling - workload balancing. If each iteration takes vastly different from the mean time to be completed then high work imbalance might occur in the static case. Take as an example the case where time to complete an iteration grows linearly with the iteration number. If iteration space is divided statically between two threads the second one will have three times more work than the first one and hence for 2/3 of the compute time the first thread will be idle. Dynamic schedule introduces some additional overhead but in that particular case will lead to much better workload distribution. A special kind of dynamic scheduling is the guided where smaller and smaller iteration blocks are given to each task as the work progresses.
Since precompiled code could be run on various platforms it would be nice if the end user can control the scheduling. That's why OpenMP provides the special schedule(runtime) clause. With runtime scheduling the type is taken from the content of the environment variable OMP_SCHEDULE. This allows to test different scheduling types without recompiling the application and also allows the end user to fine-tune for his or her platform.
I think the misunderstanding comes from the fact that you miss the point about OpenMP.
In a sentence OpenMP allows you to execute you program faster by enabling parallelism.
In a program parallelism can be enabled in many ways and one of the is by using threads.
Suppose you have and array:
[1,2,3,4,5,6,7,8,9,10]
and you want to increment all elements by 1 in this array.
If you are going to use
#pragma omp for schedule(static, 5)
it means that to each of the threads will be assigned 5 contiguous iterations. In this case the first thread will take 5 numbers. The second one will take another 5 and so on until there are no more data to process or the maximum number of threads is reached (typically equal to the number of cores). Sharing of workload is done during the compilation.
In case of
#pragma omp for schedule(dynamic, 5)
The work will be shared amongst threads but this procedure will occur at a runtime. Thus involving more overhead. Second parameter specifies size of the chunk of the data.
Not being very familiar to OpenMP I risk to assume that dynamic type is more appropriate when compiled code is going to run on the system that has a different configuration that the one on which code was compiled.
I would recommend the page bellow where there are discussed techniques used for parallelizing the code, preconditions and limitations
https://computing.llnl.gov/tutorials/parallel_comp/
Additional links:
http://en.wikipedia.org/wiki/OpenMP
Difference between static and dynamic schedule in openMP in C
http://openmp.blogspot.se/
The loop partitioning scheme is different. The static scheduler would divide a loop over N elements into M subsets, and each subset would then contain strictly N/M elements.
The dynamic approach calculates the size of the subsets on the fly, which can be useful if the subsets' computation times vary.
The static approach should be used if computation times vary not much.