Nest vectors consumes to much memory in c++ [duplicate] - c++

This question already has answers here:
Nested STL vector using way too much memory
(6 answers)
Closed 8 years ago.
I`m trying to figure out why my application consumes too much memory. Here it is:
#include <iostream>
#include <sstream>
#include <string>
#include <exception>
#include <algorithm>
#include <vector>
#include <utility>
#include <assert.h>
#include <limits.h>
#include <time.h>
#include <tchar.h>
#include <random>
typedef unsigned __int32 uint;
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
vector<vector<uint>> arr(65536 * 16, vector<uint>());
mt19937 mt;
mt.seed(time(NULL));
uniform_int<uint> generator(0, arr.size() - 1);
for (uint i = 0; i < 10000000; i++)
{
for (uint j = 0; j < 16; j++)
{
uint bucketIndex = generator(mt);
arr[bucketIndex].push_back(i);
}
}
uint cap = 0;
for (uint i = 0; i < arr.size(); i++)
{
cap += sizeof(uint) * arr[i].capacity() + sizeof(arr[i]);
}
cap += sizeof(vector<uint>) * arr.capacity() + sizeof(vector<vector<uint>>);
cout << "Total bytes: " << cap << endl;
cout << "Press any key..." << endl;
cin.get();
}
I use Windows 7 64-bit and Visual Studio 2010, code is also compiled as 64-bit.
Code outputs the following in Debug and Release
Total bytes: 914591424
Looks correct (you can check it by hand), but memory manager shows that application consumes ~ 1.4 gigabytes of RAM.
Where did those 500 megabytes came from? Could you please give me an idea how to sort this out?
UPDATE
Problem is caused by memory fragmentation. Can be solved by compacting memory from time to time.

This is because each vector contains three pointers (or their moral and size equivalent): begin, begin + size, and begin + capacity. So when you have a vector containing tons of other small vectors, each inner vector wastes three words (so 24 bytes on a 64-bit system).
And since each inner vector's begin() points to a separate allocation, you have N times the allocation overhead cost to pay. That could be another several bytes.
Instead, you probably want to allocate a single large region and treat it like a 2D array. Or use one of the many libraries that offer such functionality. That won't work if your inner vectors are of different sizes, but often they are all one size, so you really want a 2D "rectangle" anyway, rather than a vector of vectors.

I've compared with Boost Container's vector. And added shrink_to_fit. The difference:
Total bytes: 690331672 // boost::container::vector::shrink_to_fit()
Total bytes: 1120033816 // std::vector
(Note also that boost containers never dynamically allocate on default construction.)
Here's the code (not much change, there):
#include <iostream>
#include <exception>
#include <algorithm>
#include <vector>
#include <utility>
#include <cassert>
#include <cstdint>
#include <random>
#include <boost/optional.hpp>
#include <boost/container/vector.hpp>
using boost::container::vector;
using boost::optional;
int main()
{
vector<vector<uint32_t>> arr(1<<20);
std::mt19937 mt;
mt.seed(time(NULL));
std::uniform_int_distribution<uint32_t> generator(0, arr.size() - 1);
for (uint32_t i = 0; i < 10000000; i++)
{
for (uint32_t j = 0; j < 16; j++)
{
auto& bucket = arr[generator(mt)];
//if (!bucket) bucket = vector<uint32_t>();
bucket.push_back(i);
}
}
for(auto& i : arr)
i.shrink_to_fit();
uint32_t cap = 0;
for (uint32_t i = 0; i < arr.size(); i++)
{
cap += sizeof(uint32_t) * arr[i].capacity() + sizeof(arr[i]);
}
cap += sizeof(vector<uint32_t>) * arr.capacity() + sizeof(arr);
std::cout << "Total bytes: " << cap << std::endl;
std::cout << "Press any key..." << std::endl;
std::cin.get();
}
Update memory profile run
--------------------------------------------------------------------------------
Command: ./test
Massif arguments: (none)
ms_print arguments: massif.out.4193
--------------------------------------------------------------------------------
MB
822.7^ #
| ###
| #####:
| #######:
| #########:
| :############:
| :::::##:############:
| ##:: ::# :############:
| #####:: ::# :############:
| ### ###:: ::# :############:
| :::#### ###:: ::# :############:
| ###:::: #### ###:: ::# :############:
| ##### :::: #### ###:: ::# :############:
| #### ### :::: #### ###:: ::# :############:
| ###:#### ### :::: #### ###:: ::# :############:
| ###### :#### ### :::: #### ###:: ::# :############:
| #### ### :#### ### :::: #### ###:: ::# :############:
| ######## ### :#### ### :::: #### ###:: ::# :############:
| :::::::## ##### ### :#### ### :::: #### ###:: ::# :############:
| ::#:#:::: ::: ## ##### ### :#### ### :::: #### ###:: ::# :############:
0 +----------------------------------------------------------------------->Gi
0 69.85

The problem is that you don't know the exact sizes for your arrays, otherwise you could set the vector capcities with reserve before actually filling them, this way you could avoid fragmentation. Try the following:
Generate the random seed (time(NULL)) and save it for later use.
Create an std::vector<uint> with array size 65536 * 16 and initialize all integers/counters in it to zero, lets name this array/vector "vec_sizes". We will use this array to store/find-out the size of your arrays that we will later create/fill-up.
Initialize a random generator with the seed acquired in step #1.
Run your algorithm (the nested for loops) but instead of storing an item into a 2D vector just like arr[bucketIndex].push_back(i); in your code just increase the vec_sizes[bucketIndex] counter.
Now we know the sizes of all vectors.
Create your arr vector.
for all subvectors in arr call the reserve method of the vector with the corresponding size found in the vec_sizes vector. This should preallocate the vectors effectively and you can avoid reallocations.
Initialize a random generator with the same seed we stored in step #1.
run your algorithm. Now pushing data into the vectors doesn't reallocate as their storage has already been allocated by your reserve calls.
Here we exploited the fact that you are using a pseudo random generator that gives the very same series of numbers if you run it twice starting with the same seed.
Note: Often when memory efficiency is the goal the solution is doing the work twice: First calculating different dimensions of the final data and then allocating space very effectively/"compactly" and then filling up the effectively allocated storage. usually you have to sacrifice something.

Related

Argmax of 2d vector on C++

I am working on python/pytorch and I have an example like
2d vector a
|
v
dim-0 ---> -----> dim-1 ------> -----> --------> dim-1
| [[-1.7739, 0.8073, 0.0472, -0.4084],
v [ 0.6378, 0.6575, -1.2970, -0.0625],
| [ 1.7970, -1.3463, 0.9011, -0.8704],
v [ 1.5639, 0.7123, 0.0385, 1.8410]]
|
v
Then, the argmax with the index of 1 will be
# argmax (indices where max values are present) along dimension-1
In [215]: torch.argmax(a, dim=1)
Out[215]: tensor([1, 1, 0, 3])
My question is that given the 2d vector a as above, how could I implement argmax function on C++ to give me same output as above? Thanks for reading
This is what I did
vector<vector<float>> a_vect
{
{-1.7739, 0.8073, 0.0472, -0.4084},
{0.6378, 0.6575, -1.2970, -0.0625},
{1.7970, -1.3463, 0.9011, -0.8704},
{1.5639, 0.7123, 0.0385, 1.8410}
};
std::vector<int>::iterator max = max_element(a_vect.begin() , a_vect.end()-a_vect.begin());
You can use std::max_element to find the index in each sub vector
#include <algorithm>
#include <iostream>
#include <vector>
using std::vector;
int main()
{
vector<vector<float>> a_vect=
{
{-1.7739, 0.8073, 0.0472, -0.4084},
{0.6378, 0.6575, -1.2970, -0.0625},
{1.7970, -1.3463, 0.9011, -0.8704},
{1.5639, 0.7123, 0.0385, 1.8410}
};
vector<int> max_index;
for(auto& v:a_vect)
max_index.push_back(std::max_element(v.begin(),v.end())-v.begin());
for(auto i:max_index)
std::cout << i << ' '; // 1 1 0 3
}

C++ Need help sorting a 2D string array

I'm a little stuck with sorting a string Table[X][Y]. As tagged, Im using C++ and have to use standard libraries and make it for all C++ (not only C++ 11).
The size of the Table is fixed (i get the X reading how many lines a file has and the Y is fixed because thats the different "attributes" has each line).
When i create the Table, each part of it is obtained as Table[X][Y] = stringX.data(); from things previously read from a file and stored in strings. I have numbers in the first column (the one im going to use as sorting criteria), names, address, etc in the others.
The part where the Table is created is:
Table[i][0] = string1.data();
Table[i][1] = string2.data();
Table[i][2] = string3.data();
Table[i][3] = string4.data();
Table[i][4] = string5.data();
Where "i" is the current "iteration" of a while(fgets) that reads one line at a time from a file, does some operations and stores in those strings the "final values" of each part of the line read.
I have to sort that Table using the first column as criteria in decreasing order.
Lets imagine the Table is like this: Table[4][3]
20 | Jhon | 14th July
2 | Mary | 9th June
44 | Mark | 10th December
1 | Chris | 4th Feb
And i need the output to be this:
44 | Mark | 10th December
20 | Jhon | 14th July
2 | Mary | 9th June
1 | Chris | 4th Feb
I have been reading several questions and pages and they sort int/chars arrays or convert the array into vector and then work with them. Im trying to sort the string Table i have without converting anything (dunno if possible).
I dont know if i managed to explain the issue and the situation i have clear enough. Im not putting all the code i have because apart from the declaration of the string Table and the strings that are then placed as string.data in the Table, the rest of the code has nothing to do with the Table and the sorting process. The code opens the file, reads line by line, filters the info i need from some separators and special characters and places each of the "rankings criteria" to a string, then assigns a "ranking" after evaluating each of the criterias and giving a total score (which then is stored in "string1").
After all this is done, i create the string Table[x][y] and place the filtered and processed information in that Table one row at a time (because i assing this while reading each line from the file).
The only thing that remains is the sorting of the table from the best scored to the last and then create a file with the top 10.
I appreciate and thank in advance the time you took reading this and any tip, information, code or source from where i can read this that you could provide.
First, as mentioned in the comments, a variable length array is accomplished in C++ by using std::vector. The current syntax you're using
std::string Table[X][Y]
where either X or Y are runtime variables, is not legal C++. Given your example, a standard C++ declaration would be this:
std::vector<std::array<std::string, 3>> Table;
So let's assume that this is what you are going to use.
The next step is to sort the data based on the first column. That can be accomplished by utilizing the std::sort algorithm function, along with the appropriate predicate indicating that you are using the first column as the sorting criteria.
Here is a short example, using your data, of how this is all accomplished:
#include <vector>
#include <array>
#include <iostream>
#include <algorithm>
#include <string>
int main()
{
std::vector<std::array<std::string, 3>> Table;
// Set up the test data
Table.push_back({"20", "Jhon", "14th July"});
Table.push_back({"2", "Mary", "9th June"});
Table.push_back({"44", "Mark", "10th December"});
Table.push_back({"1", "Chris", "4th Feb"});
std::cout << "Before sort:\n\n";
for (auto& s : Table)
std::cout << s[0] << " | " << s[1] << " | " << s[2] << "\n";
std::cout << "\n\nAfter sort:\n\n";
// Sort the data using the first column of each `std::array` as the criteria
std::sort(Table.begin(), Table.end(), [&](auto& a1, auto& a2)
{ return std::stoi(a1[0]) > std::stoi(a2[0]); });
// Output the results:
for (auto& s : Table)
std::cout << s[0] << " | " << s[1] << " | " << s[2] << "\n";
}
Here is the final output:
Before sort:
20 | Jhon | 14th July
2 | Mary | 9th June
44 | Mark | 10th December
1 | Chris | 4th Feb
After sort:
44 | Mark | 10th December
20 | Jhon | 14th July
2 | Mary | 9th June
1 | Chris | 4th Feb
The output needs a little bit of formatting, but that's not important.
Remember, it is not important as to where the data comes from, whether it is from a file or hardcoded as the example above shows. However you populate the Table, that's up to you. The goal is to show you once populated, how to sort the data.
The first thing we did was create the Table and fill it in with the test data. Note that the vector has a push_back function to add entries to the vector.
Then the call to std::sort has a predicate function (the lambda), where the predicate is given two items, in this case it would be two std::array's by reference. Then the goal is to return if the first std::array (in this case, a1) should be placed before the second std::array (a2).
Note that we only care about the first column, so we only need to consider array[0] of each of those arrays, and compare them.
Also note that since array[0] is a std::string, we simply can't compare it lexicographically -- we need to convert the string to an int and compare the int value. That's the reason for the std::stoi call to convert to an integer.
The final thing about the sort predicate is that we want to have a descending sort. Thus the comparing operator to use is > instead of the "traditional" < (which would have sorted in a ascending manner).
Hopefully this explains what the code is doing.
Edit:
Since you are attempting to get this code to work in C++98, the easiest way to do that is
Change to std::vector<std::vector<std::string>> instead of std::vector<std::array<std::string, 3>>
Not use the brace-initialization that C++11 offers
Use a comparison function instead of a lambda.
Given that, here is the code for C++98:
#include <vector>
#include <iostream>
#include <algorithm>
#include <string>
bool SortFirstColumn(const std::vector<std::string>& a1,
const std::vector<std::string>& a2)
{
return atoi(a1[0].c_str()) > atoi(a2[0].c_str());
}
int main()
{
std::vector<std::vector<std::string>> Table;
// Set up the test data
std::vector<std::string> vect(3);
vect[0] = "20";
vect[1] = "Jhon";
vect[2] = "14th July";
Table.push_back(vect);
vect[0] = "2";
vect[1] = "Mary";
vect[2] = "9th June";
Table.push_back(vect);
vect[0] = "44";
vect[1] = "Mark";
vect[2] = "10th December";
Table.push_back(vect);
vect[0] = "1";
vect[1] = "Chris";
vect[2] = "10th December";
Table.push_back(vect);
std::cout << "Before sort:\n\n";
for (size_t i = 0; i < Table.size(); ++i)
std::cout << Table[i][0] << " | " << Table[i][1] << " | " << Table[i][2] << "\n";
std::cout << "\n\nAfter sort:\n\n";
// Sort the data using the first column of each `std::vector<std::string>` as the criteria
std::sort(Table.begin(), Table.end(), SortFirstColumn);
// Output the results:
for (size_t i = 0; i < Table.size(); ++i)
std::cout << Table[i][0] << " | " << Table[i][1] << " | " << Table[i][2] << "\n";
}

Why is (n += 2 * i * i) faster than (n+= i) in C++?

This C++11 program takes on average between 7.42s and 7.79s to run.
#include <iostream>
#include <chrono>
using namespace std;
using c = chrono::system_clock;
using s = chrono::duration<double>;
void func(){
int n=0;
const auto before = c::now();
for(int i=0; i<2000000000; i++){
n += i;
}
const s duration = c::now() - before;
cout << duration.count();
}
if I replace n += i with n += 2 * i * i it takes between 5.80s and 5.96s. how come?
I ran each version of the program 20 times, alternating between the two. Here are the results:
n += i | n += 2 * i * i
---------+----------------
7.77047 | 5.87978
7.69226 | 5.83551
7.77375 | 5.84888
7.73748 | 5.84629
7.72988 | 5.84356
7.69736 | 5.83784
7.72597 | 5.84246
7.72722 | 5.81678
7.73291 | 5.81237
7.71871 | 5.81016
7.7478 | 5.80119
7.64906 | 5.80058
7.7253 | 5.9078
7.42734 | 5.96399
7.72573 | 5.84733
7.65591 | 5.81793
7.76619 | 5.83116
7.76963 | 5.84424
7.79928 | 5.87078
7.79274 | 5.84689
I have compiled it with (GCC) 9.1.1 20190503 (Red Hat 9.1.1-1). No optimization levels
g++ -std=c++11
We know that the maximum integer is ~ 2 billion. So, when i ~ 32000, can we say that the compiler predicts that the calculation will overflow?
https://godbolt.org/z/B3zIsv
You'll notice that with -O2, the code used to calculate 'n' is removed completely. So the real questions should be:
Why are you profiling code without -O2?
Why are you profiling code that has no observable side effects? ('n' can be removed completely - e.g. printing the value of 'n' at the end would be more useful here)
Why are you not profiling code in a profiler?
The timing results you have, result from a deeply flawed methodology.

Why is adding two std::vectors slower than raw arrays from new[]?

I'm looking around OpenMP, partially because my program need to make additions of very large vectors (millions of elements). However i see a quite large difference if i use std::vector or raw array. Which i cannot explain. I insist that the difference is only on the loop, not the initialisation of course.
The difference in time I refer to, is only timing the addition, especially not to take into account any initialization difference between vectors, arrays, etc. I'm really talking only about the sum part. The size of the vectors is not known at compile time.
I use g++ 5.x on Ubuntu 16.04.
edit: I tested what #Shadow said, it got me thinking, is there something going on with optimization? If i compile with -O2, then, using raw arrays initialized, I get back for loop scaling with number of threads. But with -O3 or -funroll-loops, it is as if the compiler kicks in early and optimize before the pragma is seen.
I came up with the following, simple test:
#define SIZE 10000000
#define TRIES 200
int main(){
std::vector<double> a,b,c;
a.resize(SIZE);
b.resize(SIZE);
c.resize(SIZE);
double start = omp_get_wtime();
unsigned long int i,t;
#pragma omp parallel shared(a,b,c) private(i,t)
{
for( t = 0; t< TRIES; t++){
#pragma omp for
for( i = 0; i< SIZE; i++){
c[i] = a[i] + b[i];
}
}
}
std::cout << "finished in " << omp_get_wtime() - start << std::endl;
return 0;
}
I compile with
g++ -O3 -fopenmp -std=c++11 main.cpp
And get for one threads
>time ./a.out
finished in 2.5638
./a.out 2.58s user 0.04s system 99% cpu 2.619 total.
For two threads, loop takes 1.2s, for 1.23 total.
Now if I use raw arrays:
int main(){
double *a, *b, *c;
a = new double[SIZE];
b = new double[SIZE];
c = new double[SIZE];
double start = omp_get_wtime();
unsigned long int i,t;
#pragma omp parallel shared(a,b,c) private(i,t)
{
for( t = 0; t< TRIES; t++)
{
#pragma omp for
for( i = 0; i< SIZE; i++)
{
c[i] = a[i] + b[i];
}
}
}
std::cout << "finished in " << omp_get_wtime() - start << std::endl;
delete[] a;
delete[] b;
delete[] c;
return 0;
}
And i get (1 thread):
>time ./a.out
finished in 1.92901
./a.out 1.92s user 0.01s system 99% cpu 1.939 total
std::vector is 33% slower!
For two threads:
>time ./a.out
finished in 1.20061
./a.out 2.39s user 0.02s system 198% cpu 1.208 total
As a comparison, with Eigen or Armadillo for exactly the same operation (using c = a+b overload with vector object), I get for total real time ~2.8s. They are not multi-threaded for vector additions.
Now, i thought the std::vector has almost no overhead? What is happening here? I'd like to use nice standard library objects.
I cannot find any reference anywhere on a simple example like this.
Meaningful benchmarking is hard
The answer from Xirema has already outlined in detail the difference in the code. std::vector::reserve initializes the data to zero, whereas new double[size] does not. Note that you can use new double[size]() to force initalization.
However your measurement doesn't include initialization, and the number of repetitions is so high that the loop costs should outweigh the small initialization even in Xirema's example. So why do the very same instructions in the loop take more time because the data is initialized?
Minimal example
Let's dig to the core of this with a code that dynamically determines whether memory is initialized or not (Based on Xirema's, but only timing the loop itself).
#include <vector>
#include <chrono>
#include <iostream>
#include <memory>
#include <iomanip>
#include <cstring>
#include <string>
#include <sys/types.h>
#include <unistd.h>
constexpr size_t size = 10'000'000;
auto time_pointer(size_t reps, bool initialize, double init_value) {
double * a = new double[size];
double * b = new double[size];
double * c = new double[size];
if (initialize) {
for (size_t i = 0; i < size; i++) {
a[i] = b[i] = c[i] = init_value;
}
}
auto start = std::chrono::steady_clock::now();
for (size_t t = 0; t < reps; t++) {
for (size_t i = 0; i < size; i++) {
c[i] = a[i] + b[i];
}
}
auto end = std::chrono::steady_clock::now();
delete[] a;
delete[] b;
delete[] c;
return end - start;
}
int main(int argc, char* argv[]) {
bool initialize = (argc == 3);
double init_value = 0;
if (initialize) {
init_value = std::stod(argv[2]);
}
auto reps = std::stoll(argv[1]);
std::cout << "pid: " << getpid() << "\n";
auto t = time_pointer(reps, initialize, init_value);
std::cout << std::setw(12) << std::chrono::duration_cast<std::chrono::milliseconds>(t).count() << "ms" << std::endl;
return 0;
}
Results are consistent:
./a.out 50 # no initialization
657ms
./a.out 50 0. # with initialization
1005ms
First glance at performance counters
Using the excellent Linux perf tool:
$ perf stat -e LLC-loads -e dTLB-misses ./a.out 50
pid: 12481
626ms
Performance counter stats for './a.out 50':
101.589.231 LLC-loads
105.415 dTLB-misses
0,629369979 seconds time elapsed
$ perf stat -e LLC-loads -e dTLB-misses ./a.out 50 0.
pid: 12499
1008ms
Performance counter stats for './a.out 50 0.':
145.218.903 LLC-loads
1.889.286 dTLB-misses
1,096923077 seconds time elapsed
Linear scaling with increasing number of repetitions also tells us, that the difference comes from within the loop. But why would initializing the memory cause more last level cache-loads and data TLB misses?
Memory is complex
To understand that, we need to understand how memory is allocated. Just because a malloc / new returns some pointer to virtual memory, doesn't mean that there is physical memory behind it. The virtual memory can be in a page that is not backed by physical memory - and the physical memory is only assigned on the first page fault. Now here is where page-types (from linux/tools/vm - and the pid we show as output comes in handy. Looking at the page statistics during a long execution of our little benchmark:
With initialization
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000804 1 0 __R________M______________________________ referenced,mmap
0x000000000004082c 392 1 __RU_l_____M______u_______________________ referenced,uptodate,lru,mmap,unevictable
0x000000000000086c 335 1 __RU_lA____M______________________________ referenced,uptodate,lru,active,mmap
0x0000000000401800 56721 221 ___________Ma_________t___________________ mmap,anonymous,thp
0x0000000000005868 1807 7 ___U_lA____Ma_b___________________________ uptodate,lru,active,mmap,anonymous,swapbacked
0x0000000000405868 111 0 ___U_lA____Ma_b_______t___________________ uptodate,lru,active,mmap,anonymous,swapbacked,thp
0x000000000000586c 1 0 __RU_lA____Ma_b___________________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
total 59368 231
Most of the virtual memory is in a normal mmap,anonymous region - something that is mapped to a physical address.
Without initialization
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000001000000 1174 4 ________________________z_________________ zero_page
0x0000000001400000 37888 148 ______________________t_z_________________ thp,zero_page
0x0000000000000800 1 0 ___________M______________________________ mmap
0x000000000004082c 388 1 __RU_l_____M______u_______________________ referenced,uptodate,lru,mmap,unevictable
0x000000000000086c 347 1 __RU_lA____M______________________________ referenced,uptodate,lru,active,mmap
0x0000000000401800 18907 73 ___________Ma_________t___________________ mmap,anonymous,thp
0x0000000000005868 633 2 ___U_lA____Ma_b___________________________ uptodate,lru,active,mmap,anonymous,swapbacked
0x0000000000405868 37 0 ___U_lA____Ma_b_______t___________________ uptodate,lru,active,mmap,anonymous,swapbacked,thp
0x000000000000586c 1 0 __RU_lA____Ma_b___________________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
total 59376 231
Now here, only 1/3 of the memory is backed by dedicated physical memory, and 2/3 are mapped to a zero page. The data behind a and b is all backed by a single read-only 4kiB page filled with zeros. c (and a, b in the other test) have already been written to, so it has to have it's own memory.
0 != 0
Now it may look weird: everything here is zero1 - why does it matter how it became zero? Whether you memset(0), a[i] = 0., or std::vector::reserve - everything causes explicit writes to memory, hence a page fault if you do it on a zero page. I don't think you can/should prevent physical page allocation at that point. The only thing you could do for the memset / reserve is to use calloc to explicitly request zero'd memory, which is probably backed by a zero_page, but I doubt it is done (or makes a lot of sense). Remember that for new double[size]; or malloc there is no guarantee what kind of memory you get, but that includes the possibility of zero-memory.
1: Remember that the double 0.0 has all bits set to zero.
In the end the performance difference really comes only from the loop, but is caused by initialization. std::vector carries no overhead for the loop. In the benchmark code, raw arrays just benefit from optimization of an abnormal case of uninitialized data.
The observed behaviour is not OpenMP-specific and has to do with the way modern operating systems manage memory. Memory is virtual, meaning that each process has its own virtual address (VA) space and a special translation mechanism is used to map pages of that VA space to frames of physical memory. Consequently, memory allocation is performed in two stages:
reservation of a region within the VA space - this is what operator new[] does when the allocation is big enough (smaller allocations are handled differently for reasons of efficiency)
actually backing the region with physical memory upon access to some part of the region
The process is split in two parts since in many cases applications do not really use at once all the memory they reserve and backing the entire reservation with physical memory might lead to waste (and unlike virtual memory, physical one is a very limited resource). Therefore, backing reservations with physical memory is performed on-demand the very first time the process writes to a region of the allocated memory space. The process is known as faulting the memory region since on most architectures it involves a soft page-fault that triggers the mapping within the OS kernel. Every time your code writes for the first time to a region of memory that is still not backed by physical memory, a soft page-fault is triggered and the OS tries to map a physical page. The process is slow as it involves finding a free page and modification on the process page table. The typical granularity of that process is 4 KiB unless some kind of large pages mechanism is in place, e.g., the Transparent Huge Pages mechanism on Linux.
What happens if you read for the first time from a page that has never been written to? Again, a soft page fault occurs, but instead of mapping a frame of physical memory, the Linux kernel maps a special "zero page". The page is mapped in CoW (copy-on-write) mode, which means that when you try to write it, the mapping to the zero page will be replaced by a mapping to a fresh frame of physical memory.
Now, take a look at the size of the arrays. Each of a, b, and c occupies 80 MB, which exceeds the cache size of most modern CPUs. One execution of the parallel loop thus has to bring 160 MB of data from the main memory and write back 80 MB. Because of how system cache works, writing to c actually reads it once, unless non-temporal (cache-bypassing) stores are used, therefore 240 MB of data is read and 80 MB of data gets written. Multiplied by 200 outer iterations, this gives 48 GB of data read and 16 GB of data written in total.
The above is not the case when a and b are not initialised, i.e. the case when a and b are simply allocated using operator new[]. Since reads in those case result in access to the zero page, and there is physically only one zero page that easily fits in the CPU cache, no real data has to be brought in from the main memory. Therefore, only 16 GB of data has to be read in and then written back. If non-temporal stores are used, no memory is read at all.
This could be easily proven using LIKWID (or any other tool able to read the CPU hardware counters):
std::vector<double> version:
$ likwid-perfctr -C 0 -g HA a.out
...
+-----------------------------------+------------+
| Metric | Core 0 |
+-----------------------------------+------------+
| Runtime (RDTSC) [s] | 4.4796 |
| Runtime unhalted [s] | 5.5242 |
| Clock [MHz] | 2850.7207 |
| CPI | 1.7292 |
| Memory read bandwidth [MBytes/s] | 10753.4669 |
| Memory read data volume [GBytes] | 48.1715 | <---
| Memory write bandwidth [MBytes/s] | 3633.8159 |
| Memory write data volume [GBytes] | 16.2781 |
| Memory bandwidth [MBytes/s] | 14387.2828 |
| Memory data volume [GBytes] | 64.4496 | <---
+-----------------------------------+------------+
Version with uninitialised arrays:
+-----------------------------------+------------+
| Metric | Core 0 |
+-----------------------------------+------------+
| Runtime (RDTSC) [s] | 2.8081 |
| Runtime unhalted [s] | 3.4226 |
| Clock [MHz] | 2797.2306 |
| CPI | 1.0753 |
| Memory read bandwidth [MBytes/s] | 5696.4294 |
| Memory read data volume [GBytes] | 15.9961 | <---
| Memory write bandwidth [MBytes/s] | 5703.4571 |
| Memory write data volume [GBytes] | 16.0158 |
| Memory bandwidth [MBytes/s] | 11399.8865 |
| Memory data volume [GBytes] | 32.0119 | <---
+-----------------------------------+------------+
Version with uninitialised array and non-temporal stores (using Intel's #pragma vector nontemporal):
+-----------------------------------+------------+
| Metric | Core 0 |
+-----------------------------------+------------+
| Runtime (RDTSC) [s] | 1.5889 |
| Runtime unhalted [s] | 1.7397 |
| Clock [MHz] | 2530.1640 |
| CPI | 0.5465 |
| Memory read bandwidth [MBytes/s] | 123.4196 |
| Memory read data volume [GBytes] | 0.1961 | <---
| Memory write bandwidth [MBytes/s] | 10331.2416 |
| Memory write data volume [GBytes] | 16.4152 |
| Memory bandwidth [MBytes/s] | 10454.6612 |
| Memory data volume [GBytes] | 16.6113 | <---
+-----------------------------------+------------+
The disassembly of the two versions provided in your question when using GCC 5.3 shows that the two loops are translated to exactly the same sequence of assembly instructions sans the different code address. The sole reason for the difference in the execution time is the memory access as explained above. Resizing the vectors initialises them with zeros, which results in a and b being backed up by their own physical memory pages. Not initialising a and b when operator new[] is used results in their backing by the zero page.
Edit: It took me so long to write this that in the mean time Zulan has written a way more technical explanation.
I have a good hypothesis.
I've written three versions of the code: one using raw double *, one using std::unique_ptr<double[]> objects, and one using std::vector<double>, and compared the runtimes of each of these versions of the code. For my purposes, I've used a single-threaded version of the code to try to simplify the case.
Total Code::
#include<vector>
#include<chrono>
#include<iostream>
#include<memory>
#include<iomanip>
constexpr size_t size = 10'000'000;
constexpr size_t reps = 50;
auto time_vector() {
auto start = std::chrono::steady_clock::now();
{
std::vector<double> a(size);
std::vector<double> b(size);
std::vector<double> c(size);
for (size_t t = 0; t < reps; t++) {
for (size_t i = 0; i < size; i++) {
c[i] = a[i] + b[i];
}
}
}
auto end = std::chrono::steady_clock::now();
return end - start;
}
auto time_pointer() {
auto start = std::chrono::steady_clock::now();
{
double * a = new double[size];
double * b = new double[size];
double * c = new double[size];
for (size_t t = 0; t < reps; t++) {
for (size_t i = 0; i < size; i++) {
c[i] = a[i] + b[i];
}
}
delete[] a;
delete[] b;
delete[] c;
}
auto end = std::chrono::steady_clock::now();
return end - start;
}
auto time_unique_ptr() {
auto start = std::chrono::steady_clock::now();
{
std::unique_ptr<double[]> a = std::make_unique<double[]>(size);
std::unique_ptr<double[]> b = std::make_unique<double[]>(size);
std::unique_ptr<double[]> c = std::make_unique<double[]>(size);
for (size_t t = 0; t < reps; t++) {
for (size_t i = 0; i < size; i++) {
c[i] = a[i] + b[i];
}
}
}
auto end = std::chrono::steady_clock::now();
return end - start;
}
int main() {
std::cout << "Vector took " << std::setw(12) << time_vector().count() << "ns" << std::endl;
std::cout << "Pointer took " << std::setw(12) << time_pointer().count() << "ns" << std::endl;
std::cout << "Unique Pointer took " << std::setw(12) << time_unique_ptr().count() << "ns" << std::endl;
return 0;
}
Test Results:
Vector took 1442575273ns //Note: the first one executed, regardless of
//which function it is, is always slower than expected. I'll talk about that later.
Pointer took 542265103ns
Unique Pointer took 1280087558ns
So all of STL objects are demonstrably slower than the raw version. Why might this be?
Let's go to the Assembly! (compiled using Godbolt.com, using the snapshot version of GCC 8.x)
There's a few things we can observe to start with. For starters, the std::unique_ptr and std::vector code are generating virtually identical assembly code. std::unique_ptr<double[]> swaps out new and delete for new[] and delete[]. Since their runtimes are within margin of error, we'll focus on the std::unique_ptr<double[]> version and compare that to double *.
Starting with .L5 and .L22, the code seems to be identical. The only major differences are an extra pointer arithmetic before the delete[] calls are made in the double * version, and some extra stack cleanup code at the end in .L34 (std::unique_ptr<double[]> version), which doesn't exist for the double * version. Neither of these seem likely to have strong impact on the code speed, so we're going to ignore them for now.
The code that's identical appears to be the code directly responsible for the loop. You'll notice that the code which is different (which I'll get to momentarily) doesn't contain any jump statements, which are integral to loops.
So all of the major differences appear to be specific to the initial allocation of the objects in question. This is between time_unique_ptr(): and .L32 for the std::unique_ptr<double[]> version, and between time_pointer(): and .L22 for the double * version.
So what's the difference? Well, they're almost doing the same thing. Except for a few lines of code that show up in the std::unique_ptr<double[]> version that don't show up in the double * version:
std::unique_ptr<double[]>:
mov edi, 80000000
mov r12, rax
call operator new[](unsigned long)
mov edx, 80000000
mov rdi, rax
xor esi, esi //Sets register to 0, which is probably used in...
mov rbx, rax
call memset //!!!
mov edi, 80000000
call operator new[](unsigned long)
mov rdi, rax
mov edx, 80000000
xor esi, esi //Sets register to 0, which is probably used in...
mov rbp, rax
call memset //!!!
mov edi, 80000000
call operator new[](unsigned long)
mov r14, rbx
xor esi, esi //Sets register to 0, which is probably used in...
mov rdi, rax
shr r14, 3
mov edx, 80000000
mov r13d, 10000000
and r14d, 1
call memset //!!!
double *:
mov edi, 80000000
mov rbp, rax
call operator new[](unsigned long)
mov rbx, rax
mov edi, 80000000
mov r14, rbx
shr r14, 3
call operator new[](unsigned long)
and r14d, 1
mov edi, 80000000
mov r12, rax
sub r13, r14
call operator new[](unsigned long)
Well would you look at that! Some unexpected calls to memset that aren't part of the double * code! It's quite clear that std::vector<T> and std::unique_ptr<T[]> are contracted to "initialize" the memory they allocate, whereas double * has no such contract.
So this is basically a very, very round-about way of verifying what Shadow observed: When you make no attempt to "zero-fill" the arrays, the compiler will
Do nothing for double * (saving precious CPU cycles), and
Do the initialization without prompting for std::vector<double> and std::unique_ptr<double[]> (costing time initializing everything).
But when you do add zero-fill, the compiler recognizes that it's about to "repeat itself", optimizes out the second zero-fill for std::vector<double> and std::unique_ptr<double[]> (which results in the code not changing) and adds it to the double * version, making it the same as the other two versions. You can confirm this by comparing the new version of the assembly where I've made the following change to the double * version:
double * a = new double[size];
for(size_t i = 0; i < size; i++) a[i] = 0;
double * b = new double[size];
for(size_t i = 0; i < size; i++) b[i] = 0;
double * c = new double[size];
for(size_t i = 0; i < size; i++) c[i] = 0;
And sure enough, the assembly now has those loops optimized into memset calls, the same as the std::unique_ptr<double[]> version! And the runtime is now comparable.
(Note: the runtime of the pointer is now slower than the other two! I observed that the first function called, regardless of which one, is always about 200ms-400ms slower. I'm blaming branch prediction. Either way, the speed should be identical in all three code paths now).
So that's the lesson: std::vector and std::unique_ptr are making your code slightly safer by preventing that Undefined Behavior you were invoking in your code that used raw pointers. The consequence is that it's also making your code slower.
I tested it and found out the following: The vector case had a runtime about 1.8 times longer than the raw array case. But this was only the case when I did not initialize the raw array. After adding a simple loop before the time measurement to initialize all entries with 0.0 the raw array case took as long as the vector case.
It took a closer look and did the following:
I did not initialize the raw arrays like
for (size_t i{0}; i < SIZE; ++i)
a[i] = 0.0;
but did it this way:
for (size_t i{0}; i < SIZE; ++i)
if (a[i] != 0.0)
{
std::cout << "a was set at position " << i << std::endl;
a[i] = 0.0;
}
(the other arrays accordingly).
The result was that I got no console output from initializing the arrays and it was again as fast as without initializing at all, that is about 1.8 faster than with the vectors.
When I initialized for example only a "normal" and the other two vector with the if clause I measured a time between the vector runtime and the runtime with all arrays "fake initialized" with the if clause.
Well... that's strange...
Now, i thougt the std::vector has almost no overhead ? What is happening here ? I'd like to use nice STL objects...
Although I cannot explain you this behavior, I can tell you that there is not really an overhead for std::vector if you use it "normal". This is just a very artificial case.
EDIT:
As qPCR4vir and the OP Napseis pointed out this might have to do with optimization. As soon as I turned on optimization the "real init" case was about the already mentioned factor of 1.8 slower. But without it was still about 1.1 times slower.
So I looked at the assembler code but I did not saw any difference in the 'for' loops...
The major thing to notice here is the fact that
The array version has undefined behavior
dcl.init #12 states:
If an indeterminate value is produced by an evaluation, the behavior is undefined
And this is exactly what happens in that line:
c[i] = a[i] + b[i];
Both a[i] and b[i] are indeterminate values since the arrays are default-initialized.
The UB perfectly explains the measuring results (whatever they are).
UPD: In the light of #HristoIliev and #Zulan answers I'd like to emphasize language POV once more.
The UB of reading uninitialized memory for the compiler essentialy means that it can always assume that memory is initialized, so whatever the OS does is fine with C++, even if the OS has some specific behavior for that case.
Well it turns out that it does - your code is not reading the physical memory and your measurements correspond to that.
One could say that the resulting program does not compute the sum of two arrays - it computes the sum of two more easily accessible mocks, and it is fine with C++ exactly because of the UB. If it did something else, it would still be perfectly fine.
So in the end you have two programs: one adds up two vectors and the other just does something undefined (from C++ point of view) or something unrelated (from OS point of view). What is the point of measuring their timings and comparing the results?
Fixing the UB solves the whole problem, but more importantly it validates your measurements and allows you to meaningfully compare the results.
In this case, i think the culprit is -funroll-loops, from what i just tests in O2 with and without this option.
https://gcc.gnu.org/onlinedocs/gcc-5.4.0/gcc/Optimize-Options.html#Optimize-Options
funroll-loops: Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop. -funroll-loops implies -frerun-cse-after-loop. It also turns on complete loop peeling (i.e. complete removal of loops with small constant number of iterations). This option makes code larger, and may or may not make it run faster.

Running std::normal_distribution with user-defined random generator

I am about to generate an array of normally distributed pseudo-random numbers. As I know the std library offers the following code for that:
std::random_device rd;
std::mt19937 gen(rd());
std::normal_distribution<> d(mean,std);
...
double number = d(gen);
The problem is that I want to use a Sobol' quasi-random sequence instead of Mersenne
Twister pseudo-random generator. So, my question is:
Is it possible to run the std::normal_distribution with a user-defined random generator (with a Sobol' quasi-random sequence generator in my case)?
More details: I have a class called RandomGenerators, which is used to generate a Sobol' quasi-random numbers:
RandomGenerator randgen;
double number = randgen.sobol(0,1);
Yes, it is possible. Just make it comply to the requirements of a uniform random number generator (§26.5.1.3 paragraphs 2 and 3):
2 A class G satisfies the requirements of a uniform random number
generator if the expressions shown in Table 116 are valid and have the
indicated semantics, and if G also satisfies all other requirements
of this section. In that Table and throughout this section:
a) T is the type named by G’s associatedresult_type`, and
b) g is a value of G.
Table 116 — Uniform random number generator requirements
Expression | Return type | Pre/post-condition | Complexity
----------------------------------------------------------------------
G::result_type | T | T is an unsigned integer | compile-time
| | type (§3.9.1). |
----------------------------------------------------------------------
g() | T | Returns a value in the | amortized constant
| | closed interval |
| | [G::min(), G::max()]. |
----------------------------------------------------------------------
G::min() | T | Denotes the least value | compile-time
| | potentially returned by |
| | operator(). |
----------------------------------------------------------------------
G::max() | T | Denotes the greatest value | compile-time
| | potentially returned by |
| | operator(). |
3 The following relation shall hold: G::min() < G::max().
A word of caution here - I came across a big gotcha when I implemented this. It seems that if the return types of max()/min()/operator() are not 64 bit then the distribution will resample. My (unsigned) 32 bit Sobol implementation was getting sampled twice per deviate thus destroying the properties of the numbers. This code reproduces:
#include <random>
#include <limits>
#include <iostream>
#include <cstdint>
typedef uint32_t rng_int_t;
int requested = 0;
int sampled = 0;
struct Quasi
{
rng_int_t operator()()
{
++sampled;
return 0;
}
rng_int_t min() const
{
return 0;
}
rng_int_t max() const
{
return std::numeric_limits<rng_int_t>::max();
}
};
int main()
{
std::uniform_real_distribution<double> dist(0.0,1.0);
Quasi q;
double total = 0.0;
for (size_t i = 0; i < 10; ++i)
{
dist(q);
++requested;
}
std::cout << "requested: " << requested << std::endl;
std::cout << "sampled: " << sampled << std::endl;
}
Output (using g++ 5.4):
requested: 10
sampled: 20
and even when compiled with -m32. If you change rng_int_t to 64bit the problem goes away. My workaround is to stick the 32 bit value into the most significant bits of the return value, e.g
return uint64_t(val) << 32;
You can now generate Sobol sequences directly with Boost. See boost/random/sobol.hpp.