parallel push_back for vector of vector - c++

I have a large text file around 13G which the content is an edge list of a graph. Each line has two integers uandv represent the endpoint of an edge. I want to read it to a vector of vector as an adjency vector of the graph.
Then It comes to folowing code.
const int N = 3328557;
vector<vector<int> >adj{N};
int main() {
FILE * pFile;
pFile = fopen("path/to/edge/list", "r");
int u, v;
while (fscanf(pFile, "%d%d", &u, &v) == 2) {
adj[u].push_back(v);
adj[v].push_back(u);
}
fclose(pFile);
}
It consumes about 7min. After some analysis, I find adj[u].push_back(v) and adj[v].push_back(u) consumes most time because of random address.
Then I use a two dimension array as cache. Once it's filled, I copy all the value to vector and clear it.
const int N = 3328557;
const int threshold = 100;
vector<vector<int> >adj{N};
int ln[N];
int cache[N][threshold];
void write2vec(int node) {
for (int i = 0; i < ln[node]; i++)
adj[node].push_back(cache[node][i]);
ln[node] = 0;
}
int main() {
FILE * pFile;
pFile = fopen("path/to/edge/list", "r");
int u, v;
while (fscanf(pFile, "%d%d", &u, &v) == 2) {
cache[u][ln[u]++] = v;
if (ln[u] == threshold)
write2vec(u);
cache[v][ln[v]++] = u;
if (ln[v] == threshold)
write2vec(v);
}
for (int i = 1; i < N; i++)
write2vec(i);
fclose(pFile);
}
This time it consumes 5.5 min. It's still too long. Then I think the two push_back in the first code can be parallelized. But I don't know how to do. And does anyone has other idea?
Thanks.
Edit.
I think the reason why my second approach is faster is addressing on vector of vector is slower. The address of vector of vector is not contiguous, so accessing adj[u] needs two operation, first is adj then adj[u].
So I want to know if I can use multiprocessing to make addressing parallelized.

"I think the two push_back in the first code can be parallelized."
It's likely that your CPU will agree. Given the data size, this is likely to hit a bottleneck from L3 cache to main memory. Modern CPU cores are capable of out-of-order execution, and this looks like the CPU will happily start with instructions that belong to the second push_back while the first one is waiting for main memory. That's exactly why out-of-order execution is a common feature.
The chief problem is reallocation - you didn't reserve capacity. And reallocation is not a simple CPU operation; it requires access to a global heap. I would suggest reserving 128/sizeof(int) elements per inner vector. That's one or two cache lines on comon CPU's, so you don't have vectors sharing cache lines.

Related

Fast copying contiguous array of arrays

I am trying to copy from an array of arrays, to another one, while leaving a space between arrays in the target.
They are both contiguous each vector sizes size is between 5000 and 52000 floats,
Output_jump is the vector size times eight, and vector_count vary in my tests.
I did the best I learned here https://stackoverflow.com/a/34450588/1238848 and here https://stackoverflow.com/a/16658555/1238848
but still it seems so slow.
void copyToTarget(const float *input, float *output, int vector_count, int vector_size, int output_jump)
{
int left_to_do,offset;
constexpr int block=2048;
constexpr int blockInBytes = block*sizeof(float);
float temp[2048];
for (int i = 0; i < vector_count; ++i)
{
left_to_do = vector_size;
offset = 0;
while(left_to_do > block)
{
memcpy(temp, input, blockInBytes);
memcpy(output, temp, blockInBytes);
left_to_do -= block;
input += block;
output += block;
}
if (left_to_do)
{
memcpy(temp, input, left_to_do*sizeof(float));
memcpy(output, temp, left_to_do*sizeof(float));
input += left_to_do;
output += left_to_do;
}
output += output_jump;
}
}
I'm skeptical of the answer you linked, which encourages avoiding a function call to memcpy. Surely the implementation of memcpy is very well optimized, probably hand written in assembly, and therefore hard to beat! Moreover for large-sized copies, the function call overhead is negligible compared to memory access latency. So simply calling memcpy is likely the fastest way to copy contiguous bytes around in memory.
If output_jump were zero, a single call to memcpy can copy input directly to output (and this would be hard to beat). For nonzero output_jump, the copy needs to be divided up over the contiguous vectors. Use one memcpy per vector, without the temp buffer, copying directly from input + i * vector_size to output + i * (vector_size + output_jump).
But better yet, like the top answer on that thread suggests, try if possible to find a way to avoid copying data in the first place.

How to get the memory used by a multidimensional vector

I am currently writing some code to create a neural network, and i am trying to make it as optimised as possible. I want to be able to get the amount of memory consumed by a object of type Network, since memory usage is very important in order to avoid cache misses. I tried using sizeof(), however this does not work, since, i assume, that vectors store the values on the heap, so the sizeof() function will just tell me the size of the pointers. Here is my code so far.
#include <iostream>
#include <vector>
#include <random>
#include <chrono>
class Timer
{
private:
std::chrono::time_point<std::chrono::high_resolution_clock> start_time;
public:
Timer(bool auto_start=true)
{
if (auto_start)
{
start();
}
}
void start()
{
start_time = std::chrono::high_resolution_clock::now();
}
float get_duration()
{
std::chrono::duration<float> duration = std::chrono::high_resolution_clock::now() - start_time;
return duration.count();
}
};
class Network
{
public:
std::vector<std::vector<std::vector<float>>> weights;
std::vector<std::vector<std::vector<float>>> deriv_weights;
std::vector<std::vector<float>> biases;
std::vector<std::vector<float>> deriv_biases;
std::vector<std::vector<float>> activations;
std::vector<std::vector<float>> deriv_activations;
};
Network create_network(std::vector<int> layers)
{
Network network;
network.weights.reserve(layers.size() - 1);
int nodes_in_prev_layer = layers[0];
for (unsigned int i = 0; i < layers.size() - 1; ++i)
{
int nodes_in_layer = layers[i + 1];
network.weights.push_back(std::vector<std::vector<float>>());
network.weights[i].reserve(nodes_in_layer);
for (int j = 0; j < nodes_in_layer; ++j)
{
network.weights[i].push_back(std::vector<float>());
network.weights[i][j].reserve(nodes_in_prev_layer);
for (int k = 0; k < nodes_in_prev_layer; ++k)
{
float input_weight = float(std::rand()) / RAND_MAX;
network.weights[i][j].push_back(input_weight);
}
}
nodes_in_prev_layer = nodes_in_layer;
}
return network;
}
int main()
{
Timer timer;
Network network = create_network({784, 800, 16, 10});
std::cout << timer.get_duration() << std::endl;
std::cout << sizeof(network) << std::endl;
std::cin.get();
}
I've recently updated our production neural network code to AVX-512; it's definitely real-world production code. A key part of our optimalisations is that each matrix is not a std::vector, but a 1D AVX-aligned array. Even without AVX alignment, we see a huge benefit in moving to a one-dimensional array backing each matrix. This means the memory access will be fully sequential, which is much faster. The size will then be (rows*cols)*sizeof(float).
We store the bias as the first full row. Commonly that's implemented by prefixing the input with a 1.0 element, but for our AVX code we use the bias as the starting values for the FMA (Fused Multiply-Add) operations. I.e. in pseudo-code result=bias; for(input:inputs) result+=(input*weight). This keeps the input also AVX-aligned.
Since each matrix is used in turn, you can safely have a std::vector<Matrix> layers.
As quote from https://stackoverflow.com/a/17254518/7588455:
Vector stores its elements in an internally-allocated memory array. You can do this:
sizeof(std::vector<int>) + (sizeof(int) * MyVector.size())
This will give you the size of the vector structure itself plus the size of all the ints in it, but it may not include whatever small overhead your memory allocator may impose. I'm not sure there's a platform-independent way to include that.
In your case only the actually internally-allocated memory array matters since you're just accessing these. Also be aware of how you're accessing the memory.
In order to write cache friendly code I highly recommend to read thru this SO post: https://stackoverflow.com/a/16699282/7588455

Fastest way to create a vector of indices from distance matrix in C++

I have a distance matrix D of size n by n and a constant L as input. I need to create a vector v contains all entries in D such that its value is at most L. Here v must be in a specific order v = [v1 v2 .. vn] where vi contains entries in ith row of D with value at most L. The order of entries in each vi is not important.
I wonder there is a fast way to create v using vector, array or any data structure + parallization. What I did is to use for loops and it is very slow for large n.
vector<int> v;
for (int i=0; i < n; ++i){
for (int j=0; j < n; ++j){
if (D(i,j) <= L) v.push_back(j);
}
}
The best way is mostly depending on the context. If you are seeking for GPU parallization you should take a look at OpenCL.
For CPU based parallization the C++ standard #include <thread> library is probably your best bet, but you need to be careful:
Threads take time to create so if n is relatively small (<1000 or so) it will slow you down
D(i,j) has to be readably by multiple threads at the same time
v has to be writable by multiple threads, a standard vector wont cut it
v may be a 2d vector with vi as its subvectors, but these have to be initialized before the parallization:
std::vector<std::vector<int>> v;
v.reserve(n);
for(size_t i = 0; i < n; i++)
{
v.push_back(std::vector<int>());
}
You need to decide how many threads you want to use. If this is for one machine only, hardcoding is a valid option. There is a function in the thread library that gets the amount of supported threads, but it is more of a hint than trustworthy.
size_t threadAmount = std::thread::hardware_concurrency(); //How many threads should run hardware_concurrency() gives you a hint, but its not optimal
std::vector<std::thread> t; //to store the threads in
t.reserve(threadAmount-1); //you need threadAmount-1 extra threads (we already have the main-thread)
To start a thread you need a function it can execute. In this case this is to read through part of your matrix.
void CheckPart(size_t start, size_t amount, int L, std::vector<std::vector<int>>& vec)
{
for(size_t i = start; i < amount+start; i++)
{
for(size_t j = 0; j < n; j++)
{
if(D(i,j) <= L)
{
vec[i].push_back(j);
}
}
}
}
Now you need to split your matrix in parts of about n/threadAmount rows and start the threads. The thread constructor needs a function and its parameter, but it will always try to copy the parameters, even if the function wants a reference. To prevent this, you need to force using a reference with std::ref()
int i = 0;
int rows;
for(size_t a = 0; a < threadAmount-1; a++)
{
rows = n/threadAmount + ((n%threadAmount>a)?1:0);
t.push_back(std::thread(CheckPart, i, rows, L, std::ref(v)));
i += rows;
}
The threads are now running and all there is to do is run the last block on the main function:
SortPart(i, n/threadAmount, L, v);
After that you need to wait for the threads finishing and clean them up:
for(unsigned int a = 0; a < threadAmount-1; a++)
{
if(t[a].joinable())
{
t[a].join();
}
}
Please note that this is just a quick and dirty example. Different problems might need different implementation, and since I can't guess the context the help I can give is rather limited.
In consideration of the comments, I made the appropriate corrections (in emphasis).
Have you searched tips for writing performance code, threading, asm instructions (if your assembly is not exactly what you want) and OpenCL for parallel-processing? If not, I strongly recommend!
In some cases, declaring all for loop variables out of the for loop (to avoid declaring they a lot of times) will make it faster, but not in this case (comment from our friend Paddy).
Also, using new insted of vector can be faster, as we see here: Using arrays or std::vectors in C++, what's the performance gap? - and I tested, and with vector it's 6 seconds slower than with new,which only takes 1 second. I guess that the safety and ease of management guarantees that come with std::vector is not desired when someone is searching for performance, even because using new is not so difficult, just avoid heap overflow with calculations and remember using delete[]
user4581301 is correct here, and the following statement is untrue: Finally, if you build D in a array instead of matrix (or maybe if you copy D into a constant array, maybe...), it will be much mor cache-friendly and will save one for loop statement.

Why are C++ STL vectors 1000x slower when doing many reserves?

I've run into a strange situation.
In my program I have a loop that combines a bunch of data together in a giant vector. I was trying to figure out why it was running so slowly, even though it seemed like I was doing everything right to allocate memory in an efficient manner on the go.
In my program it is difficult to determine how big the final vector of combined data should be, but the size of each piece of data is known as it is processed. So instead of reserving and resizing the combined data vector in one go, I was reserving enough space for each data chunk as it is added to the larger vector. That's when I ran into this issue that is repeatable using the simple snippet below:
std::vector<float> arr1;
std::vector<float> arr2;
std::vector<float> arr3;
std::vector<float> arr4;
int numLoops = 10000;
int numSubloops = 50;
{
// Test 1
// Naive test where no pre-allocation occurs
for (int q = 0; q < numLoops; q++)
{
for (int g = 0; g < numSubloops; g++)
{
arr1.push_back(q * g);
}
}
}
{
// Test 2
// Ideal situation where total amount of data is reserved beforehand
arr2.reserve(numLoops * numSubloops);
for (int q = 0; q < numLoops; q++)
{
for (int g = 0; g < numSubloops; g++)
{
arr2.push_back(q * g);
}
}
}
{
// Test 3
// Total data is not known beforehand, so allocations made for each
// data chunk as they are processed using 'resize' method
int arrInx = 0;
for (int q = 0; q < numLoops; q++)
{
arr3.resize(arr3.size() + numSubloops);
for (int g = 0; g < numSubloops; g++)
{
arr3[arrInx++] = q * g;
}
}
}
{
// Test 4
// Total data is not known beforehand, so allocations are made for each
// data chunk as they are processed using the 'reserve' method
for (int q = 0; q < numLoops; q++)
{
arr4.reserve(arr4.size() + numSubloops);
for (int g = 0; g < numSubloops; g++)
{
arr4.push_back(q * g);
}
}
}
The results of this test, after compilation in Visual Studio 2017, are as follows:
Test 1: 7 ms
Test 2: 3 ms
Test 3: 4 ms
Test 4: 4000 ms
Why is there the huge discrepancy in running times?
Why does calling reserve a bunch of times, followed by push_back take 1000x times longer than calling resize a bunch of times, followed by direct index access?
How does it make any sense that it could take 500x longer than the naive approach which includes no pre-allocations at all?
How does it make any sense that it could take 500x longer than the
naive approach which includes no pre-allocations at all?
That's where you're mistaken. The 'naive' approach you speak of does do pre-allocations. They're just done behind the scenes, and infrequently, in the call to push_back. It doesn't just allocate room for one more element every time you call push_back. It allocates some amount that is a factor (usually between 1.5x and 2x) of the current capacity. And then it doesn't need to allocate again until that capacity runs out. This is much more efficient than your loop which does an allocation every time 50 elements are added, with no regard for the current capacity.
#Benjamin Lindley's answer explains the capacity of std::vector. However, for exactly why the 4th test case is that slow, in fact it's an implementation detail of the standard library.
[vector.capacity]
void reserve(size_type n);
...
Effects: A directive that informs a vector of a planned change in size, so that it can manage the storage allocation accordingly. After reserve(), capacity() is greater or equal to the argument of reserve if reallocation happens; and equal to the previous value of capacity() otherwise. Reallocation happens at this point if and only if the current capacity is less than the argument of reserve().
Thus it is not guaranteed by C++ standard that after reserve() for a larger capacity, the actual capacity should be the requested one. Personally I think it's not unreasonable for an implementation to follow some specific policy when such larger capacity request is received. However, I also tested on my machine, it seems the STL just does the simplest thing.

c++ read file and store integer in vectors. ends up taking around 5 time more resident memory than actual file size

I need to read in few input files(each contains a 2d matrix of integers) and store them in a vector of 2d vectors. below is code I wrote:
int main(int argc, char *argv[]) {
/*
int my_rank;
int p;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &p);
*/
std::vector<std::vector<std::vector<int > > > matrices(argc);
for(int i=1; i<argc; ++i){
std::string line;
std::ifstream fp(argv[i]);
std::vector<std::vector<int> > matrix;
if (fp.is_open()) {
while (getline(fp, line)) {
if(line!=""){
//add a new row to file
std::vector<int> newRow;
//parse each row put the values in the file buffer
std::stringstream buff(line);
//buffValue is each number in a row
int buffValue;
while (buff >> buffValue) {
newRow.push_back(buffValue);
}
matrix.push_back(newRow);
}
}
}
else {
std::cout << "Failed to read files" << std::endl;
}
matrices.push_back(matrix);
}
//MPI_Finalize();
return 0;
}
I have two questions here:
when I read in one single file of 175M, the program ended up taking 900M in resident memory. This is a problem because I usually need to read in 4 files with few hundred M's per file. and it will eventually take multiple G's of memory. Is this because of the way I read/store the integers?
If I uncomment the lines involve MPI, the resident memory usage goes up to 1.7G, is this normal or I'm doing something wrong here, I'm using MPICH.
Vector-of-vector-of-vector is not an efficient structure to use. You have memory overheads of the vector classes themselves, plus the standard behaviour of push_back.
A vector will grow its memory exponentially when it needs to resize after push_back, in order to meet time complexity requirements. If your vector capacity is currently 10 values, and you add 11 values, then it will most likely resize its capacity to 20 values.
A side-effect of this growth is potential memory fragmentation. Vector memory is defined to be contiguous. The standard allocators do not have a realloc ability, as in C. So, they must allocate more memory elsewhere, move the data, and free the old storage. This can leave holes in memory that your program can't use for anything else. Not to mention reduce cache locality of your data, leading to poor performance.
You would be better off creating a more memory-efficient 2D structure for your matrices, and then push them on to a deque instead of vector. Here's one I prepared earlier ;). At the very least, if you must use vector-of-vector for the matrix, then pre-allocate it using vector::reserve.
If memory is more important to you than I/O, then it's not out of the question to read the file twice. The first time around, you obtain information about matrix sizes and row lengths. Then you pre-allocate all your structures, and read the file again.
Otherwise, using some kind of temporary pool to store your values for a matrix would be acceptable:
std::deque< std::vector< std::vector< int > > > matrices;
std::vector< size_t > columns; // number of columns, indexed by row
std::vector< int > values; // all values in matrix
columns.reserve( 1000 ); // Guess a reasonable row count to begin with
values.reserve( 1000000 ); // Guess reasonable value count to begin with
while( getline(fp, line) ) {
if( line.empty() ) {
AddMatrix( matrices, columns, values );
} else {
std::istringstream iss( line );
size_t count = 0;
for( int val; iss >> val; ) {
values.push_back( val );
count++;
}
columns.push_back( count );
}
}
// In case last line in file was not empty, add the last matrix.
AddMatrix( matrices, columns, values );
And add the matrix something like this:
void AddMatrix( std::deque< std::vector< std::vector< int > > > & matrices,
std::vector< size_t > & columns,
std::vector< int > & values )
{
if( columns.empty() ) return;
// Reserve matrix rows
size_t num_rows = columns.size();
std::vector< std::vector< int > > matrix;
matrix.reserve( num_rows );
// Copy rows into matrix
auto val_it = values.begin();
for( size_t num_cols : columns )
{
std::vector< int > row;
row.reserve( num_cols );
std::copy_n( val_it, num_cols, std::back_inserter( row ) );
matrix.emplace_back( row );
val_it += num_cols;
}
// Clear the column and value pools for re-use.
columns.clear();
values.clear();
}
Finally, I recommend you choose an appropriate integer type from <cstdint> rather than leaving it up the compiler. If you need only 32-bit integers, use int_least32_t. If your data range fits in 16-bit integers, you'll save a lot of memory by using int_least16_t.
I guess you are seeing the combination of 2 effects: Different size of the int + extra memory in vector.
I'm not sure if you are able to see the first effect, though an int takes about 4 bytes of memory (I think they are allowed to make this 8 bytes, though I have not yet seen implementations of it). The character on the other end only takes 1 byte per digit/char + 1 byte for the space. So if you would have a lot of small integer in there, the internal representation will be larger, though if you have a lot of large numbers, it will be smaller.
Also check if you are comparing to the size of the file, or the size on disk, as some filesystems support compression!
A next effect that you will most likely notice is the capacity of the vector, as you most likely have many, this can give quite some overhead.
In order to not having to realloc every insertion, the std::vector class has a capacity, this is the size it actually uses and will fill in with the objects you add.
Depending on the implementation, the capacity can grow. An example: Doubling the capacity every time you go over it: If you start with a capacity of 10 and you reach a size of 11, the capacity can go to 20, if you reach a size of 21, the capacity can go to 40 ... (Note: This is also the reason that reserving is important, as it will directly give you the right size)
So if you check the capacity and the size of every individual vector, this can be different. If this is really dramatic for you, you can call shrink_to_fit on the vector to realloc the capacity to the actual stored size.
Finally, the size of your program is also influenced by the application itself. I don't think it's gonna be influencing here, though if you happen to link a lot of shared objects and all are loaded during startup, some memory measurements can include the size of these shared objects as part of your programs memory.