I have a graph ( adjacency_list (listS, vecS, bidirectionalS, VertexVal) ) in which I need to delete 100,000+ nodes. Each node also contains a structure of 2 64-bit integers and another 64-bit integer. The guid check that happens in the code below is checking 1st integer in the structure.
On my laptop ( i7 2.7GHz, 16GB RAM ) it takes about 88 seconds according to VTune.
Following is how I delete the nodes:
vertex_iterator vi,vi_end;
boost::tie(vi, vi_end) = boost::vertices(m_graph);
while (vi!=vi_end) {
if (m_graph[*vi].guid.part1 == 0) {
boost::remove_vertex(*vi,m_graph);
boost::tie(vi, vi_end) = boost::vertices(m_graph);
} else
++vi;
}
Vtune shows that the boost::remove_vertex() call takes 88.145 seconds. Is there a more efficient way to delete these vertices?
In your removal branch you re-tie() the iterators:
boost::tie(vi, vi_end) = boost::vertices(m_graph);
This will cause the loop to restart every time you restart the loop. This is exactly Schlemiel The Painter.
I'll find out whether you can trust remove_vertex not triggering a reallocation. If so, it's easily fixed. Otherwise, you'd want an indexer-based loop instead of iterator-based. Or you might be able to work on the raw container (it's a private member, though, as I remember).
Update Using vecS as the container for vertices is going to cause bad performance here:
If the VertexList template parameter of the adjacency_list was vecS, then all vertex descriptors, edge descriptors, and iterators for the graph are invalidated by this operation. <...> If you need to make frequent use of the remove_vertex() function the listS selector is a much better choice for the VertexList template parameter.
This small benchmark test.cpp compares:
with -DSTABLE_IT (listS)
$ ./stable
Generated 100000 vertices and 5000 edges in 14954ms
The graph has a cycle? false
starting selective removal...
Done in 0ms
After: 99032 vertices and 4916 edges
without -DSTABLE_IT (vecS)
$ ./unstable
Generated 100000 vertices and 5000 edges in 76ms
The graph has a cycle? false
starting selective removal...
Done in 396ms
After: 99032 vertices and 4916 edges
using filtered_graph (thanks #cv_and_he in the comments)
Generated 100000 vertices and 5000 edges in 15ms
The graph has a cycle? false
starting selective removal...
Done in 0ms
After: 99032 vertices and 4916 edges
Done in 13ms
You can clearly see that removal is much faster for listS but generating is much slower.
I was able to successfully serialize the graph using Boost serialization routines into a string, parse the string and remove the nodes I didn't need and de-serialize the modified string. For 200,000 total nodes in graph and 100,000 that needs to be deleted I was able to successfully finish the operation in less than 2 seconds.
For my particular use-case each vertex has 3 64bit integers. When it needs to be deleted, I mark 2 of those integers as 0s. A valid vertex would never have a 0. When the point comes to clean up the graph - to delete the "deleted" vertices, I follow the above logic.
In the code below removeDeletedNodes() does the string parsing and removing the vertices and mapping the edge numbers.
It would be interesting to see more of the Vtune data.
My experience has been that the default Microsoft allocator can be a big bottleneck when deleting tens of thousands of small objects. Does your Vtune graph show a lot of time in delete or free?
If so, consider switching to a third-party allocator. Nedmalloc is said to be good: http://www.nedprod.com/programs/portable/nedmalloc/
Google has one, tcmalloc, which is very well regarded and much faster than the built-in allocators on almost every platform. https://code.google.com/p/gperftools/ tcmalloc is not a drop-in for Windows.
Related
I am trying to parallelise a biological model in C++ with boost::mpi. It is my first attempt, and I am entirely new to the boost library (I have started from the Boost C++ Libraries book by Schaling). The model consists of grid cells and cohorts of individuals living within each grid cell. The classes are nested, such that a vector of Cohorts* belongs to a GridCell. The model runs for 1000 years, and at each time step, there is dispersal such that the cohorts of individuals move randomly between grid cells. I want to parallelise the content of the for loop, but not the loop itself as each time step depends on the state of the previous time.
I use world.send() and world.recv() to send the necessary information from one rank to another. Because sometimes there is nothing to send between ranks I use with mpi::status and world.iprobe() to make sure the code does not hang waiting for a message that was never sent (I followed this tutorial)
The first part of my code seems to work fine but I am having troubles with making sure all the sent messages have been received before moving on to the next step in the for loop. In fact, I noticed that some ranks move on to the following time step before the other ranks have had the time to send their messaages (or at least that what it looks like from the output)
I am not posting the code because it consists of several classes and it’s quite long. If interested the code is on github. I write here roughly the pseudocode. I hope this will be enough to understand the problem.
int main()
{
// initialise the GridCells and Cohorts living in them
//depending on the number of cores requested split the
//grid cells that are processed by each core evenly, and
//store the relevant grid cells in a vector of GridCell*
// start to loop through each time step
for (int k = 0; k < (burnIn+simTime); k++)
{
// calculate the survival and reproduction probabilities
// for each Cohort and the dispersal probability
// the dispersing Cohorts are sorted based on the rank of
// the destination and stored in multiple vector<Cohort*>
// I send the vector<Cohort*> with
world.send(…)
// the receiving rank gets the vector of Cohorts with:
mpi::status statuses[world.size()];
for(int st = 0; st < world.size(); st++)
{
....
if( world.iprobe(st, tagrec) )
statuses[st] = world.recv(st, tagrec, toreceive[st]);
//world.iprobe ensures that the code doesn't hang when there
// are no dispersers
}
// do some extra calculations here
//wait that all processes are received, and then the time step ends.
//This is the bit where I am stuck.
//I've seen examples with wait_all for the non-blocking isend/irecv,
// but I don't think it is applicable in my case.
//The problem is that I noticed that some ranks proceed to the next
//time step before all the other ranks have sent their messages.
}
}
I compile with
mpic++ -I/$HOME/boost_1_61_0/boost/mpi -std=c++11 -Llibdir \-lboost_mpi -lboost_serialization -lboost_locale -o out
and execute with mpirun -np 5 out, but I would like to be able to execute with a higher number of cores on an HPC cluster later on (the model will be run at the global scale, and the number of cells might depend on the grid cell size chosen by the user).
The compilers installed are g++ (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0, Open MPI: 2.1.1
The fact that you have nothing to send is an important piece of information in your scenario. You can not deduce that fact from only the absence of a message. The absence of a message only means nothing was sent yet.
Simply sending a zero-sized vector and skipping the probing is the easiest way out.
Otherwise you would probably have to change your approach radically or implement a very complex speculative execution / rollback mechanism.
Also note that the linked tutorial uses probe in a very different fashion.
I`m pretty new in MPI programming and got stuck in the middle of my project.
I want to write an MPI code for the following problem. I am not sure which functions from MPI is appropriate.
Here is the problem:
Processor 0 has a 2D vector or array of Edges={(0,4),(1,5)}. It needs to get some information from the other processors, which is not always a fixed processor, it depends on set Edges. Therefore, I need a for loop as follows:
if(my_rank==0)
{
for(all pairs (i,j) in Edges)
{
send i (or j) to Processor r (r depends on the index i)
recieve L_r from Processor r
create (L_i, L_j, min(L_i,L_j)) // want to broadcast to all later.
}
}
Now, I am not sure how to do it for processor r, should I do in a for loop?
Note that I can not do it in an if statement since I dont know which processor would be and so based on the number of processors I need an if statement which I don`t think is a right way. I might have so many processors which each holds some part of a matrix.
Need to point that I cannot communicate with a subgroup of communicators, since it all depends on the indices, basically, I want the labels for example indices (0,4) which need to communicate with P4 that holds it.
Any ideas are appreciated.
I would do it as follow:
1) Proc 0 construct a list of every processes it has to comunicate with.
2) Proc 0 broadcast this list to all processes (or only to the one he have to communicate with, but that will be more complicated, can be done once you got a version which works)
3) You perform your comm:
If(rank==0){...}
else if (rank in the list){...}
Hi i'm trying to fine tuning vgg on my problem but when i try to train the net i get this error.
OOM when allocating tensor with shape[25088,4096]
The net has this structure:
I take this tensorflow pretrained vgg implementation code from this site.
I only add this procedure to train the net:
with tf.name_scope('joint_loss'):
joint_loss = ya_loss+yb_loss+yc_loss+yd_loss+ye_loss+yf_loss+yg_loss+yh_loss+yi_loss+yl_loss+ym_loss+yn_loss
# Loss with weight decay
l2_loss = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables()])
self.joint_loss = joint_loss + self.weights_decay * l2_loss
self.optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate).minimize(joint_loss)
i try to reduce the batch size to 2 but not works i get the same error. The error is due to the big tensor that cannot be allocated in memory. I get this error only in train cause if i feed a value without minimize the net works. How i can avoid this error? how can i save memory of graphic card(Nvidia GeForce GTX 970)?
UPDATE: if i use the GradientDescentOptimizer the training process start, instead if i use AdamOptimizer i get the memory error, seems that the GradientDescentOptimizer use less memory.
Without a backward pass ("feed a value without minimizing"), TensorFlow can immediately de-allocate intermediate activations. With a backward pass, the graph has a giant U-shape, where activations from the forward pass need to be kept in memory for the backward pass. There are some tricks (such as swapping to host memory), but in general backprop means that memory usage will be higher.
Adam does keep some extra bookkeeping variables around, so it will increase memory usage proportional to the amount of memory your weight variables are already using. If your training steps take quite a while (in which case having the variable updates on the GPU isn't important), you could instead locate the optimization ops in host memory.
If you need a larger batch size and can't reduce image resolution or model size, combining gradients from multiple workers/GPUs using something like SyncReplicasOptimizer can be a good option. Looking at the paper associated with this model, it looks like they were training on 4 GPUs each with 12GB of memory.
I'm working on software for processing audio in real time in C++ with Qt. I need that requirements are minimized.
Defining a temporary buffer 40ms, launching our device with a sampling frequency Fs = 8000Hz, every 320 samples entered a feature called Data Processing ().
The idea is to have a global buffer that stores the 10s last recorded, 80000 samples.
This Buffer in each iteration eliminates the initial 320 samples and looped at the end, 320 new samples. Thus the buffer is updated and the user can observe the real-time graphical representation of the recorded signal.
At first I thought of using QVector (equivalent to std::vector but for Qt) for this deployment, thus we reduce the process a few lines of code
int NUM_POINTS=320;
DatosTemporales.erase(DatosTemporales.begin(),DatosTemporales.begin()+NUM_POINTS);
DatosTemporales+= (DatosNuevos); // Datos Nuevos con un tamaño de NUM_POINTS
In each iteration we create a vector of 80000 samples in addition to free some positions so requires some processing time. An alternative for opting was the use of * double, and iterations a loop:
for(int i=0;i<80000;i++){
if(i<80000-NUM_POINTS){
aux=DatosTemporales[i];
DatosTemporales[i+NUM_POINTS]=aux;
}else{
DatosTemporales[i]=DatosNuevos[i-NUN_POINTS];
}
}
Does fails. I think the best way is to use dynamic memory. Implementing this process by pointers. Could anyone give me some idea how to implement it?
It sounds like what you are looking for is a circular buffer.
https://www.google.com/search?q=qcircularbuffer
https://qt.gitorious.org/qt/qtbase/merge_requests/60
And it looks like you only need the header file and you should be good to go.
A similar tool that is already in the Qt data set is found here:
http://doc.qt.io/qt-5/qcontiguouscache.html#details
The advantage of using a system like these presented, is that they don't need to have dynamic memory, it just needs to move the head and the tail pointers.
Hope that helps.
I need to write a function that can receive fragments of different messages, and then piece them together. The fragments are in the form of a class msg, which holds information of
int message_id
int no_of_fragments
int fragment_id
string msg_fragment
The function needs to do the following
Check received message - if no_of_fragments == 1 then the message has not been fragmented and function can stop here
If no_of_fragments > 1 then message is fragmented
get message_id and fragment_id
collect all fragments e.g. for message_id=111 with no_of_fragments=6, the system should ensure that fragments_id 1-6 have been collected
piece fragments together
What is the best way for doing this? I thought a map might be useful (with the message_id serving as key, pointing to a container that would hold the fragments) but would appreciate any suggestions.
Thank you!
I would use a map of vectors. Each time you receive a new message ID, use that as your map key. Then allocate a vector to hold the fragments based on the number of fragments specified in the first fragment received (doesn't have to be in order). You'll also need to hold the count, so it's easy to know when you've received the last fragment, so probably a map of message_id to a struct of count and the vector of fragments.
My c++ is rusty:
struct message_parts {
int fragments_expected; // init to no_of_fragments
int fragments_received; // init to 0 (you'll bump it as soon as you add the fragment to the vector)
vector<fragment *> fragments; <-- initialize size to no_of_fragments
}
std::map<int, message_parts> partial_messages
When you insert a fragment, put it directly into the location in the fragments vector based on the fragment_id - 1 (since they are zero-indexed). This way you'll always have them in the right order, no matter the order they come in.
After you add a fragment, check to see if fragments_received == fragments expected, and then you can piece it together and deal with the data.
This gives constant time first-fragment detection and allocation, constant time fragment insertion, constant time complete-message-received detection, and linear time message reconstruction (can't do any better than this).
This solution requires no special casing for non-fragmented data.
Don't forget to delete the fragments once you've reassembled them into the complete message.