Best datastructure for iterating over and moving elements to front - c++

As part of a solution to a bigger problem that is finding the solution to a maximum flow problem. In my implementation of the relabel-to-front algorithm I'm having a performance bottleneck that I didn't expect.
The general structure for storing the graph data is as follows:
struct edge{
int destination;
int capacity;
};
struct vertex{
int e_flow;
int h;
vector<edge> edges;
};
The specifics of the algorithm are not that important to the question. In the main loop of the solution I'm looping over all vertices except the source and the sink. If at some point a change is made to a vertex then that vertex is put at the front of the list and the iteration starts again from the start. Until the end of the list is reached and we terminate. This part looks as follows now
//nodes are 0..nodeCount-1 with source=0 and sink=nodeCount-1
vector<int> toDischarge(nodeCount-2,0);
for(int i=1;i<sink;i++){
toDischarge[i-1]=i;
}//skip over source and sink
//custom pointer to the entry of toDischarge we are currently accessing
int point = 0;
while(point != nodeCount-2){
int val = toDischarge[point];
int oldHeight = graph[val].h;
discharge(val, graph, graph[val].e_flow);
if(graph[val].h != oldHeight){
rotate(toDischarge.begin(), toDischarge.begin()+point, toDischarge.begin()+point+1);
//if the value of the vertex has changed move it to the front and reset pointer
point = 0;
}
point++;
}
I tried using an std::list data structure before the vector solution but that was even slower even though conceptually that didn't make sense to me since (re)moving elements in a list should be easy. After some research I found out that it was probably horribly performant due to caching issues with list.
Even with the vector solution though I did some basic benchmarking using valgrind and have the following results.
If I understand this correctly then over 30% of my execution time is just spent doing vector element accesses.
Another solution I've tried is making a copy of the vertex needed for that iteration into a variable since it is accessed multiple times, but that was even worse performance because I think it is also making a copy of the whole edge list.
What data structure would improve the general performance of these operations? I'm also interested in other data structures for storing the graph data if that would help.

It seems to me that this is what std::deque<> is for. Imagine it as a 'non-continuous vector', or some vector-like batches tied together. You can use the same interface as vector, except that you cannot assume that adding an index to the first element's pointer results in the given element (or anything sensible other than UB); you need to use [] for indexing. Also, you have dq.insert(it, elem); that's quick if it is std::begin(it) or std::end(it).

Related

Which container is most efficient for multiple insertions / deletions in C++?

I was set a homework challenge as part of an application process (I was rejected, by the way; I wouldn't be writing this otherwise) in which I was to implement the following functions:
// Store a collection of integers
class IntegerCollection {
public:
// Insert one entry with value x
void Insert(int x);
// Erase one entry with value x, if one exists
void Erase(int x);
// Erase all entries, x, from <= x < to
void Erase(int from, int to);
// Return the count of all entries, x, from <= x < to
size_t Count(int from, int to) const;
The functions were then put through a bunch of tests, most of which were trivial. The final test was the real challenge as it performed 500,000 single insertions, 500,000 calls to count and 500,000 single deletions.
The member variables of IntegerCollection were not specified and so I had to choose how to store the integers. Naturally, an STL container seemed like a good idea and keeping it sorted seemed an easy way to keep things efficient.
Here is my code for the four functions using a vector:
// Previous bit of code shown goes here
private:
std::vector<int> integerCollection;
};
void IntegerCollection::Insert(int x) {
/* using lower_bound to find the right place for x to be inserted
keeps the vector sorted and makes life much easier */
auto it = std::lower_bound(integerCollection.begin(), integerCollection.end(), x);
integerCollection.insert(it, x);
}
void IntegerCollection::Erase(int x) {
// find the location of the first element containing x and delete if it exists
auto it = std::find(integerCollection.begin(), integerCollection.end(), x);
if (it != integerCollection.end()) {
integerCollection.erase(it);
}
}
void IntegerCollection::Erase(int from, int to) {
if (integerCollection.empty()) return;
// lower_bound points to the first element of integerCollection >= from/to
auto fromBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), from);
auto toBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), to);
/* std::vector::erase deletes entries between the two pointers
fromBound (included) and toBound (not indcluded) */
integerCollection.erase(fromBound, toBound);
}
size_t IntegerCollection::Count(int from, int to) const {
if (integerCollection.empty()) return 0;
int count = 0;
// lower_bound points to the first element of integerCollection >= from/to
auto fromBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), from);
auto toBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), to);
// increment pointer until fromBound == toBound (we don't count elements of value = to)
while (fromBound != toBound) {
++count; ++fromBound;
}
return count;
}
The company got back to me saying that they wouldn't be moving forward because my choice of container meant the runtime complexity was too high. I also tried using list and deque and compared the runtime. As I expected, I found that list was dreadful and that vector took the edge over deque. So as far as I was concerned I had made the best of a bad situation, but apparently not!
I would like to know what the correct container to use in this situation is? deque only makes sense if I can guarantee insertion or deletion to the ends of the container and list hogs memory. Is there something else that I'm completely overlooking?
We cannot know what would make the company happy. If they reject std::vector without concise reasoning I wouldn't want to work for them anyway. Moreover, we dont really know the precise requirements. Were you asked to provide one reasonably well performing implementation? Did they expect you to squeeze out the last percent of the provided benchmark by profiling a bunch of different implementations?
The latter is probably too much for a homework challenge as part of an application process. If it is the first you can either
roll your own. It is unlikely that the interface you were given can be implemented more efficiently than one of the std containers does... unless your requirements are so specific that you can write something that performs well under that specific benchmark.
std::vector for data locality. See eg here for Bjarne himself advocating std::vector rather than linked lists.
std::set for ease of implementation. It seems like you want the container sorted and the interface you have to implement fits that of std::set quite well.
Let's compare only isertion and erasure assuming the container needs to stay sorted:
operation std::set std::vector
insert log(N) N
erase log(N) N
Note that the log(N) for the binary_search to find the position to insert/erase in the vector can be neglected compared to the N.
Now you have to consider that the asymptotic complexity listed above completely neglects the non-linearity of memory access. In reality data can be far away in memory (std::set) leading to many cache misses or it can be local as with std::vector. The log(N) only wins for huge N. To get an idea of the difference 500000/log(500000) is roughly 26410 while 1000/log(1000) is only ~100.
I would expect std::vector to outperform std::set for considerably small container sizes, but at some point the log(N) wins over cache. The exact location of this turning point depends on many factors and can only reliably determined by profiling and measuring.
Nobody knows which container is MOST efficient for multiple insertions / deletions. That is like asking what is the most fuel-efficient design for a car engine possible. People are always innovating on the car engines. They make more efficient ones all the time. However, I would recommend a splay tree. The time required for a insertion or deletion is a splay tree is not constant. Some insertions take a long time and some take only a very a short time. However, the average time per insertion/deletion is always guaranteed to be be O(log n), where n is the number of items being stored in the splay tree. logarithmic time is extremely efficient. It should be good enough for your purposes.
The first thing that comes to mind is to hash the integer value so single look ups can be done in constant time.
The integer value can be hashed to compute an index in to an array of bools or bits, used to tell if the integer value is in the container or not.
Counting and and deleting large ranges could be sped up from there, by using multiple hash tables for specific integer ranges.
If you had 0x10000 hash tables, that each stored ints from 0 to 0xFFFF and were using 32 bit integers you could then mask and shift the upper half of the int value and use that as an index to find the correct hash table to insert / delete values from.
IntHashTable containers[0x10000];
u_int32 hashIndex = (u_int32)value / 0x10000;
u_int32int valueInTable = (u_int32)value - (hashIndex * 0x10000);
containers[hashIndex].insert(valueInTable);
Count for example could be implemented as so, if each hash table kept count of the number of elements it contained:
indexStart = startRange / 0x10000;
indexEnd = endRange / 0x10000;
int countTotal = 0;
for (int i = indexStart; i<=indexEnd; ++i) {
countTotal += containers[i].count();
}
Not sure if using sorting really is a requirement for removing the range. It might be based on position. Anyway, here is a link with some hints which STL container to use.
In which scenario do I use a particular STL container?
Just FYI.
Vector maybe a good choice, but it does a lot of re allocation, as you know. I prefer deque instead, as it doesn't require big chunk of memory to allocate all items. For such requirement as you had, list probably fit better.
Basic solution for this problem might be std::map<int, int>
where key is the integer you are storing and value is the number of occurences.
Problem with this is that you can not quickly remove/count ranges. In other words complexity is linear.
For quick count you would need to implement your own complete binary tree where you can know the number of nodes between 2 nodes(upper and lower bound node) because you know the size of tree, and you know how many left and right turns you took to upper and lower bound nodes. Note that we are talking about complete binary tree, in general binary tree you can not make this calculation fast.
For quick range remove I do not know how to make it faster than linear.

How to construct a new vector/set of pointer from another vector/set of object?

Background
I wanted to manipulate the copy of a vector, however doing a vector copy operation on each of its element is normally expensive operation.
There are concept called shallow copy which I read somewhere is the default copy constructor behavior. However I'm not sure why it doesn't work or at least I tried to do the copy of vector object and the result looks like a deep copy.
struct Vertex{
int label;
Vertex(int label):label(label){ }
};
int main(){
vector<Vertex> vertices { Vertex(0), Vertex(1) };
// I Couldn't force this to be vector<Vertex*>
vector<Vertex> myvertices(vertices);
myvertices[1].label = 123;
std::cout << vertices[1].label << endl;
// OUTPUT: 1 (meaning object is deeply copied)
return 0;
}
Naive Solution: for pointer copy.
int main(){
vector<Vertex> vertices { Vertex(0), Vertex(1) };
vector<Vertex*> myvertices;
for (auto it = vertices.begin(); it != vertices.end(); ++it){
myvertices.push_back(&*it);
}
myvertices[1].label = 123;
std::cout << vertices[1].label << endl;
// OUTPUT: 123 (meaning object is not copied, just the pointer)
return 0;
}
Improvement
Is there any other better approach or std::vector API to construct a new vector containing just the pointer of each of the elements in the original vector?
One way you could transform a vector of elements to a vector of pointers that point to the elements of the original vector that is better in terms of efficiency compared to your example, due to the fact that it preallocates the buffer of the vector of pointers, and IMHO more elegant is via using std::transform as follows:
std::vector<Vertex*> myvertices(vertices.size());
std::transform(vertices.begin(), vertices.end(), myvertices.begin(), [](Vertex &v) { return &v; });
Live Demo
Or if you don't want to use a lambda for the unary operator:
std::vector<Vertex*> myvertices(vertices.size());
std::transform(vertices.begin(), vertices.end(), myvertices.begin(), std::addressof<Vertex>);
Live Demo
Caution: If you alter the original vector then you invalidate the pointers in the pointers' vector.
Thanks for #kfsone for noticing on the main problem that it is very uncommon people wanted to keep track of pointer from another vector of object without utilizing the core idea behind it. He provided an alternative approach that solve similar problem by using bit masking. It may not be obvious for me at first until he mentioned that.
When we are trying to store just the pointers of another vector, we are most probably wanted to do some tracking, house keeping (keeping track) of another object. Which later to be performed on the pointer itself without touching the original data. For my case, I'm solving a minimum vertex cover problem via bruteforce approach. Whereby I will need to generate all permutation of vertices (e.g. 20 vertices will generate 2**20=1million++ permutation), then I trim down all irrelevant permutation by slowly iterating each of the vertices in the vertex cover and remove edges that are covered by the vertices. In doing so, my first intuition is to copy all pointers to ensure efficiency and later i could just remove the pointer one by one.
However, another way of looking into this problem is not to use vector/set at all, but rather just keep track each of those pointer as a bit pattern. I won't go in the detail but feel free to learn from others.
The performance difference is very significant such that in bitwise, you can achieve O(1) constant time without much problem, whereas using a specific container, you tend to have to iterate each of the elements which bound your algorithm to O(n). To make it worst, if you are bruteforcing NP hard problem, you need to keep the constant factor as low as possible, and from O(1) to O(N) is a huge difference in such scenario.

Boost: Big graphs & Multithreading

I need to create a directed graph that can be quite large from a big dataset. I know these things for sure:
Each node has at most K outgoing edges
I have a list (unordered_map) of N >> K nodes
The graph is build by comparing all nodes with each other (yes, O(N^2) unfortunately)
Thinking about it, I would parallelize the graph creation using std::thread, and I was wondering if this could be done via Boost Graph Library.
If I use the adjacency matrix, it should be possible to preallocate the matrix (K*N elements), and hence it would be thread-safe to insert all adjacent nodes.
I've read that BGL could be thread-unsafe, but the posts I've found are three years old.
Do you know if it's possible to do what I'm thinking? Do you recommend doing otherwise?
Cheers!
Almost any graph algorithm in BGL needs a mapping: vertex -> int which assigns to each vertex a unique integer within the range [0, num_vertices(g) ). This mapping is known as "vertex_index" and is usually accessible as property_map.
Having said that, I can assume your vertices are already integers or associated with some integers (e.g. your unordered_map has some extra field in "mapped_type"). Even better (for performance and memory) if your input vertices are stored in continuous tight array, e.g. std::vector, then indexing is natural.
If vertices are [associated with] integers, your best choice for memory-tight graph is "Compressed Sparse Row Graph". The graph is immutable, so you need to populate edges container before you generate a graph.
As ravenspoint explained, your best choice is to equip each thread with its own local container of results and lock the central container only when merging the local result into the final one. Such strategy is implemented lock-less by TBB template tbb::parallel_reduce. So your full code for graph building can look roughly as below:
#include "tbb/blocked_range2d.h"
#include "tbb/parallel_reduce.h"
#include "boost/graph/compressed_sparse_row_graph.hpp"
typedef something vertex; //e.g.something is integer giving index of a real data
class EdgeBuilder
{
public:
typedef std::pair<int,int> edge;
typedef std::vector<edge> Edges;
typedef ActualStorage Input;
EdgeBuilder(const Input & input):_input(input){} //OPTIONAL: reserve some space in _edges
EdgeBuilder( EdgeBuilder& parent, tbb::split ): _input(parent.input){} // reserve something
void operator()( const const tbb::blocked_range2d<size_t>& r )
{
for( size_t i=r.rows().begin(); i!=r.rows().end(); ++i ){
for( size_t j=r.cols().begin(); j!=r.cols().end(); ++j ) {
//I assume you provide some function to compute existence
if (my_func_edge_exist(_input,i, j))
m_edges.push_back(edge(i,j));
}
}
}
//merges local results from two TBB threads
void join( EdgeBuilder& rhs )
{
m_edges.insert( m_edges.end(), rhs.m_edges.begin(), rhs.m_edges.end() );
}
Edges _edges; //for a given interval of vertices
const Input & _input;
};
//full flow:
boost::compressed_sparse_row_graph<>* build_graph( const Storage & vertices)
{
EdgeBuilder builder(vertices);
tbb::blocked_range2d<size_t,size_t> range(0,vertices.size(), 100, //row grain size
0,vertices.size(), 100); //col grain size
tbb::parallel_reduce(range, builder);
boost::compressed_sparse_row_graph<>
theGraph = new boost::compressed_sparse_row_graph<>
(boost::edges_are_unsorted_multi_pass_t,
builder._edges.begin(), builder._edges.end(),
vertices.size() );
return theGraph;
}
I think you should break your goal down into two separate sub-goals.
Create the links between nodes by doing the N * ( N - 1 ) tests of pairs of nodes. You appear to have an idea of how to break this up into independent threads. Store the results in a data structure that you know is thread safe, without worrying about the mysteries of boost:graph.
Create the boost::graph from your nodes and ( just created ) links.
A note about storing the links created in each thread: It is not so easy to find a suitable thread safe data structure. If you use a STL dynamically allocated structure, then you have to worry about making a thread safe allocator which is a challenge. If you pre-allocate, then there is a lot of meessy code to handle the allocations. So, I would suggest storing the links created by each thread in a separate data structure, so they do not have to be thread safe. When the links are all created, you can loop over the links created by each thread one by one.
A slightly more efficient design could be imagined, but will require a lot of arcane knowledge about thread safety. The design I propose can be implemented without arcane knowledge or tricky code and will therefore be implemented more quickly and more robustly and will be easier to maintain.

Using an array of iterators for reading sparse information

I am currently writing a c++ code to work with spike trains for a problem in theoretical neuroscience. The actual neuroscience, however, is fairly irrelevant to my question. Basically, I have a long timeframe, I want to store every time the neuron "fires" during this time. Since "firing" is a discrete event, this can be done by simply recording the time of each event into a c++ vector, thereby creating a much sparser representation then storing information about every point in time. What makes this difficult is that I want to deal with several neurons at once. My solution to this problem has been to create a class includes a map from each neuron's identifier (an integer) to that neuron's vector:
using namespace std;
typedef pair<int,vector<int> > Pair;
typedef map<int,vector<int> > Map;
class SpikeTrain{
public:
Map * train;//Spike train
double * dt;//timestep
int * t_now;//curent timestep (index)
vector<int>::iterator * spikeIt;//Array of iterators for traversal.
//Methods, etc;
};
The map part of this works fine. The problem comes when I try to ask: how many events occur at any given time step. This is an easier question to ask then to answer, because, if you remember, only the times at which events occur on each neuron are stored. I therefore turn to the strategy of using iterators initializing an array of iterators:
void SpikeTrain::beginIterator(){
spikeIt= new vector<int>::iterator[N()];
t_now = new int(0);
int n=N();
for(int i = 0;i<n;i++){
if((*train)[i].size()>0){
spikeIt[i] = (*train)[i].begin();
}
}
}
Where the first time of each event is pointed to by the iterator corresponding to the individual neuron [N() is simply the number of neurons, i.e. vectors, that I am counting over], that is, the first entry in its vector of spikes. I then attempt to traverse my sparse sudo-matrix by looking at each time, counting over the number of neurons that spike at that time and, if a neuron does spike, moving the corresponding iterator in my array to the next entry in its vector:
bool* SpikeTrain::spikingNow(){
bool * spikingNeurons = new bool[N()];
int n = N();
for (int i = 0;i<n;i++){
if(*(spikeIt[i]) ==(*t_now)){
spikingNeurons[i] =true;
spikeIt[i]++;
}
}
(*t_now)++;
return spikingNeurons;
}
My problem, then, comes in attempting to access each iterator in the array to compare to the current time. I get a
EXC_BAD_ACCESS(code = 1,address = 0x0)
at:
if(*(spikeIt[i]) ==(*t_now))
I am new to c++, and to non-matlab programing in general, so I apologize if there are any heinous faux pas in this post. This being said, I am having a great deal of difficulty navigating this complex structure. Thanks!
If any vector in *train is empty, the corresponding iterator in spikeIt is never initialized - but you are dereferencing and incrementing it anyhow. This exhibits undefined behavior.
Further, there is no attempt to prevent iterators from incrementing past the end of their vectors.

Sorting 1000-2000 elements with many cache misses

I have an array of 1000-2000 elements which are pointers to objects. I want to keep my array sorted and obviously I want to do this as quick as possible. They are sorted by a member and not allocated contiguously so assume a cache miss whenever I access the sort-by member.
Currently I'm sorting on-demand rather than on-add, but because of the cache misses and [presumably] non-inlining of the member access the inner loop of my quick sort is slow.
I'm doing tests and trying things now, (and see what the actual bottleneck is) but can anyone recommend a good alternative to speeding this up?
Should I do an insert-sort instead of quicksorting on-demand, or should I try and change my model to make the elements contigious and reduce cache misses?
OR, is there a sort algorithm I've not come accross which is good for data that is going to cache miss?
Edit: Maybe I worded this wrong :), I don't actually need my array sorted all the time (I'm not iterating through them sequentially for anything) I just need it sorted when I'm doing a binary chop to find a matching object, and doing that quicksort at that time (when I want to search) is currently my bottleneck, because of the cache misses and jumps (I'm using a < operator on my object, but I'm hoping that inlines in release)
Simple approach: insertion sort on every insert. Since your elements are not aligned in memory I'm guessing linked list. If so, then you could transform it into a linked list with jumps to the 10th element, the 100th and so on. This is kind of similar to the next suggestion.
Or you reorganize your container structure into a binary tree (or what every tree you like, B, B*, red-black, ...) and insert elements like you would insert them into a search tree.
Running a quicksort on each insertion is enormously inefficient. Doing a binary search and insert operation would likely be orders of magnitude faster. Using a binary search tree instead of a linear array would reduce the insert cost.
Edit: I missed that you were doing sort on extraction, not insert. Regardless, keeping things sorted amortizes sorting time over each insert, which almost has to be a win, unless you have a lot of inserts for each extraction.
If you want to keep the sort on-extract methodology, then maybe switch to merge sort, or another sort that has good performance for mostly-sorted data.
I think the best approach in your case would be changing your data structure to something logarithmic and rethinking your architecture. Because the bottleneck of your application is not that sorting thing, but the question why do you have to sort everything on each insert and try to compensate that by adding on-demand sort?.
Another thing you could try (that is based on your current implementation) is implementing an external pointer - something mapping table / function and sort those second keys, but I actually doubt it would benefit in this case.
Instead of the array of the pointers you may consider an array of structs which consist of both a pointer to your object and the sort criteria. That is:
Instead of
struct MyType {
// ...
int m_SomeField; // this is the sort criteria
};
std::vector<MyType*> arr;
You may do this:
strcut ArrayElement {
MyType* m_pObj; // the actual object
int m_SortCriteria; // should be always equal to the m_pObj->m_SomeField
};
std::vector<ArrayElement> arr;
You may also remove the m_SomeField field from your struct, if you only access your object via this array.
By such in order to sort your array you won't need to dereference m_pObj every iteration. Hence you'll utilize the cache.
Of course you must keep the m_SortCriteria always synchronized with m_SomeField of the object (in case you're editing it).
As you mention, you're going to have to do some profiling to determine if this is a bottleneck and if other approaches provide any relief.
Alternatives to using an array are std::set or std::multiset which are normally implemented as R-B binary trees, and so have good performance for most applications. You're going to have to weigh using them against the frequency of the sort-when-searched pattern you implemented.
In either case, I wouldn't recommend rolling-your-own sort or search unless you're interested in learning more about how it's done.
I would think that sorting on insertion would be better. We are talking O(log N) comparisons here, so say ceil( O(log N) ) + 1 retrieval of the data to sort with.
For 2000, it amounts to: 8
What's great about this is that you can buffer the data of the element to be inserted, that's how you only have 8 function calls to actually insert.
You may wish to look at some inlining, but do profile before you're sure THIS is the tight spot.
Nowadays you could use a set, either a std::set, if you have unique values in your structure member, or, std::multiset if you have duplicate values in you structure member.
One side note: The concept using pointers, is in general not advisable.
STL containers (if used correctly) give you nearly always an optimized performance.
Anyway. Please see some example code:
#include <iostream>
#include <array>
#include <algorithm>
#include <set>
#include <iterator>
// Demo data structure, whatever
struct Data {
int i{};
};
// -----------------------------------------------------------------------------------------
// All in the below section is executed during compile time. Not during runtime
// It will create an array to some thousands pointer
constexpr std::size_t DemoSize = 4000u;
using DemoPtrData = std::array<const Data*, DemoSize>;
using DemoData = std::array<Data, DemoSize>;
consteval DemoData createDemoData() {
DemoData dd{};
int k{};
for (Data& d : dd)
d.i = k++*2;
return dd;
}
constexpr DemoData demoData = createDemoData();
consteval DemoPtrData createDemoPtrData(const DemoData& dd) {
DemoPtrData dpd{};
for (std::size_t k{}; k < dpd.size(); ++k)
dpd[k] = &dd[k];
return dpd;
}
constexpr DemoPtrData dpd = createDemoPtrData(demoData);
// -----------------------------------------------------------------------------------------
struct Comp {bool operator () (const Data* d1, const Data* d2) const { return d1->i < d2->i; }};
using MySet = std::multiset<const Data*, Comp>;
int main() {
// Add some thousand pointers. Will be sorted according to struct member
MySet mySet{ dpd.begin(), dpd.end() };
// Extract a range of data. integer values between 42 and 52
const Data* p42 = dpd[21];
const Data* p52 = dpd[26];
// Show result
for (auto iptr = mySet.lower_bound(p42); iptr != mySet.upper_bound(p52); ++iptr)
std::cout << (*iptr)->i << '\n';
// Insert a new element
Data d1{ 47 };
mySet.insert(&d1);
// Show again
std::cout << "\n\n";
for (auto iptr = mySet.lower_bound(p42); iptr != mySet.upper_bound(p52); ++iptr)
std::cout << (*iptr)->i << '\n';
}