How to efficiently clear vector<Mat> in opencv c++ - c++

I have a vector<Mat> my_vect that each Mat is float and their size is 90*90. I start loading matrices from the disk that I load 16000 matrices to that vector. After I finish working with those matrices, I clear them. Here is my code for loading and clearing the vector:
Mat mat1(90,90,CV_32F);
load_vector_of_matrices("filename",my_vect); //this loads 16K elements
//do something
for(i = 1:16K)
correlate(mat1, my_vect.at(i));
my_vect.clear();
I'm loading 16K element together for the sake of efficiency.
Now my question is reading all these matrices takes 3-4 second and my_vect.clear() takes approximately same amount of time which is a lot.
According to this answer, it should take O(n) time which I assume vector<Mat> doesn't have a trivial destructor.
Why clearing takes so much time, is matrix destructor overwrite each index in the matrix? Is there a way to decrease the time for clearing the vector?
EDIT:
I'm using Visual Studio 2010 and the level of optimization is Maximize Speed(/O2).

First, a streaming loader. Provide it with a function that, given a max, returns a vector of data (aka loader<T>). It can store internal state, but it will be copied, so store that internal state in a std::shared_ptr. I guarantee that only one copy of it will be invoked.
You are not responsible for returning all max data from your loader, but as written you must return at least 1 element. Returning more is gravy, and may reduce threading overhead.
You then call streaming_loader<T>( your_loader, count ).
It returns a std::shared_ptr< std::vector< std::future< T > > >. You can wait on these futures, but you must wait on them in order (the 2nd one is not guaranteed to be ready to be waited on until the first one has provided data).
template<class T>
using loader = std::function< std::vector<T>(size_t max) >;
template<class T>
using stream_data = std::shared_ptr< std::vector< std::future<T> > >;
namespace details {
template<class T>
T streaming_load_some( loader<T> l, size_t start, stream_data<T> data ) {
auto loaded = l(data->size()-start);
// populate the stuff after start first, so they are ready:
for( size_t i = 1; i < loaded.size(); ++i ) {
std::promise<T> promise;
promise.set_value( std::move(loaded[i]) );
(*data)[start+i] = promise.get_future();
}
if (start+loaded.size() < data->size()) {
// recurse:
std::size_t new_start = start+loaded.size();
(*data)[new_start] = std::async(
std::launch::async,
[l, new_start, data]{return streaming_load_some<T>( l, new_start, data );}
);
}
// populate the future:
return std::move(loaded.front());
}
}
template<class T>
stream_data< T >
streaming_loader( loader<T> l, size_t n ) {
auto retval = std::make_shared<std::vector< std::future<T> >>(n);
if (retval->empty()) return retval;
retval->front() = std::async(
std::launch::async,
[retval, l]()->T{return details::streaming_load_some<T>( l, 0, retval );
});
return retval;
}
For use, you take the stream_data<T> (aka a shared pointer to vector of future data), iterate over it, and .get() each in turn. Then do your processing. If you need a block of 50 of them, call .get() on each in turn until you get to 50 -- do not skip to number 50.
Here is a completely toy loader and test harness:
struct loader_state {
int index = 0;
};
struct test_loader {
std::shared_ptr<loader_state> state; // current loading state stored here
std::vector<int> operator()( std::size_t max ) const {
std::size_t amt = max/2+1;// really, really stupid way to decide how much to load
std::vector<int> retval;
retval.reserve(amt);
for (size_t i = 0; i < amt; ++i) {
retval.push_back( -(int)(state->index + i) ); // populate the return value
}
state->index += amt;
return retval;
}
// in real code, make this constructor do something:
test_loader():state(std::make_shared<loader_state>()) {}
};
int main() {
auto data = streaming_loader<int>( test_loader{}, 1024 );
std::size_t count = 0;
for( std::future<int>& x : *data ) {
++count;
int value = x.get(); // get data
// process. In this case, print out 100 in blocks of 10:
if (count * 100 / data->size() > (count-1) * 100 / data->size())
std::cout << value << ", ";
if (count * 10 / data->size() > (count-1) * 10 / data->size())
std::cout << "\n";
}
std::cout << std::endl;
// your code goes here
return 0;
}
count may or may not be worthless. The internal state of the loader above is pretty darn worthless, I just use it to demonstrate how to store some state.
You can do something similar to destroy a pile of objects without waiting for their destructors to complete. Or, you can rely on the fact that destroying your data can happen while you are working on it and waiting for the next data to load.
live example
In an industrial strength solution, you'd need to include ways to abort all this stuff, among other things. Exceptions might be one way. Also, feedback to the loader about how far behind the processing code is can be helpful (if it is at your heels, return smaller chunks -- if it is way behind, return larger chunks). In theory, that can be arranged via a back channel in loader<T>.
Now that I have played with the above for a bit, a probably better fit is:
#include <iostream>
#include <future>
#include <functional>
#include <vector>
#include <memory>
// if it returns empty, there is nothing more to load:
template<class T>
using loader = std::function< std::vector<T>() >;
template<class T>
struct next_data;
template<class T>
struct streamer {
std::vector<T> data;
std::unique_ptr<next_data<T>> next;
};
template<class T>
struct next_data:std::future<streamer<T>> {
using parent = std::future<streamer<T>>;
using parent::parent;
next_data( parent&& o ):parent(std::move(o)){}
};
live example. It requires some infrastructure to populate that very first streamer<T>, but the code will be simpler, and the strange requirement (of knowing how much data, and only doing a .get() from the first element) goes away.
template<class T>
streamer<T> stream_step( loader<T> l ) {
streamer<T> retval;
retval.data = l();
if (retval.data.empty())
return retval;
retval.next.reset( new next_data<T>(std::async( std::launch::async, [l](){ return stream_step(l); })));
return retval;
}
template<class T>
streamer<T> start_stream( loader<T> l ) {
streamer<T> retval;
retval.next.reset( new next_data<T>(std::async( std::launch::async, [l](){ return stream_step(l); })));
return retval;
}
A downside is that writing a ranged-based iterator becomes a bit trickier.
Here is a sample use of the second implementation:
struct counter {
std::size_t max;
std::size_t current = 0;
counter( std::size_t m ):max(m) {}
std::vector<int> operator()() {
std::vector<int> retval;
std::size_t do_at_most = 100;
while( current < max && (do_at_most-->0)) {
retval.push_back( int(current) );
++current;
}
return retval;
}
};
int main() {
streamer<int> s = start_stream<int>( counter(1024) );
while(true) {
for (int x : s.data) {
std::cout << x << ",";
}
std::cout << std::endl;
if (!s.next)
break;
s = std::move(s.next->get());
}
// your code goes here
return 0;
}
where counter is a trivial loader (an object that reads data into a std::vector<T> in whatever sized chunks it feels like). The processing of the data is in the main code, where we just print them out in get-sized chunks.
The loading happens in a different thread, and will continue asynchronously whatever the main thread does. The main thread just gets delivered std::vector<T> to do with as they will. In your case, you'd make T a Mat.

The Mat objects are complex objects that have internal allocations of memory. When you clear the vector, one will need to iterate through every instance of Mat contained and run its destructor which is itself a non-trivial operation.
Also remember that free-store memory perations are non-trivial, so depending on your heap implementation, the heap may decide to merge cells etc.
If this is a problem, you should run your clear through a profiler and find out where the bottle-neck is.

Be careful using optimization, it can drive the debugger crazy.
If you were to do this in a function and simply let the vector go out of scope??
Since the elements are not pointers I think this will work.

Related

Optimizing huge graph traversal with OpenMP

I am trying to optimize this function which according to the perf tool is the bottleneck of archiving close to linear scaling. The performance gets worse when the number of threads go up, when I drill down the assembly code generated by perf it shows most of the time is spent checking for visited and not visited vertices. I've done a ton of google searches to improve the performance to no avail. Is there a way to improve the performance of this function? Or is there a thread safe way of implementing this function? Thanks for your help in advance!
typedef uint32_t vidType;
template<typename T, typename U, typename V>
bool compare_and_swap(T &x, U old_val, V new_val) {
return __sync_bool_compare_and_swap(&x, old_val, new_val);
}
template<bool map_vertices, bool map_edges>
VertexSet GraphT<map_vertices, map_edges>::N(vidType vid) const {
assert(vid >= 0);
assert(vid < n_vertices);
eidType begin = vertices[vid], end = vertices[vid+1];
if (begin > end || end > n_edges) {
fprintf(stderr, "vertex %u bounds error: [%lu, %lu)\n", vid, begin, end);
exit(1);
}
assert(end <= n_edges);
return VertexSet(edges + begin, end - begin, vid);
}
void bfs_step(Graph &g, vidType *depth, SlidingQueue<vidType> &queue) {
#pragma omp parallel
{
QueueBuffer<vidType> lqueue(queue);
#pragma omp for
for (auto q_iter = queue.begin(); q_iter < queue.end(); q_iter++) {
auto src = *q_iter;
for (auto dst : g.N(src)) {
//int curr_val = parent[dst];
auto curr_val = depth[dst];
if (curr_val == MYINFINITY) { // not visited
//if (compare_and_swap(parent[dst], curr_val, src)) {
if (compare_and_swap(depth[dst], curr_val, depth[src] + 1)) {
lqueue.push_back(dst);
}
}
}
}
lqueue.flush();
}
}
First of all, you're using a very traditional formulation of graph algorithms. Good for textbooks, not for computation. If you write this as a generalized matrix-vector product with the adjacency matrix you lose all those fiddly queues and the parallelism becomes quite obvious.
In your formulation, the problem is with the push_back function on the queue. That is hard to parallelize. The solution is to let each thread have its own queue, and then using a reduction. This works if you define the plus operator on your queue object to effect a merge of the local queues.

Boost Fibonacci Heap Access Violation during pop()

Context
I'm currently implementing some form of A* algorithm. I decided to use boost's fibonacci heap as underlying priority queue.
My Graph is being built while the algorithm runs. As Vertex object I'm using:
class Vertex {
public:
Vertex(double, double);
double distance = std::numeric_limits<double>::max();
double heuristic = 0;
HeapData* fib;
Vertex* predecessor = nullptr;
std::vector<Edge*> adj;
double euclideanDistanceTo(Vertex* v);
}
My Edge looks like:
class Edge {
public:
Edge(Vertex*, double);
Vertex* vertex = nullptr;
double weight = 1;
}
In order to use boosts fibonacci heap, I've read that one should create a heap data object, which I did like that:
struct HeapData {
Vertex* v;
boost::heap::fibonacci_heap<HeapData>::handle_type handle;
HeapData(Vertex* u) {
v = u;
}
bool operator<(HeapData const& rhs) const {
return rhs.v->distance + rhs.v->heuristic < v->distance + v->heuristic;
}
};
Note, that I included the heuristic and the actual distance in the comparator to get the A* behaviour, I want.
My actual A* implementation looks like that:
boost::heap::fibonacci_heap<HeapData> heap;
HeapData fibs(startPoint);
startPoint->distance = 0;
startPoint->heuristic = getHeuristic(startPoint);
auto handles = heap.push(fibs);
(*handles).handle = handles;
while (!heap.empty()) {
HeapData u = heap.top();
heap.pop();
if (u.v->equals(endPoint)) {
return;
}
doSomeGraphCreationStuff(u.v); // this only creates vertices and edges
for (Edge* e : u.v->adj) {
double newDistance = e->weight + u.v->distance;
if (e->vertex->distance > newDistance) {
e->vertex->distance = newDistance;
e->vertex->predecessor = u.v;
if (!e->vertex->fib) {
if (!u.v->equals(endPoint)) {
e->vertex->heuristic = getHeuristic(e->vertex);
}
e->vertex->fib = new HeapData(e->vertex);
e->vertex->fib->handle = heap.push(*(e->vertex->fib));
}
else {
heap.increase(e->vertex->fib->handle);
}
}
}
}
Problem
The algorithm runs just fine, if I use a very small heuristic (which degenerates A* to Dijkstra). If I introduce some stronger heuristic, however, the program throws an exepction stating:
0xC0000005: Access violation writing location 0x0000000000000000.
in the unlink method of boosts circular_list_algorithm.hpp. For some reason, next and prev are null. This is a direct consequence of calling heap.pop().
Note that heap.pop() works fine for several times and does not crash immediately.
Question
What causes this problem and how can I fix it?
What I have tried
My first thought was that I accidentally called increase() even though distance + heuristic got bigger instead of smaller (according to the documentation, this can break stuff). This is not possible in my implementation, however, because I can only change a node if the distance got smaller. The heurisitic stays constant. I tried to use update() instead of increase() anyway, without success
I tried to set several break points to get a more detailed view, but my data set is huge and I fail to reproduce it with smaller sets.
Additional Information
Boost Version: 1.76.0
C++14
the increase function is indeed right (instead of a decrease function) since all boost heaps are implemented as max-heaps. We get a min-heap by reversing the comparator and using increase/decrease reversed
Okay, prepare for a ride.
First I found a bug
Next, I fully reviewed, refactored and simplified the code
When the dust settled, I noticed a behaviour change that looked like a potential logic error in the code
1. The Bug
Like I commented at the question, the code complexity is high due to over-reliance on raw pointers without clear semantics.
While I was reviewing and refactoring the code, I found that this has, indeed, lead to a bug:
e->vertex->fib = new HeapData(e->vertex);
e->vertex->fib->handle = heap.push(*(e->vertex->fib));
In the first line you create a HeapData object. You make the fib member point to that object.
The second line inserts a copy of that object (meaning, it's a new object, with a different object identity, or practically speaking: a different address).
So, now
e->vertex->fib points to a (leaked) HeapData object that does not exist in the queue, and
the actual queued HeapData copy has a default-constructed handle member, which means that the handle wraps a null pointer. (Check boost::heap::detail::node_handle<> in detail/stable_heap.hpp to verify this).
This would handsomely explain the symptom you are seeing.
2. Refactor, Simplify
So, after understanding the code I have come to the conclusion that
HeapData and Vertex should to be merged: HeapData only served to link a handle to a Vertex, but you can already make the Vertex contain a Handle directly.
As a consequence of this merge
your vertex queue now actually contains vertices, expressing intent of the code
you reduce all of the vertex access by one level of indirection (reducing Law Of Demeter violations)
you can write the push operation in one natural line, removing the room for your bug to crop up. Before:
target->fib = new HeapData(target);
target->fib->handle = heap.push(*(target->fib));
After:
target->fibhandle = heap.push(target);
Your Edge class doesn't actually model an edge, but rather an "adjacency" - the target
part of the edge, with the weight attribute.
I renamed it OutEdge for clarity and also changed the vector to contain values instead of
dynamically allocated OutEdge instances.
I can't tell from the code shown, but I can almost guarantee these were
being leaked.
Also, OutEdge is only 16 bytes on most platforms, so copying them will be fine, and adjacencies are by definition owned by their source vertex (because including/moving it to another source vertex would change the meaning of the adjacency).
In fact, if you're serious about performance, you may want to use a boost::container::small_vector with a suitably chosen capacity if you know that e.g. the median number of edges is "small"
Your comparison can be "outsourced" to a function object
using Node = Vertex*;
struct PrioCompare {
bool operator()(Node a, Node b) const;
};
After which the heap can be typed as:
namespace bh = boost::heap;
using Heap = bh::fibonacci_heap<Node, bh::compare<PrioCompare>>;
using Handle = Heap::handle_type;
Your cost function violated more Law-Of-Demeter, which was easily fixed by adding a Literate-Code accessor:
Cost cost() const { return distance + heuristic; }
From quick inspection I think it would be more accurate to use infinite() over max() as initial distance. Also, use a constant for readability:
static constexpr auto INF = std::numeric_limits<Cost>::infinity();
Cost distance = INF;
You had a repeated check for xyz->equals(endPoint) to avoid updating the heuristic for a vertex. I suggest moving the update till after vertex dequeue, so the repetition can be gone (of both the check and the getHeuristic(...) call).
Like you said, we need to tread carefully around the increase/update fixup methods. As I read your code, the priority of a node is inversely related to the "cost" (cumulative edge-weight and heuristic values).
Because Boost Heap heaps are max heaps this implies that increasing the priority should match decreasing cost. We can just assert this to detect any programmer error in debug builds:
assert(target->cost() < previous_cost);
heap.increase(target->fibhandle);
With these changes in place, the code can read a lot quieter:
Cost AStarSearch(Node start, Node destination) {
Heap heap;
start->distance = 0;
start->fibhandle = heap.push(start);
while (!heap.empty()) {
Node u = heap.top();
heap.pop();
if (u->equals(destination)) {
return u->cost();
}
u->heuristic = getHeuristic(start);
doSomeGraphCreationStuff(u);
for (auto& [target, weight] : u->adj) {
auto curDistance = weight + u->distance;
// if cheaper route, queue or update queued
if (curDistance < target->distance) {
auto cost_prior = target->cost();
target->distance = curDistance;
target->predecessor = u;
if (target->fibhandle == NOHANDLE) {
target->fibhandle = heap.push(target);
} else {
assert(target->cost() < cost_prior);
heap.update(target->fibhandle);
}
}
}
}
return INF;
}
2(b) Live Demo
Adding some test data:
Live On Coliru
#include <boost/heap/fibonacci_heap.hpp>
#include <iostream>
using Cost = double;
struct Vertex;
Cost getHeuristic(Vertex const*) { return 0; }
void doSomeGraphCreationStuff(Vertex const*) {
// this only creates vertices and edges
}
struct OutEdge { // adjacency from implied source vertex
Vertex* target = nullptr;
Cost weight = 1;
};
namespace bh = boost::heap;
using Node = Vertex*;
struct PrioCompare {
bool operator()(Node a, Node b) const;
};
using Heap = bh::fibonacci_heap<Node, bh::compare<PrioCompare>>;
using Handle = Heap::handle_type;
static const Handle NOHANDLE{}; // for expressive comparisons
static constexpr auto INF = std::numeric_limits<Cost>::infinity();
struct Vertex {
Vertex(Cost d = INF, Cost h = 0) : distance(d), heuristic(h) {}
Cost distance = INF;
Cost heuristic = 0;
Handle fibhandle{};
Vertex* predecessor = nullptr;
std::vector<OutEdge> adj;
Cost cost() const { return distance + heuristic; }
Cost euclideanDistanceTo(Vertex* v);
bool equals(Vertex const* u) const { return this == u; }
};
// Now Vertex is a complete type, implement comparison
bool PrioCompare::operator()(Node a, Node b) const {
return a->cost() > b->cost();
}
Cost AStarSearch(Node start, Node destination) {
Heap heap;
start->distance = 0;
start->fibhandle = heap.push(start);
while (!heap.empty()) {
Node u = heap.top();
heap.pop();
if (u->equals(destination)) {
return u->cost();
}
u->heuristic = getHeuristic(start);
doSomeGraphCreationStuff(u);
for (auto& [target, weight] : u->adj) {
auto curDistance = weight + u->distance;
// if cheaper route, queue or update queued
if (curDistance < target->distance) {
auto cost_prior = target->cost();
target->distance = curDistance;
target->predecessor = u;
if (target->fibhandle == NOHANDLE) {
target->fibhandle = heap.push(target);
} else {
assert(target->cost() < cost_prior);
heap.update(target->fibhandle);
}
}
}
}
return INF;
}
int main() {
// a very very simple graph data structure with minimal helpers:
std::vector<Vertex> graph(10);
auto node = [&graph](int id) { return &graph.at(id); };
auto id = [&graph](Vertex const* node) { return node - graph.data(); };
// defining 6 edges
graph[0].adj = {{node(2), 1.5}, {node(3), 15}};
graph[2].adj = {{node(4), 2.5}, {node(1), 5}};
graph[1].adj = {{node(7), 0.5}};
graph[7].adj = {{node(3), 0.5}};
// do a search
Node startPoint = node(0);
Node endPoint = node(7);
Cost cost = AStarSearch(startPoint, endPoint);
std::cout << "Overall cost: " << cost << ", reverse path: \n";
for (Node node = endPoint; node != nullptr; node = node->predecessor) {
std::cout << " - " << id(node) << " distance " << node->distance
<< "\n";
}
}
Prints
Overall cost: 7, reverse path:
- 7 distance 7
- 1 distance 6.5
- 2 distance 1.5
- 0 distance 0
3. The Plot Twist: Lurking Logic Errors?
I felt uneasy about moving the getHeuristic() update around. I wondered
whether I might have changed the meaning of the code, even though the control
flow seemed to check out.
And then I realized that indeed the behaviour changed. It is subtle. At first I thought the
the old behaviour was just problematic. So, let's analyze:
The source of the risk is an inconsistency in node visitation vs. queue prioritization.
When visiting nodes, the condition to see whether the target became "less
distant" is expressed in terms of distance only.
However, the queue priority will be based on cost, which is different
from distance in that it includes any heuristics.
The problem lurking there is that it is possible to write code that where the
fact that distance decreases, NEED NOT guarantee that cost decreases.
Going back to the code, we can see that this narrowly avoided, because the
getHeuristic update is only executed in the non-update path of the code.
In Conclusion
Understanding this made me realize that
the Vertex::heuristic field is intended merely as a "cached" version of the getHeuristic() function call
implying that that function is treated as if it is idempotent
that my version did change behaviour in that getHeuristic was now
potentially executed more than once for the same vertex (if visited again
via a cheaper path)
I would suggest to fix this by
renaming the heuristic field to cachedHeuristic
making an enqueue function to encapsulate the three steps for enqueuing a vertex:
simply omitting the endpoint check because it can at MOST eliminate a single invocation of getHeuristic for that node, probably not worth the added complexity
add a comment pointing out the subtlety of that code path
UPDATE as discovered in the comments, we also need the inverse
operatione (dequeue) to symmtrically update handle so it reflects that
the node is no longer in the queue...
It also drives home the usefulness of having the precondition assert added before invoking Heap::increase.
Final Listing
With the above changes
encapsulated into a Graph object, that
also reads the graph from input like:
0 2 1.5
0 3 15
2 4 2.5
2 1 5
1 7 0.5
7 3 0.5
Where each line contains (source, target, weight).
A separate file can contain heuristic values for vertices index [0, ...),
optionally newline-separated, e.g. "7 11 99 33 44 55"
and now returning the arrived-at node instead of its cost only
Live On Coliru
#include <boost/heap/fibonacci_heap.hpp>
#include <iostream>
#include <deque>
#include <fstream>
using Cost = double;
struct Vertex;
struct OutEdge { // adjacency from implied source vertex
Vertex* target = nullptr;
Cost weight = 1;
};
namespace bh = boost::heap;
using Node = Vertex*;
struct PrioCompare {
bool operator()(Node a, Node b) const;
};
using MutableQueue = bh::fibonacci_heap<Node, bh::compare<PrioCompare>>;
using Handle = MutableQueue::handle_type;
static const Handle NOHANDLE{}; // for expressive comparisons
static constexpr auto INF = std::numeric_limits<Cost>::infinity();
struct Vertex {
Vertex(Cost d = INF, Cost h = 0) : distance(d), cachedHeuristic(h) {}
Cost distance = INF;
Cost cachedHeuristic = 0;
Handle handle{};
Vertex* predecessor = nullptr;
std::vector<OutEdge> adj;
Cost cost() const { return distance + cachedHeuristic; }
Cost euclideanDistanceTo(Vertex* v);
};
// Now Vertex is a complete type, implement comparison
bool PrioCompare::operator()(Node a, Node b) const {
return a->cost() > b->cost();
}
class Graph {
std::vector<Cost> _heuristics;
Cost getHeuristic(Vertex* v) {
size_t n = id(v);
return n < _heuristics.size() ? _heuristics[n] : 0;
}
void doSomeGraphCreationStuff(Vertex const*) {
// this only creates vertices and edges
}
public:
Graph(std::string edgeFile, std::string heurFile) {
{
std::ifstream stream(heurFile);
_heuristics.assign(std::istream_iterator<Cost>(stream), {});
if (!stream.eof())
throw std::runtime_error("Unexpected heuristics");
}
std::ifstream stream(edgeFile);
size_t src, tgt;
double weight;
while (stream >> src >> tgt >> weight) {
_nodes.resize(std::max({_nodes.size(), src + 1, tgt + 1}));
_nodes[src].adj.push_back({node(tgt), weight});
}
if (!stream.eof())
throw std::runtime_error("Unexpected input");
}
Node search(size_t from, size_t to) {
assert(from < _nodes.size());
assert(to < _nodes.size());
return AStar(node(from), node(to));
}
size_t id(Node node) const {
// ugh, this is just for "pretty output"...
for (size_t i = 0; i < _nodes.size(); ++i) {
if (node == &_nodes[i])
return i;
}
throw std::out_of_range("id");
};
Node node(int id) { return &_nodes.at(id); };
private:
// simple graph data structure with minimal helpers:
std::deque<Vertex> _nodes; // reference stable when growing at the back
// search state
MutableQueue _queue;
void enqueue(Node n) {
assert(n && (n->handle == NOHANDLE));
// get heuristic before insertion!
n->cachedHeuristic = getHeuristic(n);
n->handle = _queue.push(n);
}
Node dequeue() {
Node node = _queue.top();
node->handle = NOHANDLE;
_queue.pop();
return node;
}
Node AStar(Node start, Node destination) {
_queue.clear();
start->distance = 0;
enqueue(start);
while (!_queue.empty()) {
Node u = dequeue();
if (u == destination) {
return u;
}
doSomeGraphCreationStuff(u);
for (auto& [target, weight] : u->adj) {
auto curDistance = u->distance + weight;
// if cheaper route, queue or update queued
if (curDistance < target->distance) {
auto cost_prior = target->cost();
target->distance = curDistance;
target->predecessor = u;
if (target->handle == NOHANDLE) {
// also caches heuristic
enqueue(target);
} else {
// NOTE: avoid updating heuristic here, because it
// breaks the queue invariant if heuristic increased
// more than decrease in distance
assert(target->cost() < cost_prior);
_queue.increase(target->handle);
}
}
}
}
return nullptr;
}
};
int main() {
Graph graph("input.txt", "heur.txt");
Node arrival = graph.search(0, 7);
std::cout << "reverse path: \n";
for (Node n = arrival; n != nullptr; n = n->predecessor) {
std::cout << " - " << graph.id(n) << " cost " << n->cost() << "\n";
}
}
Again, printing the expected
reverse path:
- 7 cost 7
- 1 cost 17.5
- 2 cost 100.5
- 0 cost 7
Note how the heuristics changed the cost, but not optimal path in this case.

Inserting multiple values into a vector at specific positions

Say I have a vector of integers like this std::vector<int> _data;
I know that if I want to remove multiple items from _data, then I can simply call
_data.erase( std::remove_if( _data.begin(), _data.end(), [condition] ), _data.end() );
Which is much faster than eraseing multiple elements, as less movement of data is required within the vector. I'm wondering if there's something similar for insertions.
For example, if I have the following pairs
auto pair1 = { _data.begin() + 5, 5 };
auto pair2 = { _data.begin() + 12, 12 };
Can I insert both of these in one iteration using some existing std function? I know I can do something like:
_data.insert( pair2.first, pair2.second );
_data.insert( pair1.first, pair1.second );
But this is (very) slow for large vectors (talking 100,000+ elements).
EDIT: Basically, I have a custom set (and map) which use a vector as the underlying containers. I know I can just use std::set or std::map, but the number of traversals I do far outweighs the insertion/removals. Switching from a set and map to this custom set/map already cut 20% of run-time off. Currently though, insertions take approximately 10% of the remaining run time, so reducing that is important.
The order is also required, unfortunately. As much as possible, I use the unordered_ versions, but in some places the order does matter.
One way is to create another vector with capacity equal to the original size plus the number of the elements being inserted and then do an insert loop with no reallocations, O(N) complexity:
template<class T>
std::vector<T> insert_elements(std::vector<T> const& v, std::initializer_list<std::pair<std::size_t, T>> new_elements) {
std::vector<T> u;
u.reserve(v.size() + new_elements.size());
auto src = v.begin();
size_t copied = 0;
for(auto const& element : new_elements) {
auto to_copy = element.first - copied;
auto src_end = src + to_copy;
u.insert(u.end(), src, src_end);
src = src_end;
copied += to_copy;
u.push_back(element.second);
}
u.insert(u.end(), src, v.end());
return u;
}
int main() {
std::vector<int> v{1, 3, 5};
for(auto e : insert_elements(v, {{1,2}, {2,4}}))
std::cout << e << ' ';
std::cout << '\n';
}
Output:
1 2 3 4 5
Ok, we need some assumptions. Let old_end be a reverse iterator to the last element of your vector. Assume that your _data has been resized to exactly fit both its current content and what you want to insert. Assume that inp is a container of std::pair containing your data to be inserted that is ordered reversely (so first the element that is to be inserted at the hindmost position and so on). Then we can do:
std::merge(old_end, _data.rend(), inp.begin(), inp.end(), data.rend(), [int i = inp.size()-1](const &T t, const &std::pair<Iter, T> p) mutable {
if( std::distance(_data.begin(), p.first) == i ) {
--i;
return false;
}
return true;
}
But I think that is not more clear than using a good old for. The problem with the stl-algorithms is that the predicates work on values and not on iterators thats a bit annoying for this problem.
Here's my take:
template<class Key, class Value>
class LinearSet
{
public:
using Node = std::pair<Key, Value>;
template<class F>
void insert_at_multiple(F&& f)
{
std::queue<Node> queue;
std::size_t index = 0;
for (auto it = _kvps.begin(); it != _kvps.end(); ++it)
{
// The container size is left untouched here, no iterator invalidation.
if (std::optional<Node> toInsert = f(index))
{
queue.push(*it);
*it = std::move(*toInsert);
}
else
{
++index;
// Replace current node with queued one.
if (!queue.empty())
{
queue.push(std::move(*it));
*it = std::move(queue.front());
queue.pop();
}
}
}
// We now have as many displaced items in the queue as were inserted,
// add them to the end.
while (!queue.empty())
{
_kvps.emplace_back(std::move(queue.front()));
queue.pop();
}
}
private:
std::vector<Node> _kvps;
};
https://godbolt.org/z/EStKgQ
This is a linear time algorithm that doesn't need to know the number of inserted elements a priori. For each index, it asks for an element to insert there. If it gets one, it pushes the corresponding existing vector element to a queue and replaces it with the new one. Otherwise, it extracts the current item to the back of the queue and puts the item at the front of the queue into the current position (noop if no elements were inserted yet). Note that the vector size is left untouched during all this. Only at the end do we push back all items still in the queue.
Note that the indices we use for determining inserted item locations here are all pre-insertion. I find this a point of potential confusion (and it is a limitation - you can't add an element at the very end with this algorithm. Could be remedied by calling f during the second loop too, working on that...).
Here's a version that allows inserting arbitrarily many elements at the end (and everywhere else). It passes post-insertion indices to the functor!
template<class F>
void insert_at_multiple(F&& f)
{
std::queue<Node> queue;
std::size_t index = 0;
for (auto it = _kvps.begin(); it != _kvps.end(); ++it)
{
if (std::optional<Node> toInsert = f(index))
queue.push(std::move(*toInsert));
if (!queue.empty())
{
queue.push(std::move(*it));
*it = std::move(queue.front());
queue.pop();
}
++index;
}
// We now have as many displaced items in the queue as were inserted,
// add them to the end.
while (!queue.empty())
{
if (std::optional<Node> toInsert = f(index))
{
queue.push(std::move(*toInsert));
}
_kvps.emplace_back(std::move(queue.front()));
queue.pop();
++index;
}
}
https://godbolt.org/z/DMuCtJ
Again, this leaves potential for confusion over what it means to insert at indices 0 and 1 (do you end up with an original element in between the two? In the first snippet you would, in the second you wouldn't). Can you insert at the same index multiple times? With pre-insertion indices that makes sense, with post-insertion indices it doesn't. You could also write this in terms of passing the current *it (i.e. key value pair) to the functor, but that alone seems not too useful...
This is an attempt I made, which inserts in reverse order. I did get rid of the iterators/indices for this.
template<class T>
void insert( std::vector<T> &vector, const std::vector<T> &values ) {
size_t last_index = vector.size() - 1;
vector.resize( vector.size() + values.size() ); // relies on T being default constructable
size_t move_position = vector.size() - 1;
size_t last_value_index = values.size() - 1;
size_t values_size = values.size();
bool isLastIndex = false;
while ( !isLastIndex && values_size ) {
if ( values[last_value_index] > vector[last_index] ) {
vector[move_position] = std::move( values[last_value_index--] );
--values_size;
} else {
isLastIndex = last_index == 0;
vector[move_position] = std::move( vector[last_index--] );
}
--move_position;
}
if ( isLastIndex && values_size ) {
while ( values_size ) {
vector[move_position--] = std::move( values[last_value_index--] );
--values_size;
}
}
}
Tried with ICC, Clang, and GCC on Godbolt, and vector's insert was faster (for 5 numbers inserted). On my machine, MSVC, same result but less severe. I also compared with Maxim's version from his answer. I realize using Godbolt isn't a good method for comparison, but I don't have access to the 3 other compilers on my current machine.
https://godbolt.org/z/vjV2wA
Results from my machine:
My insert: 659us
Maxim insert: 712us
Vector insert: 315us
Godbolt's ICC
My insert: 470us
Maxim insert: 139us
Vector insert: 127us
Godbolt's GCC
My insert: 815us
Maxim insert: 97us
Vector insert: 97us
Godbolt's Clang:
My insert: 477us
Maxim insert: 188us
Vector insert: 96us

performing vector intersection in C++

I have a vector of vector of unsigned. I need to find the intersection of all these vector of unsigned's for doing so I wrote the following code:
int func()
{
vector<vector<unsigned> > t;
vector<unsigned> intersectedValues;
bool firstIntersection=true;
for(int i=0;i<(t).size();i++)
{
if(firstIntersection)
{
intersectedValues=t[0];
firstIntersection=false;
}else{
vector<unsigned> tempIntersectedSubjects;
set_intersection(t[i].begin(),
t[i].end(), intersectedValues.begin(),
intersectedValues.end(),
std::inserter(tempIntersectedSubjects, tempIntersectedSubjects.begin()));
intersectedValues=tempIntersectedSubjects;
}
if(intersectedValues.size()==0)
break;
}
}
Each individual vector has 9000 elements and there are many such vectors in "t". When I profiled my code I found that set_intersection takes the maximum amount of time and hence makes the code slow when there are many invocations of func(). Can someone please suggest as to how can I make the code more efficient.
I am using: gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15)
EDIT: Individual vectors in vector "t" are sorted.
I don't have a framework to profile the operations but I'd certainly change the code to reuse the readily allocated vector. In addition, I'd hoist the initial intersection out of the loop. Also, std::back_inserter() should make sure that elements are added in the correct location rather than in the beginning:
int func()
{
vector<vector<unsigned> > t = some_initialization();
if (t.empty()) {
return;
}
vector<unsigned> intersectedValues(t[0]);
vector<unsigned> tempIntersectedSubjects;
for (std::vector<std::vector<unsigned>>::size_type i(1u);
i < t.size() && !intersectedValues.empty(); ++i) {
std::set_intersection(t[i].begin(), t[i].end(),
intersectedValues.begin(), intersectedValues.end(),
std::back_inserter(tempIntersectedSubjects);
std::swap(intersectedValues, tempIntersectedSubjects);
tempIntersectedSubjects.clear();
}
}
I think this code has a fair chance to be faster. It may also be reasonable to intersect the sets different: instead of keeping one set and intersecting with that you could create a new intersection for pairs of adjacent sets and then intersect the first sets with their respect adjacent ones:
std::vector<std::vector<unsigned>> intersections(
std::vector<std::vector<unsigned>> const& t) {
std::vector<std::vector<unsigned>> r;
std::vector<std::vector<unsignned>>::size_type i(0);
for (; i + 1 < t.size(); i += 2) {
r.push_back(intersect(t[i], t[i + 1]));
}
if (i < t.size()) {
r.push_back(t[i]);
}
return r;
}
std::vector<unsigned> func(std::vector<std::vector<unsigned>> const& t) {
if (t.empty()) { /* deal with t being empty... */ }
std::vector<std::vector<unsigned>> r(intersections(t))
return r.size() == 1? r[0]: func(r);
}
Of course, you wouldn't really implement it like this: you'd use Stepanov's binary counter to keep the intermediate sets. This approach assumes that the result is most likely non-empty. If the expectation is that the result will be empty that may not be an improvement.
I can't test this but maybe something like this would be faster?
int func()
{
vector<vector<unsigned> > t;
vector<unsigned> intersectedValues;
// remove if() branching from loop
if(t.empty())
return -1;
intersectedValues = t[0];
// now start from 1
for(size_t i = 1; i < t.size(); ++i)
{
vector<unsigned> tempIntersectedSubjects;
tempIntersectedSubjects.reserve(intersectedValues.size()); // pre-allocate
// insert at end() not begin()
set_intersection(t[i].begin(),
t[i].end(), intersectedValues.begin(),
intersectedValues.end(),
std::inserter(tempIntersectedSubjects, tempIntersectedSubjects.end()));
// as these are not used again you can move them rather than copy
intersectedValues = std::move(tempIntersectedSubjects);
if(intersectedValues.empty())
break;
}
return 0;
}
Another possibility:
Thinking about it using swap() could optimize the exchange of data and remove the need to re-allocate. Also then the temp constructor can be moved out of the loop.
int func()
{
vector<vector<unsigned> > t;
vector<unsigned> intersectedValues;
// remove if() branching from loop
if(t.empty())
return -1;
intersectedValues = t[0];
// no need to construct this every loop
vector<unsigned> tempIntersectedSubjects;
// now start from 1
for(size_t i = 1; i < t.size(); ++i)
{
// should already be the correct size from previous loop
// but just in case this should be cheep
// (profile removing this line)
tempIntersectedSubjects.reserve(intersectedValues.size());
// insert at end() not begin()
set_intersection(t[i].begin(),
t[i].end(), intersectedValues.begin(),
intersectedValues.end(),
std::inserter(tempIntersectedSubjects, tempIntersectedSubjects.end()));
// swap should leave tempIntersectedSubjects preallocated to the
// correct size
intersectedValues.swap(tempIntersectedSubjects);
tempIntersectedSubjects.clear(); // will not deallocate
if(intersectedValues.empty())
break;
}
return 0;
}
You can make std::set_intersection as well as a bunch of other standard library algorithms run in parallel by defining _GLIBCXX_PARALLEL during compilation. That probably has the best work-gain ratio. For documentation see this.
Obligatory pitfall warning:
Note that the _GLIBCXX_PARALLEL define may change the sizes and behavior of standard class templates such as std::search, and therefore one can only link code compiled with parallel mode and code compiled without parallel mode if no instantiation of a container is passed between the two translation units. Parallel mode functionality has distinct linkage, and cannot be confused with normal mode symbols.
from here.
Another simple, though probably insignificantly small, optimization would be reserving enough space before filling your vectors.
Also, try to find out whether inserting the values at the back instead of the front and then reversing the vector helps. (Although I even think that your code is wrong right now and your intersectedValues is sorted the wrong way. If I'm not mistaken, you should use std::back_inserter instead of std::inserter(...,begin) and then not reverse.) While shifting stuff through memory is pretty fast, not shifting should be even faster.
To copy elements from vectors from vector for loop with emplace_back() may save your time. And no need of a flag if you change the iterator index of for loop. So for loop can be optimized, and condition check can be removed for each iteration.
void func()
{
vector<vector<unsigned > > t;
vector<unsigned int > intersectedValues;
for(unsigned int i=1;i<(t).size();i++)
{
intersectedValues=t[0];
vector<unsigned > tempIntersectedSubjects;
set_intersection(t[i].begin(),
t[i].end(), intersectedValues.begin(),
intersectedValues.end(),
std::back_inserter(tempIntersectedSubjects);
for(auto &ele: tempIntersectedSubjects)
intersectedValues.emplace_back(ele);
if( intersectedValues.empty())
break;
}
}
set::set_intersection can be rather slow for large vectors. It's possible use create a similar function that uses lower_bound. Something like this:
template<typename Iterator1, typename Iterator2, typename Function>
void lower_bound_intersection(Iterator1 begin_1, Iterator1 end_1, Iterator2 begin_2, Iterator2 end_2, Function func)
{
for (; begin_1 != end_1 && begin_2 != end_2;)
{
if (*begin_1 < *begin_2)
{
begin_1 = begin_1.lower_bound(*begin_2);
//++begin_1;
}
else if (*begin_2 < *begin_1)
{
begin_2 = begin_2.lower_bound(*begin_1);
//++begin_2;
}
else // equivalent
{
func(*begin_1);
++begin_1;
++begin_2;
}
}
}

Reorder vector using a vector of indices [duplicate]

This question already has answers here:
How do I sort a std::vector by the values of a different std::vector? [duplicate]
(13 answers)
Closed 12 months ago.
I'd like to reorder the items in a vector, using another vector to specify the order:
char A[] = { 'a', 'b', 'c' };
size_t ORDER[] = { 1, 0, 2 };
vector<char> vA(A, A + sizeof(A) / sizeof(*A));
vector<size_t> vOrder(ORDER, ORDER + sizeof(ORDER) / sizeof(*ORDER));
reorder_naive(vA, vOrder);
// A is now { 'b', 'a', 'c' }
The following is an inefficient implementation that requires copying the vector:
void reorder_naive(vector<char>& vA, const vector<size_t>& vOrder)
{
assert(vA.size() == vOrder.size());
vector vCopy = vA; // Can we avoid this?
for(int i = 0; i < vOrder.size(); ++i)
vA[i] = vCopy[ vOrder[i] ];
}
Is there a more efficient way, for example, that uses swap()?
This algorithm is based on chmike's, but the vector of reorder indices is const. This function agrees with his for all 11! permutations of [0..10]. The complexity is O(N^2), taking N as the size of the input, or more precisely, the size of the largest orbit.
See below for an optimized O(N) solution which modifies the input.
template< class T >
void reorder(vector<T> &v, vector<size_t> const &order ) {
for ( int s = 1, d; s < order.size(); ++ s ) {
for ( d = order[s]; d < s; d = order[d] ) ;
if ( d == s ) while ( d = order[d], d != s ) swap( v[s], v[d] );
}
}
Here's an STL style version which I put a bit more effort into. It's about 47% faster (that is, almost twice as fast over [0..10]!) because it does all the swaps as early as possible and then returns. The reorder vector consists of a number of orbits, and each orbit is reordered upon reaching its first member. It's faster when the last few elements do not contain an orbit.
template< typename order_iterator, typename value_iterator >
void reorder( order_iterator order_begin, order_iterator order_end, value_iterator v ) {
typedef typename std::iterator_traits< value_iterator >::value_type value_t;
typedef typename std::iterator_traits< order_iterator >::value_type index_t;
typedef typename std::iterator_traits< order_iterator >::difference_type diff_t;
diff_t remaining = order_end - 1 - order_begin;
for ( index_t s = index_t(), d; remaining > 0; ++ s ) {
for ( d = order_begin[s]; d > s; d = order_begin[d] ) ;
if ( d == s ) {
-- remaining;
value_t temp = v[s];
while ( d = order_begin[d], d != s ) {
swap( temp, v[d] );
-- remaining;
}
v[s] = temp;
}
}
}
And finally, just to answer the question once and for all, a variant which does destroy the reorder vector (filling it with -1's). For permutations of [0..10], It's about 16% faster than the preceding version. Because overwriting the input enables dynamic programming, it is O(N), asymptotically faster for some cases with longer sequences.
template< typename order_iterator, typename value_iterator >
void reorder_destructive( order_iterator order_begin, order_iterator order_end, value_iterator v ) {
typedef typename std::iterator_traits< value_iterator >::value_type value_t;
typedef typename std::iterator_traits< order_iterator >::value_type index_t;
typedef typename std::iterator_traits< order_iterator >::difference_type diff_t;
diff_t remaining = order_end - 1 - order_begin;
for ( index_t s = index_t(); remaining > 0; ++ s ) {
index_t d = order_begin[s];
if ( d == (diff_t) -1 ) continue;
-- remaining;
value_t temp = v[s];
for ( index_t d2; d != s; d = d2 ) {
swap( temp, v[d] );
swap( order_begin[d], d2 = (diff_t) -1 );
-- remaining;
}
v[s] = temp;
}
}
In-place reordering of vector
Warning: there is an ambiguity about the semantic what the ordering-indices mean. Both are answered here
move elements of vector to the position of the indices
Interactive version here.
#include <iostream>
#include <vector>
#include <assert.h>
using namespace std;
void REORDER(vector<double>& vA, vector<size_t>& vOrder)
{
assert(vA.size() == vOrder.size());
// for all elements to put in place
for( int i = 0; i < vA.size() - 1; ++i )
{
// while the element i is not yet in place
while( i != vOrder[i] )
{
// swap it with the element at its final place
int alt = vOrder[i];
swap( vA[i], vA[alt] );
swap( vOrder[i], vOrder[alt] );
}
}
}
int main()
{
std::vector<double> vec {7, 5, 9, 6};
std::vector<size_t> inds {1, 3, 0, 2};
REORDER(vec, inds);
for (size_t vv = 0; vv < vec.size(); ++vv)
{
std::cout << vec[vv] << std::endl;
}
return 0;
}
output
9
7
6
5
note that you can save one test because if n-1 elements are in place the last nth element is certainly in place.
On exit vA and vOrder are properly ordered.
This algorithm performs at most n-1 swapping because each swap moves the element to its final position. And we'll have to do at most 2N tests on vOrder.
draw the elements of vector from the position of the indices
Try it interactively here.
#include <iostream>
#include <vector>
#include <assert.h>
template<typename T>
void reorder(std::vector<T>& vec, std::vector<size_t> vOrder)
{
assert(vec.size() == vOrder.size());
for( size_t vv = 0; vv < vec.size() - 1; ++vv )
{
if (vOrder[vv] == vv)
{
continue;
}
size_t oo;
for(oo = vv + 1; oo < vOrder.size(); ++oo)
{
if (vOrder[oo] == vv)
{
break;
}
}
std::swap( vec[vv], vec[vOrder[vv]] );
std::swap( vOrder[vv], vOrder[oo] );
}
}
int main()
{
std::vector<double> vec {7, 5, 9, 6};
std::vector<size_t> inds {1, 3, 0, 2};
reorder(vec, inds);
for (size_t vv = 0; vv < vec.size(); ++vv)
{
std::cout << vec[vv] << std::endl;
}
return 0;
}
Output
5
6
7
9
It appears to me that vOrder contains a set of indexes in the desired order (for example the output of sorting by index). The code example here follows the "cycles" in vOrder, where following a sub-set (could be all of vOrder) of indexes will cycle through the sub-set, ending back at the first index of the sub-set.
Wiki article on "cycles"
https://en.wikipedia.org/wiki/Cyclic_permutation
In the following example, every swap places at least one element in it's proper place. This code example effectively reorders vA according to vOrder, while "unordering" or "unpermuting" vOrder back to its original state (0 :: n-1). If vA contained the values 0 through n-1 in order, then after reorder, vA would end up where vOrder started.
template <class T>
void reorder(vector<T>& vA, vector<size_t>& vOrder)
{
assert(vA.size() == vOrder.size());
// for all elements to put in place
for( size_t i = 0; i < vA.size(); ++i )
{
// while vOrder[i] is not yet in place
// every swap places at least one element in it's proper place
while( vOrder[i] != vOrder[vOrder[i]] )
{
swap( vA[vOrder[i]], vA[vOrder[vOrder[i]]] );
swap( vOrder[i], vOrder[vOrder[i]] );
}
}
}
This can also be implemented a bit more efficiently using moves instead swaps. A temp object is needed to hold an element during the moves. Example C code, reorders A[] according to indexes in I[], also sorts I[] :
void reorder(int *A, int *I, int n)
{
int i, j, k;
int tA;
/* reorder A according to I */
/* every move puts an element into place */
/* time complexity is O(n) */
for(i = 0; i < n; i++){
if(i != I[i]){
tA = A[i];
j = i;
while(i != (k = I[j])){
A[j] = A[k];
I[j] = j;
j = k;
}
A[j] = tA;
I[j] = j;
}
}
}
If it is ok to modify the ORDER array then an implementation that sorts the ORDER vector and at each sorting operation also swaps the corresponding values vector elements could do the trick, I think.
A survey of existing answers
You ask if there is "a more efficient way". But what do you mean by efficient and what are your requirements?
Potatoswatter's answer works in O(N²) time with O(1) additional space and doesn't mutate the reordering vector.
chmike and rcgldr give answers which use O(N) time with O(1) additional space, but they achieve this by mutating the reordering vector.
Your original answer allocates new space and then copies data into it while Tim MB suggests using move semantics. However, moving still requires a place to move things to and an object like an std::string has both a length variable and a pointer. In other words, a move-based solution requires O(N) allocations for any objects and O(1) allocations for the new vector itself. I explain why this is important below.
Preserving the reordering vector
We might want that reordering vector! Sorting costs O(N log N). But, if you know you'll be sorting several vectors in the same way, such as in a Structure of Arrays (SoA) context, you can sort once and then reuse the results. This can save a lot of time.
You might also want to sort and then unsort data. Having the reordering vector allows you to do this. A use case here is for performing genomic sequencing on GPUs where maximal speed efficiency is obtained by having sequences of similar lengths processed in batches. We cannot rely on the user providing sequences in this order so we sort and then unsort.
So, what if we want the best of all worlds: O(N) processing without the costs of additional allocation but also without mutating our ordering vector (which we might, after all, want to reuse)? To find that world, we need to ask:
Why is extra space bad?
There are two reasons you might not want to allocate additional space.
The first is that you don't have much space to work with. This can occur in two situations: you're on an embedded device with limited memory. Usually this means you're working with small datasets, so the O(N²) solution is probably fine here. But it can also happen when you are working with really large datasets. In this case O(N²) is unacceptable and you have to use one of the O(N) mutating solutions.
The other reason extra space is bad is because allocation is expensive. For smaller datasets it can cost more than the actual computation. Thus, one way to achieve efficiency is to eliminate allocation.
Outline
When we mutate the ordering vector we are doing so as a way to indicate whether elements are in their permuted positions. Rather than doing this, we could use a bit-vector to indicate that same information. However, if we allocate the bit vector each time that would be expensive.
Instead, we could clear the bit vector each time by resetting it to zero. However, that incurs an additional O(N) cost per function use.
Rather, we can store a "version" value in a vector and increment this on each function use. This gives us O(1) access, O(1) clear, and an amoritzed allocation cost. This works similarly to a persistent data structure. The downside is that if we use an ordering function too often the version counter needs to be reset, though the O(N) cost of doing so is amortized.
This raises the question: what is the optimal data type for the version vector? A bit-vector maximizes cache utilization but requires a full O(N) reset after each use. A 64-bit data type probably never needs to be reset, but has poor cache utilization. Experimenting is the best way to figure this out.
Two types of permutations
We can view an ordering vector as having two senses: forward and backward. In the forward sense, the vector tell us where elements go to. In the backward sense, the vector tells us where elements are coming from. Since the ordering vector is implicitly a linked list, the backward sense requires O(N) additional space, but, again, we can amortize the allocation cost. Applying the two senses sequentially brings us back to our original ordering.
Performance
Running single-threaded on my "Intel(R) Xeon(R) E-2176M CPU # 2.70GHz", the following code takes about 0.81ms per reordering for sequences 32,767 elements long.
Code
Fully commented code for both senses with tests:
#include <algorithm>
#include <cassert>
#include <random>
#include <stack>
#include <stdexcept>
#include <vector>
///#brief Reorder a vector by moving its elements to indices indicted by another
/// vector. Takes O(N) time and O(N) space. Allocations are amoritzed.
///
///#param[in,out] values Vector to be reordered
///#param[in] ordering A permutation of the vector
///#param[in,out] visited A black-box vector to be reused between calls and
/// shared with with `backward_reorder()`
template<class ValueType, class OrderingType, class ProgressType>
void forward_reorder(
std::vector<ValueType> &values,
const std::vector<OrderingType> &ordering,
std::vector<ProgressType> &visited
){
if(ordering.size()!=values.size()){
throw std::runtime_error("ordering and values must be the same size!");
}
//Size the visited vector appropriately. Since vectors don't shrink, this will
//shortly become large enough to handle most of the inputs. The vector is 1
//larger than necessary because the first element is special.
if(visited.empty() || visited.size()-1<values.size());
visited.resize(values.size()+1);
//If the visitation indicator becomes too large, we reset everything. This is
//O(N) expensive, but unlikely to occur in most use cases if an appropriate
//data type is chosen for the visited vector. For instance, an unsigned 32-bit
//integer provides ~4B uses before it needs to be reset. We subtract one below
//to avoid having to think too much about off-by-one errors. Note that
//choosing the biggest data type possible is not necessarily a good idea!
//Smaller data types will have better cache utilization.
if(visited.at(0)==std::numeric_limits<ProgressType>::max()-1)
std::fill(visited.begin(), visited.end(), 0);
//We increment the stored visited indicator and make a note of the result. Any
//value in the visited vector less than `visited_indicator` has not been
//visited.
const auto visited_indicator = ++visited.at(0);
//For doing an early exit if we get everything in place
auto remaining = values.size();
//For all elements that need to be placed
for(size_t s=0;s<ordering.size() && remaining>0;s++){
assert(visited[s+1]<=visited_indicator);
//Ignore already-visited elements
if(visited[s+1]==visited_indicator)
continue;
//Don't rearrange if we don't have to
if(s==visited[s])
continue;
//Follow this cycle, putting elements in their places until we get back
//around. Use move semantics for speed.
auto temp = std::move(values[s]);
auto i = s;
for(;s!=(size_t)ordering[i];i=ordering[i],--remaining){
std::swap(temp, values[ordering[i]]);
visited[i+1] = visited_indicator;
}
std::swap(temp, values[s]);
visited[i+1] = visited_indicator;
}
}
///#brief Reorder a vector by moving its elements to indices indicted by another
/// vector. Takes O(2N) time and O(2N) space. Allocations are amoritzed.
///
///#param[in,out] values Vector to be reordered
///#param[in] ordering A permutation of the vector
///#param[in,out] visited A black-box vector to be reused between calls and
/// shared with with `forward_reorder()`
template<class ValueType, class OrderingType, class ProgressType>
void backward_reorder(
std::vector<ValueType> &values,
const std::vector<OrderingType> &ordering,
std::vector<ProgressType> &visited
){
//The orderings form a linked list. We need O(N) memory to reverse a linked
//list. We use `thread_local` so that the function is reentrant.
thread_local std::stack<OrderingType> stack;
if(ordering.size()!=values.size()){
throw std::runtime_error("ordering and values must be the same size!");
}
//Size the visited vector appropriately. Since vectors don't shrink, this will
//shortly become large enough to handle most of the inputs. The vector is 1
//larger than necessary because the first element is special.
if(visited.empty() || visited.size()-1<values.size());
visited.resize(values.size()+1);
//If the visitation indicator becomes too large, we reset everything. This is
//O(N) expensive, but unlikely to occur in most use cases if an appropriate
//data type is chosen for the visited vector. For instance, an unsigned 32-bit
//integer provides ~4B uses before it needs to be reset. We subtract one below
//to avoid having to think too much about off-by-one errors. Note that
//choosing the biggest data type possible is not necessarily a good idea!
//Smaller data types will have better cache utilization.
if(visited.at(0)==std::numeric_limits<ProgressType>::max()-1)
std::fill(visited.begin(), visited.end(), 0);
//We increment the stored visited indicator and make a note of the result. Any
//value in the visited vector less than `visited_indicator` has not been
//visited.
const auto visited_indicator = ++visited.at(0);
//For doing an early exit if we get everything in place
auto remaining = values.size();
//For all elements that need to be placed
for(size_t s=0;s<ordering.size() && remaining>0;s++){
assert(visited[s+1]<=visited_indicator);
//Ignore already-visited elements
if(visited[s+1]==visited_indicator)
continue;
//Don't rearrange if we don't have to
if(s==visited[s])
continue;
//The orderings form a linked list. We need to follow that list to its end
//in order to reverse it.
stack.emplace(s);
for(auto i=s;s!=(size_t)ordering[i];i=ordering[i]){
stack.emplace(ordering[i]);
}
//Now we follow the linked list in reverse to its beginning, putting
//elements in their places. Use move semantics for speed.
auto temp = std::move(values[s]);
while(!stack.empty()){
std::swap(temp, values[stack.top()]);
visited[stack.top()+1] = visited_indicator;
stack.pop();
--remaining;
}
visited[s+1] = visited_indicator;
}
}
int main(){
std::mt19937 gen;
std::uniform_int_distribution<short> value_dist(0,std::numeric_limits<short>::max());
std::uniform_int_distribution<short> len_dist (0,std::numeric_limits<short>::max());
std::vector<short> data;
std::vector<short> ordering;
std::vector<short> original;
std::vector<size_t> progress;
for(int i=0;i<1000;i++){
const int len = len_dist(gen);
data.clear();
ordering.clear();
for(int i=0;i<len;i++){
data.push_back(value_dist(gen));
ordering.push_back(i);
}
original = data;
std::shuffle(ordering.begin(), ordering.end(), gen);
forward_reorder(data, ordering, progress);
assert(original!=data);
backward_reorder(data, ordering, progress);
assert(original==data);
}
}
Never prematurely optimize. Meassure and then determine where you need to optimize and what. You can end with complex code that is hard to maintain and bug-prone in many places where performance is not an issue.
With that being said, do not early pessimize. Without changing the code you can remove half of your copies:
template <typename T>
void reorder( std::vector<T> & data, std::vector<std::size_t> const & order )
{
std::vector<T> tmp; // create an empty vector
tmp.reserve( data.size() ); // ensure memory and avoid moves in the vector
for ( std::size_t i = 0; i < order.size(); ++i ) {
tmp.push_back( data[order[i]] );
}
data.swap( tmp ); // swap vector contents
}
This code creates and empty (big enough) vector in which a single copy is performed in-order. At the end, the ordered and original vectors are swapped. This will reduce the copies, but still requires extra memory.
If you want to perform the moves in-place, a simple algorithm could be:
template <typename T>
void reorder( std::vector<T> & data, std::vector<std::size_t> const & order )
{
for ( std::size_t i = 0; i < order.size(); ++i ) {
std::size_t original = order[i];
while ( i < original ) {
original = order[original];
}
std::swap( data[i], data[original] );
}
}
This code should be checked and debugged. In plain words the algorithm in each step positions the element at the i-th position. First we determine where the original element for that position is now placed in the data vector. If the original position has already been touched by the algorithm (it is before the i-th position) then the original element was swapped to order[original] position. Then again, that element can already have been moved...
This algorithm is roughly O(N^2) in the number of integer operations and thus is theoretically worse in performance time as compare to the initial O(N) algorithm. But it can compensate if the N^2 swap operations (worst case) cost less than the N copy operations or if you are really constrained by memory footprint.
It's an interesting intellectual exercise to do the reorder with O(1) space requirement but in 99.9% of the cases the simpler answer will perform to your needs:
void permute(vector<T>& values, const vector<size_t>& indices)
{
vector<T> out;
out.reserve(indices.size());
for(size_t index: indices)
{
assert(0 <= index && index < values.size());
out.push_back(std::move(values[index]));
}
values = std::move(out);
}
Beyond memory requirements, the only way I can think of this being slower would be due to the memory of out being in a different cache page than that of values and indices.
You could do it recursively, I guess - something like this (unchecked, but it gives the idea):
// Recursive function
template<typename T>
void REORDER(int oldPosition, vector<T>& vA,
const vector<int>& vecNewOrder, vector<bool>& vecVisited)
{
// Keep a record of the value currently in that position,
// as well as the position we're moving it to.
// But don't move it yet, or we'll overwrite whatever's at the next
// position. Instead, we first move what's at the next position.
// To guard against loops, we look at vecVisited, and set it to true
// once we've visited a position.
T oldVal = vA[oldPosition];
int newPos = vecNewOrder[oldPosition];
if (vecVisited[oldPosition])
{
// We've hit a loop. Set it and return.
vA[newPosition] = oldVal;
return;
}
// Guard against loops:
vecVisited[oldPosition] = true;
// Recursively re-order the next item in the sequence.
REORDER(newPos, vA, vecNewOrder, vecVisited);
// And, after we've set this new value,
vA[newPosition] = oldVal;
}
// The "main" function
template<typename T>
void REORDER(vector<T>& vA, const vector<int>& newOrder)
{
// Initialise vecVisited with false values
vector<bool> vecVisited(vA.size(), false);
for (int x = 0; x < vA.size(); x++)
{
REORDER(x, vA, newOrder, vecVisited);
}
}
Of course, you do have the overhead of vecVisited. Thoughts on this approach, anyone?
To iterate through the vector is O(n) operation. Its sorta hard to beat that.
Your code is broken. You cannot assign to vA and you need to use template parameters.
vector<char> REORDER(const vector<char>& vA, const vector<size_t>& vOrder)
{
assert(vA.size() == vOrder.size());
vector<char> vCopy(vA.size());
for(int i = 0; i < vOrder.size(); ++i)
vCopy[i] = vA[ vOrder[i] ];
return vA;
}
The above is slightly more efficient.
It is not clear by the title and the question if the vector should be ordered with the same steps it takes to order vOrder or if vOrder already contains the indexes of the desired order.
The first interpretation has already a satisfying answer (see chmike and Potatoswatter), I add some thoughts about the latter.
If the creation and/or copy cost of object T is relevant
template <typename T>
void reorder( std::vector<T> & data, std::vector<std::size_t> & order )
{
std::size_t i,j,k;
for(i = 0; i < order.size() - 1; ++i) {
j = order[i];
if(j != i) {
for(k = i + 1; order[k] != i; ++k);
std::swap(order[i],order[k]);
std::swap(data[i],data[j]);
}
}
}
If the creation cost of your object is small and memory is not a concern (see dribeas):
template <typename T>
void reorder( std::vector<T> & data, std::vector<std::size_t> const & order )
{
std::vector<T> tmp; // create an empty vector
tmp.reserve( data.size() ); // ensure memory and avoid moves in the vector
for ( std::size_t i = 0; i < order.size(); ++i ) {
tmp.push_back( data[order[i]] );
}
data.swap( tmp ); // swap vector contents
}
Note that the two pieces of code in dribeas answer do different things.
I was trying to use #Potatoswatter's solution to sort multiple vectors by a third one and got really confused by output from using the above functions on a vector of indices output from Armadillo's sort_index. To switch from a vector output from sort_index (the arma_inds vector below) to one that can be used with #Potatoswatter's solution (new_inds below), you can do the following:
vector<int> new_inds(arma_inds.size());
for (int i = 0; i < new_inds.size(); i++) new_inds[arma_inds[i]] = i;
I came up with this solution which has the space complexity of O(max_val - min_val + 1), but it can be integrated with std::sort and benefits from std::sort's O(n log n) decent time complexity.
std::vector<int32_t> dense_vec = {1, 2, 3};
std::vector<int32_t> order = {1, 0, 2};
int32_t max_val = *std::max_element(dense_vec.begin(), dense_vec.end());
std::vector<int32_t> sparse_vec(max_val + 1);
int32_t i = 0;
for(int32_t j: dense_vec)
{
sparse_vec[j] = order[i];
i++;
}
std::sort(dense_vec.begin(), dense_vec.end(),
[&sparse_vec](int32_t i1, int32_t i2) {return sparse_vec[i1] < sparse_vec[i2];});
The following assumptions made while writing this code:
Vector values start from zero.
Vector does not contain repeated values.
We have enough memory to sacrifice in order to use std::sort
This should avoid copying the vector:
void REORDER(vector<char>& vA, const vector<size_t>& vOrder)
{
assert(vA.size() == vOrder.size());
for(int i = 0; i < vOrder.size(); ++i)
if (i < vOrder[i])
swap(vA[i], vA[vOrder[i]]);
}