how to vectorize this program - c++

The program below (well, the lines after "from here") is a construct i have to use a lot.
I was wondering whether it is possible (eventually using functions from the eigen library)
to vectorize or otherwise make this program run faster.
Essentially, given a vector of float x, this construct has recover the indexes
of the sorted elements of x in a int vector SIndex. For example, if the first
entry of SIndex is 10, it means that the 10th element of x was the smallest element
of x.
#include <algorithm>
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <vector>
using std::vector;
using namespace std;
typedef pair<int, float> sortData;
bool sortDataLess(const sortData& left, const sortData& right){
return left.second<right.second;
}
int main(){
int n=20,i;
float LO=-1.0,HI=1.0;
srand (time(NULL));
vector<float> x(n);
vector<float> y(n);
vector<int> SIndex(n);
vector<sortData> foo(n);
for(i=0;i<n;i++) x[i]=LO+(float)rand()/((float)RAND_MAX/(HI-LO));
//from here:
for(i=0;i<n;i++) foo[i]=sortData(i,x[i]);
sort(foo.begin(),foo.end(),sortDataLess);
for(i=0;i<n;i++){
sortData bar=foo[i];
y[i]=x[bar.first];
SIndex[i]=bar.first;
}
for(i=0;i<n;i++) std::cout << SIndex[i] << std::endl;
return 0;
}

There's no getting around the fact that this is a sorting problem, and vectorization doesn't necessarily improve sorts very much. For example, the partition step of quicksort can do the comparison in parallel, but it then needs to select and store the 0–n values that passed the comparison. This can absolutely be done, but it starts throwing out the advantages you get from vectorization—you need to convert from a comparison mask to a shuffle mask, which is probably a lookup table (bad), and you need a variable-sized store, which means no alignment (bad, although maybe not that bad). Mergesort needs to merge two sorted lists, which in some cases could be improved by vectorization, but in the worst case (I think) needs the same number of steps as the scalar case.
And, of course, there's a decent chance that any major speed boost you get from vectorization will have already been done inside your standard library's std::sort implementation. To get it, though, you'd need to be sorting primitive types with the default comparison operator.
If you're worried about performance, you can easily avoid the last loop, though. Just sort a list of indices using your float array as a comparison:
struct IndirectLess {
template <typename T>
IndirectLess(T iter) : values(&*iter) {}
bool operator()(int left, int right)
{
return values[left] < values[right];
}
float const* values;
};
int main() {
// ...
std::vector<int> SIndex;
SIndex.reserve(n);
for (int i = 0; i < n; ++i)
SIndex.push_back(n);
std::sort(SIndex.begin(), SIndex.end(), IndirectLess(x.begin()));
// ...
}
Now you've only produced your list of sorted indices. You have the potential to lose some cache locality, so for really big lists it might be slower. At that point it might be possible to vectorize your last loop, depending on the architecture. It's just data manipulation, though—read four values, store 1st and 3rd in one place and 2nd and 4th in another—so I wouldn't expect Eigen to help much at that point.

Related

std::bad_alloc when I forgot to increment the iterator

I have two questions about the following piece of code. First, I initially forgot to increment the loop and as a result I got std::bad_alloc when I ran the code. After debugging I couldn't quite understand why the mistake leads that error.
My second question is if there exist a more efficient way to store the of the objects of the type pcl::PointXYZ than a vector? Can I avoid copying them?
#include <unordered_set>
#include <random>
#include <algorithm>
#include <vector>
// Sample without replacement over a range using Bob Floyd's algorithm
std::unordered_set<int> sampleWithoutReplacement(int sampleSize, int rangeUpperBound)
{
std::unordered_set<int> sample;
std::default_random_engine generator;
for(int d = rangeUpperBound - sampleSize; d < rangeUpperBound; d++)
{
int t = std::uniform_int_distribution<>(0, d)(generator);
if (sample.find(t) == sample.end() )
sample.insert(t);
else
sample.insert(d);
}
return sample;
}
unsigned maxIterations {100};
while(maxIterations--)
{
std::unordered_set<int> inliers;
std::unordered_set<int> sampleIndices = sampleWithoutReplacement(sampleSize, cloudSize);
std::vector<pcl::PointXYZ> samplePoints {};
for (auto it { sampleIndices.begin() }; it != sampleIndices.end(); ++it)
{
samplePoints.push_back(cloud->points.at(*it));
}
// some other code that uses samplePoints.
}
As for your first question, you can look at cppreference and see that std::bad_alloc is thrown when there is a failure to allocate. In essence, you've run out of memory by continuously pushing to a vector.
As for total memory overhead, you won't really see a noticeable difference on modern systems. If we're being technical, and you know the exact size of how many elements you want to store, an array would be more efficient on memory. If you're concerned about how long it takes to find an element, std::map is faster than std::vector(O(logn) for std::map and O(n) for std::vector).

Passing vector by value

I want to create a program that uses a vector to sort it for testing reasons. So I want to calculate the CPU time by a benchmark that sorts the vector a certain amount of times. So the original vector needs to remain constant, and then use another vector so that it can be sorted.
So what I have done is...
#include <iostream>
#include <vector>
#include <random>
#include <chrono>
using namespace std;
typedef vector<int> intv;
int main(){
intv vi;
// Stuff to create my vector with certain characteristics...
intv vii=vi;
cout << "Size: \n";
cin >> tt ;
for(i=0; i<tt; ++i){
tb=sort(t,vii);
m=m+tb;
vii=vi;
}
m=m/tt;
cout << "BS" << m << "\n";
}
So I pass the vector by reference, and make a copy for each sorting so that I can sort it again. How can I do this a better way? Is it better to pass it by value, and in that case, Could someone provide me a minimum example of the best way to do this?
sort is a basic bubble sorting function:
double sort(int t, intv &vii){
vii.reserve(t);
bool swapped=true;
int a;
auto t0 =chrono::high_resolution_clock::now();
while (swapped==true){
for (int i=1; i<t; ++i){
swapped=false;
if (vii[i-1]>vii[i]){
a=vii[i];
vii[i]=vii[i-1];
vii[i-1]=a;
swapped=true;
}
}
t=t-1;
}
auto t1 = chrono::high_resolution_clock::now();
double T = chrono::duration_cast<chrono::nanoseconds>(t1-t0).count();
return T;
}
Once you have sorted, you have to do something that is equivalent to:
vii=vi;
I think assigning vi to vii will be the most efficient method of copying the contents of vi to vii. You can try:
size_t index = 0;
for ( auto const& val : vi )
{
vii[index++] = val;
}
However, I will be really surprised if the second method is more efficient than the first.
Nothing wrong with sorting in-place, and making a copy of the vector. The code you have should work, though it is not clear from where your parameter t is coming.
Note that the statement vii.reserve(t) is not doing anything useful in your sort routine: either t is less than or equal to the size of vii, in which case the reserve call does nothing, or it is greater than the size of vii, in which case you are accessing values outside the range of the vector. Better to check t against the vector size and throw an error or similar if it is too big.
Passing by value is straightforward: just declare your sort routine as double sort(int t, intv vii). When the function is called, vii will be copied from whichever vector you pass in as the second argument.
From a design point of view though, it is better to make a copy and then pass a reference. Sorting should change the thing being sorted; passing by value in the context of your code would mean that nothing would be able to inspect the sorted result.

Sort an array of std::pair vs. struct: which one is faster?

I was wondering whether sorting an array of std::pair is faster, or an array of struct?
Here are my code segments:
Code #1: sorting std::pair array (by first element):
#include <algorithm>
pair <int,int> client[100000];
sort(client,client+100000);
Code #2: sort struct (by A):
#include <algorithm>
struct cl{
int A,B;
}
bool cmp(cl x,cl y){
return x.A < y.A;
}
cl clients[100000];
sort(clients,clients+100000,cmp);
code #3: sort struct (by A and internal operator <):
#include <algorithm>
struct cl{
int A,B;
bool operator<(cl x){
return A < x.A;
}
}
cl clients[100000];
sort(clients,clients+100000);
Update: I used these codes to solve a problem in an online Judge. I got time limit of 2 seconds for code #1, and accept for code #2 and #3 (ran in 62 milliseconds). Why code #1 takes so much time in comparison to other codes? Where is the difference?
You know what std::pair is? It's a struct (or class, which is the same thing in C++ for our purposes). So if you want to know what's faster, the usual advice applies: you have to test it and find out for yourself on your platform. But the best bet is that if you implement the equivalent sorting logic to std::pair, you will have equivalent performance, because the compiler does not care whether your data type's name is std::pair or something else.
But note that the code you posted is not equivalent in functionality to the operator < provided for std::pair. Specifically, you only compare the first member, not both. Obviously this may result in some speed gain (but probably not enough to notice in any real program).
I would estimate that there isn't much difference at all between these two solutions.
But like ALL performance related queries, rather than rely on someone on the internet telling they are the same, or one is better than the other, make your own measurements. Sometimes, subtle differences in implementation will make a lot of difference to the actual results.
Having said that, the implementation of std::pair is a struct (or class) with two members, first and second, so I have a hard time imagining that there is any real difference here - you are just implementing your own pair with your own compare function that does exactly the same things that the already existing pair does... Whether it's in an internal function in the class or as an standalone function is unlikely to make much of a difference.
Edit: I made the following "mash the code together":
#include <algorithm>
#include <iostream>
#include <iomanip>
#include <cstdlib>
using namespace std;
const int size=100000000;
pair <int,int> clients1[size];
struct cl1{
int first,second;
};
cl1 clients2[size];
struct cl2{
int first,second;
bool operator<(const cl2 x) const {
return first < x.first;
}
};
cl2 clients3[size];
template<typename T>
void fill(T& t)
{
srand(471117); // Use same random number each time/
for(size_t i = 0; i < sizeof(t) / sizeof(t[0]); i++)
{
t[i].first = rand();
t[i].second = -t[i].first;
}
}
void func1()
{
sort(clients1,clients1+size);
}
bool cmp(cl1 x, cl1 y){
return x.first < y.first;
}
void func2()
{
sort(clients2,clients2+size,cmp);
}
void func3()
{
sort(clients3,clients3+size);
}
void benchmark(void (*f)(), const char *name)
{
cout << "running " << name << endl;
clock_t time = clock();
f();
time = clock() - time;
cout << "Time taken = " << (double)time / CLOCKS_PER_SEC << endl;
}
#define bm(x) benchmark(x, #x)
int main()
{
fill(clients1);
fill(clients2);
fill(clients3);
bm(func1);
bm(func2);
bm(func3);
}
The results are as follows:
running func1
Time taken = 10.39
running func2
Time taken = 14.09
running func3
Time taken = 10.06
I ran the benchmark three times, and they are all within ~0.1s of the above results.
Edit2:
And looking at the code generated, it's quite clear that the "middle" function takes quite a bit longer, since the comparison is made inline for pair and struct cl2, but can't be made inline for struct cl1 - so every compare literally makes a function call, rather than a few instructions inside the functions. This is a large overhead.

Sorting a range (with no duplicates) in C++, is std::vector and std::sort faster than std::set?

I have a sequence of double (with no duplicates) and I need to sort them. Is filling a vector and then sorting it faster than inserting the values in a set?
Is this question answerable without a knowledge of the implementation of the standard library (and without a knowledge of the hardware on which the program will run) but just with the information provided by the C++ standard?
#include <vector>
#include <set>
#include <algorithm>
#include <random>
#include <iostream>
std::uniform_real_distribution<double> unif(0,10000);
std::default_random_engine re;
int main()
{
std::vector< double > v;
std::set< double > s;
std::vector< double > r;
size_t sz = 10;
for(size_t i = 0; i < sz; i++) {
r.push_back( unif(re) );
}
for(size_t i = 0; i < sz; i++) {
v.push_back(r[i]);
}
std::sort(v.begin(),v.end());
for(size_t i = 0; i < sz; i++) {
s.insert(r[i]);
}
return 0;
}
From the C++ standard, all we can say is that they both have the same asymptotic complexity (O(n*log(n))).
The set may be faster for large objects that can't be efficiently moved or swapped, since the objects don't need to be moved more than once. The vector may be faster for small objects, since sorting it involves no pointer updates and less indirection.
Which is faster in any given situation can only be determined by measuring (or a thorough knowledge of both the implementation and the target platform).
The use of vector may be faster because of data cache factors as the data operated upon will be in a more coherent memory region (probably).
The vector will also have less memory overhead per-value.
If you can, reserve the vector size before inserting data to minimize effort during filling the vector with values.
In terms of complexity both should be the same i.e, nlog(n).
The answer is not trivial. If you have 2 main sections in your software: 1st setup, 2nd lookup and lookup is used more than setup: the sorted vector could be faster, because of 2 reasons:
lower_bound <algorithm> function is faster than the usual tree implementation of <set>,
std::vector memory is allocated less heap page, so there will be less page faults while you are looking for an element.
If the usage is mixed, or lookup is not more then setup, than <set> will be faster. More info: Scott Meyers: Effective STL, Item 23.
Since you said sorting in a range, you could use partial_sort instead of sorting the entire collection.
If we don't want to disturb the existing collection and want to have a new collection with sorted data and no duplicates, then std::set gives us a straight forward solution.
#include <vector>
#include <set>
#include <algorithm>
#include <iostream>
using namespace std;
int main()
{
int arr[] = { 1, 3, 4, 1, 6, 7, 9, 6 , 3, 4, 9 };
vector<int> ints ( arr, end(arr));
const int ulimit = 5;
auto last = ints.begin();
advance(last, ulimit);
set<int> sortedset;
sortedset.insert(ints.begin() , last);
for_each(sortedset.begin(), sortedset.end(), [](int x) { cout << x << "\n"; });
}

A proper way to create a matrix in c++

I want to create an adjacency matrix for a graph. Since I read it is not safe to use arrays of the form matrix[x][y] because they don't check for range, I decided to use the vector template class of the stl. All I need to store in the matrix are boolean values. So my question is, if using std::vector<std::vector<bool>* >* produces too much overhead or if there is a more simple way for a matrix and how I can properly initialize it.
EDIT: Thanks a lot for the quick answers. I just realized, that of course I don't need any pointers. The size of the matrix will be initialized right in the beginning and won't change until the end of the program. It is for a school project, so it would be good if I write "nice" code, although technically performance isn't too important. Using the STL is fine. Using something like boost, is probably not appreciated.
Note that also you can use boost.ublas for matrix creation and manipulation and also boost.graph to represent and manipulate graphs in a number of ways, as well as using algorithms on them, etc.
Edit: Anyway, doing a range-check version of a vector for your purposes is not a hard thing:
template <typename T>
class BoundsMatrix
{
std::vector<T> inner_;
unsigned int dimx_, dimy_;
public:
BoundsMatrix (unsigned int dimx, unsigned int dimy)
: dimx_ (dimx), dimy_ (dimy)
{
inner_.resize (dimx_*dimy_);
}
T& operator()(unsigned int x, unsigned int y)
{
if (x >= dimx_ || y>= dimy_)
throw std::out_of_range("matrix indices out of range"); // ouch
return inner_[dimx_*y + x];
}
};
Note that you would also need to add the const version of the operators, and/or iterators, and the strange use of exceptions, but you get the idea.
Best way:
Make your own matrix class, that way you control every last aspect of it, including range checking.
eg. If you like the "[x][y]" notation, do this:
class my_matrix {
std::vector<std::vector<bool> >m;
public:
my_matrix(unsigned int x, unsigned int y) {
m.resize(x, std::vector<bool>(y,false));
}
class matrix_row {
std::vector<bool>& row;
public:
matrix_row(std::vector<bool>& r) : row(r) {
}
bool& operator[](unsigned int y) {
return row.at(y);
}
};
matrix_row& operator[](unsigned int x) {
return matrix_row(m.at(x));
}
};
// Example usage
my_matrix mm(100,100);
mm[10][10] = true;
nb. If you program like this then C++ is just as safe as all those other "safe" languages.
The standard vector does NOT do range checking by default.
i.e. The operator[] does not do a range check.
The method at() is similar to [] but does do a range check.
It will throw an exception on out of range.
std::vector::at()
std::vector::operator[]()
Other notes:
Why a vector<Pointers> ?
You can quite easily have a vector<Object>. Now there is no need to worry about memory management (i.e. leaks).
std::vector<std::vector<bool> > m;
Note: vector<bool> is overloaded and not very efficient (i.e. this structure was optimized for size not speed) (It is something that is now recognized as probably a mistake by the standards committee).
If you know the size of the matrix at compile time you could use std::bitset?
std::vector<std::bitset<5> > m;
or if it is runtime defined use boost::dynamic_bitset
std::vector<boost::dynamic_bitset> m;
All of the above will allow you to do:
m[6][3] = true;
If you want 'C' array performance, but with added safety and STL-like semantics (iterators, begin() & end() etc), use boost::array.
Basically it's a templated wrapper for 'C'-arrays with some NDEBUG-disable-able range checking asserts (and also some std::range_error exception-throwing accessors).
I use stuff like
boost::array<boost::array<float,4>,4> m;
instead of
float m[4][4];
all the time and it works great (with appropriate typedefs to keep the verbosity down, anyway).
UPDATE: Following some discussion in the comments here of the relative performance of boost::array vs boost::multi_array, I'd point out that this code, compiled with g++ -O3 -DNDEBUG on Debian/Lenny amd64 on a Q9450 with 1333MHz DDR3 RAM takes 3.3s for boost::multi_array vs 0.6s for boost::array.
#include <iostream>
#include <time.h>
#include "boost/array.hpp"
#include "boost/multi_array.hpp"
using namespace boost;
enum {N=1024};
typedef multi_array<char,3> M;
typedef array<array<array<char,N>,N>,N> C;
// Forward declare to avoid being optimised away
static void clear(M& m);
static void clear(C& c);
int main(int,char**)
{
const clock_t t0=clock();
{
M m(extents[N][N][N]);
clear(m);
}
const clock_t t1=clock();
{
std::auto_ptr<C> c(new C);
clear(*c);
}
const clock_t t2=clock();
std::cout
<< "multi_array: " << (t1-t0)/static_cast<float>(CLOCKS_PER_SEC) << "s\n"
<< "array : " << (t2-t1)/static_cast<float>(CLOCKS_PER_SEC) << "s\n";
return 0;
}
void clear(M& m)
{
for (M::index i=0;i<N;i++)
for (M::index j=0;j<N;j++)
for (M::index k=0;k<N;k++)
m[i][j][k]=1;
}
void clear(C& c)
{
for (int i=0;i<N;i++)
for (int j=0;j<N;j++)
for (int k=0;k<N;k++)
c[i][j][k]=1;
}
What I would do is create my own class for dealing with matrices (probably as an array[x*y] because I'm more used to C (and I'd have my own bounds checking), but you could use vectors or any other sub-structure in that class).
Get your stuff functional first then worry about how fast it runs. If you design the class properly, you can pull out your array[x*y] implementation and replace it with vectors or bitmasks or whatever you want without changing the rest of the code.
I'm not totally sure, but I thing that's what classes were meant for, the ability to abstract the implementation well out of sight and provide only the interface :-)
In addition to all the answers that have been posted so far, you might do well to check out the C++ FAQ Lite. Questions 13.10 - 13.12 and 16.16 - 16.19 cover several topics related to rolling your own matrix class. You'll see a couple of different ways to store the data and suggestions on how to best write the subscript operators.
Also, if your graph is sufficiently sparse, you may not need a matrix at all. You could use std::multimap to map each vertex to those it connects.
my favourite way to store a graph is vector<set<int>>; n elements in vector (nodes 0..n-1), >=0 elements in each set (edges). Just do not forget adding a reverse copy of every bi-directional edge.
Consider also how big is your graph/matrix, does performance matter a lot? Is the graph static, or can it grow over time, e.g. by adding new edges?
Probably, not relevant as this is an old question, but you can use the Armadillo library, which provides many linear algebra oriented data types and functions.
Below is an example for your specific problem:
// In C++11
Mat<bool> matrix = {
{ true, true},
{ false, false},
};
// In C++98
Mat<bool> matrix;
matrix << true << true << endr
<< false << false << endr;
Mind you std::vector doesn't do range checking either.