I'm working on a the Hamming weight for a vector and what I do is count in linear way all the 1 in the vector, is there any more efficient way?
int HammingWeight(vector<int> a){
int HG=0;
for(int i=0; i<a.size(); i++){
if(a[i] == 1)
HG++;
}
return HG;
}
To calculate the hamming weight, you need to visit each element, giving you O(n) best case, making your loop as efficient as it gets when discounting micro optimizations.
However your function call itself is extremely inefficient: You pass the vector by value, resulting in a copy of all it's content. This copy can easily be more expensive then the rest of the function combined. Furthermore there is nothing at all in your function which actually needs that copy. So changing the signature to int HammingWeight(const std::vector<int>& a) should make your function much more efficient.
Another (possible) optimization comes to mind, assuming your vector only contains ones and zeros (otherwise I don't see how your code makes sense). In that case you could just add the corresponding vectorelement to HG, getting rid of the if (addition is typically much faster then branching):
for(size_t i=0; i<a.size(); ++i)
HG+=a[i];
I would assume this to likely be faster, however whether or not it actually is isdepends on how the compiler optimizes.
If you'd actually need to you could of course apply common microoptimizations (loop unrolling, vectorization, ...), but that would be premature unless you have good reason to. Besides in that case the first thing to do (again assuming the vector only contains zeros and ones) would be to use a more compact (read efficient) representation of the data.
Also note that both approaches (the direct summation and the if version) could also be expressed using the standard library:
int HG=std::count(a.begin(), a.end(), 1); does basically the same thing as your code
int HG=std::accumulate(a.begin(), a.end(), 0); would be equivalent to the loopI mentioned above
Now this is unlikely to help performance, but using less code to archieve the same effect is typically considered a good thing.
Related
I have an algorithm which requires applying set union many times to growing sets of integers. For efficiency I represent the sets as sorted vectors, so that their union can be obtained by merging them.
A classical way to merge two sorted vectors is this:
void inmerge(vector<int> &a, const vector<int> &b) {
a.reserve(a.size() + b.size());
std::copy(b.begin(), b.end(), std::back_inserter(a));
std::inplace_merge(a.begin(), a.end() - b.size(), a.end());
}
Unfortunately, std::inplace_merge appears to be much slower than std::sort in this case, because of the allocation overhead. The fastest way is to use std::merge directly to output into one of the vectors. In order not to write a value before reading it, we have to proceed from the ends, like this:
void inmerge(vector<int> &a, const vector<int> &b) {
a.resize(a.size() + b.size());
orig_a_rbegin = a.rbegin() + b.size();
std::merge(orig_a_rbegin, a.rend(), b.rbegin(), b.rend(), a.rend(), [](int x, int y) { return x > y; });
}
It is for sure that an implementation of merge will never write more elements than it has read, so this is a safe thing to do. Unfortunately, the C++ standard (even C++17 draft) forbids this:
The resulting range shall not overlap with either of the original
ranges.
Is it okay to ignore this restriction if I know what I'm doing?
No, ignoring a mandate of the standard (or any other documentation of some library you're using) is never ok. You may know what you are doing, but are you sure you know what the library is doing - or might be doing in the next version?
For example, the merge algorithm could detect that at least two of your ranges are reverse ranges, unwrap them (and unwrap or reverse the third), and do the merge in the other direction. No observable difference as long as the preconditions are kept, but possibly a tiny bit faster since the overhead of the reverse iterators is gone. But it would really screw with your code.
To state it simply: No.
A bit longer: If you ignore a mandate by the standard you end up in Undefined Behaviour land and your compiler is free to do whatever it wants.
This includes doing exactly what you expect, doing nothing at all, crashing the program, deleting all your files or summoning nasal demons. That's not a place you want to be.
I was wandering if there is any STL algorithm which produces the same result of the following code:
std::vector<int> data;
std::vector<int> counter(N); //I know in advance that all values in data
//are between 0 and N-1
for(int i=0; i<data.size(); ++i)
counter[data[i]]++;
This code simply outputs the histogram of my integer data, with pre-defined bin size equal to one.
I know that I should avoid loops as much as I could, as the equivalents with STL algorithms are much better optimized than what the majority of C++ programmer may come up with.
Any suggestions?
Thank you in advance, Giuseppe
Well, you can certainly at least clean up the loop a bit:
for (auto i : data)
++count[i];
You could (for example) use std::for_each instead:
std::for_each(data.begin(), data.end(), [&count](int i) { ++count[i]; });
...but that doesn't really look like much (if any) of an improvement to me.
I don't think there's a more efficient way of doing this. You're right about avoiding loops and preferring STL in most cases, but this only applies to bigger, and overly-complicated loops which are harder to write and maintain, therefore likely to be not optimal.
Looking at the problem at an assembly level, the only way to compute this problem is exactly the way you have it in your example. Since C/C++ loops translate to assembly very efficiently with zero unnecessary overhead, this leaves me believing that no STL function could preform this faster than your algorithm.
There is one STL function called count, but the complexity of it is linear ( O(n) ), and so as your solution's.
If you really want to squeeze out the maximum of every CPU-cycle, then consider using C-style arrays, and a separate counter variable. The overhead introduced by vectors is barely even measurable, but if any, that's the only opportunity I see for optimization here. Not that I would suggest it, but I'm afraid that's the only way you can get a hair more speed out of this.
If you think about it, in order to count the occurrences of elements in a vector, each element would have to be "visited" at least once, there's no avoiding it.
A simple loop like this is already the most efficient. You can try to unroll it, but that's probably the best you can do. STL or not, I doubt if there's a better algorithm.
You can use for_each and one lambda function. Check this example:
#include <algorithm>
#include <vector>
#include <ctime>
#include <iostream>
const int N = 10;
using namespace std;
int main()
{
srand(time(0));
std::vector<int> counter(N);
std::vector<int> data(N);
generate(data.begin(),data.end(),[]{return rand()%N;});
for (int i = 0;i<N;i++)
cout<<data[i]<<endl;
cout<<endl;
for_each(data.begin(),data.end(),[&counter](int i){++counter[i];});
for (int i = 0;i<N;i++)
cout<<counter[i]<<endl;
}
I would like to create a vector (arma::uvec) of integers - I do not ex ante know the size of the vector. I could not find approptiate function in Armadillo documentation, but moreover I was not successfull with creating the vector by a loop. I think the issue is initializing the vector or in keeping track of its length.
arma::uvec foo(arma::vec x){
arma::uvec vect;
int nn=x.size();
vect(0)=1;
int ind=0;
for (int i=0; i<nn; i++){
if ((x(i)>0)){
ind=ind+1;
vect(ind)=i;
}
}
return vect;
}
The error message is: Error: Mat::operator(): index out of bounds.
I would not want to assign 1 to the first element of the vector, but could live with that if necessary.
PS: I would really like to know how to obtain the vector of unknown length by appending, so that I could use it even in more general cases.
Repeatedly appending elements to a vector is a really bad idea from a performance point of view, as it can cause repeated memory reallocations and copies.
There are two main solutions to that.
Set the size of the vector to the theoretical maximum length of your operation (nn in this case), and then use a loop to set some of the values in the vector. You will need to keep a separate counter for the number of set elements in the vector so far. After the loop, take a subvector of the vector, using the .head() function. The advantage here is that there will be only one copy.
An alternative solution is to use two loops, to reduce memory usage. In the first loop work out the final length of the vector. Then set the size of the vector to the final length. In the second loop set the elements in the vector. Obviously using two loops is less efficient than one loop, but it's likely that this is still going to be much faster than appending.
If you still want to be a lazy coder and inefficiently append elements, use the .insert_rows() function.
As a sidenote, your foo(arma::vec x) is already making an unnecessary copy the input vector. Arguments in C++ are by default passed by value, which basically means C++ will make a copy of x before running your function. To avoid this unnecessary copy, change your function to foo(const arma::vec& x), which means take a constant reference to x. The & is critical here.
In addition to mtall's answer, which i agree with,
for a case in which performance wasn't needed i used this:
void uvec_push(arma::uvec & v, unsigned int value) {
arma::uvec av(1);
av.at(0) = value;
v.insert_rows(v.n_rows, av.row(0));
}
I'm intersecting some sets of numbers, and doing this by storing a count of each time I see a number in a map.
I'm finding the performance be very slow.
Details:
- One of the sets has 150,000 numbers in it
- The intersection of that set and another set takes about 300ms the first time, and about 5000ms the second time
- I haven't done any profiling yet, but every time I break the debugger while doing the intersection its in malloc.c!
So, how can I improve this performance? Switch to a different data structure? Some how improve the memory allocation performance of map?
Update:
Is there any way to ask std::map or
boost::unordered_map to pre-allocate
some space?
Or, are there any tips for using these efficiently?
Update2:
See Fast C++ container like the C# HashSet<T> and Dictionary<K,V>?
Update3:
I benchmarked set_intersection and got horrible results:
(set_intersection) Found 313 values in the intersection, in 11345ms
(set_intersection) Found 309 values in the intersection, in 12332ms
Code:
int runIntersectionTestAlgo()
{
set<int> set1;
set<int> set2;
set<int> intersection;
// Create 100,000 values for set1
for ( int i = 0; i < 100000; i++ )
{
int value = 1000000000 + i;
set1.insert(value);
}
// Create 1,000 values for set2
for ( int i = 0; i < 1000; i++ )
{
int random = rand() % 200000 + 1;
random *= 10;
int value = 1000000000 + random;
set2.insert(value);
}
set_intersection(set1.begin(),set1.end(), set2.begin(), set2.end(), inserter(intersection, intersection.end()));
return intersection.size();
}
You should definitely be using preallocated vectors which are way faster. The problem with doing set intersection with stl sets is that each time you move to the next element you're chasing a dynamically allocated pointer, which could easily not be in your CPU caches. With a vector the next element will often be in your cache because it's physically close to the previous element.
The trick with vectors, is that if you don't preallocate the memory for a task like this, it'll perform EVEN WORSE because it'll go on reallocating memory as it resizes itself during your initialization step.
Try something like this instaed - it'll be WAY faster.
int runIntersectionTestAlgo() {
vector<char> vector1; vector1.reserve(100000);
vector<char> vector2; vector2.reserve(1000);
// Create 100,000 values for set1
for ( int i = 0; i < 100000; i++ ) {
int value = 1000000000 + i;
set1.push_back(value);
}
sort(vector1.begin(), vector1.end());
// Create 1,000 values for set2
for ( int i = 0; i < 1000; i++ ) {
int random = rand() % 200000 + 1;
random *= 10;
int value = 1000000000 + random;
set2.push_back(value);
}
sort(vector2.begin(), vector2.end());
// Reserve at most 1,000 spots for the intersection
vector<char> intersection; intersection.reserve(min(vector1.size(),vector2.size()));
set_intersection(vector1.begin(), vector1.end(),vector2.begin(), vector2.end(),back_inserter(intersection));
return intersection.size();
}
Without knowing any more about your problem, "check with a good profiler" is the best general advise I can give. Beyond that...
If memory allocation is your problem, switch to some sort of pooled allocator that reduces calls to malloc. Boost has a number of custom allocators that should be compatible with std::allocator<T>. In fact, you may even try this before profiling, if you've already noticed debug-break samples always ending up in malloc.
If your number-space is known to be dense, you can switch to using a vector- or bitset-based implementation, using your numbers as indexes in the vector.
If your number-space is mostly sparse but has some natural clustering (this is a big if), you may switch to a map-of-vectors. Use higher-order bits for map indexing, and lower-order bits for vector indexing. This is functionally very similar to simply using a pooled allocator, but it is likely to give you better caching behavior. This makes sense, since you are providing more information to the machine (clustering is explicit and cache-friendly, rather than a random distribution you'd expect from pool allocation).
I would second the suggestion to sort them. There are already STL set algorithms that operate on sorted ranges (like set_intersection, set_union, etc):
set_intersection
I don't understand why you have to use a map to do intersection. Like people have said, you could put the sets in std::set's, and then use std::set_intersection().
Or you can put them into hash_set's. But then you would have to implement intersection manually: technically you only need to put one of the sets into a hash_set, and then loop through the other one, and test if each element is contained in the hash_set.
Intersection with maps are slow, try a hash_map. (however, this is not provided in all STL implementation.
Alternatively, sort both map and do it in a merge-sort-like way.
What is your intersection algorithm? Maybe there are some improvements to be made?
Here is an alternate method
I do not know it to be faster or slower, but it could be something to try. Before doing so, I also recommend using a profiler to ensure you really are working on the hotspot. Change the sets of numbers you are intersecting to use std::set<int> instead. Then iterate through the smallest one looking at each value you find. For each value in the smallest set, use the find method to see if the number is present in each of the other sets (for performance, search from smallest to largest).
This is optimised in the case that the number is not found in all of the sets, so if the intersection is relatively small, it may be fast.
Then, store the intersection in std::vector<int> instead - insertion using push_back is also very fast.
Here is another alternate method
Change the sets of numbers to std::vector<int> and use std::sort to sort from smallest to largest. Then use std::binary_search to find the values, using roughly the same method as above. This may be faster than searching a std::set since the array is more tightly packed in memory. Actually, never mind that, you can then just iterate through the values in lock-step, looking at the ones with the same value. Increment only the iterators which are less than the minimum value you saw at the previous step (if the values were different).
Might be your algorithm. As I understand it, you are spinning over each set (which I'm hoping is a standard set), and throwing them into yet another map. This is doing a lot of work you don't need to do, since the keys of a standard set are in sorted order already. Instead, take a "merge-sort" like approach. Spin over each iter, dereferencing to find the min. Count the number that have that min, and increment those. If the count was N, add it to the intersection. Repeat until the first map hits it's end (If you compare the sizes before starting, you won't have to check every map's end each time).
Responding to update: There do exist faculties to speed up memory allocation by pre-reserving space, like boost::pool_alloc. Something like:
std::map<int, int, std::less<int>, boost::pool_allocator< std::pair<int const, int> > > m;
But honestly, malloc is pretty good at what it does; I'd profile before doing anything too extreme.
Look at your algorithms, then choose the proper data type. If you're going to have set-like behaviour, and want to do intersections and the like, std::set is the container to use.
Since it's elements are stored in a sorted way, insertion may cost you O(log N), but intersection with another (sorted!) std::set can be done in linear time.
I figured something out: if I attach the debugger to either RELEASE or DEBUG builds (e.g. hit F5 in the IDE), then I get horrible times.
I have a class containing a number of double values. This is stored in a vector where the indices for the classes are important (they are referenced from elsewhere). The class looks something like this:
Vector of classes
class A
{
double count;
double val;
double sumA;
double sumB;
vector<double> sumVectorC;
vector<double> sumVectorD;
}
vector<A> classes(10000);
The code that needs to run as fast as possible is something like this:
vector<double> result(classes.size());
for(int i = 0; i < classes.size(); i++)
{
result[i] += classes[i].sumA;
vector<double>::iterator it = find(classes[i].sumVectorC.begin(), classes[i].sumVectorC.end(), testval);
if(it != classes[i].sumVectorC.end())
result[i] += *it;
}
The alternative is instead of one giant loop, split the computation into two separate loops such as:
for(int i = 0; i < classes.size(); i++)
{
result[i] += classes[i].sumA;
}
for(int i = 0; i < classes.size(); i++)
{
vector<double>::iterator it = find(classes[i].sumVectorC.begin(), classes[i].sumVectorC.end(), testval);
if(it != classes[i].sumVectorC.end())
result[i] += *it;
}
or to store each member of the class in a vector like so:
Class of vectors
vector<double> classCounts;
vector<double> classVal;
...
vector<vector<double> > classSumVectorC;
...
and then operate as:
for(int i = 0; i < classes.size(); i++)
{
result[i] += classCounts[i];
...
}
Which way would usually be faster (across x86/x64 platforms and compilers)? Are look-ahead and cache lines are the most important things to think about here?
Update
The reason I'm doing a linear search (i.e. find) here and not a hash map or binary search is because the sumVectors are very short, around 4 or 5 elements. Profiling showed a hash map was slower and a binary search was slightly slower.
As the implementation of both variants seems easy enough I would build both versions and profile them to find the fastest one.
Empirical data usually beats speculation.
As a side issue: Currently, the find() in your innermost loop does a linear scan through all elements of classes[i].sumVectorC until it finds a matching value. If that vector contains many values, and you have no reason to believe that testVal appears near the start of the vector, then this will be slow -- consider using a container type with faster lookup instead (e.g. std::map or one of the nonstandard but commonly implemented hash_map types).
As a general guideline: consider algorithmic improvements before low-level implementation optimisation.
As lothar says, you really should test it out. But to answer your last question, yes, cache misses will be a major concern here.
Also, it seems that your first implementation would run into load-hit-store stalls as coded, but I'm not sure how much of a problem that is on x86 (it's a big problem on XBox 360 and PS3).
It looks like optimizing the find() would be a big win (profile to know for sure). Depending on the various sizes, in addition to replacing the vector with another container, you could try sorting sumVectorC and using a binary search in the form of lower_bound. This will turn your linear search O(n) into O(log n).
if you can guarrantee that std::numeric_limits<double>::infinity is not a possible value, ensuring that the arrays are sorted with a dummy infinite entry at the end and then manually coding the find so that the loop condition is a single test:
array[i]<test_val
and then an equality test.
then you know that the average number of looked at values is (size()+1)/2 in the not found case. Of course if the search array changes very frequently then the issue of keeping it sorted is an issue.
of course you don't tell us much about sumVectorC or the rest of A for that matter, so it is hard to ascertain and give really good advice. For example if sumVectorC is never updates then it is probably possible to find an EXTREMELY cheap hash (eg cast ULL and bit extraction) that is perfect on the sumVectorC values that fits into double[8]. Then the overhead is bit extract and 1 comparison versus 3 or 6
Also if you have a bound on sumVectorC.size() that is reasonable(you mentioned 4 or 5 so this assumption seems not bad) you could consider using an aggregated array or even just a boost::array<double> and add your own dynamic size eg :
class AggregatedArray : public boost::array<double>{
size_t _size;
size_t size() const {
return size;
}
....
push_back(..){...
pop(){...
resize(...){...
};
this gets rid of the extra cache line access to the allocated array data for sumVectorC.
In the case of sumVectorC very infrequently updating if finding a perfect hash (out of your class of hash algoithhms)is relatively cheap then you can incur that with profit when sumVectorC changes. These small lookups can be problematic and algorithmic complexity is frequently irrelevant - it is the constants that dominate. It is an engineering problem and not a theoretical one.
Unless you can guarantee that the small maps are in cache you can be almost be guaranteed that using a std::map will yield approximately 130% worse performance as pretty much each node in the tree will be in a separate cache line
Thus instead of accessing (4 times 1+1 times 2)/5 = 1.2 cache lines per search (the first 4 are in first cacheline, the 5th in the second cacheline, you will access (1 + 2 times 2 + 2 times 3) = 9/5) + 1 for the tree itself = 2.8 cachelines per search (the 1 being 1 node at the root, 2 nodes being children of the root, and the last 2 being grandchildren of the root, plus the tree itself)
So I would predict using a std::map to take 2.8/1.2 = 233% as long for a sumVectorC having 5 entries
This what I meant when I said: "It is an engineering problem and not a theoretical one."