What is the performance of STL bitset::count() method?

What is the performance of STL bitset::count() method? - c++

I searched around and could not find the performance time specifications for bitset::count(). Does anybody know what it is (O(n) or better) and where to find it?
EDIT By STL I refer only to the Standard Template Library.

I read this file (C:\cygwin\lib\gcc\i686-pc-cygwin\3.4.4\include\c++\bitset) on my computer.
See these
/// Returns the number of bits which are set.
size_t
count() const { return this->_M_do_count(); }
size_t
_M_do_count() const
{
size_t __result = 0;
for (size_t __i = 0; __i < _Nw; __i++)
__result += __builtin_popcountl(_M_w[__i]);
return __result;
}
BTW, this is where _Nw is specified:
template<size_t _Nw>
struct _Base_bitset
Thus it's O(n) in gcc implementation. We conclude the specification doesn't require it better than O(n). And nobody in their right mind will implement it in a way worse than that. We can then safely assume that it's at worst O(n). Possibly better but you can never count on that.

I can't be sure what you really mean by "STL" here, due to a prevailing misuse of the term in the C++ community.
The C++ Standard (2003) makes no mandate for the performance of std::bitset::count() (or, in fact, any members of std::bitset as far as I can see).
I can't find any reference suggesting a mandate for the performance of STL's bitset::count() either.
I think any sane implementation will provide this in constant (or at worst linear) time, though. However, this is merely a feeling. Check yours to find out what you'll actually get.

"SGI's reference implementation runs
in linear time with respect to the
number of bytes needed to store the
bits. It does this by creating a
static array of 256 integers. The
value stored at ith index in the array
is the number of bits set in the value
i."
http://www.cplusplus.com/forum/general/12486/

I'm not sure you're going to find a specification for that, since the STL doesn't typically require a certain level of performance. I've seen hints that it's "fast", around 1 cycle per bit in the set's size. You can of course read your particular implementation's code to find out what to expect.

The Algorithm that we follow is to count all the bits that are set to 1.
Now if we want to count through that bitset for a number n, we would go through log(n)+1 digits.
For example: for the number 13, we get 1101 as the bitset.
Natural log of 13 = 2.564 (approximately) 3
Number of bits = 3+1 = 4
For any number n(decimal) we loop log(n)+1 times.
Another approach would be the following:
int count_set_bits_fast(int n) {
int count = 0;
while (n > 0) {
n=(n&(n-1));
count++
}
return count;
}
If you analyse the functional line n=(n&(n-1)); you shall find that it essentially reduces the number of bits from right to left.
The Order would therefore be number of total set bits.
For example: 13 = 1101
1101&1100 = 1100
1100&1011 = 1000
1000&0111 = 0
O(number of set bits), O(Log(n)+1) Worst case

Related

CodeSignal: Execution time limit exceeded with c++

I am trying to solve the programming problem firstDuplicate on codesignal. The problem is "Given an array a that contains only numbers in the range 1 to a.length, find the first duplicate number for which the second occurrence has minimal index".
Example: For a = [2, 1, 3, 5, 3, 2] the output should be firstDuplicate(a) = 3
There are 2 duplicates: numbers 2 and 3. The second occurrence of 3 has a smaller index than the second occurrence of 2 does, so the answer is 3.
With this code I pass 21/23 tests, but then it tells me that the program exceeded the execution time limit on test 22. How would I go about making it faster so that it passes the remaining two tests?
#include <algorithm>
int firstDuplicate(vector<int> a) {
vector<int> seen;
for (size_t i = 0; i < a.size(); ++i){
if (std::find(seen.begin(), seen.end(), a[i]) != seen.end()){
return a[i];
}else{
seen.push_back(a[i]);
}
}
if (seen == a){
return -1;
}
}

Anytime you get asked a question about "find the duplicate", "find the missing element", or "find the thing that should be there", your first instinct should be use a hash table. In C++, there are the unordered_map and unordered_set classes that are for such types of coding exercises. The unordered_set is effectively a map of keys to bools.
Also, pass you vector by reference, not value. Passing by value incurs the overhead of copying the entire vector.
Also, that comparison seems costly and unnecessary at the end.
This is probably closer to what you want:
#include <unordered_set>
int firstDuplicate(const vector<int>& a) {
std::unordered_set<int> seen;
for (int i : a) {
auto result_pair = seen.insert(i);
bool duplicate = (result_pair.second == false);
if (duplicate) {
return (i);
}
}
return -1;
}

std::find is linear time complexity in terms of distance between first and last element (or until the number is found) in the container, thus having a worst-case complexity of O(N), so your algorithm would be O(N^2).
Instead of storing your numbers in a vector and searching for it every time, Yyu should do something like hashing with std::map to store the numbers encountered and return a number if while iterating, it is already present in the map.
std::map<int, int> hash;
for(const auto &i: a) {
if(hash[i])
return i;
else
hash[i] = 1;
}
Edit: std::unordered_map is even more efficient if the order of keys doesn't matter, since insertion time complexity is constant in average case as compared to logarithmic insertion complexity for std::map.

It's probably an unnecessary optimization, but I think I'd try to take slightly better advantage of the specification. A hash table is intended primarily for cases where you have a fairly sparse conversion from possible keys to actual keys--that is, only a small percentage of possible keys are ever used. For example, if your keys are strings of length up to 20 characters, the theoretical maximum number of keys is 25620. With that many possible keys, it's clear no practical program is going to store any more than a minuscule percentage, so a hash table makes sense.
In this case, however, we're told that the input is: "an array a that contains only numbers in the range 1 to a.length". So, even if half the numbers are duplicates, we're using 50% of the possible keys.
Under the circumstances, instead of a hash table, even though it's often maligned, I'd use an std::vector<bool>, and expect to get considerably better performance in the vast majority of cases.
int firstDuplicate(std::vector<int> const &input) {
std::vector<bool> seen(input.size()+1);
for (auto i : input) {
if (seen[i])
return i;
seen[i] = true;
}
return -1;
}
The advantage here is fairly simple: at least in a typical case, std::vector<bool> uses a specialization to store bools in only one bit apiece. This way we're storing only one bit for each number of input, which increases storage density, so we can expect excellent use of the cache. In particular, as long as the number of bytes in the cache is at least a little more than 1/8th the number of elements in the input array, we can expect all of seen to be in the cache most of the time.
Now make no mistake: if you look around, you'll find quite a few articles pointing out that vector<bool> has problems--and for some cases, that's entirely true. There are places and times that vector<bool> should be avoided. But none of its limitations applies to the way we're using it here--and it really does give an advantage in storage density that can be quite useful, especially for cases like this one.
We could also write some custom code to implement a bitmap that would give still faster code than vector<bool>. But using vector<bool> is easy, and writing our own replacement that's more efficient is quite a bit of extra work...

What is the performance of std::bitset?

I recently asked a question on Programmers regarding reasons to use manual bit manipulation of primitive types over std::bitset.
From that discussion I have concluded that the main reason is its comparatively poorer performance, although I'm not aware of any measured basis for this opinion. So next question is:
what is the performance hit, if any, likely to be incurred by using std::bitset over bit-manipulation of a primitive?
The question is intentionally broad, because after looking online I haven't been able to find anything, so I'll take what I can get. Basically I'm after a resource that provides some profiling of std::bitset vs 'pre-bitset' alternatives to the same problems on some common machine architecture using GCC, Clang and/or VC++. There is a very comprehensive paper which attempts to answer this question for bit vectors:
http://www.cs.up.ac.za/cs/vpieterse/pub/PieterseEtAl_SAICSIT2010.pdf
Unfortunately, it either predates or considered out of scope std::bitset, so it focuses on vectors/dynamic array implementations instead.
I really just want to know whether std::bitset is better than the alternatives for the use cases it is intended to solve. I already know that it is easier and clearer than bit-fiddling on an integer, but is it as fast?

Update
It's been ages since I posted this one, but:
I already know that it is easier and clearer than bit-fiddling on an
integer, but is it as fast?
If you are using bitset in a way that does actually make it clearer and cleaner than bit-fiddling, like checking for one bit at a time instead of using a bit mask, then inevitably you lose all those benefits that bitwise operations provide, like being able to check to see if 64 bits are set at one time against a mask, or using FFS instructions to quickly determine which bit is set among 64-bits.
I'm not sure that bitset incurs a penalty to use in all ways possible (ex: using its bitwise operator&), but if you use it like a fixed-size boolean array which is pretty much the way I always see people using it, then you generally lose all those benefits described above. We unfortunately can't get that level of expressiveness of just accessing one bit at a time with operator[] and have the optimizer figure out all the bitwise manipulations and FFS and FFZ and so forth going on for us, at least not since the last time I checked (otherwise bitset would be one of my favorite structures).
Now if you are going to use bitset<N> bits interchangeably with like, say, uint64_t bits[N/64] as in accessing both the same way using bitwise operations, it might be on par (haven't checked since this ancient post). But then you lose many of the benefits of using bitset in the first place.
for_each method
In the past I got into some misunderstandings, I think, when I proposed a for_each method to iterate through things like vector<bool>, deque, and bitset. The point of such a method is to utilize the internal knowledge of the container to iterate through elements more efficiently while invoking a functor, just as some associative containers offer a find method of their own instead of using std::find to do a better than linear-time search.
For example, you can iterate through all set bits of a vector<bool> or bitset if you had internal knowledge of these containers by checking for 64 elements at a time using a 64-bit mask when 64 contiguous indices are occupied, and likewise use FFS instructions when that's not the case.
But an iterator design having to do this type of scalar logic in operator++ would inevitably have to do something considerably more expensive, just by the nature in which iterators are designed in these peculiar cases. bitset lacks iterators outright and that often makes people wanting to use it to avoid dealing with bitwise logic to use operator[] to check each bit individually in a sequential loop that just wants to find out which bits are set. That too is not nearly as efficient as what a for_each method implementation could do.
Double/Nested Iterators
Another alternative to the for_each container-specific method proposed above would be to use double/nested iterators: that is, an outer iterator which points to a sub-range of a different type of iterator. Client code example:
for (auto outer_it = bitset.nbegin(); outer_it != bitset.nend(); ++outer_it)
{
for (auto inner_it = outer_it->first; inner_it != outer_it->last; ++inner_it)
// do something with *inner_it (bit index)
}
While not conforming to the flat type of iterator design available now in standard containers, this can allow some very interesting optimizations. As an example, imagine a case like this:
bitset<64> bits = 0x1fbf; // 0b1111110111111;
In that case, the outer iterator can, with just a few bitwise iterations ((FFZ/or/complement), deduce that the first range of bits to process would be bits [0, 6), at which point we can iterate through that sub-range very cheaply through the inner/nested iterator (it would just increment an integer, making ++inner_it equivalent to just ++int). Then when we increment the outer iterator, it can then very quickly, and again with a few bitwise instructions, determine that the next range would be [7, 13). After we iterate through that sub-range, we're done. Take this as another example:
bitset<16> bits = 0xffff;
In such a case, the first and last sub-range would be [0, 16), and the bitset could determine that with a single bitwise instruction at which point we can iterate through all set bits and then we're done.
This type of nested iterator design would map particularly well to vector<bool>, deque, and bitset as well as other data structures people might create like unrolled lists.
I say that in a way that goes beyond just armchair speculation, since I have a set of data structures which resemble the likes of deque which are actually on par with sequential iteration of vector (still noticeably slower for random-access, especially if we're just storing a bunch of primitives and doing trivial processing). However, to achieve the comparable times to vector for sequential iteration, I had to use these types of techniques (for_each method and double/nested iterators) to reduce the amount of processing and branching going on in each iteration. I could not rival the times otherwise using just the flat iterator design and/or operator[]. And I'm certainly not smarter than the standard library implementers but came up with a deque-like container which can be sequentially iterated much faster, and that strongly suggests to me that it's an issue with the standard interface design of iterators in this case which come with some overhead in these peculiar cases that the optimizer cannot optimize away.
Old Answer
I'm one of those who would give you a similar performance answer, but I'll try to give you something a bit more in-depth than "just because". It is something I came across through actual profiling and timing, not merely distrust and paranoia.
One of the biggest problems with bitset and vector<bool> is that their interface design is "too convenient" if you want to use them like an array of booleans. Optimizers are great at obliterating all that structure you establish to provide safety, reduce maintenance cost, make changes less intrusive, etc. They do an especially fine job with selecting instructions and allocating the minimal number of registers to make such code run as fast as the not-so-safe, not-so-easy-to-maintain/change alternatives.
The part that makes the bitset interface "too convenient" at the cost of efficiency is the random-access operator[] as well as the iterator design for vector<bool>. When you access one of these at index n, the code has to first figure out which byte the nth bit belongs to, and then the sub-index to the bit within that. That first phase typically involves a division/rshifts against an lvalue along with modulo/bitwise and which is more costly than the actual bit operation you're trying to perform.
The iterator design for vector<bool> faces a similar awkward dilemma where it either has to branch into different code every 8+ times you iterate through it or pay that kind of indexing cost described above. If the former is done, it makes the logic asymmetrical across iterations, and iterator designs tend to take a performance hit in those rare cases. To exemplify, if vector had a for_each method of its own, you could iterate through, say, a range of 64 elements at once by just masking the bits against a 64-bit mask for vector<bool> if all the bits are set without checking each bit individually. It could even use FFS to figure out the range all at once. An iterator design would tend to inevitably have to do it in a scalar fashion or store more state which has to be redundantly checked every iteration.
For random access, optimizers can't seem to optimize away this indexing overhead to figure out which byte and relative bit to access (perhaps a bit too runtime-dependent) when it's not needed, and you tend to see significant performance gains with that more manual code processing bits sequentially with advanced knowledge of which byte/word/dword/qword it's working on. It's somewhat of an unfair comparison, but the difficulty with std::bitset is that there's no way to make a fair comparison in such cases where the code knows what byte it wants to access in advance, and more often than not, you tend to have this info in advance. It's an apples to orange comparison in the random-access case, but you often only need oranges.
Perhaps that wouldn't be the case if the interface design involved a bitset where operator[] returned a proxy, requiring a two-index access pattern to use. For example, in such a case, you would access bit 8 by writing bitset[0][6] = true; bitset[0][7] = true; with a template parameter to indicate the size of the proxy (64-bits, e.g.). A good optimizer may be able to take such a design and make it rival the manual, old school kind of way of doing the bit manipulation by hand by translating that into: bitset |= 0x60;
Another design that might help is if bitsets provided a for_each_bit kind of method, passing a bit proxy to the functor you provide. That might actually be able to rival the manual method.
std::deque has a similar interface problem. Its performance shouldn't be that much slower than std::vector for sequential access. Yet unfortunately we access it sequentially using operator[] which is designed for random access or through an iterator, and the internal rep of deques simply don't map very efficiently to an iterator-based design. If deque provided a for_each kind of method of its own, then there it could potentially start to get a lot closer to std::vector's sequential access performance. These are some of the rare cases where that Sequence interface design comes with some efficiency overhead that optimizers often can't obliterate. Often good optimizers can make convenience come free of runtime cost in a production build, but unfortunately not in all cases.
Sorry!
Also sorry, in retrospect I wandered a bit with this post talking about vector<bool> and deque in addition to bitset. It's because we had a codebase where the use of these three, and particularly iterating through them or using them with random-access, were often hotspots.
Apples to Oranges
As emphasized in the old answer, comparing straightforward usage of bitset to primitive types with low-level bitwise logic is comparing apples to oranges. It's not like bitset is implemented very inefficiently for what it does. If you genuinely need to access a bunch of bits with a random access pattern which, for some reason or other, needs to check and set just one bit a time, then it might be ideally implemented for such a purpose. But my point is that almost all use cases I've encountered didn't require that, and when it's not required, the old school way involving bitwise operations tends to be significantly more efficient.

Did a short test profiling std::bitset vs bool arrays for sequential and random access - you can too:
#include <iostream>
#include <bitset>
#include <cstdlib> // rand
#include <ctime> // timer
inline unsigned long get_time_in_ms()
{
return (unsigned long)((double(clock()) / CLOCKS_PER_SEC) * 1000);
}
void one_sec_delay()
{
unsigned long end_time = get_time_in_ms() + 1000;
while(get_time_in_ms() < end_time)
{
}
}
int main(int argc, char **argv)
{
srand(get_time_in_ms());
using namespace std;
bitset<5000000> bits;
bool *bools = new bool[5000000];
unsigned long current_time, difference1, difference2;
double total;
one_sec_delay();
total = 0;
current_time = get_time_in_ms();
for (unsigned int num = 0; num != 200000000; ++num)
{
bools[rand() % 5000000] = rand() % 2;
}
difference1 = get_time_in_ms() - current_time;
current_time = get_time_in_ms();
for (unsigned int num2 = 0; num2 != 100; ++num2)
{
for (unsigned int num = 0; num != 5000000; ++num)
{
total += bools[num];
}
}
difference2 = get_time_in_ms() - current_time;
cout << "Bool:" << endl << "sum total = " << total << ", random access time = " << difference1 << ", sequential access time = " << difference2 << endl << endl;
one_sec_delay();
total = 0;
current_time = get_time_in_ms();
for (unsigned int num = 0; num != 200000000; ++num)
{
bits[rand() % 5000000] = rand() % 2;
}
difference1 = get_time_in_ms() - current_time;
current_time = get_time_in_ms();
for (unsigned int num2 = 0; num2 != 100; ++num2)
{
for (unsigned int num = 0; num != 5000000; ++num)
{
total += bits[num];
}
}
difference2 = get_time_in_ms() - current_time;
cout << "Bitset:" << endl << "sum total = " << total << ", random access time = " << difference1 << ", sequential access time = " << difference2 << endl << endl;
delete [] bools;
cin.get();
return 0;
}
Please note: the outputting of the sum total is necessary so the compiler doesn't optimise out the for loop - which some do if the result of the loop isn't used.
Under GCC x64 with the following flags: -O2;-Wall;-march=native;-fomit-frame-pointer;-std=c++11;
I get the following results:
Bool array:
random access time = 4695, sequential access time = 390
Bitset:
random access time = 5382, sequential access time = 749

Not a great answer here, but rather a related anecdote:
A few years ago I was working on real-time software and we ran into scheduling problems. There was a module which was way over time-budget, and this was very surprising because the module was only responsible for some mapping and packing/unpacking of bits into/from 32-bit words.
It turned out that the module was using std::bitset. We replaced this with manual operations and the execution time decreased from 3 milliseconds to 25 microseconds. That was a significant performance issue and a significant improvement.
The point is, the performance issues caused by this class can be very real.

In addition to what the other answers said about the performance of access, there may also be a significant space overhead: Typical bitset<> implementations simply use the longest integer type to back their bits. Thus, the following code
#include <bitset>
#include <stdio.h>
struct Bitfield {
unsigned char a:1, b:1, c:1, d:1, e:1, f:1, g:1, h:1;
};
struct Bitset {
std::bitset<8> bits;
};
int main() {
printf("sizeof(Bitfield) = %zd\n", sizeof(Bitfield));
printf("sizeof(Bitset) = %zd\n", sizeof(Bitset));
printf("sizeof(std::bitset<1>) = %zd\n", sizeof(std::bitset<1>));
}
produces the following output on my machine:
sizeof(Bitfield) = 1
sizeof(Bitset) = 8
sizeof(std::bitset<1>) = 8
As you see, my compiler allocates a whopping 64 bits to store a single one, with the bitfield approach, I only need to round up to eight bits.
This factor eight in space usage can become important if you have a lot of small bitsets.

Rhetorical question: Why std::bitset is written in that inefficacy way?
Answer: It is not.
Another rhetorical question: What is difference between:
std::bitset<128> a = src;
a[i] = true;
a = a << 64;
and
std::bitset<129> a = src;
a[i] = true;
a = a << 63;
Answer: 50 times difference in performance http://quick-bench.com/iRokweQ6JqF2Il-T-9JSmR0bdyw
You need be very careful what you ask for, bitset support lot of things but each have it own cost. With correct handling you will have exactly same behavior as raw code:
void f(std::bitset<64>& b, int i)
{
b |= 1L << i;
b = b << 15;
}
void f(unsigned long& b, int i)
{
b |= 1L << i;
b = b << 15;
}
Both generate same assembly: https://godbolt.org/g/PUUUyd (64 bit GCC)
Another thing is that bitset is more portable but this have cost too:
void h(std::bitset<64>& b, unsigned i)
{
b = b << i;
}
void h(unsigned long& b, unsigned i)
{
b = b << i;
}
If i > 64 then bit set will be zero and in case of unsigned we have UB.
void h(std::bitset<64>& b, unsigned i)
{
if (i < 64) b = b << i;
}
void h(unsigned long& b, unsigned i)
{
if (i < 64) b = b << i;
}
With check preventing UB both generate same code.
Another place is set and [], first one is safe and mean you will never get UB but this will cost you a branch. [] have UB if you use wrong value but is fast as using var |= 1L<< i;. Of corse if std::bitset do not need have more bits than biggest int available on system because other wise you need split value to get correct element in internal table. This mean for std::bitset<N> size N is very important for performance. If is bigger or smaller than optimal one you will pay cost of it.
Overall I find that best way is use something like that:
constexpr size_t minBitSet = sizeof(std::bitset<1>)*8;
template<size_t N>
using fasterBitSet = std::bitset<minBitSet * ((N + minBitSet - 1) / minBitSet)>;
This will remove cost of trimming exceeding bits: http://quick-bench.com/Di1tE0vyhFNQERvucAHLaOgucAY

Which is the fastest way of generating a vector of random bits?

I have the following question. Suppose I want to generate a std::vector<bool>, or std::vector<unsigned int>, or even a C-style array that contains only 0's and 1's, uniformly distributed (i.e., at each run should have something like 0110 or 1011, you get the idea). There are 2 approaches:
generate each element using an std::uniform_int_distribution(0,1)
generate a random integer between 0 and 2^n-1 (where n is the number of bits), then use a bitset<n>
I am talking here about large vectors, and did some timing but didn't get any clear insight.
Does anyone know which method is more efficient? In my opinion the second approach should be better, but I am not convinced.

If performance is a concern and your output set needs to be large, I'd skip vector<bool>. While compact, they are lower performance to access an individual member.
Either use a std::vector<unsigned int> and call .resize(N) on it to make sure it's large enough, or allocate a C style array via malloc. Either of these will have performant access if you use [index] to access.
Secondly, get your own RNG, many system ones suck. e.g. MSVC uses an LCG, albeit shifted right by 16 bits. Even so, http://en.wikipedia.org/wiki/Linear_congruential_generator notes that they should be avoided if you really care. My recommendation is an Xor-Shift RNG because it's (A) blindingly fast, and (B) designed by George Marsaglia, who was a bit of an RNG expert.
The other advantage is that you can get your own to RNG return a full 32 bit unsigned int per call, versus MSVC rand() and possibly others which only provide 15 bits per call. Therefore you'll make over twice as many calls to rand() as you will to your own generator.
To actually fill the array, you'd do something like:
void fillarray(unsigned int *data, unsigned int count)
{
unsigned int randomValue;
for (int index = 0; index < count; index++)
{
if ((index & 31) == 0)
{
randomValue = nextRandom();
}
data[index] = randomValue & 1;
randomValue >>= 1;
}
}

order of an operation when upper bound is fixed

I recently had an interview and was asked to find number of bits in integer supplied. I had something like this:
#include <iostream>
using namespace std;
int givemCountOnes (unsigned int X) {
int count =0;
while (X != 0 ) {
if(X & 1)
count++;
X= X>>1;
}
return count;
}
int main() {
cout << givemCountOnes (4);
return 0;
}
I know there are better approaches but that is not the question here.
Question is, What is the complexity of this program?
Since it goes for number of bits in the input, people say this is O(n) where n is the number of bits in input.
However I feel that since the upper bound is sizeof(unsigned int) i.e. say 64 bits, I should say order is o(1).
Am I wrong?

The complexity is O(N). The complexity rises linearly with the size of the type used (unsigned int).
The upper bound does not matter as it can be extended any time in the future. It also does not matter because there is always an upper bound (memory size, number of atoms in the universe) and then everything could be considered O(1).

I will just add a better solution to above problem.
Use the following step in the Loop
x = x & (x-1);
This will remove the right most ON bit one at a time.
So your loop will at max run as long as there is an ON bit. Terminate when the number approaches 0.
Hence the complexity improves from O(number of bits in int) to O(number of on bits).

The O notation is used to tell what the difference is between different values of n. In this case, n would be the number of bits, as (in your case) the number of bits will change the (relative) time it takes to perform the calculation. So O(n) is correct - a one bit integer will take 1 unit of time, a 32-bit integer will take 32 units of time, and a 64-bit integer will take 64 units of time.
Actually, your algorithm is not dependent on the actual number of bits in the number, but the number of the highest bit set in the number, but that's a different matter. However, since we're typically talking about O as the "worst case", it's still O(n), where n is the number of bits in the integer.
And I can't really think of any method that is sufficiently better than that in terms of O - I can think of methods that improve the number of iterations in the loop (e.g using a 256 entry table, and dealing with 8 bits at a time), but it's still "bigger data -> longer time". Since O(n) and O(n/2) or O(n/8) are all the same (it's just that the overall time is 1/8 in the latter case than in the first case).

Big O notation describes count of algorithm steps in worst case scenario. Which is in this case, when there is a 1 in the last bit. So there will be n iterations/steps when you pass n bit number as input.
Imagine a similar algorithm which searches count of 1's in a list. It's complexity is O(n), where n is a list length. By your assumption, if you always pass fixed size lists as input, then algorithm complexity will become O(1) which is incorrect.
However if you fix bit length in algorithm: i.e. something like for (int i = 0; i < 64; ++i) ... then it will have O(1) complexity, since it doing O(1) operation 64 times, you can ignore constant here. Otherwise O(c*n) is O(n), O(c) is O(1), where c is constant.
Hope all these examples helped. BTW, there is O(1) solution for this, I'll post it when I remember :)

There's one thing should be cleared: the complexity of operation on your integer. It is not clear in this example, as you work on int, which is natural word size on your machine its complexity seem to be just 1.
But O-notation is about large amount of data and large tasks, say you have n bit integer, where n is about 4096 or so. In this case complexity addition, subtraction and shift are of O(n) complexity at least, so your algorithm then applied to such integer would be O(n²) complexity (n operations of O(n) complexity applied).
Direct count algorithm without shifting of whole number (in assumption that one bit test is O(1)) gives O(n log(n)) complexity (it involves up to n additions on log(n) sized integer).
But for fixed length data (which is C's int) big O analysis is simply meaningless, because it based on input data of variable length, say more, data of virtually any length upto infinity.

C++: building iterator from bits

I have a bitmap and would like to return an iterator of positions of set bits. Right now I just walk the whole bitmap and if bit is set, then I provide next position. I believe this could be done more effectively: for example build statically array for each combination of bits in single byte and return vector of positions. This can't be done for a whole int, because array would be too big. But maybe there are some better solutions? Do you know any smart algorithms for this?

I can suggest several ideas.
Turns out modern CPUs have dedicated instructions for finding the next set bit in a 32- or 64-bit word.
I like very much your idea of constructing the iterator for the whole bitmap from prepared efficient per-byte mini-iterators, this is really cool and I'm surprised I've never seen it before!
If your bitmap is very sparse, you may represent it in some other form, such as a balanced tree, where iteration algorithms are quite well-known.
If your bitmap is sparse but with dense regions (that sounds exotic, but I've encountered situations where this was exactly the case), use a balanced tree of small (32-bit or 64-bit) bitmaps and use a combined iteration algorithm for a tree and for bits of a word.
To avoid the memory overhead of an explicit tree, use an implicit one, like in the canonical heapsort algorithm. After your bitset is ready and will not be mutated, build a "pyramid" on top of it where level(N+1)[i] = level(N)[2*i] | level(N)[2*i+1]. This will allow you to rapidly skip uninhabited regions of the bitset, and iteration will be done in a fashion similar to iterating over a regular binary tree. You might as well build a pyramid of inhabitance, starting from bytes, etc.: it all depends on how sparse your bitset it.
There are well-known bit tricks for finding the number of leading zeros in a word; see, for example, the code of java's standard libraries:
You might gain a lot of performance by using a passive iterator instead of an active one, t.i. instead of begin() and operator++(), provide a foreach(F) function for your bitset where F has operator(). If you need passive iteration with premature termination, make F's operator() return a boolean that denotes whether termination is requested.
EDIT: I couldn't resist trying out your approach with preparing iterators for bytes. I wrote a code generator into C#2.0 that produces code of the following form:
IEnumerable<int> bits(byte[] bytes) {
for(int i=0; i<bytes.Length; ++i) {
int oi=8*i;
switch(bytes[i]) {
....
case 74: yield return oi+1; yield return oi+4; yield return oi+6; break;
....
}
}
}
I compared its performance for counting bits of a random 50%-filled byte array (10Mb) with the performance of code that does not use iterators at all and consists of two loops:
for (int i = 0; i < bytes.Length; ++i) {
byte b = bytes[i];
for (int j = 7; j >= 0; --j) {
if (((int)b & (1 << j)) != 0) s++;
}
}
The second code snippet is just some 1.66 times faster than the first one (~1.5s vs ~2.5s). I think that sparser bit arrays might even make the first code outperform the second one.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js