I want to most quickly & efficiently find out if two memory buffers - holding arbitrarily defined values - are identical in a bitwise comparision.
I'm not interested in anything but the Boolean "is identical" and I want the method to return as quickly as possible, i.e. at first difference found.
What is the best way to achieve this?
I'm currenlty first comparing the overall size - which I know - and use
memcmp if they are of same size
memcmp( buf1_ptr, buf2_ptr, sizeof(buf1) )
Is this the most efficient I can do? Should I split the comparison into junks of a for-loop?
In general, memcmp will have been written in assembler by experts. It is very, very, unlikely you can do any better than them at the general purpose problem it solves.
If you can promise that the pointers will always be (eg) aligned on a 16 byte boundary, and that the length will always be a multiple of 16 bytes, you might be able to do a little better by using some vectorized solution like SSE. (memcmp wil probably end up using SSE too under those circumstances, but it will have to do some tests first to make sure - and you can save the cost of those tests).
Otherwise - just use memcmp.
Related
I have three large arbitrary sparse boolean vectors, all of the same size - say: pool1, pool2, intersection_of_other_pools. I'm interested in performing a bitwise operator. It'd be great if I could do: intersection_of_other_pools |= pool1 | pool2, but that doesn't seem to be an option - as far as I could find.
Since the size of all these vectors are very large, and pool1 and pool2 are very sparse, I'd be interested in a way to perform a bitwise operation on these vectors without looping. I understand that the under-the-hood implementation of std::vector<bool> is just an array of bits, which led me to believe it's possible to do this without looping.
I'm open to strange bitwise hacky solutions in the name of speed.
Of course, if the fastest way (or the only way) to do this is just looping, then I'll happily accept that as an answer as well.
I've checked out valarray as a potential alternative to vector, but I couldn't tell if it looping or is doing some magical bitwise operation. But ideally, I don't want to change the existing codebase.
Don't use std::vector<bool> or similar for a sparse array.
A truly sparse array should have the ability to skip over large sections.
Encode your data as block headers, which state how long a region it is in bytes. Use all 1s in the length to say "the length of the length field is twice as long and follows", recursively.
So 0xFF0100 states that there is a block of length 512 following. (you can do a bit better by not permitting 0 nor 1-254, but that is rounding error).
Alternate blocks of "all 0s" with blocks of mixed 1s and 0s.
Don't read the block headers directly; use memcpy into aligned storage.
Once you have this, your | or & operation is more of a stitch than a bitwise operation. Only in the rare case where both have a non-zero block do you actually do bitwise work.
After doing & you might want to check if any of the non-0 regions are actually all 0.
This assumes an extremely sparse bitfield. Like, 1 bit in every few 10000 is set is a typical case. If by sparse you mean "1 in 10", then just use a vector of uint64_t or something.
implement as std::vector<uint64_t>, your cpu will propably be quite fast to do the bitwise "or" on these. They will be memory-aligned, so cache friendly.
the loop is not as bad as you think, as there will be a hidden implicit loop anyway on a different data structure.
if its extremly sparse (<< 1 in 1000), then just store the indices of the "set" bits in a (sorted) vector and use std::set_intersection to do the matching
I'm implementing a struct with a pointer to some manually managed memory. It all works great with DMD, but when I test it with GDC, it fails on the opEquals operator overload. I've narrowed it down to memcmp. In opEquals I compare the pointed to memory with memcmp, which behaves as I expect in DMD but fails with GDC.
If I go back and write the opEquals method by comparing each value stored in the manually managed memory 1 at a time using == on the builtin types, it works in both compilers. I prefer the memcmp route because it was shorter to write and seems like it should be faster (less indirection, iteration, etc).
Why? Is this a bug?
(My experience with C was 10 years ago, been using python/java since, I never had this kind of problem in C, but I didn't use it that much.)
Edit:
The memory I'm comparing represents a 2-D array of 'real' values, I just wanted it to be allocated in one chunk so I didn't have to deal with jagged arrays. I'll be using the structs a lot in tight loops. Basically I'm rolling my own matrix struct that will (eventually) cache some frequently used values (trace, determinant) and offers an alternate read only view into the transpose that doesn't require copying it. I plan to work with matrices of about 10x10 to about 1000x1000 (though not always square).
I also plan on implementing a version that allocates memory with the GC via a ubyte[] and profiling the two implementations.
Edit 2:
Ok, I tried a couple of things. I also have some parallel loops, and I had a hunch that might be the problem. So I added some version statements to make a parallel and non-parallel version. In order to get it to work with GDC, I had to use the non-parallel version AND change real to double.
All cases compiled under GDC. But the unit tests failed, not always consistently on the same line, but consistently at an opEquals call when I used real or parallel. In DMD all cases compiled and ran no problem.
Thanks,
real has a bit of a strange size: it is 80 bits of data, but if you check real.sizeof, you'll see it is bigger than that (at least on Linux, I think it is 10 bytes on Windows, I betcha you wouldn't see this bug there). The reason is to make sure it is aligned on a word boundary - a multiple of four bytes - for the processor to load more efficient in arrays.
The bytes between each data element are called padding, and their content is not always defined. I haven't confirmed this myself, but #jpf's comment on the question said the same thing my gut does, so I'm posting it as answer now.
The is operator in D does the same as memcmp(&data1, &data2, data.sizeof), so #jpf's comment and your memcmp would be the same thing. It checks the data AND the padding, whereas == only checks the data (and does a bit special for floating types btw because it also compares for NaN, so the exact bit pattern is important to those checks; actually, my first gut when I saw the question title was that it was NaN related! but not the case)
Anyway, apparently dmd initializes the padding bytes as well, whereas gdc doesn't, leaving it as garbage which doesn't always match.
I'm currently evaluating whether I should utilize a single large bitset or many 64-bit unsigned longs (uint_64) to store a large amount of bitmap information. In this case, the bitmap represents the current status of a few GB of memory pages (dirty / not dirty), and has thousands of entries.
The work which I am performing requires that I be able to query and update the dirty pages, including performing OR operations between two dirty page bitmaps.
To be clear, I will be performing the following:
Importing a bitmap from a file, and performing a bitwise OR operation with the existing bitmap
Computing the hamming weight (counting the number of bits set to 1, which represents the number of dirty pages)
Resetting / clearing a bit, to mark it as updated / clean
Checking the current status of a bit, to determine if it is clean
It looks like it is easy to perform bitwise operations on a C++ bitset, and easily compute the hamming weight. However, I imagine there is no magic here -- the CPU can only perform bitwise operations on as many bytes as it can store in a register -- so the routine utilized by the bitset is likely the same I would implement myself. This is probably also true for the hamming weight.
In addition, importing the bitmap data from the file to the bitset looks ugly -- I need to perform bitshifts multiple times, as shown here. I imagine given the size of the bitsets I would be working with, this would have a negative performance impact. Of course, I imagine I could just use many small bitsets instead, but there may be no advantage to this (other then perhaps ease of implementation).
Any advice is appriciated, as always. Thanks!
Sounds like you have a very specific single-use application. Personally, I've never used a bitset, but from what I can tell its advantages are in being accessible as if it was an array of bools as well as being able to grow dynamically like a vector.
From what I can gather, you don't really have a need for either of those. If that's the case and if populating the bitset is a drama, I would tend towards doing it myself, given that it really is quite simple to allocate a whole bunch of integers and do bit operations on them.
Given that have very specific requirements, you will probably benefit from making your own optimizations. Having access to the raw bit data is kinda crucial for this (for example, using pre-calculated tables of hamming weights for a single byte, or even two bytes if you have memory to spare).
I don't generally advocate reinventing the wheel... But if you have special optimization requirements, it might be best to tailor your solution towards those. In this case, the functionality you are implementing is pretty simple.
Thousands bits does not sound as a lot. But maybe you have millions.
I suggest you write your code as-if you had the ideal implementation by abstracting it (to begin with use whatever implementation is easier to code, ignoring any performance and memory requirement problems) then try several alternative specific implementations to verify (by measuring them) which performs best.
One solution that you did not even consider is to use Judy arrays (specifically Judy1 arrays).
I think if I were you I would probably just save myself the hassle of any DIY and use boost::dynamic_bitset. They've got all the bases covered in terms of functionality, including stream operator overloads which you could use for file IO (or just read your data in as unsigned ints and use their conversions, see their examples) and a count method for your Hamming weight. Boost is very highly regarded a least by Sutter & Alexandrescu, and they do everything in the header file--no linking, just #include the appropriate files. In addition, unlike the Standard Library bitset, you can wait until runtime to specify the size of the bitset.
Edit: Boost does seem to allow for the fast input reading that you need. dynamic_bitset supplies the following constructor:
template <typename BlockInputIterator>
dynamic_bitset(BlockInputIterator first, BlockInputIterator last,
const Allocator& alloc = Allocator());
The underlying storage is a std::vector (or something almost identical to it) of Blocks, e.g. uint64s. So if you read in your bitmap as a std::vector of uint64s, this constructor will write them directly into memory without any bitshifting.
I got some code that I'd like to improve. It's a simple app for one of the variations of 2DBPP and you can take a look at the source at https://gist.github.com/892951
Here's an outline of things that I use chars for (I'd like to switch to binary values instead.) Initialize a block of memory with '0's ():
...
char* bin;
bin = new (nothrow) char[area];
memset(bin, '\0', area);
sometimes I check particular values:
if (!bin[j*height+k]) {...}
or blocks:
if (memchr(bin+i*height+pos.y, '\1', pos.height)) {...}
or set values to '1's:
memset(bin+i*height+best.y,'\1',best.height);
I don't know of any standart types or methods to work with binary values. How do I get to use the bits instead of bytes?
There's a related question that you might be interested in -
C++ performance: checking a block of memory for having specific values in specific cells
Thank you!
Edit: There's still a bigger question - would it be an improvement? I'm only concerned with time.
For starters, you can refer to this post:
How do you set, clear, and toggle a single bit?
Also, try looking into the C++ Std Bitset, or bit field.
I recommend reading up on boost.dynamic_bitset, which is a runtime-sized version of std::bitset.
Alternatively, if you don't want to use boost for some reason, consider using a std::vector<bool>. Quoting cppreference.com:
Note that a boolean vector (std::vector<bool>) is a specialization of the vector template that is designed to use less memory. A normal boolean variable usually uses 1-4 bytes of memory, but a boolean vector uses only one bit per boolean value.
Unless memory space is an issue, I would stay away from bit twiddling. You may save some memory space but extend performance time. Packing and unpacking bits takes time, and extra code.
Get the code more robust and correct before attempting bit twiddling. Play with different (high level) designs that can improve performance and memory usage.
If you are going to the bit level, study up on boolean arithmetic and logic. Redesign your data to be easier to manipulate at the bit level.
What I need to do is open a text file with 0s and 1s to find patterns between the columns in the file.
So my first thought was to parse each column into a big array of bools, and then do the logic between the columns (now in arrays). Until I found that the size of bools is actually a byte not a bit, so i would be wasting 1/8 of memory, assigning each value to a bool.
Is it even relevant in a grid of 800x800 values? What would be the best way to handle this?
I would appreciate a code snippet in case its a complicated answer
You could use std::bitset or Boosts dynamic_bitset which provide different methods which will help you manage your bits.
They for example support constructors which create bitsets from other default types like int or char. You can also export the bitset into an ulong or into a string (which then could be turned into a bitset again etc)
I once asked about concatenating those, which wasn't performantly possible to do. But perhaps you could use the info in that question too.
you can use std::vector<bool> which is a specialization of vector that uses a compact store for booleans....1 bit not 8 bits.
I think it was Knuth who said "premature optimization is the root of all evil." Let's find out a little bit more about the problem. Your array is 800**2 == 640,000 bytes, which is no big deal on anything more powerful than a digital watch.
While storing it as bytes may seem wasteful -- as you say, 7/8ths of the memory is redundant -- but on the other hand, most machines don't do bit operations as efficiently as bytes; by saving the memory, you might waste so much effort masking and testing that you would have been better off with the bytes model.
On the other hand, if what you want to do with it is look for larger patterns, you might want to use a bitwise representation because you can do things with 8 bits at a time.
The real point here is that there are several possibilities, but no one can tell you the "right" representation without knowing what the problem is.
For that size grid your array of bools would be about 640KB. Depends how much memory you have if that will be a problem. It would probably be the simplest for the logic analysis code.
By grouping the bits and storing in an array of int you could drop the memory requirement to 80KB, but the logic code would be more complicated as you'd be always isolating the bits you wanted to check.