I was wondering if there is a way to mask a list of int values using bitwise operators, and using that mask to know if a int value is one of the values that are in the mask.
i.e. if I have the values 129 and 17, how can calculate a mask that tells me if a int value corresponds in the mask (if the int value is either 129 or 17).
I expect my problem to be better understand with the next pseudocode.
**EDIT:
I Want to pack, mask or "compress" an int array in only one value (mask), and then, only accept values that are in the list of values to mask (array).
Is it possible? Thanks in advance.
valuesToMask = [17, 129, ...]
mask = getmask(valuesToMask)
lstValues = [0,1, 10, ..., 17, 18, 19, ..., 129, ...]
foreach(int value, in lstValues) {
if(check(mask,value))
printf("\nValue %d is in the mask", value);
else
printf("\nValue %d is not in the mask", value);
}
Thanks in advance. I Really appreciate your help and your time.
(Sorry for my english)
You can do this for certain sets of values, but not necessarily in general. For example, if you want to determine whether a value is 4, 5, 6, or 7, then you can do:
if ((value & ~3) == 4) ...
This creates a mask with all bits 1 except the least significant two bits. The & operator effectively sets the least significant two bits to 0. The comparison then checks to see whether the pattern of bits matches the value you are looking for. In binary representation, this looks like the following (assume value is an 8-bit value):
value masked
00000011 00000000 = 0
00000100 00000100 = 4
00000101 00000100 = 4
00000110 00000100 = 4
00000111 00000100 = 4
00001000 00001000 = 8
This technique would not work if for example you wanted to check for just "4, 5, or 7".
You can partially solve your problem with Bloom Filters. The way this works is that in order to test for membership in an N-set of items, you define K hash functions to map each item to an M-bit key. For insertion of an element a, set the filter's bits at positions h1(a) ... hk(a) equal to 1. For lookup of an element b, if you detect a zero bit at any of the h1(b) ... hk(b), then b is guaranteed not to be in the set. Depending on the values for N, M and K, there is however a small probability that you get a false positive (i.e. you detect no zeros from the hash functions, but b was not previously stored in the filter).
In pseudo-code:
const int M = 256;
typedef std::bitset<M> Mask;
int listValues[N] = { v1, ... , vN };
typedef unsigned char (*)(int) HashFunction; // maps int to 0...255
HashFunction hash[K] = { h1, ..., hK };
Mask make_mask(int x)
{
Mask m(0):
for (int i = 0; i < K; ++i) {
m[(hash[i])(x)] = 1; // update mask with item's hash
}
return(m);
}
// initialize
Mask BloomFilter(0);
for (int i = 0; i < N; ++i) {
BloomFilter |= make_mask(listValues[i]);
}
// probe
bool is_not_in_filter(const Mask& F, int x)
{
// if a zero-bit in F matches a 1-bit in make_mask(x), then x is not in F
return ~F & make_mask(x) != 0;
}
// call
int x = ...;
bool in_set = is_not_in_filter(BloomFilter, x);
Effectively, this expands each item to an M-bit key, and the filter is the aggregate bitwise OR over all items. Testing for set-membership then becomes a simple (though probabilistic) bitwise AND between the NEGATED filter with the M-bit expanded item to be tested.
UPDATE:
The above code is pseudo-code to explain how it works. To get an actual library, see e.g. the experimental Boost.Bloomfilters or bloom
I think your asking how you can check if a number is 129 or 17.
int[] lstValues = [0,1, 10, 17, 18, 19, 129];
foreach(int value in lstValues) {
if(lstValues == 129 || lstValues == 17)
printf("\nValue is in the mask");
else
printf("\nValue is not in the mask");
}
Related
I've created a function that enables you to get a range of bits in a byte, counting bit 0 for the least significant bit (LSb) on the right and 7 for the most significant bit (MSb) on the left for code in C or C++. However, it's not so straightforward to set bits. This can be generalized for short, int, long etc., but for now I'm sticking with bytes.
Given the following:
#include <stdio.h>
#typedef unsigned char BYTE;
BYTE getByteBits(BYTE n, BYTE b) {
return (n < 8) ? b & ((0x01 << (n + 1)) - 1) : 0;
}
where the bits we want to extract are between bits 0 and n, and b is the full byte. If n is 0 only 0 or 1 is returned for this LSb, if n is 1, values between 0 and 3 can be returned, etc., up to n is 7 for the full byte. Of course in the latter case the code is redundant.
This can be called from main() by using something like:
BYTE num = getByteBits(4, myByte);
which will return the numerical value from the 4 lowest bits. However, I've discovered that this can be generalized to:
BYTE num = getByteBits(n - m, myByte >> m);
which will extract the value returned by the bits from m to n, such that 0 <= m <= n <= 7. This is done by shifting myByte by m bits to the right, which is equivalent to shifting the mask to the left.
However, so far I've been unable to do something similar to set bits using a function with three arguments. The best I can do is to create the function:
BYTE setByteBits(BYTE n, BYTE m, BYTE b, BYTE c) {
BYTE mask = ((0x01 << (n + 1)) - 1) << m;
return (b & ~mask) | (c & mask);
}
where n and m are the same as before, b is the value of the original byte, and c is the value of some of the bits we want to change. The update byte is returned. This would be called as:
BYTE num = setByteBits(n - m, m, MyByte, c << m);
but 4 rather than 3 arguments are needed, and rather than shifting myByte by m bits to the right, the mask is shifted to the left together with its complement in the function, as given be the 2nd argument m. Although as far as I can see this works correctly, so far all attempts to do something similar as getByteBits() with 3 arguments have failed.
Does anybody have any idea about this? Also I would appreciate if any bugs can be found in the code. Incidentally, for setByteBits() I made use of the link:
How do you set only certain bits of a byte in C without affecting the rest?
Many thanks.
so far I've been unable to do something similar to set bits using a function with three arguments
I do not understand your code, because you use so many m n c b one letter variables. Maybe try to be more descriptive? You don't have to write short code, there is no performance gain in that.
When n is the stopping bit position inside the byte, 1 << (n + 1) will give you too big mask. You have to shift 1 of the length of the range of bits, and then that mask shift to the left by the start position. I mixed n with m so many times in your code, I do not know which is which.
The following works, at least for tests I tried:
#include <stdio.h>
#include <stdint.h>
#include <assert.h>
#include <limits.h>
typedef unsigned char byte;
#define BYTE_BITS CHAR_BIT
typedef uint_fast8_t bitpos;
/**
* Set bits in the byte `thebyte`
* between the range of bit `rangestart` inclusive to `rangestop` exclusive
* counting from 0 from LSB
* the the range of these bits inside `masktoset`.
*/
byte setByteBits(bitpos rangestart,
bitpos rangestop,
byte thebyte,
byte masktoset) {
assert(rangestop <= BYTE_BITS);
assert(rangestart < rangestop);
const bitpos rangelen = rangestop - rangestart;
const byte rangemask = (1u << rangelen) - 1u;
const byte mask = rangemask << rangestart;
return (thebyte & ~mask) | (masktoset & mask);
}
int main() {
const int tests[][5] = {
{ 4,6,0,0xff,0b00110000 },
{ 4,6,0,0b01010101,0b00010000 },
{ 4,6,0,0b01100101,0b00100000 },
{ 4,6,0xff,0b01000101,0b11001111 },
{ 0,8,0,0xff,0xff },
{ 0,8,0,0xab,0xab },
{ 0,4,0xfa,0xab,0xfb },
{ 0,4,0xab,0xcd,0xad },
{ 4,8,0xab,0xcd,0xcb },
{ 4,8,0xef,0xab,0xaf },
};
for (size_t i = 0; i <sizeof(tests)/sizeof(*tests); ++i) {
const int *const t = tests[i];
const byte r = setByteBits(t[0],t[1],t[2],t[3]);
fprintf(stderr, "%d (%#02x,%#02x,%#02x,%#02x)->%#02x ?= %#02x\n",
i,t[0],t[1],t[2],t[3],r,t[4]);
assert(r == t[4]);
}
}
Here is the definition of the function:
inline uint32_t CountLeadingZeros(uint32_t Val)
{
// Use BSR to return the log2 of the integer
unsigned long Log2;
if (_BitScanReverse(&Log2, Val) != 0)
{
return 31 - Log2;
}
return 32;
}
inline uint32_t CeilLog2(uint32_t Val)
{
int BitMask = ((int)(CountLeadingZeros(Val) << 26)) >> 31;
return (32 - CountLeadingZeros(Val - 1)) & (~BitMask);
}
Here is my hypothesis:
The range of the return value of the function CountLeadingZeros is [0, 32]. When the input Val is equal to 0, CountLeadingZeros(Val) << 26 should be 1000,0000,....,0000,0000.
Since the left hand side of operator >> is signed number, the result of >> 32 would be 1111,1111,....,1111,1111. When Val is not equal to 0, the BitMask would always be 0000,0000,....,0000,0000.
So I guess that the utility of variable BitMask is to let the function return 0 when the input Val is zero.
But the question is that when I pass an -1 to this function, it would be cast to 4294967295, result in the output become 32.
Is my hypothesis right?
I have seen this implementation many times in the RayTracing renderer on the github.
What is actual effect of BitMask here? Confused :(
Since the left hand side of operator >> is signed number, the result of >> 32 would be 1111,1111,....,1111,1111. When Val is not equal to 0, the BitMask would always be 0000,0000,....,0000,0000.
Your analysis is absolutely correct: BitMask is either all ones when Val is non-zero; otherwise it is all zeros. You can eliminate BitMask with a simple conditional:
return Val ? (32 - CountLeadingZeros(Val - 1)) : 0;
This does not create new branching, because the conditional replaces the if of CountLeadingZeros.
But the question is that when I pass an -1 to this function, it would be cast to 4294967295, result in the output become 32.
Function takes an unsigned number, so you should pass 0xFFFFFFFF, not -1 (representation of negatives is implementation-defined). In this case the return value should be 32, the correct value of log2 ceiling for this value.
I have binary matrices in C++ that I repesent with a vector of 8-bit values.
For example, the following matrix:
1 0 1 0 1 0 1
0 1 1 0 0 1 1
0 0 0 1 1 1 1
is represented as:
const uint8_t matrix[] = {
0b01010101,
0b00110011,
0b00001111,
};
The reason why I'm doing it this way is because then computing the product of such a matrix and a 8-bit vector becomes really simple and efficient (just one bitwise AND and a parity computation, per row), which is much better than calculating each bit individually.
I'm now looking for an efficient way to transpose such a matrix, but I haven't been able to figure out how to do it without having to manually calculate each bit.
Just to clarify, for the above example, I'd like to get the following result from the transposition:
const uint8_t transposed[] = {
0b00000000,
0b00000100,
0b00000010,
0b00000110,
0b00000001,
0b00000101,
0b00000011,
0b00000111,
};
NOTE: I would prefer an algorithm that can calculate this with arbitrary-sized matrices but am also interested in algorithms that can only handle certain sizes.
I've spent more time looking for a solution, and I've found some good ones.
The SSE2 way
On a modern x86 CPU, transposing a binary matrix can be done very efficiently with SSE2 instructions. Using such instructions it is possible to process a 16×8 matrix.
This solution is inspired by this blog post by mischasan and is vastly superior to every suggestion I've got so far to this question.
The idea is simple:
#include <emmintrin.h>
Pack 16 uint8_t variables into an __m128i
Use _mm_movemask_epi8 to get the MSBs of each byte, producing an uint16_t
Use _mm_slli_epi64 to shift the 128-bit register by one
Repeat until you've got all 8 uint16_ts
A generic 32-bit solution
Unfortunately, I also need to make this work on ARM. After implementing the SSE2 version, it would be easy to just just find the NEON equivalents, but the Cortex-M CPU, (contrary to the Cortex-A) does not have SIMD capabilities, so NEON isn't too useful for me at the moment.
NOTE: Because the Cortex-M doesn't have native 64-bit arithmetics, I could not use the ideas in any answers that suggest to do it by treating a 8x8 block as an uint64_t. Most microcontrollers that have a Cortex-M CPU also don't have too much memory so I prefer to do all this without a lookup table.
After some thinking, the same algorithm can be implemented using plain 32-bit arithmetics and some clever coding. This way, I can work with 4×8 blocks at a time. It was suggested by a collegaue and the magic lies in the way 32-bit multiplication works: you can find a 32-bit number with which you can multiply and then the MSB of each byte gets next to each other in the upper 32 bits of the result.
Pack 4 uint8_ts in a 32-bit variable
Mask the 1st bit of each byte (using 0x80808080)
Multiply it with 0x02040810
Take the 4 LSBs of the upper 32 bits of the multiplication
Generally, you can mask the Nth bit in each byte (shift the mask right by N bits) and multiply with the magic number, shifted left by N bits. The advantage here is that if your compiler is smart enough to unroll the loop, both the mask and the 'magic number' become compile-time constants so shifting them does not incur any performance penalty whatsoever. There's some trouble with the last series of 4 bits, because then one LSB is lost, so in that case I needed to shift the input left by 8 bits and use the same method as the first series of 4-bits.
If you do this with two 4×8 blocks, then you can get an 8x8 block done and arrange the resulting bits so that everything goes into the right place.
My suggestion is that, you don't do the transposition, rather you add one bit information to your matrix data, indicating whether the matrix is transposed or not.
Now, if you want to multiply a transposd matrix with a vector, it will be the same as multiplying the matrix on the left by the vector (and then transpose). This is easy: just some xor operations of your 8-bit numbers.
This however makes some other operations complicated (e.g. adding two matrices). But in the comment you say that multiplication is exactly what you want to optimize.
Here is the text of Jay Foad's email to me regarding fast Boolean matrix
transpose:
The heart of the Boolean transpose algorithm is a function I'll call transpose8x8 which transposes an 8x8 Boolean matrix packed in a 64-bit word (in row major order from MSB to LSB). To transpose any rectangular matrix whose width and height are multiples of 8, break it down into 8x8 blocks, transpose each one individually and store them at the appropriate place in the output. To load an 8x8 block you have to load 8 individual bytes and shift and OR them into a 64-bit word. Same kinda thing for storing.
A plain C implementation of transpose8x8 relies on the fact that all the bits on any diagonal line parallel to the leading diagonal move the same distance up/down and left/right. For example, all the bits just above the leading diagonal have to move one place left and one place down, i.e. 7 bits to the right in the packed 64-bit word. This leads to an algorithm like this:
transpose8x8(word) {
return
(word & 0x0100000000000000) >> 49 // top right corner
| (word & 0x0201000000000000) >> 42
| ...
| (word & 0x4020100804020100) >> 7 // just above diagonal
| (word & 0x8040201008040201) // leading diagonal
| (word & 0x0080402010080402) << 7 // just below diagonal
| ...
| (word & 0x0000000000008040) << 42
| (word & 0x0000000000000080) << 49; // bottom left corner
}
This runs about 10x faster than the previous implementation, which copied each bit individually from the source byte in memory and merged it into the destination byte in memory.
Alternatively, if you have PDEP and PEXT instructions you can implement a perfect shuffle, and use that to do the transpose as mentioned in Hacker's Delight. This is significantly faster (but I don't have timings handy):
shuffle(word) {
return pdep(word >> 32, 0xaaaaaaaaaaaaaaaa) | pdep(word, 0x5555555555555555);
} // outer perfect shuffle
transpose8x8(word) { return shuffle(shuffle(shuffle(word))); }
POWER's vgbbd instruction effectively implements the whole of transpose8x8 in a single instruction (and since it's a 128-bit vector instruction it does it twice, independently, on the low 64 bits and the high 64 bits). This gave about 15% speed-up over the plain C implementation. (Only 15% because, although the bit twiddling is much faster, the overall run time is now dominated by the time it takes to load 8 bytes and assemble them into the argument to transpose8x8, and to take the result and store it as 8 separate bytes.)
My suggestion would be to use a lookup table to speed up the processing.
Another thing to note is with the current definition of your matrix the maximum size will be 8x8 bits. This fits into a uint64_t so we can use this to our advantage especially when using a 64-bit platform.
I have worked out a simple example using a lookup table which you can find below and run using: http://www.tutorialspoint.com/compile_cpp11_online.php online compiler.
Example code
#include <iostream>
#include <bitset>
#include <stdint.h>
#include <assert.h>
using std::cout;
using std::endl;
using std::bitset;
/* Static lookup table */
static uint64_t lut[256];
/* Helper function to print array */
template<int N>
void print_arr(const uint8_t (&arr)[N]){
for(int i=0; i < N; ++i){
cout << bitset<8>(arr[i]) << endl;
}
}
/* Transpose function */
template<int N>
void transpose_bitmatrix(const uint8_t (&matrix)[N], uint8_t (&transposed)[8]){
assert(N <= 8);
uint64_t value = 0;
for(int i=0; i < N; ++i){
value = (value << 1) + lut[matrix[i]];
}
/* Ensure safe copy to prevent misalignment issues */
/* Can be removed if input array can be treated as uint64_t directly */
for(int i=0; i < 8; ++i){
transposed[i] = (value >> (i * 8)) & 0xFF;
}
}
/* Calculate lookup table */
void calculate_lut(void){
/* For all byte values */
for(uint64_t i = 0; i < 256; ++i){
auto b = std::bitset<8>(i);
auto v = std::bitset<64>(0);
/* For all bits in current byte */
for(int bit=0; bit < 8; ++bit){
if(b.test(bit)){
v.set((7 - bit) * 8);
}
}
lut[i] = v.to_ullong();
}
}
int main()
{
calculate_lut();
const uint8_t matrix[] = {
0b01010101,
0b00110011,
0b00001111,
};
uint8_t transposed[8];
transpose_bitmatrix(matrix, transposed);
print_arr(transposed);
return 0;
}
How it works
your 3x8 matrix will be transposed to a 8x3 matrix, represented in an 8x8 array.
The issue is that you want to convert bits, your "horizontal" representation to a vertical one, divided over several bytes.
As I mentioned above, we can take advantage of the fact that the output (8x8) will always fit into a uint64_t. We will use this to our advantage because now we can use an uint64_t to write the 8 byte array, but we can also use it for to add, xor, etc. because we can perform basic arithmetic operations on a 64 bit integer.
Each entry in your 3x8 matrix (input) is 8 bits wide, to optimize processing we first generate 256 entry lookup table (for each byte value). The entry itself is a uint64_t and will contain a rotated version of the bits.
example:
byte = 0b01001111 = 0x4F
lut[0x4F] = 0x0001000001010101 = (uint8_t[]){ 0, 1, 0, 0, 1, 1, 1, 1 }
Now for the calculation:
For the calculations we use the uint64_t but keep in mind that under water it will represent a uint8_t[8] array. We simple shift the current value (start with 0), look up our first byte and add it to the current value.
The 'magic' here is that each byte of the uint64_t in the lookup table will either be 1 or 0 so it will only set the least significant bit (of each byte). Shifting the uint64_t will shift each byte, as long as we make sure we do not do this more than 8 times! we can do operations on each byte individually.
Issues
As someone noted in the comments: Translate(Translate(M)) != M so if you need this you need some additional work.
Perfomance can be improved by directly mapping uint64_t's instead of uint8_t[8] arrays since it omits a "safe-copy" to prevent alignment issues.
I have added a new awnser instead of editing my original one to make this more visible (no comment rights unfortunatly).
In your own awnser you add an additional requirement not present in the first one: It has to work on ARM Cortex-M
I did come up with an alternative solution for ARM in my original awnser but omitted it as it was not part of the question and seemed off topic (mostly because of the C++ tag).
ARM Specific solution Cortex-M:
Some or most Cortex-M 3/4 have a bit banding region which can be used for exactly what you need, it expands bits into 32-bit fields, this region can be used to perform atomic bit operations.
If you put your array in a bitbanded region it will have an 'exploded' mirror in the bitband region where you can just use move operations on the bits itself. If you make a loop the compiler will surely be able to unroll and optimize to just move operations.
If you really want to, you can even setup a DMA controller to process an entire batch of transpose operations with a bit of effort and offload it entirely from the cpu :)
Perhaps this might still help you.
This is a bit late, but I just stumbled across this interchange today.
If you look at Hacker's Delight, 2nd Edition,there are several algorithms for efficiently transposing Boolean arrays, starting on page 141.
They are quite efficient: a colleague of mine obtained a factor about 10X
speedup compared to naive coding, on an X86.
Here's what I posted on gitub (mischasan/sse2/ssebmx.src)
Changing INP() and OUT() to use induction vars saves an IMUL each.
AVX256 does it twice as fast.
AVX512 is not an option, because there is no _mm512_movemask_epi8().
#include <stdint.h>
#include <emmintrin.h>
#define INP(x,y) inp[(x)*ncols/8 + (y)/8]
#define OUT(x,y) out[(y)*nrows/8 + (x)/8]
void ssebmx(char const *inp, char *out, int nrows, int ncols)
{
int rr, cc, i, h;
union { __m128i x; uint8_t b[16]; } tmp;
// Do the main body in [16 x 8] blocks:
for (rr = 0; rr <= nrows - 16; rr += 16)
for (cc = 0; cc < ncols; cc += 8) {
for (i = 0; i < 16; ++i)
tmp.b[i] = INP(rr + i, cc);
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
*(uint16_t*)&OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
}
if (rr == nrows) return;
// The remainder is a row of [8 x 16]* [8 x 8]?
// Do the [8 x 16] blocks:
for (cc = 0; cc <= ncols - 16; cc += 16) {
for (i = 8; i--;)
tmp.b[i] = h = *(uint16_t const*)&INP(rr + i, cc),
tmp.b[i + 8] = h >> 8;
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
OUT(rr, cc + i) = h = _mm_movemask_epi8(tmp.x),
OUT(rr, cc + i + 8) = h >> 8;
}
if (cc == ncols) return;
// Do the remaining [8 x 8] block:
for (i = 8; i--;)
tmp.b[i] = INP(rr + i, cc);
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
}
HTH.
Inspired by Roberts answer, polynomial multiplication in Arm Neon can be utilised to scatter the bits --
inline poly8x16_t mull_lo(poly8x16_t a) {
auto b = vget_low_p8(a);
return vreinterpretq_p8_p16(vmull_p8(b,b));
}
inline poly8x16_t mull_hi(poly8x16_t a) {
auto b = vget_high_p8(a);
return vreinterpretq_p8_p16(vmull_p8(b,b));
}
auto a = mull_lo(word);
auto b = mull_lo(a), c = mull_hi(a);
auto d = mull_lo(b), e = mull_hi(b);
auto f = mull_lo(c), g = mull_hi(c);
Then the vsli can be used to combine the bits pairwise.
auto ab = vsli_p8(vget_high_p8(d), vget_low_p8(d), 1);
auto cd = vsli_p8(vget_high_p8(e), vget_low_p8(e), 1);
auto ef = vsli_p8(vget_high_p8(f), vget_low_p8(f), 1);
auto gh = vsli_p8(vget_high_p8(g), vget_low_p8(g), 1);
auto abcd = vsli_p8(ab, cd, 2);
auto efgh = vsli_p8(ef, gh, 2);
return vsli_p8(abcd, efgh, 4);
Clang optimizes this code to avoid vmull2 instructions, using heavily ext q0,q0,8 to vget_high_p8.
An iterative approach would possibly be not only faster, but also uses less registers and also simdifies for 2x or more throughput.
// transpose bits in 2x2 blocks, first 4 rows
// x = a b|c d|e f|g h a i|c k|e m|g o | byte 0
// i j|k l|m n|o p b j|d l|f n|h p | byte 1
// q r|s t|u v|w x q A|s C|u E|w G | byte 2
// A B|C D|E F|G H r B|t D|v F|h H | byte 3 ...
// ----------------------
auto a = (x & 0x00aa00aa00aa00aaull);
auto b = (x & 0x5500550055005500ull);
auto c = (x & 0xaa55aa55aa55aa55ull) | (a << 7) | (b >> 7);
// transpose 2x2 blocks (first 4 rows shown)
// aa bb cc dd aa ii cc kk
// ee ff gg hh -> ee mm gg oo
// ii jj kk ll bb jj dd ll
// mm nn oo pp ff nn hh pp
auto d = (c & 0x0000cccc0000ccccull);
auto e = (c & 0x3333000033330000ull);
auto f = (c & 0xcccc3333cccc3333ull) | (d << 14) | (e >> 14);
// Final transpose of 4x4 bit blocks
auto g = (f & 0x00000000f0f0f0f0ull);
auto h = (f & 0x0f0f0f0f00000000ull);
x = (f & 0xf0f0f0f00f0f0f0full) | (g << 28) | (h >> 28);
In ARM each step can now be composed with 3 instructions:
auto tmp = vrev16_u8(x);
tmp = vshl_u8(tmp, plus_minus_1); // 0xff01ff01ff01ff01ull
x = vbsl_u8(mask_1, x, tmp); // 0xaa55aa55aa55aa55ull
tmp = vrev32_u16(x);
tmp = vshl_u16(tmp, plus_minus_2); // 0xfefe0202fefe0202ull
x = vbsl_u8(mask_2, x, tmp); // 0xcccc3333cccc3333ull
tmp = vrev64_u32(x);
tmp = vshl_u32(tmp, plus_minus_4); // 0xfcfcfcfc04040404ull
x = vbsl_u8(mask_4, x, tmp); // 0xf0f0f0f00f0f0f0full
This question already has answers here:
Compute fast log base 2 ceiling
(15 answers)
Fastest way to count consecutive 1 bits. C++
(8 answers)
Closed 8 years ago.
Consider this program
#include <iostream>
#include <bitset>
#include <cstdint>
#include <cstdlib>
typedef uint8_t Tnum;
template <typename T>
void printBits(T a)
{
std::cout << std::bitset<(sizeof(a) * 8)>(a).to_string() << '\n';
}
int main()
{
printBits(Tnum(15));
printBits(Tnum(17));
return EXIT_SUCCESS;
}
it prints
00001111
00010001
Now consider this 2 guys from the previous output
00001111
^
00010001
^
I would like to know how, given a signed or unsigned integer type, and given a value for an instance of that type, I can get the location of that leading 1 in the pattern, starting to count from 0 the result I expect is 3 for the first row, 4 for the second one. The total amount of positions involved is also acceptable to me, like 4 for the first row and 5 for the second one.
I don't have Hacker's Delight or similar text available at the moment and I can't find any quick bit twiddling .
This is kinda it but it's error prone and it will never pass a conversion test or set of warning flags about conversions, at least in my case. Plus it's probably a non-optimal choice.
Please no lookup tables, I'm willing to accept anything that doesn't cause conversion issues and doesn't use a LUT. For C89/99 and C++11 .
If this is X86 and you can use assembly, there's the bit scan reverse instruction. Depending on the compiler, there may be an intrinsic for this.
bit scan reverse
Why don't you have access to hacker's delight ? Proxy limitation ?
Here is the solution from http://www.hackersdelight.org/hdcodetxt/nlz.c.txt
int nlz1(unsigned x) {
int n;
if (x == 0) return(32);
n = 0;
if (x <= 0x0000FFFF) {n = n +16; x = x <<16;}
if (x <= 0x00FFFFFF) {n = n + 8; x = x << 8;}
if (x <= 0x0FFFFFFF) {n = n + 4; x = x << 4;}
if (x <= 0x3FFFFFFF) {n = n + 2; x = x << 2;}
if (x <= 0x7FFFFFFF) {n = n + 1;}
return n;
}
As others have stated, x86-64 processors have the most significant bit (MSB) instruction which can be accessed through compilers using either inline assembly or compiler instructions (intrinsics). The Microsoft C compiler has the _BitScanReverse instruction for 32bits.
An example of how to insert inline assembly code for the gcc compiler may be found here: https://www.biicode.com/pablodev/pablodev/bitscan/master/25/bitboard.h
In case you are not interested in this type of solutions an O(log(N)) solution with a good compromise between table size and speed using a de Bruijn magic number is:
uint32_t v;
int r;
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
};
v |= v >> 1;
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
r = MultiplyDeBruijnBitPosition[(uint32_t)(v * 0x07C4ACDDU) >> 27];
Basically the first shifts round the input number to one less than a power of 2 and then de Bruijn multiplication and lookup does the rest. The shifts are not necessary when it is known that the input number is a power of 2 (the magic number is different though). All information is available here.
You can use BitScanRevers intrinsic if you're using visual studio.
If I have a char array A, I use it to store hex
A = "0A F5 6D 02" size=11
The binary representation of this char array is:
00001010 11110101 01101101 00000010
I want to ask is there any function can random flip the bit?
That is:
if the parameter is 5
00001010 11110101 01101101 00000010
-->
10001110 11110001 01101001 00100010
it will random choose 5 bit to flip.
I am trying make this hex data to binary data and use bitmask method to achieve my requirement. Then turn it back to hex. I am curious is there any method to do this job more quickly?
Sorry, my question description is not clear enough. In simply, I have some hex data, and I want to simulate bit error in these data. For example, if I have 5 byte hex data:
"FF00FF00FF"
binary representation is
"1111111100000000111111110000000011111111"
If the bit error rate is 10%. Then I want to make these 40 bits have 4 bits error. One extreme random result: error happened in the first 4 bit:
"0000111100000000111111110000000011111111"
First of all, find out which char the bit represents:
param is your bit to flip...
char *byteToWrite = &A[sizeof(A) - (param / 8) - 1];
So that will give you a pointer to the char at that array offset (-1 for 0 array offset vs size)
Then get modulus (or more bit shifting if you're feeling adventurous) to find out which bit in here to flip:
*byteToWrite ^= (1u << param % 8);
So that should result for a param of 5 for the byte at A[10] to have its 5th bit toggled.
store the values of 2^n in an array
generate a random number seed
loop through x times (in this case 5) and go data ^= stored_values[random_num]
Alternatively to storing the 2^n values in an array, you could do some bit shifting to a random power of 2 like:
data ^= (1<<random%7)
Reflecting the first comment, you really could just write out that line 5 times in your function and avoid the overhead of a for loop entirely.
You have 32 bit number. You can treate the bits as parts of hte number and just xor this number with some random 5-bits-on number.
int count_1s(int )
{
int m = 0x55555555;
int r = (foo&m) + ((foo>>>1)&m);
m = 0x33333333;
r = (r&m) + ((r>>>2)&m);
m = 0x0F0F0F0F;
r = (r&m) + ((r>>>4)&m);
m = 0x00FF00FF;
r = (r&m) + ((r>>>8)&m);
m = 0x0000FFFF;
return r = (r&m) + ((r>>>16)&m);
}
void main()
{
char input[] = "0A F5 6D 02";
char data[4] = {};
scanf("%2x %2x %2x %2x", &data[0], &data[1], &data[2], &data[3]);
int *x = reinterpret_cast<int*>(data);
int y = rand();
while(count_1s(y) != 5)
{
y = rand(); // let's have this more random
}
*x ^= y;
printf("%2x %2x %2x %2x" data[0], data[1], data[2], data[3]);
return 0;
}
I see no reason to convert the entire string back and forth from and to hex notation. Just pick a random character out of the hex string, convert this to a digit, change it a bit, convert back to hex character.
In plain C:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
int main (void)
{
char *hexToDec_lookup = "0123456789ABCDEF";
char hexstr[] = "0A F5 6D 02";
/* 0. make sure we're fairly random */
srand(time(0));
/* 1. loop 5 times .. */
int i;
for (i=0; i<5; i++)
{
/* 2. pick a random hex digit
we know it's one out of 8, grouped per 2 */
int hexdigit = rand() & 7;
hexdigit += (hexdigit>>1);
/* 3. convert the digit to binary */
int hexvalue = hexstr[hexdigit] > '9' ? hexstr[hexdigit] - 'A'+10 : hexstr[hexdigit]-'0';
/* 4. flip a random bit */
hexvalue ^= 1 << (rand() & 3);
/* 5. write it back into position */
hexstr[hexdigit] = hexToDec_lookup[hexvalue];
printf ("[%s]\n", hexstr);
}
return 0;
}
It might even be possible to omit the convert-to-and-from-ASCII steps -- flip a bit in the character string, check if it's still a valid hex digit and if necessary, adjust.
First randomly chose x positions (each position consist of array index and the bit position).
Now if you want to flip ith bit from right for a number n. Find the remainder of n by 2n as :
code:
int divisor = (2,i);
int remainder = n % divisor;
int quotient = n / divisor;
remainder = (remainder == 0) ? 1 : 0; // flip the remainder or the i th bit from right.
n = divisor * quotient + remainder;
Take mod 8 of input(5%8)
Shift 0x80 to right by input value (e.g 5)
XOR this value with (input/8)th element of your character array.
code:
void flip_bit(int bit)
{
Array[bit/8] ^= (0x80>>(bit%8));
}