How to simplify this "clear multiple bits at once" function? - bit-manipulation

I finally figured out through trial and error how to clear multiple bits on an integer:
const getNumberOfBitsInUint8 = function(i8) {
let i = 0
while (i8) {
i++
i8 >>= 1
}
return i
}
const write = function(n, i, x) {
let o = 0xff // 0b11111111
let c = getNumberOfBitsInUint8(x)
let j = 8 - i // right side start
let k = j - c // right side remaining
let h = c + i
let a = x << k // set bits
let b = a ^ o // set bits flip
let d = o >> h // mask right
let q = d ^ b //
let m = o >> j // mask left
let s = m << j
let t = s ^ q // clear bits!
let w = n | a // set the set bits
let z = w & ~t // perform some magic https://stackoverflow.com/q/8965521/169992
return z
}
The write function takes an integer n, the index i to write bits into, and the bits value x.
Is there any way to simplify this function down and remove some steps? (Without just combining multiple operations on a single line)?

One possibility is to first clear the relevant part and then copy the bits into it:
return (n & ~((0xff << (8 - c)) >> i)) | (x << (8 - c - i))
assuming the left shift is restricted to 8 bits so the top bits disappear. Another is to use xor to find the bits to be changed :
return n ^ ((((n >> (8 - c - i)) ^ x) << (8 - c)) >> i)

Related

Code to find X such that the product (A ^ X) * (B ^ X) is maximised

Find X such that (A ^ X) * (B ^ X) is maximum
Given A, B, and N (X < 2^N)
Return the maximum product modulus 10^9+7.
Example:
A = 4
B = 6
N = 3
We can choose X = 3 and (A ^ X) = 7 and (B ^ X) = 5.
The product will be 35 which is the maximum.
Here is my code:
int limit = (1<<n) - 1;
int MOD = 1_000_000_007;
int maxProd = 1;
for(int i = 1; i <= limit; i++){
int x1 = (A^i);
int x2 = (B^i);
maxProd = max(maxProd, (x1*x2) % MOD);
}
return maxProd;
for bits >=Nth bit, X will be zero, A^X and B^X are A and B for those bits
find set bits and zero bits shared by A and B from 0 to N-1th bits. for set bits, X will be zero there. for zero bits, X will be 1 there.
for bits that A and B are different, X will be either 0 or 1
from 1,2, we will have the value for A and B, denoted by a and b. a and b are known constants
from 3, we will have a bunch of 2^k, such as 2^3, 2^1,…, say the tot sum of them is tot. tot is a known constant
the question becomes max (a+tot-sth)*(b+sth), where sth is the subset sum of some 2^k from 3, while a,tot,and b are constants
when (a+tot-sth) and (b+sth) are as close as possible, the product will be maxed.
if a==b, we will give the most significant bit of step 3 to either a or b, and the rest to the other one
if a!=b, we will give all bits in step 3 to the smaller one

Shift masked bits to the lsb

When you and some data with a mask you get some result which is of the same size as the data/mask.
What I want to do, is to take the masked bits in the result (where there was 1 in the mask) and shift them to the right so they are next to each other and I can perform a CTZ (Count Trailing Zeroes) on them.
I didn't know how to name such a procedure so Google has failed me. The operation should preferably not be a loop solution, this has to be as fast operation as possible.
And here is an incredible image made in MS Paint.
This operation is known as compress right. It is implemented as part of BMI2 as the PEXT instruction, in Intel processors as of Haswell.
Unfortunately, without hardware support is it a quite annoying operation. Of course there is an obvious solution, just moving the bits one by one in a loop, here is the one given by Hackers Delight:
unsigned compress(unsigned x, unsigned m) {
unsigned r, s, b; // Result, shift, mask bit.
r = 0;
s = 0;
do {
b = m & 1;
r = r | ((x & b) << s);
s = s + b;
x = x >> 1;
m = m >> 1;
} while (m != 0);
return r;
}
But there is an other way, also given by Hackers Delight, which does less looping (number of iteration logarithmic in the number of bits) but more per iteration:
unsigned compress(unsigned x, unsigned m) {
unsigned mk, mp, mv, t;
int i;
x = x & m; // Clear irrelevant bits.
mk = ~m << 1; // We will count 0's to right.
for (i = 0; i < 5; i++) {
mp = mk ^ (mk << 1); // Parallel prefix.
mp = mp ^ (mp << 2);
mp = mp ^ (mp << 4);
mp = mp ^ (mp << 8);
mp = mp ^ (mp << 16);
mv = mp & m; // Bits to move.
m = m ^ mv | (mv >> (1 << i)); // Compress m.
t = x & mv;
x = x ^ t | (t >> (1 << i)); // Compress x.
mk = mk & ~mp;
}
return x;
}
Notice that a lot of the values there depend only on m. Since you only have 512 different masks, you could precompute those and simplify the code to something like this (not tested)
unsigned compress(unsigned x, int maskindex) {
unsigned t;
int i;
x = x & masks[maskindex][0];
for (i = 0; i < 5; i++) {
t = x & masks[maskindex][i + 1];
x = x ^ t | (t >> (1 << i));
}
return x;
}
Of course all of these can be turned into "not a loop" by unrolling, the second and third ways are probably more suitable for that. That's a bit of cheat however.
You can use the pack-by-multiplication technique similar to the one described here. This way you don't need any loop and can mix the bits in any order.
For example with the mask 0b10101001 == 0xA9 like above and 8-bit data abcdefgh (with a-h is the 8 bits) you can use the below expression to get 0000aceh
uint8_t compress_maskA9(uint8_t x)
{
const uint8_t mask1 = 0xA9 & 0xF0;
const uint8_t mask2 = 0xA9 & 0x0F;
return (((x & mask1)*0x03000000 >> 28) & 0x0C) | ((x & mask2)*0x50000000 >> 30);
}
In this specific case there are some overlaps of the 4 bits while adding (which incur unexpected carry) during the multiplication step, so I've split them into 2 parts, the first one extracts bit a and c, then e and h will be extracted in the latter part. There are other ways to split the bits as well, like a & h then c & e. You can see the results compared to Harold's function live on ideone
An alternate way with only one multiplication
const uint32_t X = (x << 8) | x;
return (X & 0x8821)*0x12050000 >> 28;
I got this by duplicating the bits so that they're spaced out farther, leaving enough space to avoid the carry. This is often better than splitting into 2 multiplications
If you want the result's bits reversed (i.e. heca0000) you can easily change the magic numbers accordingly
// result: he00 | 00ca;
return (((x & 0x09)*0x88000000 >> 28) & 0x0C) | (((x & 0xA0)*0x04800000) >> 30);
or you can also extract the 3 bits e, c and a at the same time, leaving h separately (as I mentioned above, there are often multiple solutions) and you need only one multiplication
return ((x & 0xA8)*0x12400000 >> 29) | (x & 0x01) << 3; // result: 0eca | h000
But there might be a better alternative like the above second snippet
const uint32_t X = (x << 8) | x;
return (X & 0x2881)*0x80290000 >> 28
Correctness check: http://ideone.com/PYUkty
For a larger number of masks you can precompute the magic numbers correspond to those masks and store them in an array so that you can look them up immediately for use. I calculated those mask by hand but you can do that automatically
Explanation
We have abcdefgh & mask1 = a0c00000. Multiply it with magic1
........................a0c00000
× 00000011000000000000000000000000 (magic1 = 0x03000000)
────────────────────────────────
a0c00000........................
+ a0c00000......................... (the leading "a" bit is outside int's range
──────────────────────────────── so it'll be truncated)
r1 = acc.............................
=> (r1 >> 28) & 0x0C = 0000ac00
Similarly we multiply abcdefgh & mask2 = 0000e00h with magic2
........................0000e00h
× 01010000000000000000000000000000 (magic2 = 0x50000000)
────────────────────────────────
e00h............................
+ 0h..............................
────────────────────────────────
r2 = eh..............................
=> (r2 >> 30) = 000000eh
Combine them together we have the expected result
((r1 >> 28) & 0x0C) | (r2 >> 30) = 0000aceh
And here's the demo for the second snippet
abcdefghabcdefgh
& 1000100000100001 (0x8821)
────────────────────────────────
a000e00000c0000h
× 00010010000001010000000000000000 (0x12050000)
────────────────────────────────
000h
00e00000c0000h
+ 0c0000h
a000e00000c0000h
────────────────────────────────
= acehe0h0c0c00h0h
& 11110000000000000000000000000000
────────────────────────────────
= aceh
For the reversed order case:
abcdefghabcdefgh
& 0010100010000001 (0x2881)
────────────────────────────────
00c0e000a000000h
x 10000000001010010000000000000000 (0x80290000)
────────────────────────────────
000a000000h
00c0e000a000000h
+ 0e000a000000h
h
────────────────────────────────
hecaea00a0h0h00h
& 11110000000000000000000000000000
────────────────────────────────
= heca
Related:
How to create a byte out of 8 bool values (and vice versa)?
Redistribute least significant bits from a 4-byte array to a nibble

Reach A Target number only using other two numbers

I am having two numbers L and R, L means left and R means Right.
I have to get to a certain number(F) using L and R.
Every time i have to start with zero as initial.
Example :
L : 1
R : 2
F : 3
SO minimum number of steps needed to get to F is 3.
Ans : First R, Second R, Third L.
IN this way i need to find the minimum number of ways to do it.
My approach:
Quo = F/R;
Remain : F%R;
x*R-Y*L = Remain
==> (x*R - Remain)/L = Y
this equation is break when (x*R - Remain)%L = 0, so we find x and y from the equation above.
So final Steps would be Quo + x(No. of right steps) + y( no. of left steps).
For Above Example :
Quo = 3/2 = 1;
Remain = 3%2 =1;
Y = (x*2 -1)/1
(x*2 -1)%1 is zero for x=1;
Now increase x from zero,
So x is 1, y is 1
Final Ans = Quo (1) + x (1) + y(1) = 3.
My code :
#include <iostream>
using namespace std;
int main()
{
int F,R,L;
cin >> F;
cin >> R;
cin >> L;
int remain = F%R;
int quo = F/R;
int Right = 0;
int left = 0;
int mode = 1;
while( mode !=0)
{
Right++;
mode = (R*Right - remain)%L;
left = (R*Right - remain)/L;
}
int final = quo + Right + left;
cout << final;
}
But i Don't think it is the good approach as i am putting x in loop which can be pretty costly
Can you please suggest me a good approach to do this question ?
In the given below equation
x*R - Remain = 0modL
where R, L and Remain are fixed.
It can be written as
((x*R)mod L - Remain mod L) mod L = 0
If Remain mod L = 0, then x*R should be multiple of L which makes x to 0modL.
Means x can be 0, nR where n is Integer.
So, simply, you can try x between 0 and L-1 to find x.
So, your loop can run from 0 to L-1 which will keep your loop finite.
Please note that this mod is different from %. -1 mod L = L-1 whereas -1%L = -1
There is another approach.
x*R mod L - Remain mod L = 0 mod L
leads to
x*R mod L = Remain mod L
(x* (R mod L)) mod L = (Remain mod L)
You can compute inverse of R (say Rinv) in field of L (if it does exists) and compute x = (Remain*Rinv)modL.
If inverse does not exists, it means equation cannot be satisfied.
Note: I am not mathematical expert. So, please give your opinion if anything is wrong.
See: https://www.cs.cmu.edu/~adamchik/21-127/lectures/congruences_print.pdf

Fastest way to transpose 4x4 byte matrix

I have a 4x4 block of bytes that I'd like to transpose using general purpose hardware. In other words, for bytes A-P, I'm looking for the most efficient (in terms of number of instructions) way to go from
A B C D
E F G H
I J K L
M N O P
to
A E I M
B F J N
C G K O
D H L P
We can assume that I have valid pointers pointing to A, E, I, and M in memory (such that reading 32-bits from A will get me the integer containing bytes ABCD).
This is not a duplicate of this question because of the restrictions on both size and data type. Each row of my matrix can fit into a 32-bit integer, and I'm looking for answers that can perform a transpose quickly using general purpose hardware, similar to the implementation of the SSE macro _MM_TRANSPOSE4_PS.
You want potability and efficiency. Well you can't have it both ways. You said you want to do this with the fewest number of instructions. Well it's possible to do this with only one instruction with SSE3 using the pshufb instruction (see below) from the x86 instruction set.
Maybe ARM Neon has something equivalent. If you want efficiency (and are sure that you need it) then learn the hardware.
The SSE equivalent of _MM_TRANSPOSE4_PS for bytes is to use _mm_shuffle_epi8 (the intrinsic for pshufb) with a mask. Define the mask outside of your main loop.
//use -msse3 with GCC or /arch:SSE2 with MSVC
#include <stdio.h>
#include <tmmintrin.h> //SSSE3
int main() {
char x[] = {0,1,2,3, 4,5,6,7, 8,9,10,11, 12,13,15,16};
__m128i mask = _mm_setr_epi8(0x0,0x04,0x08,0x0c, 0x01,0x05,0x09,0x0d, 0x02,0x06,0x0a,0x0e, 0x03,0x07,0x0b,0x0f);
__m128i v = _mm_loadu_si128((__m128i*)x);
v = _mm_shuffle_epi8(v,mask);
_mm_storeu_si128((__m128i*)x,v);
for(int i=0; i<16; i++) printf("%d ", x[i]); printf("\n");
//output: 0 4 8 12 1 5 9 13 2 6 10 15 3 7 11 16
}
Let me rephrase your question: you're asking for a C- or C++-only solution that is portable. Then:
void transpose(uint32_t const in[4], uint32_t out[4]) {
// A B C D A E I M
// E F G H B F J N
// I J K L C G K O
// M N O P D H L P
out[0] = in[0] & 0xFF000000U; // A . . .
out[1] = in[1] & 0x00FF0000U; // . F . .
out[2] = in[2] & 0x0000FF00U; // . . K .
out[3] = in[3] & 0x000000FFU; // . . . P
out[1] |= (in[0] << 8) & 0xFF000000U; // B F . .
out[2] |= (in[0] << 16) & 0xFF000000U; // C . K .
out[3] |= (in[0] << 24); // D . . P
out[0] |= (in[1] >> 8) & 0x00FF0000U; // A E . .
out[2] |= (in[1] << 8) & 0x00FF0000U; // C G K .
out[3] |= (in[1] << 16) & 0x00FF0000U; // D H . P
out[0] |= (in[2] >> 16) & 0x0000FF00U; // A E I .
out[1] |= (in[2] >> 8) & 0x0000FF00U; // B F J .
out[3] |= (in[2] << 8) & 0x0000FF00U; // D H L P
out[0] |= (in[3] >> 24); // A E I M
out[1] |= (in[3] >> 8) & 0x000000FFU; // B F J N
out[2] |= (in[3] << 8) & 0x000000FFU; // C G K O
}
I don't see how it could be answered any other way, since then you'd be depending on a particular compiler compiling it in a particular way, etc.
Of course if those manipulations themselves can be somehow simplified, it'd help. So that's the only avenue of further pursuit here. Nothing stands out so far, but then it's been a long day for me.
So far, the cost is 12 shifts, 12 ORs, 16 ANDs. If the compiler and platform are any good, it can be done in 9 32 bit registers.
If the compiler is very sad, or the platform doesn't have a barrel shifter, then some casting could help extol the fact that the shifts and masks are just byte extractions:
void transpose(uint8_t const in[16], uint8_t out[16]) {
// A B C D A E I M
// E F G H B F J N
// I J K L C G K O
// M N O P D H L P
out[0] = in[0]; // A . . .
out[1] = in[4]; // A E . .
out[2] = in[8]; // A E I .
out[3] = in[12]; // A E I M
out[4] = in[1]; // B . . .
out[5] = in[5]; // B F . .
out[6] = in[9]; // B F J .
out[7] = in[13]; // B F J N
out[8] = in[2]; // C . . .
out[9] = in[6]; // C G . .
out[10] = in[10]; // C G K .
out[11] = in[14]; // C G K O
out[12] = in[3]; // D . . .
out[13] = in[7]; // D H . .
out[14] = in[11]; // D H L .
out[15] = in[15]; // D H L P
}
If you really want to shuffle it in-place, then the following would do.
void transpose(uint8_t m[16]) {
std::swap(m[1], m[4]);
std::swap(m[2], m[8]);
std::swap(m[3], m[12]);
std::swap(m[6], m[9]);
std::swap(m[7], m[13]);
std::swap(m[11], m[14]);
}
The byte-oriented versions may well produce worse code on modern platforms. Only a benchmark can tell.
Not sure about the speed but these are okay.
template<typename T, std::size_t Size>
void Transpose(T (&Data)[Size][Size])
{
for (int I = 0; I < Size; ++I)
{
for (int J = 0; J < I; ++J)
{
std::swap(Data[I][J], Data[J][I]);
}
}
}
template<typename T, std::size_t Size>
void Transpose(T (&Data)[Size * Size])
{
for (int I = 0; I < Size; ++I)
{
for (int J = 0; J < I; ++J)
{
std::swap(Data[I * Size + J], Data[J * Size + I]);
}
}
}
An efficient solution is possible on a 64 bits machine, if you accept that.
First shift the 32 bits integer constants by (0,) 1, 2 and 3 bytes respectively [3 shitfs]. Then mask out the unwanted bits and perform logical ORs [12 ANDs with a constant, 12 ORs]. Finally, shift back to 32 bits [3 shifts] and read out the 32 bits.
ABCD
EFGH
IJKL
MNOP
ABCD
EFGH
IJKL
MNOP
A---
E---
I---
MNOP
=======
AEIMNOP
AEIM
AB--
-F--
-J--
-NOP
=======
ABFJNOP
BFJN
ABC-
--G-
--K-
--OP
=======
ABCGKOP
CGKO
ABCD
---H
---L
---P
=======
ABCDHLP
DHLP
I posted an answer for this same problem awhile back ago for SSE here.
The only things that need to be added are vectorized load/store operations.
This answer is similar to Z boson's answer to this question. Examples for load/store can be seen there. This answer differs because in addition to the SSE3 implementation there is an SSE2 implementation which is guaranteed to run on any x64 processor.
It's worth noting that both of these solutions assume that the entire matrix is row major in memory, but OP's question states that each row could have it's own pointer which implies the array could be fragmented.

Bitwise operations based on two numbers

I got an assignment today at my faculty (Mathematics Faculty of Belgrade, Serbia) which says:
1) Write a program that for two given integers x and y, inverts in integer x those bits that match the corresponding bits in y, while the rest of the bits remain the same.
For example:
x = 1001110110101
y = 1100010100011
x' = 0011101011100
I managed to write a program that does that, but I am a little insecure about the quality of my solution. Please, if you have time, check out the code and tell me how I could improve it.
int x, y, bitnum;
int z = 0;
unsigned int mask;
bitnum = sizeof(int) * 8;
mask = 1 << bitnum - 1;
printf("Unesi x i y: ");
scanf("%d%d", &x, &y);
while (mask > 1) {
if ( (((x & mask) == 0) && ((y & mask) == 0)) ||
((x & mask) && ((y & mask) == 0)) )
z += 1;
z <<= 1;
mask >>= 1;
} /* <-- THAT'S HOW STUPID PEOPLE SOLVE PROBLEMS... WITH HAMMER! */
z = y~; /* <-- THAT'S HOW SMART PEOPLE SOLVE PROBLEMS... WITH ONE LINE */
Everything works correctly, for x = 423 and y = 324 for example, I get z = -344, which is correct. Also, bit prints match I would just like to know if there is a better way to do this.
Thanks.
If you take a look at your x/y/x' example, it must strike you that x' is a complement to y. And indeed it's like that.
x y x'
--------
1 1 0
0 0 1
1 0 1
0 1 0
Spoiler (hover your mouse over block below, if you want to see a solution):
For bits that match, you invert bit in x, but as it is the same as bit in y, it's the same as inverting bit in y. When they do not match, you keep the bit from x, what is already inversion of bit in y on its own. I hope you can see the one-line solution already yourself: x' = ~y;
//Try with the next code:
unsigned int mask1, mask2, mask3, answ;
mask1 = x & y; // identify bits with 1 that match
mask2 = ~x & ~y; // identify bits with 0 that match
mask3 = mask1 | mask2; // identify bits with 0 or 1 that match
answ = x ^ m3; // Change identified bits