Fast way to "down-scale" a three-dimensional tensor index

Fast way to "down-scale" a three-dimensional tensor index - c++

This is a bit twiddling question for C or C++. I am running GCC 4.6.3 under Ubuntu 12.04.2.
I have a memory access index p for a three-dimensional tensor which has the form:
p = (i<<(2*N)) + (j<<N) + k
Here 0 <= i,j,k < (1<<N) and N some positive integer.
Now I want to compute a "down-scaled" memory access index for i>>S, j>>S, k>>S with 0 < S < N, which would be:
q = ((i>>S)<<(2*(N-S))) + ((j>>S)<<(N-S)) + (k>>S)
What is the fastest way to compute q from p (without knowing i,j,k beforehand)? We can assume that 0 < N <= 10 (i.e. p is a 32 bit integer). I would be especially interested in a fast approach for N=8 (i.e. i,j,k are 8 bit integers). N and S are both compile time constants.
An example for N=8 and S=4:
unsigned int p = 240407; // this is (3<<16) + (171<<8) + 23;
unsigned int q = 161; // this is (0<<8) + (10<<4) + 1

Straightforward way, 8 operations (others are operations on constants):
M = (1<<(N-S)) - 1; // A mask with S lowest bits.
q = ( ((p & (M<<(2*N+S))) >> (3*S)) // Mask 'i', shift to new position.
+ ((p & (M<<( N+S))) >> (2*S)) // Likewise for 'j'.
+ ((p & (M<< S)) >> S)); // Likewise for 'k'.
Looks complicated, but really isn't, just not easy (to me at least) to get all the constants correct.
To create formula with less operations, we observe that shifting numbers by U bits to the left is the same as multiplying by 1<<U. Thus, due to multiplication distributivity, multiplying by ((1<<U1) + (1<<U2) + ...) is the same as shifting to the left by U1, U2, ... and adding everything together.
So, we could try to mask needed portions of i, j and k, "shift" them all to the correct positions relative to each other with one multiplication and then shift result to the right, to the final destination. This gives us three operations to compute q from p.
Unfortunately, there are limitations, especially for the case we try to get all three at once. When we add numbers together (indirectly, by adding together several multipliers), we have to make sure that bits can be set only in one number, else we'll get a wrong result. If we try to add (indirectly) three properly shifted numbers at once, we have this:
iiiii...........jjjjj...........kkkkk.......
N-S S N-S S N-S
.....jjjjj...........kkkkk................
N-S N-S S N-S
..........kkkkk...............
N-S N-S N-S
Note that farther to the left in the second and third numbers are bits of i and j, but we ignore them. To do this, we assume that multiplication works as on x86: multiplying two types T gives a number of type T, with only the lowest bits of the actual result (equal to the result if there is no overflow).
So, to make sure that k bits from the third number do not overlap with j bits from the first, we need that 3*(N-S) <= N, i.e. S >= 2*N/3 which for N = 8 limits us to S >= 6 (just one or two bits per component after shifting; don't know if you ever use that low precision).
However, if S >= 2*N/3, we can use just 3 operations:
// Constant multiplier to perform three shifts at once.
F = (1<<(32-3*N)) + (1<<(32-3*N+S)) + (1<<(32-3*N+2*S));
// Mask, shift/combine with multipler, right shift to destination.
q = (((p & ((M<<(2*N+S)) + (M<<(N+S)) + (M<<S))) * F)
>> (32-3*(N-S)));
If the constraint for S is too strict (which it probably is), we can combine the first and second formula: compute i and k with the second approach, then add j from the first formula. Here we need that bits don't overlap in the following numbers:
iiiii...............kkkkk.......
N-S S N-S S N-S
..........kkkkk...............
N-S N-S N-S
I.e. 3*(N-S) <= 2*N, which gives S >= N / 3, or, for N = 8 much less strict S >= 3. The formula is as follows:
// Constant multiplier to perform two shifts at once.
F = (1<<(32-3*N)) + (1<<(32-3*N+2*S));
// Mask, shift/combine with multipler, right shift to destination
// and then add 'j' from the straightforward formula.
q = ((((p & ((M<<(2*N+S)) + (M<<S))) * F) >> (32-3*(N-S)))
+ ((p & (M<<(N+S))) >> (2*S)));
This formula also works for your example where S = 4.
Whether this is faster than straightforward approach depends on architecture. Also, I have no idea if C++ guarantees the assumed multiplication overflow behavior. Finally, you need to make sure values are unsigned and exactly 32 bit for the formulas to work.

If you don't care for compatibility, for N = 8 you can get i, j, k like that:
int p = ....
unsigned char *bytes = (char *)&p;
Now k is bytes[0], j is bytes[1] and i is bytes[2] (I found little endian on my machine). But I think the better way is sth. like that (we have N_MASK = 2^N - 1)
int q;
q = ( p & N_MASK ) >> S;
p >>= N;
q |= ( ( p & N_MASK ) >> S ) << S;
p >>= N;
q |= ( ( p & N_MASK ) >> S ) << 2*S;

does it meet your requirements?
#include <cstdint>
#include <iostream>
uint32_t to_q_from_p(uint32_t p, uint32_t N, uint32_t S)
{
uint32_t mask = ~(~0 << N);
uint32_t k = p &mask;
uint32_t j = (p >> N)& mask;
uint32_t i = (p >> 2*N)&mask;
return ((i>>S)<<(2*(N-S))) + ((j>>S)<<(N-S)) + (k>>S);;
}
int main()
{
uint32_t p = 240407;
uint32_t q = to_q_from_p(p, 8, 4);
std::cout << q << '\n';
}
if you assume that N always is 8 and integers are little endian then it can be
uint32_t to_q_from_p(uint32_t p, uint32_t S)
{
auto ptr = reinterpret_cast<uint8_t*>(&p);
return ((ptr[2]>>S)<<(2*(8-S))) + ((ptr[1]>>S)<<(8-S)) + (ptr[0]>>S);
}

Related

How shifting operator works in finding number of different bit in two integer?

i was trying to find out number of different bit in two number. i find a solution here but couldn't understand how it works.it right shifting with i and and doing and with 1. actually what is happening behind it? and why do loop through 32?
void solve(int A, int B)
{
int count = 0;
// since, the numbers are less than 2^31
// run the loop from '0' to '31' only
for (int i = 0; i < 32; i++) {
// right shift both the numbers by 'i' and
// check if the bit at the 0th position is different
if (((A >> i) & 1) != ((B >> i) & 1)) {
count++;
}
}
cout << "Number of different bits : " << count << endl;
}

The loop runs from 0 up to and including 31 (not through 32) because these are all of the possible bits that comprise a 32-bit integer and we need to check them all.
Inside the loop, the code
if (((A >> i) & 1) != ((B >> i) & 1)) {
count++;
}
works by shifting each of the two integers rightward by i (cutting off bits if i > 0), extracting the rightmost bit after the shift (& 1) and checking that they're the same (i.e. both 0 or both 1).
Let's walk through an example: solve(243, 2182). In binary:
243 = 11110011
2182 = 100010000110
diff bits = ^ ^^^ ^ ^
int bits = 00000000000000000000000000000000
i = 31 0
<-- loop direction
The indices of i that yield differences are 0, 2, 4, 5, 6 and 11 (we check from the right to the left--in the first iteration, i = 0 and nothing gets shifted, so & 1 gives us the rightmost bit, etc). The padding to the left of each number is all 0s in the above example.
Also, note that there are better ways to do this without a loop: take the XOR of the two numbers and run a popcount on them (count the bits that are set):
__builtin_popcount(243 ^ 2182); // => 6
Or, more portably:
std::bitset<CHAR_BIT * sizeof(int)>(243 ^ 2182).count()
Another note: best to avoid using namespace std;, return a value instead of producing a print side effect and give the method a clearer name than solve, for example bit_diff (I realize this is from geeksforgeeks).

Carry bits in incidents of overflow

/*
* isLessOrEqual - if x <= y then return 1, else return 0
* Example: isLessOrEqual(4,5) = 1.
* Legal ops: ! ~ & ^ | + << >>
* Max ops: 24
* Rating: 3
*/
int isLessOrEqual(int x, int y)
{
int msbX = x>>31;
int msbY = y>>31;
int sum_xy = (y+(~x+1));
int twoPosAndNegative = (!msbX & !msbY) & sum_xy; //isLessOrEqual is FALSE.
// if = true, twoPosAndNegative = 1; Overflow true
// twoPos = Negative means y < x which means that this
int twoNegAndPositive = (msbX & msbY) & !sum_xy;//isLessOrEqual is FALSE
//We started with two negative numbers, and subtracted X, resulting in positive. Therefore, x is bigger.
int isEqual = (!x^!y); //isLessOrEqual is TRUE
return (twoPosAndNegative | twoNegAndPositive | isEqual);
}
Currently, I am trying to work through how to carry bits in this operator.
The purpose of this function is to identify whether or not int y >= int x.
This is part of a class assignment, so there are restrictions on casting and which operators I can use.
I'm trying to account for a carried bit by applying a mask of the complement of the MSB, to try and remove the most significant bit from the equation, so that they may overflow without causing an issue.
I am under the impression that, ignoring cases of overflow, the returned operator would work.
EDIT: Here is my adjusted code, still not working. But, I think this is progress? I feel like I'm chasing my own tail.

int isLessOrEqual(int x, int y)
{
int msbX = x >> 31;
int msbY = y >> 31;
int sign_xy_sum = (y + (~x + 1)) >> 31;
return ((!msbY & msbX) | (!sign_xy_sum & (!msbY | msbX)));
}
I figured it out with the assistance of one of my peers, alongside the commentators here on StackOverflow.
The solution is as seen above.

The asker has self-answered their question (a class assignment), so providing alternative solutions seems appropriate at this time. The question clearly assumes that integers are represented as two's complement numbers.
One approach is to consider how CPUs compute predicates for conditional branching by means of a compare instruction. "signed less than" as expressed in processor condition codes is SF ≠ OF. SF is the sign flag, a copy of the sign-bit, or most significant bit (MSB) of the result. OF is the overflow flag which indicates overflow in signed integer operations. This is computed as the XOR of the carry-in and the carry-out of the sign-bit or MSB. With two's complement arithmetic, a - b = a + ~b + 1, and therefore a < b = a + ~b < 0. It remains to separate computation on the sign bit (MSB) sufficiently from the lower order bits. This leads to the following code:
int isLessOrEqual (int a, int b)
{
int nb = ~b;
int ma = a & ((1U << (sizeof(a) * CHAR_BIT - 1)) - 1);
int mb = nb & ((1U << (sizeof(b) * CHAR_BIT - 1)) - 1);
// for the following, only the MSB is of interest, other bits are don't care
int cyin = ma + mb;
int ovfl = (a ^ cyin) & (a ^ b);
int sign = (a ^ nb ^ cyin);
int lteq = sign ^ ovfl;
// desired predicate is now in the MSB (sign bit) of lteq, extract it
return (int)((unsigned int)lteq >> (sizeof(lteq) * CHAR_BIT - 1));
}
The casting to unsigned int prior to the final right shift is necessary because right-shifting of signed integers with negative value is implementation-defined, per the ISO-C++ standard, section 5.8. Asker has pointed out that casts are not allowed. When right shifting signed integers, C++ compilers will generate either a logical right shift instruction, or an arithmetic right shift instruction. As we are only interested in extracting the MSB, we can isolate ourselves from the choice by shifting then masking out all other bits besides the LSB, at the cost of one additional operation:
return (lteq >> (sizeof(lteq) * CHAR_BIT - 1)) & 1;
The above solution requires a total of eleven or twelve basic operations. A significantly more efficient solution is based on the 1972 MIT HAKMEM memo, which contains the following observation:
ITEM 23 (Schroeppel): (A AND B) + (A OR B) = A + B = (A XOR B) + 2 (A AND B).
This is straightforward, as A AND B represent the carry bits, and A XOR B represent the sum bits. In a newsgroup posting to comp.arch.arithmetic on February 11, 2000, Peter L. Montgomery provided the following extension:
If XOR is available, then this can be used to average
two unsigned variables A and B when the sum might overflow:
(A+B)/2 = (A AND B) + (A XOR B)/2
In the context of this question, this allows us to compute (a + ~b) / 2 without overflow, then inspect the sign bit to see if the result is less than zero. While Montgomery only referred to unsigned integers, the extension to signed integers is straightforward by use of an arithmetic right shift, keeping in mind that right shifting is an integer division which rounds towards negative infinity, rather than towards zero as regular integer division.
int isLessOrEqual (int a, int b)
{
int nb = ~b;
// compute avg(a,~b) without overflow, rounding towards -INF; lteq(a,b) = SF
int lteq = (a & nb) + arithmetic_right_shift (a ^ nb, 1);
return (int)((unsigned int)lteq >> (sizeof(lteq) * CHAR_BIT - 1));
}
Unfortunately, C++ itself provides no portable way to code an arithmetic right shift, but we can emulate it fairly efficiently using this answer:
int arithmetic_right_shift (int a, int s)
{
unsigned int mask_msb = 1U << (sizeof(mask_msb) * CHAR_BIT - 1);
unsigned int ua = a;
ua = ua >> s;
mask_msb = mask_msb >> s;
return (int)((ua ^ mask_msb) - mask_msb);
}
When inlined, this adds just a couple of instructions to the code when the shift count is a compile-time constant. If the compiler documentation indicates that the implementation-defined handling of signed integers of negative value is accomplished via arithmetic right shift instruction, it is safe to simplify to this six-operation solution:
int isLessOrEqual (int a, int b)
{
int nb = ~b;
// compute avg(a,~b) without overflow, rounding towards -INF; lteq(a,b) = SF
int lteq = (a & nb) + ((a ^ nb) >> 1);
return (int)((unsigned int)lteq >> (sizeof(lteq) * CHAR_BIT - 1));
}
The previously made comments regarding use of a cast when converting the sign bit into a predicate apply here as well.

How would you transpose a binary matrix?

I have binary matrices in C++ that I repesent with a vector of 8-bit values.
For example, the following matrix:
1 0 1 0 1 0 1
0 1 1 0 0 1 1
0 0 0 1 1 1 1
is represented as:
const uint8_t matrix[] = {
0b01010101,
0b00110011,
0b00001111,
};
The reason why I'm doing it this way is because then computing the product of such a matrix and a 8-bit vector becomes really simple and efficient (just one bitwise AND and a parity computation, per row), which is much better than calculating each bit individually.
I'm now looking for an efficient way to transpose such a matrix, but I haven't been able to figure out how to do it without having to manually calculate each bit.
Just to clarify, for the above example, I'd like to get the following result from the transposition:
const uint8_t transposed[] = {
0b00000000,
0b00000100,
0b00000010,
0b00000110,
0b00000001,
0b00000101,
0b00000011,
0b00000111,
};
NOTE: I would prefer an algorithm that can calculate this with arbitrary-sized matrices but am also interested in algorithms that can only handle certain sizes.

I've spent more time looking for a solution, and I've found some good ones.
The SSE2 way
On a modern x86 CPU, transposing a binary matrix can be done very efficiently with SSE2 instructions. Using such instructions it is possible to process a 16×8 matrix.
This solution is inspired by this blog post by mischasan and is vastly superior to every suggestion I've got so far to this question.
The idea is simple:
#include <emmintrin.h>
Pack 16 uint8_t variables into an __m128i
Use _mm_movemask_epi8 to get the MSBs of each byte, producing an uint16_t
Use _mm_slli_epi64 to shift the 128-bit register by one
Repeat until you've got all 8 uint16_ts
A generic 32-bit solution
Unfortunately, I also need to make this work on ARM. After implementing the SSE2 version, it would be easy to just just find the NEON equivalents, but the Cortex-M CPU, (contrary to the Cortex-A) does not have SIMD capabilities, so NEON isn't too useful for me at the moment.
NOTE: Because the Cortex-M doesn't have native 64-bit arithmetics, I could not use the ideas in any answers that suggest to do it by treating a 8x8 block as an uint64_t. Most microcontrollers that have a Cortex-M CPU also don't have too much memory so I prefer to do all this without a lookup table.
After some thinking, the same algorithm can be implemented using plain 32-bit arithmetics and some clever coding. This way, I can work with 4×8 blocks at a time. It was suggested by a collegaue and the magic lies in the way 32-bit multiplication works: you can find a 32-bit number with which you can multiply and then the MSB of each byte gets next to each other in the upper 32 bits of the result.
Pack 4 uint8_ts in a 32-bit variable
Mask the 1st bit of each byte (using 0x80808080)
Multiply it with 0x02040810
Take the 4 LSBs of the upper 32 bits of the multiplication
Generally, you can mask the Nth bit in each byte (shift the mask right by N bits) and multiply with the magic number, shifted left by N bits. The advantage here is that if your compiler is smart enough to unroll the loop, both the mask and the 'magic number' become compile-time constants so shifting them does not incur any performance penalty whatsoever. There's some trouble with the last series of 4 bits, because then one LSB is lost, so in that case I needed to shift the input left by 8 bits and use the same method as the first series of 4-bits.
If you do this with two 4×8 blocks, then you can get an 8x8 block done and arrange the resulting bits so that everything goes into the right place.

My suggestion is that, you don't do the transposition, rather you add one bit information to your matrix data, indicating whether the matrix is transposed or not.
Now, if you want to multiply a transposd matrix with a vector, it will be the same as multiplying the matrix on the left by the vector (and then transpose). This is easy: just some xor operations of your 8-bit numbers.
This however makes some other operations complicated (e.g. adding two matrices). But in the comment you say that multiplication is exactly what you want to optimize.

Here is the text of Jay Foad's email to me regarding fast Boolean matrix
transpose:
The heart of the Boolean transpose algorithm is a function I'll call transpose8x8 which transposes an 8x8 Boolean matrix packed in a 64-bit word (in row major order from MSB to LSB). To transpose any rectangular matrix whose width and height are multiples of 8, break it down into 8x8 blocks, transpose each one individually and store them at the appropriate place in the output. To load an 8x8 block you have to load 8 individual bytes and shift and OR them into a 64-bit word. Same kinda thing for storing.
A plain C implementation of transpose8x8 relies on the fact that all the bits on any diagonal line parallel to the leading diagonal move the same distance up/down and left/right. For example, all the bits just above the leading diagonal have to move one place left and one place down, i.e. 7 bits to the right in the packed 64-bit word. This leads to an algorithm like this:
transpose8x8(word) {
return
(word & 0x0100000000000000) >> 49 // top right corner
| (word & 0x0201000000000000) >> 42
| ...
| (word & 0x4020100804020100) >> 7 // just above diagonal
| (word & 0x8040201008040201) // leading diagonal
| (word & 0x0080402010080402) << 7 // just below diagonal
| ...
| (word & 0x0000000000008040) << 42
| (word & 0x0000000000000080) << 49; // bottom left corner
}
This runs about 10x faster than the previous implementation, which copied each bit individually from the source byte in memory and merged it into the destination byte in memory.
Alternatively, if you have PDEP and PEXT instructions you can implement a perfect shuffle, and use that to do the transpose as mentioned in Hacker's Delight. This is significantly faster (but I don't have timings handy):
shuffle(word) {
return pdep(word >> 32, 0xaaaaaaaaaaaaaaaa) | pdep(word, 0x5555555555555555);
} // outer perfect shuffle
transpose8x8(word) { return shuffle(shuffle(shuffle(word))); }
POWER's vgbbd instruction effectively implements the whole of transpose8x8 in a single instruction (and since it's a 128-bit vector instruction it does it twice, independently, on the low 64 bits and the high 64 bits). This gave about 15% speed-up over the plain C implementation. (Only 15% because, although the bit twiddling is much faster, the overall run time is now dominated by the time it takes to load 8 bytes and assemble them into the argument to transpose8x8, and to take the result and store it as 8 separate bytes.)

My suggestion would be to use a lookup table to speed up the processing.
Another thing to note is with the current definition of your matrix the maximum size will be 8x8 bits. This fits into a uint64_t so we can use this to our advantage especially when using a 64-bit platform.
I have worked out a simple example using a lookup table which you can find below and run using: http://www.tutorialspoint.com/compile_cpp11_online.php online compiler.
Example code
#include <iostream>
#include <bitset>
#include <stdint.h>
#include <assert.h>
using std::cout;
using std::endl;
using std::bitset;
/* Static lookup table */
static uint64_t lut[256];
/* Helper function to print array */
template<int N>
void print_arr(const uint8_t (&arr)[N]){
for(int i=0; i < N; ++i){
cout << bitset<8>(arr[i]) << endl;
}
}
/* Transpose function */
template<int N>
void transpose_bitmatrix(const uint8_t (&matrix)[N], uint8_t (&transposed)[8]){
assert(N <= 8);
uint64_t value = 0;
for(int i=0; i < N; ++i){
value = (value << 1) + lut[matrix[i]];
}
/* Ensure safe copy to prevent misalignment issues */
/* Can be removed if input array can be treated as uint64_t directly */
for(int i=0; i < 8; ++i){
transposed[i] = (value >> (i * 8)) & 0xFF;
}
}
/* Calculate lookup table */
void calculate_lut(void){
/* For all byte values */
for(uint64_t i = 0; i < 256; ++i){
auto b = std::bitset<8>(i);
auto v = std::bitset<64>(0);
/* For all bits in current byte */
for(int bit=0; bit < 8; ++bit){
if(b.test(bit)){
v.set((7 - bit) * 8);
}
}
lut[i] = v.to_ullong();
}
}
int main()
{
calculate_lut();
const uint8_t matrix[] = {
0b01010101,
0b00110011,
0b00001111,
};
uint8_t transposed[8];
transpose_bitmatrix(matrix, transposed);
print_arr(transposed);
return 0;
}
How it works
your 3x8 matrix will be transposed to a 8x3 matrix, represented in an 8x8 array.
The issue is that you want to convert bits, your "horizontal" representation to a vertical one, divided over several bytes.
As I mentioned above, we can take advantage of the fact that the output (8x8) will always fit into a uint64_t. We will use this to our advantage because now we can use an uint64_t to write the 8 byte array, but we can also use it for to add, xor, etc. because we can perform basic arithmetic operations on a 64 bit integer.
Each entry in your 3x8 matrix (input) is 8 bits wide, to optimize processing we first generate 256 entry lookup table (for each byte value). The entry itself is a uint64_t and will contain a rotated version of the bits.
example:
byte = 0b01001111 = 0x4F
lut[0x4F] = 0x0001000001010101 = (uint8_t[]){ 0, 1, 0, 0, 1, 1, 1, 1 }
Now for the calculation:
For the calculations we use the uint64_t but keep in mind that under water it will represent a uint8_t[8] array. We simple shift the current value (start with 0), look up our first byte and add it to the current value.
The 'magic' here is that each byte of the uint64_t in the lookup table will either be 1 or 0 so it will only set the least significant bit (of each byte). Shifting the uint64_t will shift each byte, as long as we make sure we do not do this more than 8 times! we can do operations on each byte individually.
Issues
As someone noted in the comments: Translate(Translate(M)) != M so if you need this you need some additional work.
Perfomance can be improved by directly mapping uint64_t's instead of uint8_t[8] arrays since it omits a "safe-copy" to prevent alignment issues.

I have added a new awnser instead of editing my original one to make this more visible (no comment rights unfortunatly).
In your own awnser you add an additional requirement not present in the first one: It has to work on ARM Cortex-M
I did come up with an alternative solution for ARM in my original awnser but omitted it as it was not part of the question and seemed off topic (mostly because of the C++ tag).
ARM Specific solution Cortex-M:
Some or most Cortex-M 3/4 have a bit banding region which can be used for exactly what you need, it expands bits into 32-bit fields, this region can be used to perform atomic bit operations.
If you put your array in a bitbanded region it will have an 'exploded' mirror in the bitband region where you can just use move operations on the bits itself. If you make a loop the compiler will surely be able to unroll and optimize to just move operations.
If you really want to, you can even setup a DMA controller to process an entire batch of transpose operations with a bit of effort and offload it entirely from the cpu :)
Perhaps this might still help you.

This is a bit late, but I just stumbled across this interchange today.
If you look at Hacker's Delight, 2nd Edition,there are several algorithms for efficiently transposing Boolean arrays, starting on page 141.
They are quite efficient: a colleague of mine obtained a factor about 10X
speedup compared to naive coding, on an X86.

Here's what I posted on gitub (mischasan/sse2/ssebmx.src)
Changing INP() and OUT() to use induction vars saves an IMUL each.
AVX256 does it twice as fast.
AVX512 is not an option, because there is no _mm512_movemask_epi8().
#include <stdint.h>
#include <emmintrin.h>
#define INP(x,y) inp[(x)*ncols/8 + (y)/8]
#define OUT(x,y) out[(y)*nrows/8 + (x)/8]
void ssebmx(char const *inp, char *out, int nrows, int ncols)
{
int rr, cc, i, h;
union { __m128i x; uint8_t b[16]; } tmp;
// Do the main body in [16 x 8] blocks:
for (rr = 0; rr <= nrows - 16; rr += 16)
for (cc = 0; cc < ncols; cc += 8) {
for (i = 0; i < 16; ++i)
tmp.b[i] = INP(rr + i, cc);
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
*(uint16_t*)&OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
}
if (rr == nrows) return;
// The remainder is a row of [8 x 16]* [8 x 8]?
// Do the [8 x 16] blocks:
for (cc = 0; cc <= ncols - 16; cc += 16) {
for (i = 8; i--;)
tmp.b[i] = h = *(uint16_t const*)&INP(rr + i, cc),
tmp.b[i + 8] = h >> 8;
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
OUT(rr, cc + i) = h = _mm_movemask_epi8(tmp.x),
OUT(rr, cc + i + 8) = h >> 8;
}
if (cc == ncols) return;
// Do the remaining [8 x 8] block:
for (i = 8; i--;)
tmp.b[i] = INP(rr + i, cc);
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
}
HTH.

Inspired by Roberts answer, polynomial multiplication in Arm Neon can be utilised to scatter the bits --
inline poly8x16_t mull_lo(poly8x16_t a) {
auto b = vget_low_p8(a);
return vreinterpretq_p8_p16(vmull_p8(b,b));
}
inline poly8x16_t mull_hi(poly8x16_t a) {
auto b = vget_high_p8(a);
return vreinterpretq_p8_p16(vmull_p8(b,b));
}
auto a = mull_lo(word);
auto b = mull_lo(a), c = mull_hi(a);
auto d = mull_lo(b), e = mull_hi(b);
auto f = mull_lo(c), g = mull_hi(c);
Then the vsli can be used to combine the bits pairwise.
auto ab = vsli_p8(vget_high_p8(d), vget_low_p8(d), 1);
auto cd = vsli_p8(vget_high_p8(e), vget_low_p8(e), 1);
auto ef = vsli_p8(vget_high_p8(f), vget_low_p8(f), 1);
auto gh = vsli_p8(vget_high_p8(g), vget_low_p8(g), 1);
auto abcd = vsli_p8(ab, cd, 2);
auto efgh = vsli_p8(ef, gh, 2);
return vsli_p8(abcd, efgh, 4);
Clang optimizes this code to avoid vmull2 instructions, using heavily ext q0,q0,8 to vget_high_p8.
An iterative approach would possibly be not only faster, but also uses less registers and also simdifies for 2x or more throughput.
// transpose bits in 2x2 blocks, first 4 rows
// x = a b|c d|e f|g h a i|c k|e m|g o | byte 0
// i j|k l|m n|o p b j|d l|f n|h p | byte 1
// q r|s t|u v|w x q A|s C|u E|w G | byte 2
// A B|C D|E F|G H r B|t D|v F|h H | byte 3 ...
// ----------------------
auto a = (x & 0x00aa00aa00aa00aaull);
auto b = (x & 0x5500550055005500ull);
auto c = (x & 0xaa55aa55aa55aa55ull) | (a << 7) | (b >> 7);
// transpose 2x2 blocks (first 4 rows shown)
// aa bb cc dd aa ii cc kk
// ee ff gg hh -> ee mm gg oo
// ii jj kk ll bb jj dd ll
// mm nn oo pp ff nn hh pp
auto d = (c & 0x0000cccc0000ccccull);
auto e = (c & 0x3333000033330000ull);
auto f = (c & 0xcccc3333cccc3333ull) | (d << 14) | (e >> 14);
// Final transpose of 4x4 bit blocks
auto g = (f & 0x00000000f0f0f0f0ull);
auto h = (f & 0x0f0f0f0f00000000ull);
x = (f & 0xf0f0f0f00f0f0f0full) | (g << 28) | (h >> 28);
In ARM each step can now be composed with 3 instructions:
auto tmp = vrev16_u8(x);
tmp = vshl_u8(tmp, plus_minus_1); // 0xff01ff01ff01ff01ull
x = vbsl_u8(mask_1, x, tmp); // 0xaa55aa55aa55aa55ull
tmp = vrev32_u16(x);
tmp = vshl_u16(tmp, plus_minus_2); // 0xfefe0202fefe0202ull
x = vbsl_u8(mask_2, x, tmp); // 0xcccc3333cccc3333ull
tmp = vrev64_u32(x);
tmp = vshl_u32(tmp, plus_minus_4); // 0xfcfcfcfc04040404ull
x = vbsl_u8(mask_4, x, tmp); // 0xf0f0f0f00f0f0f0full

(n*2-1)%p: avoiding the use of 64 bits when n and p are 32 bits

Consider the following function:
inline unsigned int f(unsigned int n, unsigned int p)
{
return (n*2-1)%p;
}
Now suppose that n (and p) are greater than std::numeric_limits<int>::max().
For example f(4294967295U, 4294967291U).
The mathematical result is 7 but the function will return 2, because n*2 will overflow.
Then the solution is simple: we just have to use 64 bits integer instead. Assuming that the declaration of the function has to stay the same:
inline unsigned int f(unsigned int n, unsigned int p)
{
return (static_cast<unsigned long long int>(n)*2-1)%p;
}
Everything is fine. At least in principle. The problem is that this function will be called millions of times in my code (I mean the overflowing version), and 64 bits modulus is way slower than the 32 bits version (see here for example).
The question is the following: is there any trick (mathematical or algorithmic) to avoid to execute a 64 bits version of the modulus operation. And what would be a new version of f using this trick? (keeping the same declaration).
Note 1: n > 0
Note 2: p > 2
Note 3: n can be lower than p: n=4294967289U, p=4294967291U
Note 4: the less the number of modulus operation used, the better (3 32 bits modulo is too large, 2 is interesting, and 1 will surely outperform)
Note 5: of course the result will be processor dependent. Assume a use on the lasts supercomputers with the last xeon available.

We know that p is less than max, then n % p is less than max. They are both unsigned, that means that n % p is positive, and smaller than p. Unsigned overflow is well-defined, so if n % p * 2 exceeds p, we can compute it as n % p - p + n % p, which will not overflow, so together it will look like this:
unsigned m = n % p;
unsigned r;
if (p - m < m) // m * 2 > p
r = m - p + m;
else // m * 2 <= p
r = m * 2;
// subtract 1, account for the fact that r can be 0
if (r == 0) r = p - 1;
else r = r - 1;
return r % p;
Note that you can avoid the last modulus, because we know that r doesn't exceed p * 2 (it is at most m * 2, and m doesn't exceed p), so the last line can be rewritten as
return r >= p ? r - p : r
Which brings the number of modulus operations to 1.

Even though I dislike dealing with AT&T syntax and GCC's "extended asm constraints", I think this works (it worked in my, admittedly limited, tests)
uint32_t f(uint32_t n, uint32_t p)
{
uint32_t res;
asm (
"xorl %%edx, %%edx\n\t"
"addl %%eax, %%eax\n\t"
"adcl %%edx, %%edx\n\t"
"subl $1, %%eax\n\t"
"sbbl $0, %%edx\n\t"
"divl %1"
: "=d"(res)
: "S"(p), "a"(n)
:
);
return res;
}
The constraints may be unnecessarily strict or wrong, I don't know. It seemed to work.
The idea here is to do a regular 32bit division, which actually takes a 64bit dividend. It only works if the quotient will fit in 32 bits (otherwise overflow is signaled), which is always true under the circumstances (p at least 2, n not zero). The stuff before the division handles the times 2 (with overflow into edx, the "high half"), then the "subtract 1" with potential borrow. The "=d" output thing makes it take the remainder as result. "a"(n) puts n in eax (letting it choose an other register doesn't help, the division will take an input in edx:eax anyway). "S"(p) could probably be "r"(p) (seems to work) but I'm not sure enough to trust it.

FWIW, this version seems to be avoid any overflows:
std::uint32_t f(std::uint32_t n, std::uint32_t p)
{
auto m = n%p;
if (m <= p/2) {
return (m==0)*p+2*m-1;
}
return p-2*(p-m)-1;
}
Demo. The idea is that if an overflow would occur in 2*m-1, we can work with p-2*(p-m)-1, which avoids this by multiplying 2 with the modular additive inverse instead.

Shifting big numbers

X = 712360810625491574981234007851998 is represented using a linked list and each node is an unsigned int
Is there a fast way to do X << 8 X << 591 other than X * 2^8 X * 2^591 ?

Bit shifting is very easy in any arbitrary number of bits. Just remember to shift the overflowed bits to the next element. That's all
Below is a left-shift-by-3 example
uint64_t i1, i2, i3, o1, o2, o3; // {o3, o2, o1} = {i3, i2, i1} << 3;
o3 = i3 << 3 | i2 >> (32 - 3);
o2 = i2 << 3 | i1 >> (32 - 3);
o1 = i1 << 3;
Similar for shifting right, just iterate in the reverse direction.
Edit:
It seems that you're using base 109 for your large number, so binary shifting does not apply here. "Shifting" left/right N digits in a base B is equivalent to multiplying the number by BN and B-N respectively. You can't do binary shift in decimal and vice versa
If you don't change your base then you have only one solution, that's multiplying the number by 2591. If you want to shift like in binary you must change to a base that is a power of 2 like base 232 or base 264
A general solution to shift would be like this, with the limbs ("digits" or each small word unit in a big integer, in arbitrary-precision arithmetic term) stored in little-endian and each digit is in base 2CHAR_BIT*sizeof(T)
template<typename T,
class = typename std::enable_if<std::is_unsigned<T>::value>::type>
void rshift(std::vector<T>& x, std::size_t shf_amount) // x >>= shf_amount
{
// The number of bits in each limb/digit
constexpr std::size_t width = CHAR_BIT*sizeof(T);
if (shf_amount > width)
throw; // or zero out the whole vector for saturating shift
// Number of limbs to shift
const std::size_t limbshift = shf_amount / width;
// Number of bits to shift in each limb
const std::size_t shift = shf_amount % width;
std::size_t i = 0;
// Shift the least significant bits
for (; i < x.size() - limbshift - 1; ++i)
x[i] = (x[i + limbshift] >> shift) |
(x[i + 1 + limbshift] << (width - shift));
x[i] = x[i + limbshift] >> shift;
i++;
// Zero out the most significant bits
for (; i < x.size() ; ++i)
x[i] = 0;
}
Moreover from the tag you're likely using a linked-list for storing the limbs which is not cache-friendly due to elements scattering all around the memory space, and it also wastes a lot of memory due to the next pointers. Actually the memory used by pointers and memory allocation is even larger than the memory for storing the data bits in this case. In fact you shouldn't use linked list in most real life problems
Bjarne Stroustrup says we must avoid linked lists
Why you should never, ever, EVER use linked-list in your code again
Number crunching: Why you should never, ever, EVER use linked-list in your code again
Bjarne Stroustrup: Why you should avoid Linked Lists
Are lists evil?—Bjarne Stroustrup

I think, if you use 64-bit integer the code should be
o3 = i3 << 3 | i2 >> (32 - 3);
...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js