Fastest way to transpose 4x4 byte matrix

Fastest way to transpose 4x4 byte matrix - c++

I have a 4x4 block of bytes that I'd like to transpose using general purpose hardware. In other words, for bytes A-P, I'm looking for the most efficient (in terms of number of instructions) way to go from
A B C D
E F G H
I J K L
M N O P
to
A E I M
B F J N
C G K O
D H L P
We can assume that I have valid pointers pointing to A, E, I, and M in memory (such that reading 32-bits from A will get me the integer containing bytes ABCD).
This is not a duplicate of this question because of the restrictions on both size and data type. Each row of my matrix can fit into a 32-bit integer, and I'm looking for answers that can perform a transpose quickly using general purpose hardware, similar to the implementation of the SSE macro _MM_TRANSPOSE4_PS.

You want potability and efficiency. Well you can't have it both ways. You said you want to do this with the fewest number of instructions. Well it's possible to do this with only one instruction with SSE3 using the pshufb instruction (see below) from the x86 instruction set.
Maybe ARM Neon has something equivalent. If you want efficiency (and are sure that you need it) then learn the hardware.
The SSE equivalent of _MM_TRANSPOSE4_PS for bytes is to use _mm_shuffle_epi8 (the intrinsic for pshufb) with a mask. Define the mask outside of your main loop.
//use -msse3 with GCC or /arch:SSE2 with MSVC
#include <stdio.h>
#include <tmmintrin.h> //SSSE3
int main() {
char x[] = {0,1,2,3, 4,5,6,7, 8,9,10,11, 12,13,15,16};
__m128i mask = _mm_setr_epi8(0x0,0x04,0x08,0x0c, 0x01,0x05,0x09,0x0d, 0x02,0x06,0x0a,0x0e, 0x03,0x07,0x0b,0x0f);
__m128i v = _mm_loadu_si128((__m128i*)x);
v = _mm_shuffle_epi8(v,mask);
_mm_storeu_si128((__m128i*)x,v);
for(int i=0; i<16; i++) printf("%d ", x[i]); printf("\n");
//output: 0 4 8 12 1 5 9 13 2 6 10 15 3 7 11 16
}

Let me rephrase your question: you're asking for a C- or C++-only solution that is portable. Then:
void transpose(uint32_t const in[4], uint32_t out[4]) {
// A B C D A E I M
// E F G H B F J N
// I J K L C G K O
// M N O P D H L P
out[0] = in[0] & 0xFF000000U; // A . . .
out[1] = in[1] & 0x00FF0000U; // . F . .
out[2] = in[2] & 0x0000FF00U; // . . K .
out[3] = in[3] & 0x000000FFU; // . . . P
out[1] |= (in[0] << 8) & 0xFF000000U; // B F . .
out[2] |= (in[0] << 16) & 0xFF000000U; // C . K .
out[3] |= (in[0] << 24); // D . . P
out[0] |= (in[1] >> 8) & 0x00FF0000U; // A E . .
out[2] |= (in[1] << 8) & 0x00FF0000U; // C G K .
out[3] |= (in[1] << 16) & 0x00FF0000U; // D H . P
out[0] |= (in[2] >> 16) & 0x0000FF00U; // A E I .
out[1] |= (in[2] >> 8) & 0x0000FF00U; // B F J .
out[3] |= (in[2] << 8) & 0x0000FF00U; // D H L P
out[0] |= (in[3] >> 24); // A E I M
out[1] |= (in[3] >> 8) & 0x000000FFU; // B F J N
out[2] |= (in[3] << 8) & 0x000000FFU; // C G K O
}
I don't see how it could be answered any other way, since then you'd be depending on a particular compiler compiling it in a particular way, etc.
Of course if those manipulations themselves can be somehow simplified, it'd help. So that's the only avenue of further pursuit here. Nothing stands out so far, but then it's been a long day for me.
So far, the cost is 12 shifts, 12 ORs, 16 ANDs. If the compiler and platform are any good, it can be done in 9 32 bit registers.
If the compiler is very sad, or the platform doesn't have a barrel shifter, then some casting could help extol the fact that the shifts and masks are just byte extractions:
void transpose(uint8_t const in[16], uint8_t out[16]) {
// A B C D A E I M
// E F G H B F J N
// I J K L C G K O
// M N O P D H L P
out[0] = in[0]; // A . . .
out[1] = in[4]; // A E . .
out[2] = in[8]; // A E I .
out[3] = in[12]; // A E I M
out[4] = in[1]; // B . . .
out[5] = in[5]; // B F . .
out[6] = in[9]; // B F J .
out[7] = in[13]; // B F J N
out[8] = in[2]; // C . . .
out[9] = in[6]; // C G . .
out[10] = in[10]; // C G K .
out[11] = in[14]; // C G K O
out[12] = in[3]; // D . . .
out[13] = in[7]; // D H . .
out[14] = in[11]; // D H L .
out[15] = in[15]; // D H L P
}
If you really want to shuffle it in-place, then the following would do.
void transpose(uint8_t m[16]) {
std::swap(m[1], m[4]);
std::swap(m[2], m[8]);
std::swap(m[3], m[12]);
std::swap(m[6], m[9]);
std::swap(m[7], m[13]);
std::swap(m[11], m[14]);
}
The byte-oriented versions may well produce worse code on modern platforms. Only a benchmark can tell.

Not sure about the speed but these are okay.
template<typename T, std::size_t Size>
void Transpose(T (&Data)[Size][Size])
{
for (int I = 0; I < Size; ++I)
{
for (int J = 0; J < I; ++J)
{
std::swap(Data[I][J], Data[J][I]);
}
}
}
template<typename T, std::size_t Size>
void Transpose(T (&Data)[Size * Size])
{
for (int I = 0; I < Size; ++I)
{
for (int J = 0; J < I; ++J)
{
std::swap(Data[I * Size + J], Data[J * Size + I]);
}
}
}

An efficient solution is possible on a 64 bits machine, if you accept that.
First shift the 32 bits integer constants by (0,) 1, 2 and 3 bytes respectively [3 shitfs]. Then mask out the unwanted bits and perform logical ORs [12 ANDs with a constant, 12 ORs]. Finally, shift back to 32 bits [3 shifts] and read out the 32 bits.
ABCD
EFGH
IJKL
MNOP
ABCD
EFGH
IJKL
MNOP
A---
E---
I---
MNOP
=======
AEIMNOP
AEIM
AB--
-F--
-J--
-NOP
=======
ABFJNOP
BFJN
ABC-
--G-
--K-
--OP
=======
ABCGKOP
CGKO
ABCD
---H
---L
---P
=======
ABCDHLP
DHLP

I posted an answer for this same problem awhile back ago for SSE here.
The only things that need to be added are vectorized load/store operations.
This answer is similar to Z boson's answer to this question. Examples for load/store can be seen there. This answer differs because in addition to the SSE3 implementation there is an SSE2 implementation which is guaranteed to run on any x64 processor.
It's worth noting that both of these solutions assume that the entire matrix is row major in memory, but OP's question states that each row could have it's own pointer which implies the array could be fragmented.

Related

Shift matrix columns/rows in all directions c++

I have a 2d vector that looks like this:
a b c
d f g // actual size is 32 x 32
h i j
And I want to shift the rows/columns:
d f g h i j b c a c a b
h i j <-- up a b c <-- down f g d <-- left g d f <-- right
a b c d f g i j h j h i
In python I can accomplish all of those in nifty one liners, such as matrix = [row[-1:] + row[:-1] for row in matrix] to move the columns to the right. But, c++ doesn't use the handy : in list indexes or negative indexes. Well, not negative indexes like in python at least.
Anyway, I'm looking for a good way to do this. I've seen lots of other SO questions about swapping rows, or rotating, but none I have seen have solved my problem.
Here's my first take on moving columns to the right:
vector<vector<string>> matrix{{"a","b","c"}, {"d","e","f"}, {"g","h","i"}};
for (int i = 0; i < matrix.size(); i++)
{
vector<string> col{matrix[i][2], matrix[i][0], matrix[i][1]};
matrix[i] = col;
}
This works, but will be very long once I write all 32 indexes. I was hoping someone could point me to something shorter and more flexible. Thanks!
EDIT: For future viewers (taken from G. Sliepen's answer) :
vector<vector<string>> matrix{{"a","b","c"}, {"d","e","f"}, {"g","h","i"}}; // or can be ints
rotate(matrix.begin(), matrix.begin() + 1, matrix.end()); // move rows up
rotate(matrix.begin(), matrix.begin() + matrix.size() - 1, matrix.end()); // move rows down
for (auto &row: matrix) // move columns to the left
{
rotate(row.begin(), row.begin() + 1, row.end());
}
for (auto &row: matrix) // move columns to the right
{
rotate(row.begin(), row.begin() + row.size() - 1, row.end());
}

There is std::rotate() that can do this for you. To rotate the contents of each row:
vector<vector<string>> matrix{{"a","b","c"}, {"d","e","f"}, {"g","h","i"}};
for (auto &row: matrix)
{
std::rotate(row.begin(), row.begin() + 1, row.end());
}
To rotate the contents of the columns, you just rotate the outer vector:
std::rotate(matrix.begin(), matrix.begin() + 1, matrix.end());

How to simplify this "clear multiple bits at once" function?

I finally figured out through trial and error how to clear multiple bits on an integer:
const getNumberOfBitsInUint8 = function(i8) {
let i = 0
while (i8) {
i++
i8 >>= 1
}
return i
}
const write = function(n, i, x) {
let o = 0xff // 0b11111111
let c = getNumberOfBitsInUint8(x)
let j = 8 - i // right side start
let k = j - c // right side remaining
let h = c + i
let a = x << k // set bits
let b = a ^ o // set bits flip
let d = o >> h // mask right
let q = d ^ b //
let m = o >> j // mask left
let s = m << j
let t = s ^ q // clear bits!
let w = n | a // set the set bits
let z = w & ~t // perform some magic https://stackoverflow.com/q/8965521/169992
return z
}
The write function takes an integer n, the index i to write bits into, and the bits value x.
Is there any way to simplify this function down and remove some steps? (Without just combining multiple operations on a single line)?

One possibility is to first clear the relevant part and then copy the bits into it:
return (n & ~((0xff << (8 - c)) >> i)) | (x << (8 - c - i))
assuming the left shift is restricted to 8 bits so the top bits disappear. Another is to use xor to find the bits to be changed :
return n ^ ((((n >> (8 - c - i)) ^ x) << (8 - c)) >> i)

Determining cell value using bitwise operation with adjacent cell values

A r*c grid has only 0's ans 1's . In each iteration , if there is any adjacent cell (up,down,left,right) same to it, the value of the current cell will be flipped . Now , how to come up with a bitwise formula to do this . It can be done with a simple if condition , but I want to know the bitwise operation to do this so the whole operation can be done once per row .
I am talking about this problem . I saw a solution using this concept here . But I couldn't understand how this is used to do the determine the cell value by this XOR operations.
ans[i] ^= ((l ^ r) | (r ^ u) | (u ^ d)) | (~s[i] ^ l);
ans[i] &= prefix;
Any help would be appreciated :D

For the start, consider s[i], l, r, u, and d to be single bits, that is, boolean variables.
s[i] (abbreviated as s in this answer) is the old color of the cell to be updated.
l, r, u, and d are the colors of the adjacent cells left, right, above (up), and below (down) of the cell to be updated.
ans[i] (abbreviated as ans in this answer) is the new color of the cell after the update.
We initialize ans = s and update it only if needed.
Recall the rules from the game for a single cell C:
If all cells adjacent to C have the opposite color of C, then C retains its color.
Otherwise (if a cell adjacent to C has the same color as C), C changes its color.
Are there various adjacent colors?
For the first condition you can use a fail-fast approach. No matter the color of C, if the adjacent cells have various colors (some are 0 and some are 1) then C changes its color. To check whether the adjacent cells l, r, u, and d have various colors you only need three checks ✱:
various_adjacent_colors = (l != r) || (r != u) || (u != d)
In bit-wise notation this is
various_adjacent_colors = (l ^ r) | (r ^ u) | (u ^ d)
✱ The "missing" checks like r != d are not necessary. Think about it the other way: If all three checks fail, then we know (l == r) && (r == u) && (u == d). In that case, from transitivity of == follows that (l == u), and (l == d), and (r == d). Therefore, all colors are the same.
Fail-Fast for various adjacent colors
If we find various adjacent colors, then we change s:
if (various_adjacent_colors)
ans = !s
In bit-wise notation this is
ans ^= various_adjacent_colors
Are all colors equal?
If we did not fail-fast, we know that all adjacent colors are equal to each other but not if they are equal to s. If s == all_adjacent_colors then we change s and if s != all_adjacent_colors then we retain s.
if (!various_adjacent_colors && s == l) // l can be replaced by either r, u, or d
ans = !s
In bit-wise notation this is
ans ^= ~various_adjacent_colors & ~(s ^ l) or
ans ^= ~various_adjacent_colors & (~s ^ l)
Putting everything together
Now let's inline (and slightly simplify) all the bit-wise notations:
vari = (l ^ r) | (r ^ u) | (u ^ d); ans ^= vari; ans ^= ~vari & (~s ^ l) is the same as
vari = (l ^ r) | (r ^ u) | (u ^ d); ans ^= vari | (~s ^ l) is the same as
ans ^= ((l ^ r) | (r ^ u) | (u ^ d)) | (~s ^ l)
Seems familiar, right? :)
From single bits to bit-vectors
So far, we only considered single bits. The linked solution uses bit-vectors instead to simultaneously update all bits/cells in a row of the 2D game board. This approach only fails at the borders of the game board:
From r = s[i] << 1 the game board might end up bigger than it should be. ans[i] &= prefix fixes the size by masking overhanging bits.
At the top and bottom row the update does not work because u = s[i-1] and d = s[i+i] do not exist. The author updates these rows "manually" in a for loop.
The update for the leftmost and rightmost cell in each row might be wrong since r = s[i] << 1 and l = s[i] >> 1 shift in "adjacent" cells of color 0 which are not actually in the game. The author updates these cells "manually" in another for loop.
By the way: A (better?) alternative to the mentioned "manual" border updates is to slightly enlarge the game board with an additional virtual row/column at each border. Before each game iteration, the virtual rows/columns are initialized such that they don't affect the update. Then the update of the actual game board is done as usual. The virtual rows/columns don't have to be stored, instead use ...
// define once for the game
bitset<N> maskMsb, maskLsb;
maskMsb[m-1] = 1;
maskLsb[0] = 1;
// define for each row when updating the game board
bitset<N> l = (s[i] >> 1) | (~s[i] & maskMsb);
bitset<N> r = (s[i] << 1) | (~s[i] & maskLsb);
bitset<N> u = i+1 <= n-1 ? s[i+1] : ~s[n-1];
bitset<N> d = i-1 >= 0 ? s[i-1] : ~s[0];

Range Update - Range Query using Fenwick Tree

http://ayazdzulfikar.blogspot.in/2014/12/penggunaan-fenwick-tree-bit.html?showComment=1434865697025#c5391178275473818224
For example being told that the value of the function or f (i) of the index-i is an i ^ k, for k> = 0 and always stay on this matter. Given query like the following:
Add value array [i], for all a <= i <= b as v Determine the total
array [i] f (i), for each a <= i <= b (remember the previous function
values clarification)
To work on this matter, can be formed into Query (x) = m * g (x) - c,
where g (x) is f (1) + f (2) + ... + f (x).
To accomplish this, we
need to know the values of m and c. For that, we need 2 separate
BIT. Observations below for each update in the form of ab v. To
calculate the value of m, virtually identical to the Range Update -
Point Query. We can get the following observations for each value of
i, which may be:
i <a, m = 0
a <= i <= b, m = v
b <i, m = 0
By using the following observation, it is clear that the Range Update - Point Query can be used on any of the BIT. To calculate the value of c, we need to observe the possibility for each value of i, which may be:
i <a, then c = 0
a <= i <= b, then c = v * g (a - 1)
b <i, c = v * (g (b) - g (a - 1))
Again, we need Range Update - Point Query, but in a different BIT.
Oiya, for a little help, I wrote the value of g (x) for k <= 3 yes: p:
k = 0 -> x
k = 1 -> x * (x + 1) / 2
k = 2 -> x * (x + 1) * (2x + 1) / 6
k = 3 -> (x * (x + 1) / 2) ^ 2
Now, example problem SPOJ - Horrible Queries . This problem is
similar issues that have described, with k = 0. Note also that
sometimes there is a matter that is quite extreme, where the function
is not for one type of k, but it could be some that polynomial shape!
Eg LA - Alien Abduction Again . To work on this problem, the solution
is, for each rank we make its BIT counter m respectively. BIT combined
to clear the counters c it was fine.
How can we used this concept if:
Given an array of integers A1,A2,…AN.
Given x,y: Add 1×2 to Ax, add 2×3 to Ax+1, add 3×4 to Ax+2, add 4×5 to
Ax+3, and so on until Ay.
Then return Sum of the range [Ax,Ay].

Shift masked bits to the lsb

When you and some data with a mask you get some result which is of the same size as the data/mask.
What I want to do, is to take the masked bits in the result (where there was 1 in the mask) and shift them to the right so they are next to each other and I can perform a CTZ (Count Trailing Zeroes) on them.
I didn't know how to name such a procedure so Google has failed me. The operation should preferably not be a loop solution, this has to be as fast operation as possible.
And here is an incredible image made in MS Paint.

This operation is known as compress right. It is implemented as part of BMI2 as the PEXT instruction, in Intel processors as of Haswell.
Unfortunately, without hardware support is it a quite annoying operation. Of course there is an obvious solution, just moving the bits one by one in a loop, here is the one given by Hackers Delight:
unsigned compress(unsigned x, unsigned m) {
unsigned r, s, b; // Result, shift, mask bit.
r = 0;
s = 0;
do {
b = m & 1;
r = r | ((x & b) << s);
s = s + b;
x = x >> 1;
m = m >> 1;
} while (m != 0);
return r;
}
But there is an other way, also given by Hackers Delight, which does less looping (number of iteration logarithmic in the number of bits) but more per iteration:
unsigned compress(unsigned x, unsigned m) {
unsigned mk, mp, mv, t;
int i;
x = x & m; // Clear irrelevant bits.
mk = ~m << 1; // We will count 0's to right.
for (i = 0; i < 5; i++) {
mp = mk ^ (mk << 1); // Parallel prefix.
mp = mp ^ (mp << 2);
mp = mp ^ (mp << 4);
mp = mp ^ (mp << 8);
mp = mp ^ (mp << 16);
mv = mp & m; // Bits to move.
m = m ^ mv | (mv >> (1 << i)); // Compress m.
t = x & mv;
x = x ^ t | (t >> (1 << i)); // Compress x.
mk = mk & ~mp;
}
return x;
}
Notice that a lot of the values there depend only on m. Since you only have 512 different masks, you could precompute those and simplify the code to something like this (not tested)
unsigned compress(unsigned x, int maskindex) {
unsigned t;
int i;
x = x & masks[maskindex][0];
for (i = 0; i < 5; i++) {
t = x & masks[maskindex][i + 1];
x = x ^ t | (t >> (1 << i));
}
return x;
}
Of course all of these can be turned into "not a loop" by unrolling, the second and third ways are probably more suitable for that. That's a bit of cheat however.

You can use the pack-by-multiplication technique similar to the one described here. This way you don't need any loop and can mix the bits in any order.
For example with the mask 0b10101001 == 0xA9 like above and 8-bit data abcdefgh (with a-h is the 8 bits) you can use the below expression to get 0000aceh
uint8_t compress_maskA9(uint8_t x)
{
const uint8_t mask1 = 0xA9 & 0xF0;
const uint8_t mask2 = 0xA9 & 0x0F;
return (((x & mask1)*0x03000000 >> 28) & 0x0C) | ((x & mask2)*0x50000000 >> 30);
}
In this specific case there are some overlaps of the 4 bits while adding (which incur unexpected carry) during the multiplication step, so I've split them into 2 parts, the first one extracts bit a and c, then e and h will be extracted in the latter part. There are other ways to split the bits as well, like a & h then c & e. You can see the results compared to Harold's function live on ideone
An alternate way with only one multiplication
const uint32_t X = (x << 8) | x;
return (X & 0x8821)*0x12050000 >> 28;
I got this by duplicating the bits so that they're spaced out farther, leaving enough space to avoid the carry. This is often better than splitting into 2 multiplications
If you want the result's bits reversed (i.e. heca0000) you can easily change the magic numbers accordingly
// result: he00 | 00ca;
return (((x & 0x09)*0x88000000 >> 28) & 0x0C) | (((x & 0xA0)*0x04800000) >> 30);
or you can also extract the 3 bits e, c and a at the same time, leaving h separately (as I mentioned above, there are often multiple solutions) and you need only one multiplication
return ((x & 0xA8)*0x12400000 >> 29) | (x & 0x01) << 3; // result: 0eca | h000
But there might be a better alternative like the above second snippet
const uint32_t X = (x << 8) | x;
return (X & 0x2881)*0x80290000 >> 28
Correctness check: http://ideone.com/PYUkty
For a larger number of masks you can precompute the magic numbers correspond to those masks and store them in an array so that you can look them up immediately for use. I calculated those mask by hand but you can do that automatically
Explanation
We have abcdefgh & mask1 = a0c00000. Multiply it with magic1
........................a0c00000
× 00000011000000000000000000000000 (magic1 = 0x03000000)
────────────────────────────────
a0c00000........................
+ a0c00000......................... (the leading "a" bit is outside int's range
──────────────────────────────── so it'll be truncated)
r1 = acc.............................
=> (r1 >> 28) & 0x0C = 0000ac00
Similarly we multiply abcdefgh & mask2 = 0000e00h with magic2
........................0000e00h
× 01010000000000000000000000000000 (magic2 = 0x50000000)
────────────────────────────────
e00h............................
+ 0h..............................
────────────────────────────────
r2 = eh..............................
=> (r2 >> 30) = 000000eh
Combine them together we have the expected result
((r1 >> 28) & 0x0C) | (r2 >> 30) = 0000aceh
And here's the demo for the second snippet
abcdefghabcdefgh
& 1000100000100001 (0x8821)
────────────────────────────────
a000e00000c0000h
× 00010010000001010000000000000000 (0x12050000)
────────────────────────────────
000h
00e00000c0000h
+ 0c0000h
a000e00000c0000h
────────────────────────────────
= acehe0h0c0c00h0h
& 11110000000000000000000000000000
────────────────────────────────
= aceh
For the reversed order case:
abcdefghabcdefgh
& 0010100010000001 (0x2881)
────────────────────────────────
00c0e000a000000h
x 10000000001010010000000000000000 (0x80290000)
────────────────────────────────
000a000000h
00c0e000a000000h
+ 0e000a000000h
h
────────────────────────────────
hecaea00a0h0h00h
& 11110000000000000000000000000000
────────────────────────────────
= heca
Related:
How to create a byte out of 8 bool values (and vice versa)?
Redistribute least significant bits from a 4-byte array to a nibble

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Fastest way to transpose 4x4 byte matrix - c++

Related

Shift matrix columns/rows in all directions c++

How to simplify this "clear multiple bits at once" function?

Determining cell value using bitwise operation with adjacent cell values

Range Update - Range Query using Fenwick Tree

Shift masked bits to the lsb

Categories

Resources