AVX or other's set instruction that can extract a specific bit, given an index from multiple integers in parrallel?

AVX or other's set instruction that can extract a specific bit, given an index from multiple integers in parrallel? - bit-manipulation

Example: a=11010001 , b=0001001, c=11010000, d = 11111111
extract(a,b,c,d,2) == 0001

There are two cases: 1. The position of interest is a compile-time constant, and 2.
The position of interest is not a compile-time constant. Both cases are answered
in the code below.
Note that if a, b, c, d, are consecutive in memory, then you can simply move them to an xmm
register by x = _mm_load_si128((_m128i*) &d);, which is much more efficient than
_mm_set_epi32() used here.
The code:
/* gcc -O3 -m64 -Wall -march=broadwell extract_2nd_bit.c */
#include <immintrin.h>
#include <stdio.h>
/* If position i = 2 (for axample) is known at compile time: */
int extract_2nd_bit(int a, int b, int c, int d){
__m128i x = _mm_set_epi32(a, b, c, d);
x = _mm_slli_epi32(x, 31 - 2); /* shift bit 2 to the highest position */
return _mm_movemask_ps(_mm_castsi128_ps(x)); /* extract the MSB of the 4 elements */
}
/* If position i is unknown at compile time: */
int extract_var(int a, int b, int c, int d, int i){
__m128i x = _mm_set_epi32(a, b, c, d);
x = _mm_sll_epi32(x, _mm_cvtsi32_si128(31 - i)); /* shift bit i to the highest position */
return _mm_movemask_ps(_mm_castsi128_ps(x)); /* extract the MSB of the 4 elements */
}
int print_32_bin(unsigned int x);
int main(){
int a = 0b11010001;
int b = 0b0001001;
int c = 0b11010000;
int d = 0b11111111;
int pos = 2;
print_32_bin(extract_2nd_bit(a, b, c, d));
print_32_bin(extract_var(a, b, c, d, pos));
return 0;
}
int print_32_bin(unsigned int x){
for (int i=31;i>=0;i--){
printf("%1u",((x>>i)&1));
}
printf("\n");
return 0;
}
The output is:
$ ./a.out
00000000000000000000000000000001
00000000000000000000000000000001
By the way, why didn't you set the avx or sse tag in the question?

Try using the
unsigned __int64 _pext_u64 (unsigned __int64 a, unsigned __int64 mask)
command, though it doesn't use multiple integers.
There are other ways using ANDs and variable SHIFTs (and other commands).

This algorithm is not optimal, because the filling of the 32-bit register is done serially. But you should get the gist. It is the PEXT instruction from the BMI2 instruction set that can do this efficiently.
This is a solution in MASM x86 assembly (a, b, c, d are BYTE values in memory):
mov ah, a
mov al, b
shl eax, 16
mov ah, c
mov al, d
; Now EAX = aaaaaaaabbbbbbbbccccccccdddddddd
mov ecx, 0b00000100000001000000010000000100 ; MASK value
pext eax, eax, ecx
; Now EAX = 00000000000000000000000000000001 ; result
For practical use, optimize the filling of the 32-bit source register (here: EAX).
Now the lowest 4 bits of EAX should contain 0001.

Related

Convert 16 bits mask to 16 bytes mask

Is there any way to convert the following code:
int mask16 = 0b1010101010101010; // int or short, signed or unsigned, it does not matter
to
__uint128_t mask128 = ((__uint128_t)0x0100010001000100 << 64) | 0x0100010001000100;
So to be extra clear something like:
int mask16 = 0b1010101010101010;
__uint128_t mask128 = intrinsic_bits_to_bytes(mask16);
or by applying directly the mask:
int mask16 = 0b1010101010101010;
__uint128_t v = ((__uint128_t)0x2828282828282828 << 64) | 0x2828282828282828;
__uint128_t w = intrinsic_bits_to_bytes_mask(v, mask16); // w = ((__uint128_t)0x2928292829282928 << 64) | 0x2928292829282928;

Bit/byte order: Unless noted, these follow the question, putting the LSB of the uint16_t in the least significant byte of the __uint128_t (lowest memory address on little-endian x86). This is what you want for an ASCII dump of a bitmap for example, but it's opposite of place-value printing order for the base-2 representation of a single 16-bit number.
The discussion of efficiently getting values (back) into RDX:RAX integer registers has no relevance for most normal use-cases since you'd just store to memory from vector registers, whether that's 0/1 byte integers or ASCII '0'/'1' digits (which you can get most efficiently without ever having 0/1 integers in a __m128i, let alone in an unsigned __int128).
Table of contents:
SSE2 / SSSE3 version: good if you want the result in a vector, e.g. for storing a char array.
(SSE2 NASM version, shuffling into MSB-first printing order and converting to ASCII.)
BMI2 pdep: good for scalar unsigned __int128 on Intel CPUs with BMI2, if you're going to make use of the result in scalar registers. Slow on AMD.
Pure C++ with a multiply bithack: pretty reasonable for scalar
AVX-512: AVX-512 has masking as a first-class operation using scalar bitmaps. Possibly not as good as BMI2 pdep if you're using the result as scalar halves, otherwise even better than SSSE3.
AVX2 printing order (MSB at lowest address) dump of a 32-bit integer.
See also is there an inverse instruction to the movemask instruction in intel avx2? for other variations on element size and mask width. (SSE2 and multiply bithack were adapted from answers linked from that collection.)
With SSE2 (preferably SSSE3)
See #aqrit's How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD answer
Adapting that to work with 16 bits -> 16 bytes, we need a shuffle that replicates the first byte of the mask to the first 8 bytes of the vector, and the 2nd mask byte to the high 8 vector bytes. That's doable with one SSSE3 pshufb, or with punpcklbw same,same + punpcklwd same,same + punpckldq same,same to finally duplicate things up to two 64-bit qwords.
typedef unsigned __int128 u128;
u128 mask_to_u128_SSSE3(unsigned bitmap)
{
const __m128i shuffle = _mm_setr_epi32(0,0, 0x01010101, 0x01010101);
__m128i v = _mm_shuffle_epi8(_mm_cvtsi32_si128(bitmap), shuffle); // SSSE3 pshufb
const __m128i bitselect = _mm_setr_epi8(
1, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1U<<7,
1, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1U<<7 );
v = _mm_and_si128(v, bitselect);
v = _mm_min_epu8(v, _mm_set1_epi8(1)); // non-zero -> 1 : 0 -> 0
// return v; // if you want a SIMD vector result
alignas(16) u128 tmp;
_mm_store_si128((__m128i*)&tmp, v);
return tmp; // optimizes to movq / pextrq (with SSE4)
}
(To get 0 / 0xFF instead of 0 / 1, replace _mm_min_epu8 with v= _mm_cmpeq_epi8(v, bitselect). If you want a string of ASCII '0' / '1' characters, do cmpeq and _mm_sub_epi8(_mm_set1_epi8('0'), v). That avoids the set1(1) vector constant.)
Godbolt including test-cases. (For this and other non-AVX-512 versions.)
# clang -O3 for Skylake
mask_to_u128_SSSE3(unsigned int):
vmovd xmm0, edi # _mm_cvtsi32_si128
vpshufb xmm0, xmm0, xmmword ptr [rip + .LCPI2_0] # xmm0 = xmm0[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI2_1] # 1<<0, 1<<1, etc.
vpminub xmm0, xmm0, xmmword ptr [rip + .LCPI2_2] # set1_epi8(1)
# done here if you return __m128i v or store the u128 to memory
vmovq rax, xmm0
vpextrq rdx, xmm0, 1
ret
BMI2 pdep: good on Intel, bad on AMD
BMI2 pdep is fast on Intel CPUs that have it (since Haswell), but very slow on AMD (over a dozen uops, high latency.)
typedef unsigned __int128 u128;
inline u128 assemble_halves(uint64_t lo, uint64_t hi) {
return ((u128)hi << 64) | lo; }
// could replace this with __m128i using _mm_set_epi64x(hi, lo) to see how that compiles
#ifdef __BMI2__
#include <immintrin.h>
auto mask_to_u128_bmi2(unsigned bitmap) {
// fast on Intel, slow on AMD
uint64_t tobytes = 0x0101010101010101ULL;
uint64_t lo = _pdep_u64(bitmap, tobytes);
uint64_t hi = _pdep_u64(bitmap>>8, tobytes);
return assemble_halves(lo, hi);
}
Good if you want the result in scalar registers (not one vector) otherwise probably prefer the SSSE3 way.
# clang -O3
mask_to_u128_bmi2(unsigned int):
movabs rcx, 72340172838076673 # 0x0101010101010101
pdep rax, rdi, rcx
shr edi, 8
pdep rdx, rdi, rcx
ret
# returns in RDX:RAX
Portable C++ with a magic multiply bithack
Not bad on x86-64; AMD since Zen has fast 64-bit multiply, and Intel's had that since Nehalem. Some low-power CPUs still have slowish imul r64, r64
This version may be optimal for __uint128_t results, at least for latency on Intel without BMI2, and on AMD, since it avoids a round-trip to XMM registers. But for throughput it's quite a few instructions
See #phuclv's answer on How to create a byte out of 8 bool values (and vice versa)? for an explanation of the multiply, and for the reverse direction. Use the algorithm from unpack8bools once for each 8-bit half of your mask.
//#include <endian.h> // glibc / BSD
auto mask_to_u128_magic_mul(uint32_t bitmap) {
//uint64_t MAGIC = htobe64(0x0102040810204080ULL); // For MSB-first printing order in a char array after memcpy. 0x8040201008040201ULL on little-endian.
uint64_t MAGIC = 0x0102040810204080ULL; // LSB -> LSB of the u128, regardless of memory order
uint64_t MASK = 0x0101010101010101ULL;
uint64_t lo = ((MAGIC*(uint8_t)bitmap) ) >> 7;
uint64_t hi = ((MAGIC*(bitmap>>8)) ) >> 7;
return assemble_halves(lo & MASK, hi & MASK);
}
If you're going to store the __uint128_t to memory with memcpy, you might want to control for host endianness by using htole64(0x0102040810204080ULL); (from GNU / BSD <endian.h>) or equivalent to always map the low bit of input to the lowest byte of output, i.e. to the first element of a char or bool array. Or htobe64 for the other order, e.g. for printing. Using that function on a constant instead of the variable data allows constant-propagation at compile time.
Otherwise, if you truly want a 128-bit integer whose low bit matches the low bit of the u16 input, the multiplier constant is independent of host endianness; there's no byte access to wider types.
clang 12.0 -O3 for x86-64:
mask_to_u128_magic_mul(unsigned int):
movzx eax, dil
movabs rdx, 72624976668147840 # 0x0102040810204080
imul rax, rdx
shr rax, 7
shr edi, 8
imul rdx, rdi
shr rdx, 7
movabs rcx, 72340172838076673 # 0x0101010101010101
and rax, rcx
and rdx, rcx
ret
AVX-512
This is easy with AVX-512BW; you can use the mask for a zero-masked load from a repeated 0x01 constant.
__m128i bits_to_bytes_avx512bw(unsigned mask16) {
return _mm_maskz_mov_epi8(mask16, _mm_set1_epi8(1));
// alignas(16) unsigned __int128 tmp;
// _mm_store_si128((__m128i*)&u128, v); // should optimize into vmovq / vpextrq
// return tmp;
}
Or avoid a memory constant (because compilers can do set1(-1) with just a vpcmpeqd xmm0,xmm0): Do a zero-masked absolute-value of -1. The constant setup can be hoisted, same as with set1(1).
__m128i bits_to_bytes_avx512bw_noconst(unsigned mask16) {
__m128i ones = _mm_set1_epi8(-1); // extra instruction *off* the critical path
return _mm_maskz_abs_epi8(mask16, ones);
}
But note that if doing further vector stuff, the result of maskz_mov might be able to optimize into other operations. For example vec += maskz_mov could optimize into a merge-masked add. But if not, vmovdqu8 xmm{k}{z}, xmm needs an ALU port like vpabsb xmm{k}{z}, xmm, but vpabsb can't run on port 5 on Skylake/Ice Lake. (A zero-masked vpsubb from a zeroed register would avoid that possible throughput problem, but then you'd be setting up 2 registers just to avoid loading a constant. In hand-written asm, you'd just materialize set1(1) using vpcmpeqd / vpabsb yourself if you wanted to avoid a 4-byte broadcast-load of a constant.)
(Godbolt compiler explorer with gcc and clang -O3 -march=skylake-avx512. Clang sees through the masked vpabsb and compiles it the same as the first version, with a memory constant.)
Even better if you can use a vector 0 / -1 instead of 0 / 1: use return _mm_movm_epi8(mask16). Compiles to just kmovd k0, edi / vpmovm2b xmm0, k0
If you want a vector of ASCII characters like '0' or '1', you could use _mm_mask_blend_epi8(mask, ones, zeroes). (That should be more efficient than a merge-masked add into a vector of set1(1) which would require an extra register copy, and also better than sub between set1('0') and _mm_movm_epi8(mask16) which would require 2 instructions: one to turn the mask into a vector, and a separate vpsubb.)
AVX2 with bits in printing order (MSB at lowest address), bytes in mem order, as ASCII '0' / '1'
With [] delimiters and \t tabs like this output format, from this codereview Q&A:
[01000000] [01000010] [00001111] [00000000]
Obviously if you want all 16 or 32 ASCII digits contiguous, that's easier and doesn't require shuffling the output to store each 8-byte chunk separately. Mostly of the reason for posting here is that it has the shuffle and mask constants in the right order for printing, and to show a version optimized for ASCII output after it turned out that's what the question really wanted.
Using How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?, basically a 256-bit version the SSSE3 code.
#include <limits.h>
#include <stdint.h>
#include <stdio.h>
#include <immintrin.h>
#include <string.h>
// https://stackoverflow.com/questions/21622212/how-to-perform-the-inverse-of-mm256-movemask-epi8-vpmovmskb
void binary_dump_4B_avx2(const void *input)
{
char buf[CHAR_BIT*4 + 2*4 + 3 + 1 + 1]; // bits, 4x [], 3x \t, \n, 0
buf[0] = '[';
for (int i=9 ; i<sizeof(buf) - 8; i+=11){ // GCC strangely doesn't unroll this loop
memcpy(&buf[i], "]\t[", 4); // 4-byte store as a single; we overlap the 0 later
}
__m256i v = _mm256_castps_si256(_mm256_broadcast_ss(input)); // aliasing-safe load; use _mm256_set1_epi32 if you know you have an int
const __m256i shuffle = _mm256_setr_epi64x(0x0000000000000000, // low byte first, bytes in little-endian memory order
0x0101010101010101, 0x0202020202020202, 0x0303030303030303);
v = _mm256_shuffle_epi8(v, shuffle);
// __m256i bit_mask = _mm256_set1_epi64x(0x8040201008040201); // low bits to low bytes
__m256i bit_mask = _mm256_set1_epi64x(0x0102040810204080); // MSB to lowest byte; printing order
v = _mm256_and_si256(v, bit_mask); // x & mask == mask
// v = _mm256_cmpeq_epi8(v, _mm256_setzero_si256()); // -1 / 0 bytes
// v = _mm256_add_epi8(v, _mm256_set1_epi8('1')); // '0' / '1' bytes
v = _mm256_cmpeq_epi8(v, bit_mask); // 0 / -1 bytes
v = _mm256_sub_epi8(_mm256_set1_epi8('0'), v); // '0' / '1' bytes
__m128i lo = _mm256_castsi256_si128(v);
_mm_storeu_si64(buf+1, lo);
_mm_storeh_pi((__m64*)&buf[1+8+3], _mm_castsi128_ps(lo));
// TODO?: shuffle first and last bytes into the high lane initially to allow 16-byte vextracti128 stores, with later stores overlapping to replace garbage.
__m128i hi = _mm256_extracti128_si256(v, 1);
_mm_storeu_si64(buf+1+11*2, hi);
_mm_storeh_pi((__m64*)&buf[1+11*3], _mm_castsi128_ps(hi));
// buf[32 + 2*4 + 3] = '\n';
// buf[32 + 2*4 + 3 + 1] = '\0';
// fputs
memcpy(&buf[32 + 2*4 + 2], "]", 2); // including '\0'
puts(buf); // appends a newline
// appending our own newline and using fputs or fwrite is probably more efficient.
}
void binary_dump(const void *input, size_t bytecount) {
}
// not shown: portable version, see Godbolt, or my or #chux's answer on the codereview question
int main(void)
{
int t = 1000000;
binary_dump_4B_avx2(&t);
binary_dump(&t, sizeof(t));
t++;
binary_dump_4B_avx2(&t);
binary_dump(&t, sizeof(t));
}
Runnable Godbolt demo with gcc -O3 -march=haswell.
Note that GCC10.3 and earlier are dumb and duplicate the AND/CMPEQ vector constant, once as bytes and once as qwords. (In that case, comparing against zero would be better, or using OR with an inverted mask and comparing against all-ones). GCC11.1 fixes that with a .set .LC1,.LC2, but still loads it twice, as memory operands instead of loading once into a register. Clang doesn't have either of these problems.
Fun fact: clang -march=icelake-client manages to turn the 2nd part of this into an AVX-512 masked blend between '0' and '1' vectors, but instead of just kmov it uses a broadcast-load, vpermb byte shuffle, then test-into-mask with the bitmask.

For each bit in the mask, you want to move a bit at position n to the low-order bit of the byte at position n, i.e. bit position 8 * n. You can do this with a loop:
__uint128_t intrinsic_bits_to_bytes(uint16_t mask)
{
int i;
__uint128_t result = 0;
for (i=0; i<16; i++) {
result |= (__uint128_t )((mask >> i) & 1) << (8 * i);
}
return result;
}

If you can use AVX512, you can do it in one instruction, no loop:
#include <immintrin.h>
__m128i intrinsic_bits_to_bytes(uint16_t mask16) {
const __m128i zeroes = _mm_setzero_si128();
const __m128i ones = _mm_set1_epi8(1);;
return _mm_mask_blend_epi8(mask16, ones, zeroes);
}
For building with gcc, I use:
g++ -std=c++11 -march=native -O3 src.cpp -pthread
This will build OK, but if your processor doesn't support AVX512, it will throw an illegal instruction at run
time.

Count array elements where there are significant 0 bits (below the highest 1 bit)

An array of 3 bytes is specified. Count the number of bytes where there's a zeros after any one. i.e. where the bits below the most-significant 1 are not all 1.
{00000100, 00000011, 00001000} - for this array the answer is 2.
My code gives 1, but it is incorrect; how to fix that?
#include <iostream>
#include <bitset>
using namespace std;
int main() {
int res = 0, res1 = 0;
_int8 arr[3] = { 4, 3, 8 };
__asm {
mov ecx, 3
mov esi, 0
start_outer:
mov bx, 8
mov al, arr[esi]
start_inner :
shl al, 1
jnb zero
jc one
one :
dec bx к
test bx, bx
jnz start_inner
jmp end_
zero :
dec bx
test bx, bx
jz end_
inc res
shl al, 1
jnb was_zero
jc start_inner
was_zero :
dec bx
dec res
jmp start_inner
end_ :
inc esi
loop start_outer
}
cout << res << endl;
system("pause");
}

Next try.
Please try to explain better next time. Many many people did not understand your question. Anyway. I hope that I understood now.
I will explain the used algorithm for one byte. Later in the program, we will run simple a outer loop 3 times, to work on all values. And, I will of course show the result in assembler. And, this is one of many possible solutions.
We can observe the following:
Your satement "Count the number of bytes where there's a zeros after any one." means, that you want to count the number of transition of a bit from 1 to 0 in one byte. And this, if we look at the bits from the msb to the lsb. So, from left to right.
If we formulate this vice versa, then we can also count the number of transitions from 0 to 1, if we go from right to left.
A transition from 0 to 1 can always be calculated by "and"ing the new value with the negated old value. Example:
OldValue NewValue NotOldValue And
0 0 1 0
0 1 1 1 --> Rising edge
1 0 0 0
1 1 0 0
We can also say in words, if the old, previous vale was not set, and the new value is set, then we have a rising edge.
We can look at one bit (of a byte) after the other, if we shift right the byte. Then, the new Value (the new lowest bit) will be the LSB. We remember the old previous bit, and the do the test. Then we set old = new, read again the new value, do the test and so on and so on. This we do for all bits.
In C++ this could look like this:
#include <iostream>
#include <bitset>
using byte = unsigned char;
byte countForByte(byte b) {
// Initialize counter variable to 0
byte counter{};
// Get the first old value. The lowest bit of the orignal array entry
byte oldValue = b & 1;
// Check all 8 bits
for (int i=0; i<8; ++i) {
// Calculate a new value. First shift to right
b = b >> 1;
// Then mask out lowest bit
byte newValue = b & 1;
// Now apply our algorithm. The result will always be 0 or one. Add to result
counter += (newValue & !oldValue);
// The next old value is the current value from this time
oldValue = newValue;
}
return counter;
}
int main() {
unsigned int x;
std::cin >> x;
std::cout << std::bitset<8>(x).to_string() << "\n";
byte s = countForByte(x);
std::cout << static_cast<int>(s) << '\n';
return 0;
}
So, and for whatever reason, you want a solution in assembler. Also here, you need to tell the people why you want to have it, what compiler you use and what target microprocessor you use. Otherwise, how can people give the correct answer?
Anyway. Here the solution for X86 architecture. Tested wis MS VS2019.
#include <iostream>
int main() {
int res = 0;
unsigned char arr[3] = { 139, 139, 139 };
__asm {
mov esi, 0; index in array
mov ecx, 3; We will work with 3 array values
DoArray:
mov ah, arr[esi]; Load array value at index
mov bl, ah; Old Value
and bl, 1; Get lowest bit of old value
push ecx; Save loop Counter for outer loop
mov ecx, 7; 7Loop runs to get the result for one byte
DoTest:
shr ah, 1; This was the original given byte
mov al, ah; Get the lowest bit from the new shifted value
and al, 1; This is now new value
not bl; Invert the old value
and bl, al; Check for rising edge
movzx edi, bl
add res, edi; Calculate new result
mov bl, al; Old value = new value
loop DoTest
inc esi; Next index in array
pop ecx; Get outer loop counter
loop DoArray; Outer loop
}
std::cout << res << '\n';
return 0;
}
And for this work, I want 100 upvotes and an accepted answer . . .

Basically, user #Michael gave already the correct answer. So all credits go to him.
You can find a lot of bit fiddling posts here on stack overflow. But a very good description for such kind of activities, you may find in the book "Hacker’s Delight" by "Henry S. Warren, Jr.". I have here the 2nd edition.
The solution is presented in chapter 2, "Basics", then "2–1 Manipulating Rightmost Bits"
And if you manually check, what values do NOT fullfill your condition, then you will find out that these are
0,1,3,7,15,31,63,127,255,
or, in binary
0b0000'0000, 0b0000'0001, 0b0000'0011, 0b0000'0111, 0b0000'1111, 0b0001'1111, 0b0011'1111, 0b0111'1111, 0b1111'1111,
And we detect that these values correspond to 2^n - 1. And, according to "Hacker’s Delight", we can find that with the simple formular
(x & (x + 1)) != 0
So, we can translate that to the following code:
#include <iostream>
int main() {
unsigned char arr[3];
unsigned int x, y, z;
std::cin >> x >> y >> z;
arr[0] = static_cast<unsigned char>(x);
arr[1] = static_cast<unsigned char>(y);
arr[2] = static_cast<unsigned char>(z);
unsigned char res = ((arr[0] & (arr[0] + 1)) != 0) + ((arr[1] & (arr[1] + 1)) != 0) + ((arr[2] & (arr[2] + 1)) != 0);
std::cout << static_cast<unsigned int>(res) << '\n';
return 0;
}
Very important. You do not need assembler code. Optimizing compiler will nearly always outperform your handwritten code.
You can check many different versions on Compiler Explorer. Here you could see, that your code example with static values would be completely optimized away. The compiler would simply calculate everthing in compile time and simply show 2 as result. So, caveat. Compiler explorer will show you the assembly language generated by different compilers and for selected hardware. You can take that if you want.
Please additionally note: The above sketched algorithm does not need any branch. Except, if you want to iterate over an array/vector. For this, you could write a small lambda and use algorithms from the C++ standard library.
C++ solution
#include <iostream>
#include <vector>
#include <algorithm>
#include <numeric>
#include <iterator>
int main() {
// Define Lambda to check conditions
auto add = [](const size_t& sum, const unsigned char& x) -> size_t {
return sum + static_cast<size_t>(((x & (x + 1)) == 0) ? 0U : 1U); };
// Vector with any number of test values
std::vector<unsigned char> test{ 4, 3, 8 };
// Calculate and show result
std::cout << std::accumulate(test.begin(), test.end(), 0U, add) << '\n';
return 0;
}

Why A / <constant-int> is faster when A is unsigned vs signed? [duplicate]

This question already has answers here:
performance of unsigned vs signed integers
(12 answers)
Closed 4 years ago.
I have been reading through the Optimizing C++ wikibook. In the faster operations chapter one of the advice is as follows:
Integer division by a constant
When you divide an integer (that is known to be positive or zero) by a
constant, convert the integer to unsigned.
If s is a signed integer, u is an unsigned integer, and C is a
constant integer expression (positive or negative), the operation s /
C is slower than u / C, and s % C is slower than u % C. This is most
significant when C is a power of two, but in all cases, the sign must
be taken into account during division.
The conversion from signed to unsigned, however, is free of charge, as
it is only a reinterpretation of the same bits. Therefore, if s is a
signed integer that you know to be positive or zero, you can speed up
its division using the following (equivalent) expressions: (unsigned)s
/ C and (unsigned)s % C.
I tested this statement with gcc and the u / C expression seems to perform consistently better than the s / c
The following example is also provided below:
#include <iostream>
#include <chrono>
#include <cstdlib>
#include <vector>
#include <numeric>
using namespace std;
int main(int argc, char *argv[])
{
constexpr int vsize = 1e6;
std::vector<int> x(vsize);
std::iota(std::begin(x), std::end(x), 0); //0 is the starting number
constexpr int a = 5;
auto start_signed = std::chrono::system_clock::now();
int sum_signed = 0;
for ([[gnu::unused]] auto i : x)
{
// signed is by default
int v = rand() % 30 + 1985; // v in the range 1985-2014
sum_signed += v / a;
}
auto end_signed = std::chrono::system_clock::now();
auto start_unsigned = std::chrono::system_clock::now();
int sum_unsigned = 0;
for ([[gnu::unused]] auto i : x)
{
int v = rand() % 30 + 1985; // v in the range 1985-2014
sum_unsigned += static_cast<unsigned int>(v) / a;
}
auto end_unsigned = std::chrono::system_clock::now();
// signed
std::chrono::duration<double> diff_signed = end_signed - start_signed;
std::cout << "sum_signed: " << sum_signed << std::endl;
std::cout << "Time it took SIGNED: " << diff_signed.count() * 1000 << "ms" << std::endl;
// unsigned
std::chrono::duration<double> diff_unsigned = end_unsigned - start_unsigned;
std::cout << "sum_unsigned: " << sum_unsigned << std::endl;
std::cout << "Time it took UNSIGNED: " << diff_unsigned.count() * 1000 << "ms" << std::endl;
return 0;
}
You can compile and run the example here: http://cpp.sh/8kie3
Why is this happening?

After some toying around, I believe I've tracked down the source of the problem to be the guarantee by the standard that negative integer divisions are rounded towards zero since C++11. For the simplest case, which is division by two, check out the following code and the corresponding assembly (godbolt link).
constexpr int c = 2;
int signed_div(int in){
return in/c;
}
int unsigned_div(unsigned in){
return in/c;
}
Assembly:
signed_div(int):
mov eax, edi
shr eax, 31
add eax, edi
sar eax
ret
unsigned_div(unsigned int):
mov eax, edi
shr eax
ret
What do these extra instructions accomplish? shr eax, 31 (right shift by 31) just isolates the sign bit, meaning that if input is non-negative, eax == 0, otherwise eax == 1. Then the input is added to eax. In other words, these two instructions translate to "if input is negative, add 1 to it. The implications of the addition are the following (only for negative input).
If input is even, its least significant bit is set to 1, but the shift discards it. The output is not affected by this operation.
If input is odd, its least significant bit was already 1 so the addition causes a remainder to propagate to the rest of the digits. When the right shift occurs, the least significant bit is discarded and the output is greater by one than the output we'd have if we hadn't added the sign bit to the input. Because by default right-shift in two's complement rounds towards negative infinity, the output now is the result of the same division but rounded towards zero.
In short, even negative numbers aren't affected, and odd numbers are now rounded towards zero instead of towards negative infinity.
For non-power-of-2 constants it gets a bit more complicated. Not all constants give the same output, but for a lot of them it looks similar to the following (godbolt link).
constexpr int c = 3;
int signed_div(int in){
return in/c;
}
int unsigned_div(unsigned in){
return in/c;
}
Assembly:
signed_div(int):
mov eax, edi
mov edx, 1431655766
sar edi, 31
imul edx
mov eax, edx
sub eax, edi
ret
unsigned_div(unsigned int):
mov eax, edi
mov edx, -1431655765
mul edx
mov eax, edx
shr eax
ret
We don't care about the change of the constant in the assembly output, because it does not affect execution time. Assuming that mul and imul take the same amount of time (which I don't know for sure but hopefully someone more knowledgeable than me can find a source on it), the signed version once again takes longer because it has extra instructions to handle the sign bit for negative operands.
Notes
Compilation was done on godbot using x86-64 GCC 7.3 with the -O2 flag.
Rounds towards zero behavior is standard-mandated since C++11. Before it was implementation defined, according to this cppreference page.

C++ external assembly: where is error in my code?

main.cpp
// Calls the external LongRandom function, written in
// assembly language, that returns an unsigned 32-bit
// random integer. Compile in the Large memory model.
// Procedure called LongRandomArray that fills an array with 32-bit unsigned
// random integers
#include <iostream.h>
#include <conio.h>
extern "C" {
unsigned long LongRandom();
void LongRandomArray(unsigned long * buffer, unsigned count);
}
const int ARRAY_SIZE = 20;
int main()
{
// Allocate array storage and fill with 32-bit
// unsigned random integers.
unsigned long * rArray = new unsigned long[ARRAY_SIZE];
LongRandomArray(rArray,ARRAY_SIZE);
for(unsigned i = 0; i < 20; i++)
{
cout << rArray[i] << ',';
}
cout << endl;
getch();
return 0;
}
LongRandom & LongRandomArray procedure module (longrand.asm)
.model large
.386
Public _LongRandom
Public _LongRandomArray
.data
seed dd 12345678h
; Return an unsigned pseudo-random 32-bit integer
; in DX:AX,in the range 0 - FFFFFFFFh.
.code
_LongRandom proc far, C
mov eax, 214013
mul seed
xor edx,edx
add eax, 2531011
mov seed, eax ; save the seed for the next call
shld edx,eax,16 ; copy upper 16 bits of EAX to DX
ret
_LongRandom endp
_LongRandomArray proc far, C
ARG bufferPtr:DWORD, count:WORD
; fill random array
mov edi,bufferPtr
mov cx, count
L1:
call _LongRandom
mov word ptr [edi],dx
add edi,2
mov word ptr [edi],ax
add edi,2
loop L1
ret
_LongRandomArray endp
end

This code is based on on an 16-bit example for MS-DOS from Kip Irvine's assembly book (6th ed.) and explicitely written for Borland C++ 5.01 and TASM 4.0 (see chapter 13.4 "Linking to C/C++ in Real-Address Mode").
Pointers in 16-bit-mode consist of a segment and an offset, usually written as segment:offset. This is not the real memory address which will calculated by the processor. You can not load segment:offset in a 32-bit-register (EDI) and store a value to the memory. So
...
mov edi,bufferPtr
...
mov word ptr [edi],dx
...
is wrong. You have to load the segment part of the pointer in a segment register e.g. ES, the offset part in a appropriate general 16-bit register eg. DI and to possibly use a segment override:
...
push es
les di,bufferPtr ; bufferPtr => ES:DI
...
mov word ptr es:[di],dx
...
pop es
...
The ARG replaces the name of the variable with the appropriate [bp+x] operand. Therefor you need a prologue (and an epilogue). TASM inserts the right instruction, if the PROC header is well written what is not the case here. Take a look at following working function:
_LongRandomArray PROC C FAR
ARG bufferPtr:DWORD, count:WORD
push es
les di,bufferPtr
mov cx, count
L1:
call _LongRandom
mov word ptr es:[di],dx
add di,2
mov word ptr es:[di],ax
add di,2
loop L1
pop es
ret
_LongRandomArray ENDP
Compile your code with BCC (not BCC32):
BCC -ml main.cpp longrand.asm

Accessing three static arrays is quicker than one static array containing 3x data?

I have 700 items and I loop through the 700 items for each I obtain the item' three attributes and perform some basic calculations. I have implemented this using two techniques:
1) Three 700-element arrays, one array for each of the three attributes. So:
item0.a = array1[0]
item0.b = array2[0]
item0.e = array3[0]
2) One 2100-element array containing data for the three attributes consecutively. So:
item0.a = array[(0*3)+0]
item0.b = array[(0*3)+1]
item0.e = array[(0*3)+2]
Now the three item attributes a, b and e are used together within the loop- therefore it would make sense that if you store them in one array the performance should be better than if you use the three-array technique (due to spatial locality). However:
Three 700-element arrays = 3300 CPU cycles on average for the whole loop
One 2100-element array = 3500 CPU cycles on average for the whole loop
Here is the code for the 2100-array technique:
unsigned int x;
unsigned int y;
double c = 0;
double d = 0;
bool data_for_all_items = true;
unsigned long long start = 0;
unsigned long long finish = 0;
unsigned int array[2100];
//I have left out code for simplicity. You can assume by now the array is populated.
start = __rdtscp(&x);
for(int i=0; i < 700; i++){
unsigned short j = i * 3;
unsigned int a = array[j + 0];
unsigned int b = array[j + 1];
data_for_all_items = data_for_all_items & (a!= -1 & b != -1);
unsigned int e = array[j + 2];
c += (a * e);
d += (b * e);
}
finish = __rdtscp(&y);
and here is the code for the three 700-element arrays technique:
unsigned int x;
unsigned int y;
double c = 0;
double d = 0;
bool data_for_all_items = true;
unsigned long long start = 0;
unsigned long long finish = 0;
unsigned int array1[700];
unsigned int array2[700];
unsigned int array3[700];
//I have left out code for simplicity. You can assume by now the arrays are populated.
start = __rdtscp(&x);
for(int i=0; i < 700; i++){
unsigned int a= array1[i]; //Array 1
unsigned int b= array2[i]; //Array 2
data_for_all_items = data_for_all_items & (a!= -1 & b != -1);
unsigned int e = array3[i]; //Array 3
c += (a * e);
d += (b * e);
}
finish = __rdtscp(&y);
Why isn't the technique using one-2100 element array faster? It should be because the three attributes are used together, per each 700 item.
I used MSVC 2012, Win 7 64
Assembly for 3x 700-element array technique:
start = __rdtscp(&x);
rdtscp
shl rdx,20h
lea r8,[this]
or rax,rdx
mov dword ptr [r8],ecx
mov r8d,8ch
mov r9,rax
lea rdx,[rbx+0Ch]
for(int i=0; i < 700; i++){
sub rdi,rbx
unsigned int a = array1[i];
unsigned int b = array2[i];
data_for_all_items = data_for_all_items & (a != -1 & b != -1);
cmp dword ptr [rdi+rdx-0Ch],0FFFFFFFFh
lea rdx,[rdx+14h]
setne cl
cmp dword ptr [rdi+rdx-1Ch],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdi+rdx-18h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdi+rdx-10h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdi+rdx-14h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdx-20h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdx-1Ch],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdx-18h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdx-10h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdx-14h],0FFFFFFFFh
setne al
and cl,al
and r15b,cl
dec r8
jne 013F26DA53h
unsigned int e = array3[i];
c += (a * e);
d += (b * e);
}
finish = __rdtscp(&y);
rdtscp
shl rdx,20h
lea r8,[y]
or rax,rdx
mov dword ptr [r8],ecx
Assembler for the 2100-element array technique:
start = __rdtscp(&x);
rdtscp
lea r8,[this]
shl rdx,20h
or rax,rdx
mov dword ptr [r8],ecx
for(int i=0; i < 700; i++){
xor r8d,r8d
mov r10,rax
unsigned short j = i*3;
movzx ecx,r8w
add cx,cx
lea edx,[rcx+r8]
unsigned int a = array[j + 0];
unsigned int b = array[j + 1];
data_for_all_items = data_for_all_items & (best_ask != -1 & best_bid != -1);
movzx ecx,dx
cmp dword ptr [r9+rcx*4+4],0FFFFFFFFh
setne dl
cmp dword ptr [r9+rcx*4],0FFFFFFFFh
setne al
inc r8d
and dl,al
and r14b,dl
cmp r8d,2BCh
jl 013F05DA10h
unsigned int e = array[pos + 2];
c += (a * e);
d += (b * e);
}
finish = __rdtscp(&y);
rdtscp
shl rdx,20h
lea r8,[y]
or rax,rdx
mov dword ptr [r8],ecx

Edit: Given your assembly code, the second loop is five times unrolled. The unrolled version could run faster on an out-of-order execution CPU such as any modern x86/x86-64 CPU.
The second code is vectorisable - two elements of each array could be loaded at each iteration in one XMM register each. Since modern CPUs use SSE for both scalar and vector FP arithmetic, this cuts the number of cycles roughly in half. With an AVX-capable CPU four doubles could be loaded in an YMM register and therefore the number of cycles should be cut in four.
The first loop is not vectorisable along i since the value of a in iteration i+1 comes from a location 3 elements after the one where the value of a in iteration i comes from. In that case vectorisation requires gathered vector loads are those are only supported in the AVX2 instruction set.
Using proper data structures is crucial when programming CPUs with vector capabilities. Converting codes like your first loop into something like your second loop is 90% of the job that one has to do in order to get good performance on Intel Xeon Phi which has very wide vector registers but awfully slow in-order execution engine.

The simple answer is that version 1 is SIMD friendly and version 2 is not. However, it's possible to make version 2, the 2100 element array, SIMD friendly. You need to us a Hybrid Struct of Arrays, aka an Array of Struct of Arrays (AoSoA). You arrange the array like this: aaaa bbbb eeee aaaa bbbb eeee ....
Below is code using GCC's vector extensions to do this. Note that now the 2100 element array code looks almost the same as the 700 element array code but it uses one array instead of three. And instead of having 700 elements between a b and e there are only 12 elements between them.
I did not find an easy solution to convert uint4 to double4 with the GCC vector extensions and I don't want to spend the time to write intrinics to do this right now so I made c and v unsigned int but for performance I would not want to be converting uint4 to double 4 in a loop anyway.
typedef unsigned int uint4 __attribute__ ((vector_size (16)));
//typedef double double4 __attribute__ ((vector_size (32)));
uint4 zero = {};
unsigned int array[2100];
uint4 test = -1 + zero;
//double4 cv = {};
//double4 dv = {};
uint4 cv = {};
uint4 dv = {};
uint4* av = (uint4*)&array[0];
uint4* bv = (uint4*)&array[4];
uint4* ev = (uint4*)&array[8];
for(int i=0; i < 525; i+=3) { //525 = 2100/4 = 700/4*3
test = test & ((av[i]!= -1) & (bv[i] != -1));
cv += (av[i] * ev[i]);
dv += (bv[i] * ev[i]);
}
double c = cv[0] + cv[1] + cv[2] + cv[3];
double v = dv[0] + dv[1] + dv[2] + dv[3];
bool data_for_all_items = test[0] & test[1] & test[2] & test[3];

The concept of 'spatial locality' is throwing you off a little bit. Chances are that with both solutions, your processor is doing its best to cache the arrays.
Unfortunately, version of your code that uses one array also has some extra math which is being performed. This is probably where your extra cycles are being spent.

Spatial locality is indeed useful, but it's actually helping you on the second case (3 distinct arrays) much more.
The cache line size is 64 Bytes (note that it doesn't divide in 3), so a single access to a 4 or 8 byte value is effectively prefetching the next elements. In addition, keep in mind that the CPU HW prefetcher is likely to go on and prefetch ahead even further elements.
However, when a,b,e are packed together, you're "wasting" this valuable prefetching on elements of the same iteration. When you access a, There's no point in prefetching b and e - the next loads are already going there (and would likely just merge in the CPU with the first load or wait for it to retrieve the data). In fact, when the arrays are merged - you fetch a new memory line only once per 64/(3*4)=~5.3 iterations. The bad alignment even means that on some iterations you'll have a and maybe b long before you get e, this imbalance is usually bad news.
In reality, since the iterations are independent, your CPU would go ahead and start the second iteration relatively fast thanks to the combination of loop unrolling (in case it was done) and out-of-order execution (calculating the index for the next set of iterations is simple and has no dependencies on the loads sent by the last ones). However you would have to run ahead pretty far in order to issue the next load everytime, and eventually the finite size of CPU instruction queues will block you, maybe before reaching the full potential memory bandwidth (number of parallel outstanding loads).
The alternative option on the other hand, where you have 3 distinct arrays, uses the spatial locality / HW prefetching solely across iterations. On each iteration, you'll issue 3 loads, which would fetch a full line once every 64/4=16 iterations. The overall data fetched is the same (well, it's the same data), but the timeliness is much better because you fetch ahead for the next 16 iterations instead of the 5. The difference become even bigger when HW prefetching is involved because you have 3 streams instead of one, meaning you can issue more prefetches (and look even further ahead).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js