This seems a fairly large topic. For example if you try and cast(convert) a -ve float to a +ve unsigned int it doesn't work. So I am now reading about two's complement, promotion and bit patterns, and how you convert/deal with -ve to +ve float/integers. For example x stays as -1 in the example on VS 2010.
float x = -1;
(unsigned int)y = (unsigned int)x;
printf("y:%u", y);
So how exactly are negative integers stored in memory in terms of bit patterns, what options in C++ are there for converting them, can you do this via bit shifting, what is the best way to do this.
So how exactly are negative integers stored in memory in terms of bit patterns
To get some better understanding of the representation of negative integer values, use the following code to play with it:
#include <iostream>
#include <bitset>
#include <cstdint>
void printBitWise(std::ostream& os, uint8_t* data, size_t size) {
for(size_t i = 0; i < size; ++i) {
for(uint8_t j = 0; j < 8; ++j) {
if((data[i] >> j) & 1) {
os << '1';
}
else {
os << '0';
}
}
}
}
int main() {
int x = -1;
std::bitset<sizeof(int) * 8> bitwise1(x);
std::cout << bitwise1.to_string() << std::endl;
int y = -2;
std::bitset<sizeof(int) * 8> bitwise2(y);
std::cout << bitwise2.to_string() << std::endl;
float a = -1;
printBitWise(std::cout,reinterpret_cast<uint8_t*>(&a),sizeof(float));
std::cout << std::endl;
double b = -1;
printBitWise(std::cout,reinterpret_cast<uint8_t*>(&b),sizeof(double));
std::cout << std::endl;
float c = -2;
printBitWise(std::cout,reinterpret_cast<uint8_t*>(&c),sizeof(float));
std::cout << std::endl;
double d = -2;
printBitWise(std::cout,reinterpret_cast<uint8_t*>(&d),sizeof(double));
std::cout << std::endl;
return 0;
}
Output:
11111111111111111111111111111111
11111111111111111111111111111110
00000000000000000000000111111101
0000000000000000000000000000000000000000000000000000111111111101
00000000000000000000000000000011
0000000000000000000000000000000000000000000000000000000000000011
The bit format of float and double values is a different story. It's described with the IEEE floating point format, and may be compiler implementation specific regarding specific behaviors (e.g. 'rounding rules' or 'operations').
In your program, the variable x is of float type. The machine need to convert it to integer type. For intel processors, the instruction is "cvttss2si". Please check http://en.wikipedia.org/wiki/Single-precision_floating-point_format to see how float is represented in the binary format.
For the code snippt that you gave out, I tested with g++ and VS 2013. Both works as expected and prints "y:-1".
#include <cstdio>
int main()
{
float x = -1;
unsigned int y;
y = (unsigned int)x;
printf("y:%d", y);
return 0;
}
However, in this program, the compiler does the float to integer conversion for us.
movl $-1, %eax
movl %eax, -12(%rbp)
movl -12(%rbp), %esi
movb $0, %al
callq _printf
The following sample program can reveal how the machine does the float to integer conversion:
#include <cstdio>
int main()
{
float x ;
scanf("%f", &x);
unsigned int y;
y = (unsigned int)x;
printf("y:%d", y);
return 0;
}
Here is the assembly show that cvttss2si does the float to integer conversion work (http://www.jaist.ac.jp/iscenter-new/mpc/altix/altixdata/opt/intel/vtune/doc/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc68.htm).
cvttss2si -8(%rbp), %rsi
movl %esi, %ecx
movl %ecx, -12(%rbp)
movl -12(%rbp), %esi
movq -24(%rbp), %rdi ## 8-byte Reload
movl %eax, -28(%rbp) ## 4-byte Spill
movb $0, %al
callq _printf
On many platforms, the sign of a number is indicated by a reserved bit.
With two's complement integers, the Most Significant Bit (MSB) indicates the sign, when set the value is negative, when clear, the value is positive. However, setting the bit may not correctly convert the value from positive to negative.
In many floating point formats, there is a bit reserved to indicate the sign of the number. You'll have to research the various floating point standard formats, especially the ones used by your platform and compiler.
The best and most portable method to convert from negative numbers to positive is to use the abs family of functions. Remember, this is with signed data types.
To convert from positive to negative, multiply by -1 or -1.0.
Negative numbers are not defined for the unsigned types.
Related
When an integer is converted to floating-point, and the value cannot be directly represented by the destination type, the nearest value is usually selected (required by IEEE-754).
I would like to convert an integer to floating-point with rounding towards zero in case the integer value cannot be directly represented by the floating-point type.
Example:
int i = 2147483647;
float nearest = static_cast<float>(i); // 2147483648 (likely)
float towards_zero = convert(i); // 2147483520
Since C++11, one can use fesetround(), the floating-point environment rounding direction manager. There are four standard rounding directions and an implementation is permitted to add additional rounding directions.
#include <cfenv> // for fesetround() and FE_* macros
#include <iostream> // for cout and endl
#include <iomanip> // for setprecision()
#pragma STDC FENV_ACCESS ON
int main(){
int i = 2147483647;
std::cout << std::setprecision(10);
std::fesetround(FE_DOWNWARD);
std::cout << "round down " << i << " : " << static_cast<float>(i) << std::endl;
std::cout << "round down " << -i << " : " << static_cast<float>(-i) << std::endl;
std::fesetround(FE_TONEAREST);
std::cout << "round to nearest " << i << " : " << static_cast<float>(i) << std::endl;
std::cout << "round to nearest " << -i << " : " << static_cast<float>(-i) << std::endl;
std::fesetround(FE_TOWARDZERO);
std::cout << "round toward zero " << i << " : " << static_cast<float>(i) << std::endl;
std::cout << "round toward zero " << -i << " : " << static_cast<float>(-i) << std::endl;
std::fesetround(FE_UPWARD);
std::cout << "round up " << i << " : " << static_cast<float>(i) << std::endl;
std::cout << "round up " << -i << " : " << static_cast<float>(-i) << std::endl;
return(0);
}
Compiled under g++ 7.5.0, the resulting executable outputs
round down 2147483647 : 2147483520
round down -2147483647 : -2147483648
round to nearest 2147483647 : 2147483648
round to nearest -2147483647 : -2147483648
round toward zero 2147483647 : 2147483520
round toward zero -2147483647 : -2147483520
round up 2147483647 : 2147483648
round up -2147483647 : -2147483520
Omitting the #pragma doesn't seem to change anything under g++.
#chux comments correctly that the standard doesn't explicitly state that fesetround() affects rounding in static_cast<float>(i). For a guarantee that the set rounding direction affects the conversion, use std::nearbyint and its -f and -l variants. See also std::rint and its many type-specific variants.
I probably should have looked up the format specifier to use a space for positive integers and floats, rather than stuffing it into the preceding string constants.
(I haven't tested the following snippet.) Your convert() function would be something like
float convert(int i, int direction = FE_TOWARDZERO){
float retVal = 0.;
int prevdirection = std::fegetround();
std::fesetround(direction);
retVal = static_cast<float>(i);
std::fesetround(prevdirection);
return(retVal);
}
You can use std::nextafter.
int i = 2147483647;
float nearest = static_cast<float>(i); // 2147483648 (likely)
float towards_zero = std::nextafter(nearest, 0.f); // 2147483520
But you have to check, if static_cast<float>(i) is exact, if so, nextafter would go one step towards 0, which you probably don't want.
Your convert function might look like this:
float convert(int x){
if(std::abs(long(static_cast<float>(x))) <= std::abs(long(x)))
return static_cast<float>(x);
return std::nextafter(static_cast<float>(x), 0.f);
}
It may be that sizeof(int)==sizeof(long) or even sizeof(int)==sizeof(long long) in this case long(...) may behave undefined, when the static_cast<float>(x) exceeds the possible values. Depending on the compiler it might still work in this cases.
I understand the question to be restricted to platforms that use IEEE-754 binary floating-point arithmetic, and where float maps to IEEE-754 (2008) binary32. This answer assumes this to be the case.
As other answers have pointed out, if the tool chain and the platform supports this, use the facilities supplied by fenv.h to set the rounding mode for the conversion as desired.
Where those are not available, or slow, it is not difficult to emulate the truncation during int to float conversion. Basically, normalize the integer until the most significant bit is set, recording the required shift count. Now, shift the normalized integer into place to form the mantissa, compute the exponent based on the normalization shift count, and add in the sign bit based on the sign of the original integer. The process of normalization can be sped up significantly if a clz (count leading zeros) primitive is available, maybe as an intrinsic.
The exhaustively tested code below demonstrates this approach for 32-bit integers, see function int32_to_float_rz. I successfully built it as both C and C++ code with the Intel compiler version 13.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <fenv.h>
float int32_to_float_rz (int32_t a)
{
uint32_t i = (uint32_t)a;
int shift = 0;
float r;
// take absolute value of integer
if (a < 0) i = 0 - i;
// normalize integer so MSB is set
if (!(i > 0x0000ffffU)) { i <<= 16; shift += 16; }
if (!(i > 0x00ffffffU)) { i <<= 8; shift += 8; }
if (!(i > 0x0fffffffU)) { i <<= 4; shift += 4; }
if (!(i > 0x3fffffffU)) { i <<= 2; shift += 2; }
if (!(i > 0x7fffffffU)) { i <<= 1; shift += 1; }
// form mantissa with explicit integer bit
i = i >> 8;
// add in exponent, taking into account integer bit of mantissa
if (a != 0) i += (127 + 31 - 1 - shift) << 23;
// add in sign bit
if (a < 0) i |= 0x80000000;
// reinterpret bit pattern as 'float'
memcpy (&r, &i, sizeof r);
return r;
}
#pragma STDC FENV_ACCESS ON
float int32_to_float_rz_ref (int32_t a)
{
float r;
int orig_mode = fegetround ();
fesetround (FE_TOWARDZERO);
r = (float)a;
fesetround (orig_mode);
return r;
}
int main (void)
{
int32_t arg;
float res, ref;
arg = 0;
do {
res = int32_to_float_rz (arg);
ref = int32_to_float_rz_ref (arg);
if (res != ref) {
printf ("error # %08x: res=% 14.6a ref=% 14.6a\n", arg, res, ref);
return EXIT_FAILURE;
}
arg++;
} while (arg);
return EXIT_SUCCESS;
}
A C implementation dependent solution that I am confident has a C++ counterpart.
Temporarily change the rounding mode the conversion uses that to determine which way to go in inexact cases.
the nearest value is usually selected (required by IEEE-754).
Is not entirely accurate. The inexact case is rounding mode dependent.
C does not specify this behavior. C allows this behavior, as it is implementation-defined.
If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner.
#include <fenv.h>
float convert(int i) {
#pragma STDC FENV_ACCESS ON
int save_round = fegetround();
fesetround(FE_TOWARDZERO);
float f = (float) i;
fesetround(save_round);
return f;
}
A specified approach.
"the nearest value is usually selected (required by IEEE-754)" implies OP expects IEEE-754 is involved. Many C/C++ implementation do follow much of IEEE-754, yet adherence to that spec is not required. The following relies on C specifications.
Conversion of an integer type to a floating point type is specified as below. Notice conversion is not specified to depend on rounding mode.
When a value of integer type is converted to a real floating type, if the value being converted can be represented exactly in the new type, it is unchanged. If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner. C17dr ยง 6.3.1.4 2
When the result it not exact, the converted value the nearest higher or nearest lower?
A round trip int --> float --> int is warranted.
Round tripping needs to watch out for convert(near_INT_MAX) converting to outside the int range.
Rather than rely on long or long long having a wider range than int (C does not specify this property), let code compare on the negative side as INT_MIN (with 2's complement) can be expected to convert exactly to a float.
float convert(int i) {
int n = (i < 0) ? i : -i; // n <= 0
float f = (float) n;
int rt_n = (int) f; // Overflow not expected on the negative side
// If f rounded away from 0.0 ...
if (rt_n < n) {
f = nextafterf(f, 0.0); // Move toward 0.0
}
return (i < 0) f : -f;
}
Changing the rounding mode is somewhat expensive, although I think some modern x86 CPUs do rename MXCSR so it doesn't have to drain the out-of-order execution back-end.
If you care about performance, benchmarking njuffa's pure integer version (using shift = __builtin_clz(i); i<<=shift;) against the rounding-mode-changing version would make sense. (Make sure to test in the context you want to use it in; it's so small that it matters how well it overlaps with surrounding code.)
AVX-512 can use rounding-mode overrides on a per-instruction basis, letting you use a custom rounding mode for the conversion basically the same cost as a normal int->float. (Only available on Intel Skylake-server, and Ice Lake CPUs so far, unfortunately.)
#include <immintrin.h>
float int_to_float_trunc_avx512f(int a) {
const __m128 zero = _mm_setzero_ps(); // SSE scalar int->float are badly designed to merge into another vector, instead of zero-extend. Short-sighted Pentium-3 decision never changed for AVX or AVX512
__m128 v = _mm_cvt_roundsi32_ss (zero, a, _MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC);
return _mm_cvtss_f32(v); // the low element of a vector already is a scalar float so this is free.
}
_mm_cvt_roundi32_ss is a synonym, IDK why Intel defined both i and si names, or if some compilers might only have one.
This compiles efficiently with all 4 mainstream x86 compilers (GCC/clang/MSVC/ICC) on the Godbolt compiler explorer.
# gcc10.2 -O3 -march=skylake-avx512
int_to_float_trunc_avx512f:
vxorps xmm0, xmm0, xmm0
vcvtsi2ss xmm0, xmm0, {rz-sae}, edi
ret
int_to_float_plain:
vxorps xmm0, xmm0, xmm0 # GCC is always cautious about false dependencies, spending an extra instruction to break it, like we did with setzero()
vcvtsi2ss xmm0, xmm0, edi
ret
In a loop, the same zeroed register can be reused as a merge target, allowing the vxorps zeroing to be hoisted out of a loop.
Using _mm_undefined_ps() instead of _mm_setzero_ps(), we can get ICC to skip zeroing XMM0 before converting into it, like clang does for plain (float)i in this case. But ironically, clang which is normally cavalier and reckless about false dependencies compiles _mm_undefined_ps() the same as setzero in this case.
The performance in practice of vcvtsi2ss (scalar integer to scalar single-precision float) is the same whether you use a rounding-mode override or not (2 uops on Ice Lake, same latency: https://uops.info/). The AVX-512 EVEX encoding is 2 bytes longer than the AVX1.
Rounding mode overrides also suppress FP exceptions (like "inexact"), so you couldn't check the FP environment to later detect if the conversion happened to be exact (no rounding). But in this case, converting back to int and comparing would be fine. (You can do that without risk of overflow because of the rounding towards 0).
A simple solution is to use a higher precision floating point for comparison. As long as the high precision floating point can exactly represent all integers, we can accurately compare whether the float result was greater.
double should be sufficient with 32 bit integers, and long double is sufficient for 64 bit on most systems, but it's good practice to verify it.
float convert(int x) {
static_assert(std::numeric_limits<double>::digits
>= sizeof(int) * CHAR_BIT);
float f = x;
double d = x;
return std::abs(f) > std::abs(d)
? std::nextafter(f, 0.f)
: f;
}
Shift the integer right by an arithmetic shift until the number of bits agrees with the precision of the floating point arithmetic. Count the shifts.
Convert the integer to float. The result is now precise.
Multiply the resulting float by a power of two corresponding to the number of shifts.
For nonnegative values, this can be done by taking the integer value and shifting right until the highest set bit is less than 24 bits (i.e. the precision of IEEE single) from the right, then shifting back.
For negative values, you would shift right until all bits from 24 and up are set, then shift back. For the shift back, you'll first need to cast the value to unsigned to avoid undefined behavior of left-shifting a negative value, then cast the result back to int before converting to float.
Note also that the conversion from unsigned to signed is implementation defined, however we're already dealing with ID as we're assuming float is IEEE754 and int is two's complement.
float rount_to_zero(int x)
{
int cnt = 0;
if (x >= 0) {
while (x != (x & 0xffffff)) {
x >>= 1;
cnt++;
}
return x << cnt;
} else {
while (~0xffffff != (x & ~0xffffff)) {
x >>= 1;
cnt++;
}
return (int)((unsigned)x << cnt);
}
}
I'm pondering at how to speed up bit testing in the following routine:
void histSubtractFromBits(uint64* cursor, uint16* hist){
//traverse each bit of the 256-bit-long bitstring by splitting up into 4 bitsets
std::bitset<64> a(*cursor);
std::bitset<64> b(*(cursor+1));
std::bitset<64> c(*(cursor+2));
std::bitset<64> d(*(cursor+3));
for(int bit = 0; bit < 64; bit++){
hist[bit] -= a.test(bit);
}
for(int bit = 0; bit < 64; bit++){
hist[bit+64] -= b.test(bit);
}
for(int bit = 0; bit < 64; bit++){
hist[bit+128] -= c.test(bit);
}
for(int bit = 0; bit < 64; bit++){
hist[bit+192] -= d.test(bit);
}
}
The actual gcc implementation does a range-check for the bit argument, then &-s with a bitmask. I could do it without the bitsets and with my own bit-shifting / masking, but I'm fairly certain that won't yield any significant speedup (tell me if I'm wrong and why).
I'm not really familiar with the x86-64 assembly, but I am aware of a certain bit test instruction, and I am aware that it's theoretically possible to do inline assembly with gcc.
1) Do you think it at all worthwhile to write an inline-assembly analogue for the above code?
2) If yes, then how would I go about doing it, i.e. could you show me some basic starter code / samples to point me in the right direction?
As far as I can tell, you basically iterate over each bit. As such, I'd imagine simply shifting and masking off the LSB every time should provide good performance. Something like:
uint64_t a = *cursor;
for(int bit = 0; a != 0; bit++, a >>= 1) {
hist[bit] -= (a & 1);
}
Alternatively, if you expect only very few bits to be set and are happy with gcc specific stuff, you could use __builtin_ffsll
uint64_t a = *cursor;
int next;
for(int bit = 0; (next = __builtin_ffsll(a)) != 0; ) {
bit += next;
hist[bit - 1] -= 1;
a >>= next;
}
The idea should be fine, but no warranty for the actual code :)
Update: code using vector extensions:
typedef short v8hi __attribute__ ((vector_size (16)));
static v8hi table[256];
void histSubtractFromBits(uint64_t* cursor, uint16_t* hist)
{
uint8_t* cursor_tmp = (uint8_t*)cursor;
v8hi* hist_tmp = (v8hi*)hist;
for(int i = 0; i < 32; i++, cursor_tmp++, hist_tmp++)
{
*hist_tmp -= table[*cursor_tmp];
}
}
void setup_table()
{
for(int i = 0; i < 256; i++)
{
for(int j = 0; j < 8; j++)
{
table[i][j] = (i >> j) & 1;
}
}
}
This will be compiled to SSE instructions if available, for example I get:
leaq 32(%rdi), %rdx
.p2align 4,,10
.p2align 3
.L2:
movzbl (%rdi), %eax
addq $1, %rdi
movdqa (%rsi), %xmm0
salq $4, %rax
psubw table(%rax), %xmm0
movdqa %xmm0, (%rsi)
addq $16, %rsi
cmpq %rdx, %rdi
jne .L2
Of course this approach relies on the table being in cache.
Another suggestion is to combine data caching, registers and loop unrolling:
// Assuming your processor has 64-bit words
void histSubtractFromBits(uint64_t const * cursor, uint16* hist)
{
register uint64_t a = *cursor++;
register uint64_t b = *cursor++;
register uint64_t c = *cursor++;
register uint64_t d = *cursor++;
register unsigned int i = 0;
for (i = 0; i < (sizeof(*cursor) * CHAR_BIT; ++i)
{
hist[i + 0] += a & 1;
hist[i + 64] += b & 1;
hist[i + 128] += c & 1;
hist[i + 192] += d & 1;
a >>= 1;
b >>= 1;
c >>= 1;
d >>= 1;
}
}
I'm not sure if you gain any more performance by reordering the instructions like this:
hist[i + 0] += a & 1;
a >>= 1;
You could try both ways and compare the assembly language for both.
One of the ideas here is to maximize the register usage. The values to test are loaded into registers and then the testing begins.
In some cases of microbenchmarking, static code analyzer is smart enough to elide multiple function calls with the same argument values, rendering measurement useless. Benchmarking function f with code like this:
long s = 0;
...
for (int i = 0; i < N; ++i) {
startTimer();
s += f(M);
stopTimer();
}
...
cout << s;
can be defeated by optimizer. I wonder, if current or near future optimizer technology is smart enough to defeat this version:
long s = 0;
...
for (int i = 0; i < N; ++i) {
long m = lround(pow(sqrt(i), 2))/i*M;
startTimer();
s += f(m);
stopTimer();
}
...
cout << s;
Answer you title question:
Is any C++ compiler able to optimize lround(pow(sqrt(i), 2)) replacing it with i, now or in the near future?
yes, for statically known arguments: see it Live On Godbolt
All of the code in that sample program got compiled down to a single constant value! And, best of all, that's with optimizations disabled: g++-4.8 -O0 :)
#include <cmath>
constexpr int N = 100;
constexpr double M = 1.0;
constexpr int i = 4;
static constexpr double foo1(int i) { return sqrt(i); }
static constexpr auto f1 = foo1(4);
static constexpr double foo2(int i) { return pow(sqrt(i), 2); }
static constexpr auto f2 = foo2(4);
static constexpr double foo3(int i) { return pow(sqrt(i), 2)/i*M; }
static constexpr auto f3 = foo3(4);
static constexpr long foo4(int i) { return pow(sqrt(i), 2)/i*M; }
static constexpr auto f4 = foo4(4);
#include <cstdio>
int main()
{
printf("f1 + f2 + f3 + f4: %f\n", f1 + f2 + f2 + f3);
}
Get's compiled into a single, statically known constant:
.LC1:
.string "f1 + f2 + f3 + f4: %f\n"
.text
.globl main
.type main, #function
main:
.LFB225:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movabsq $4622382067542392832, %rax
vmovd %rax, %xmm0
movl $.LC1, %edi
movl $1, %eax
call printf
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
Voila. That's because the GNU standard library has constexpr versions of the math functions (except the lround) in C++11 mode.
It's entirely thinkable that the compiler unrolls a loop like
for (int i; i<5; ++i)
s += foo(i);
into
s += foo(1);
s += foo(2);
s += foo(3);
s += foo(4);
Though I haven't checked that yet.
It is possible, but the optimiser must be taught the semantics of library functions, which is hard and time consuming.
Then again IEEE754 math is tricky.
What about declaring volatile long m= M; instead ?
Profiling suggests that this function here is a real bottle neck for my application:
static inline int countEqualChars(const char* string1, const char* string2, int size) {
int r = 0;
for (int j = 0; j < size; ++j) {
if (string1[j] == string2[j]) {
++r;
}
}
return r;
}
Even with -O3 and -march=native, G++ 4.7.2 does not vectorize this function (I checked the assembler output). Now, I'm not an expert with SSE and friends, but I think that comparing more than one character at once should be faster. Any ideas on how to speed things up? Target architecture is x86-64.
Of course it can.
pcmpeqb compares two vectors of 16 bytes and produces a vector with zeros where they differed, and -1 where they match. Use this to compare 16 bytes at a time, adding the result to an accumulator vector (make sure to accumulate the results of at most 255 vector compares to avoid overflow). When you're done, there are 16 results in the accumulator. Sum them and negate to get the number of equal elements.
If the lengths are very short, it will be hard to get a significant speedup from this approach. If the lengths are long, then it will be worth pursuing.
Compiler flags for vectorization:
-ftree-vectorize
-ftree-vectorize -march=<your_architecture> (Use all instruction-set extensions available on your computer, not just baseline like SSE2 for x86-64). Use -march=native to optimize for the machine the compiler is running on.) -march=<foo> also sets -mtune=<foo>, which is also a good thing.
Using SSEx intrinsics:
Padd and align the buffer to 16 bytes (according to the vector size you're actually going to use)
Create an accumlator countU8 with _mm_set1_epi8(0)
For all n/16 input (sub) vectors, do:
Load 16 chars from both strings with _mm_load_si128 or _mm_loadu_si128 (for unaligned loads)
_mm_cmpeq_epi8
compare the octets in parallel. Each match yields 0xFF (-1), 0x00 otherwise.
Substract the above result vector from countU8 using _mm_sub_epi8 (minus -1 -> +1)
Always after 255 cycles, the 16 8bit counters must be extracted into a larger integer type to prevent overflows. See unpack and horizontal add in this nice answer for how to do that: https://stackoverflow.com/a/10930706/1175253
Code:
#include <iostream>
#include <vector>
#include <cassert>
#include <cstdint>
#include <climits>
#include <cstring>
#include <emmintrin.h>
#ifdef __SSE2__
#if !defined(UINTPTR_MAX) || !defined(UINT64_MAX) || !defined(UINT32_MAX)
# error "Limit macros are not defined"
#endif
#if UINTPTR_MAX == UINT64_MAX
#define PTR_64
#elif UINTPTR_MAX == UINT32_MAX
#define PTR_32
#else
# error "Current UINTPTR_MAX is not supported"
#endif
template<typename T>
void print_vector(std::ostream& out,const __m128i& vec)
{
static_assert(sizeof(vec) % sizeof(T) == 0,"Invalid element size");
std::cout << '{';
const T* const end = reinterpret_cast<const T*>(&vec)-1;
const T* const upper = end+(sizeof(vec)/sizeof(T));
for(const T* elem = upper;
elem != end;
--elem
)
{
if(elem != upper)
std::cout << ',';
std::cout << +(*elem);
}
std::cout << '}' << std::endl;
}
#define PRINT_VECTOR(_TYPE,_VEC) do{ std::cout << #_VEC << " : "; print_vector<_TYPE>(std::cout,_VEC); } while(0)
///#note SSE2 required (macro: __SSE2__)
///#warning Not tested!
size_t counteq_epi8(const __m128i* a_in,const __m128i* b_in,size_t count)
{
assert(a_in != nullptr && (uintptr_t(a_in) % 16) == 0);
assert(b_in != nullptr && (uintptr_t(b_in) % 16) == 0);
//assert(count > 0);
/*
//maybe not so good with all that branching and additional loop variables
__m128i accumulatorU8 = _mm_set1_epi8(0);
__m128i sum2xU64 = _mm_set1_epi8(0);
for(size_t i = 0;i < count;++i)
{
//this operation could also be unrolled, where multiple result registers would be accumulated
accumulatorU8 = _mm_sub_epi8(accumulatorU8,_mm_cmpeq_epi8(*a_in++,*b_in++));
if(i % 255 == 0)
{
//before overflow of uint8, the counter will be extracted
__m128i sum2xU16 = _mm_sad_epu8(accumulatorU8,_mm_set1_epi8(0));
sum2xU64 = _mm_add_epi64(sum2xU64,sum2xU16);
//reset accumulatorU8
accumulatorU8 = _mm_set1_epi8(0);
}
}
//blindly accumulate remaining values
__m128i sum2xU16 = _mm_sad_epu8(accumulatorU8,_mm_set1_epi8(0));
sum2xU64 = _mm_add_epi64(sum2xU64,sum2xU16);
//do a horizontal addition of the two counter values
sum2xU64 = _mm_add_epi64(sum2xU64,_mm_srli_si128(sum2xU64,64/8));
#if defined PTR_64
return _mm_cvtsi128_si64(sum2xU64);
#elif defined PTR_32
return _mm_cvtsi128_si32(sum2xU64);
#else
# error "macro PTR_(32|64) is not set"
#endif
*/
__m128i sum2xU64 = _mm_set1_epi32(0);
while(count--)
{
__m128i matches = _mm_sub_epi8(_mm_set1_epi32(0),_mm_cmpeq_epi8(*a_in++,*b_in++));
__m128i sum2xU16 = _mm_sad_epu8(matches,_mm_set1_epi32(0));
sum2xU64 = _mm_add_epi64(sum2xU64,sum2xU16);
#ifndef NDEBUG
PRINT_VECTOR(uint16_t,sum2xU64);
#endif
}
//do a horizontal addition of the two counter values
sum2xU64 = _mm_add_epi64(sum2xU64,_mm_srli_si128(sum2xU64,64/8));
#ifndef NDEBUG
std::cout << "----------------------------------------" << std::endl;
PRINT_VECTOR(uint16_t,sum2xU64);
#endif
#if !defined(UINTPTR_MAX) || !defined(UINT64_MAX) || !defined(UINT32_MAX)
# error "Limit macros are not defined"
#endif
#if defined PTR_64
return _mm_cvtsi128_si64(sum2xU64);
#elif defined PTR_32
return _mm_cvtsi128_si32(sum2xU64);
#else
# error "macro PTR_(32|64) is not set"
#endif
}
#endif
int main(int argc, char* argv[])
{
std::vector<__m128i> a(64); // * 16 bytes
std::vector<__m128i> b(a.size());
const size_t nBytes = a.size() * sizeof(std::vector<__m128i>::value_type);
char* const a_out = reinterpret_cast<char*>(a.data());
char* const b_out = reinterpret_cast<char*>(b.data());
memset(a_out,0,nBytes);
memset(b_out,0,nBytes);
a_out[1023] = 1;
b_out[1023] = 1;
size_t equalBytes = counteq_epi8(a.data(),b.data(),a.size());
std::cout << "equalBytes = " << equalBytes << std::endl;
return 0;
}
The fastest SSE implementation I got for large and small arrays:
size_t counteq_epi8(const __m128i* a_in,const __m128i* b_in,size_t count)
{
assert((count > 0 ? a_in != nullptr : true) && (uintptr_t(a_in) % sizeof(__m128i)) == 0);
assert((count > 0 ? b_in != nullptr : true) && (uintptr_t(b_in) % sizeof(__m128i)) == 0);
//assert(count > 0);
const size_t maxInnerLoops = 255;
const size_t nNestedLoops = count / maxInnerLoops;
const size_t nRemainderLoops = count % maxInnerLoops;
const __m128i zero = _mm_setzero_si128();
__m128i sum16xU8 = zero;
__m128i sum2xU64 = zero;
for(size_t i = 0;i < nNestedLoops;++i)
{
for(size_t j = 0;j < maxInnerLoops;++j)
{
sum16xU8 = _mm_sub_epi8(sum16xU8,_mm_cmpeq_epi8(*a_in++,*b_in++));
}
sum2xU64 = _mm_add_epi64(sum2xU64,_mm_sad_epu8(sum16xU8,zero));
sum16xU8 = zero;
}
for(size_t j = 0;j < nRemainderLoops;++j)
{
sum16xU8 = _mm_sub_epi8(sum16xU8,_mm_cmpeq_epi8(*a_in++,*b_in++));
}
sum2xU64 = _mm_add_epi64(sum2xU64,_mm_sad_epu8(sum16xU8,zero));
sum2xU64 = _mm_add_epi64(sum2xU64,_mm_srli_si128(sum2xU64,64/8));
#if UINTPTR_MAX == UINT64_MAX
return _mm_cvtsi128_si64(sum2xU64);
#elif UINTPTR_MAX == UINT32_MAX
return _mm_cvtsi128_si32(sum2xU64);
#else
# error "macro PTR_(32|64) is not set"
#endif
}
Auto-vectorization in current gcc is a matter of helping the compiler to understand that's easy to vectorize the code. In your case: it will understand the vectorization request if you remove the conditional and rewrite the code in a more imperative way:
static inline int count(const char* string1, const char* string2, int size) {
int r = 0;
bool b;
for (int j = 0; j < size; ++j) {
b = (string1[j] == string2[j]);
r += b;
}
return r;
}
In this case:
movdqa 16(%rsp), %xmm1
movl $.LC2, %esi
pxor %xmm2, %xmm2
movzbl 416(%rsp), %edx
movdqa .LC1(%rip), %xmm3
pcmpeqb 224(%rsp), %xmm1
cmpb %dl, 208(%rsp)
movzbl 417(%rsp), %eax
movl $1, %edi
pand %xmm3, %xmm1
movdqa %xmm1, %xmm5
sete %dl
movdqa %xmm1, %xmm4
movzbl %dl, %edx
punpcklbw %xmm2, %xmm5
punpckhbw %xmm2, %xmm4
pxor %xmm1, %xmm1
movdqa %xmm5, %xmm6
movdqa %xmm5, %xmm0
movdqa %xmm4, %xmm5
punpcklwd %xmm1, %xmm6
(etc.)
I have some critical branching code inside a loop that's run about 2^26 times. Branch prediction is not optimal because m is random. How would I remove the branching, possibly using bitwise operators?
bool m;
unsigned int a;
const unsigned int k = ...; // k >= 7
if(a == 0)
a = (m ? (a+1) : (k));
else if(a == k)
a = (m ? 0 : (a-1));
else
a = (m ? (a+1) : (a-1));
And here is the relevant assembly generated by gcc -O3:
.cfi_startproc
movl 4(%esp), %edx
movb 8(%esp), %cl
movl (%edx), %eax
testl %eax, %eax
jne L15
cmpb $1, %cl
sbbl %eax, %eax
andl $638, %eax
incl %eax
movl %eax, (%edx)
ret
L15:
cmpl $639, %eax
je L23
testb %cl, %cl
jne L24
decl %eax
movl %eax, (%edx)
ret
L23:
cmpb $1, %cl
sbbl %eax, %eax
andl $638, %eax
movl %eax, (%edx)
ret
L24:
incl %eax
movl %eax, (%edx)
ret
.cfi_endproc
The branch-free division-free modulo could have been useful, but testing shows that in practice, it isn't.
const unsigned int k = 639;
void f(bool m, unsigned int &a)
{
a += m * 2 - 1;
if (a == -1u)
a = k;
else if (a == k + 1)
a = 0;
}
Testcase:
unsigned a = 0;
f(false, a);
assert(a == 639);
f(false, a);
assert(a == 638);
f(true, a);
assert(a == 639);
f(true, a);
assert(a == 0);
f(true, a);
assert(a == 1);
f(false, a);
assert(a == 0);
Actually timing this, using a test program:
int main()
{
for (int i = 0; i != 10000; i++)
{
unsigned int a = k / 2;
while (a != 0) f(rand() & 1, a);
}
}
(Note: there's no srand, so results are deterministic.)
My original answer: 5.3s
The code in the question: 4.8s
Lookup table: 4.5s (static unsigned lookup[2][k+1];)
Lookup table: 4.3s (static unsigned lookup[k+1][2];)
Eric's answer: 4.2s
This version: 4.0s
The fastest I've found is now the table implementation
Timings I got (UPDATED for new measurement code)
HVD's most recent: 9.2s
Table version: 7.4s (with k=693)
Table creation code:
unsigned int table[2*k];
table_ptr = table;
for(int i = 0; i < k; i++){
unsigned int a = i;
f(0, a);
table[i<<1] = a;
a = i;
f(1, a);
table[i<<1 + 1] = a;
}
Table runtime loop:
void f(bool m, unsigned int &a){
a = table_ptr[a<<1 | m];
}
With HVD's measurement code, I saw the cost of the rand() dominating the runtime, so that the runtime for a branchless version was about the same range as these solutions. I changed the measurement code to this (UPDATED to keep random branch order, and pre-computing random values to prevent rand(), etc. from trashing the cache)
int main(){
unsigned int a = k / 2;
int m[100000];
for(int i = 0; i < 100000; i++){
m[i] = rand() & 1;
}
for (int i = 0; i != 10000; i++
{
for(int j = 0; j != 100000; j++){
f(m[j], a);
}
}
}
I don't think you can remove the branches entirely, but you can reduce the number by branching on m first.
if (m){
if (a==k) {a = 0;} else {++a;}
}
else {
if (a==0) {a = k;} else {--a;}
}
Adding to Antimony's rewrite:
if (a==k) {a = 0;} else {++a;}
looks like an increase with wraparound. You can write this as
a=(a+1)%k;
which, of course, only makes sense if divisions are actually faster than branches.
Not sure about the other one; too lazy to think about what the (~0)%k will be.
This has no branches. Because K is constant, compiler might be able to optimize the modulo depending on it's value. And if K is 'small' then a full lookup table solution would probably be even faster.
bool m;
unsigned int a;
const unsigned int k = ...; // k >= 7
const int inc[2] = {1, k};
a = a + inc[m] % (k+1);
If k isn't large enough to cause overflow, you could do something like this:
int a; // Note: not unsigned int
int plusMinus = 2 * m - 1;
a += plusMinus;
if(a == -1)
a = k;
else if (a == k+1)
a = 0;
Still branches, but the branch prediction should be better, since the edge conditions are rarer than m-related conditions.