Related
The arithmetic mean of two unsigned integers is defined as:
mean = (a+b)/2
Directly implementing this in C/C++ may overflow and produce a wrong result. A correct implementation would avoid this. One way of coding it could be:
mean = a/2 + b/2 + (a%2 + b%2)/2
But this produces rather a lot of code with typical compilers. In assembler, this usually can be done much more efficiently. For example, the x86 can do this in the following way (assembler pseudo code, I hope you get the point):
ADD a,b ; addition, leaving the overflow condition in the carry bit
RCR a,1 ; rotate right through carry, effectively a division by 2
After those two instructions, the result is in a, and the remainder of the division is in the carry bit. If correct rounding is desired, a third ADC instruction would have to add the carry into the result.
Note that the RCR instruction is used, which rotates a register through the carry. In our case, it is a rotate by one position, so that the previous carry becomes the most significant bit in the register, and the new carry holds the previous LSB from the register. It seems that MSVC doesn't even offer an intrinsic for this instruction.
Is there a known C/C++ pattern that can be expected to be recognized by an optimizing compiler so that it produces such efficient code? Or, more generally, is there a rational way how to program in C/C++ source level so that the carry bit is being used by the compiler to optimize the generated code?
EDIT:
A 1-hour lecture about std::midpoint: https://www.youtube.com/watch?v=sBtAGxBh-XI
Wow!
EDIT2: Great discussion on Microsoft blog
The following method avoids overflow and should result in fairly efficient assembly (example) without depending on non-standard features:
mean = (a&b) + (a^b)/2;
There are three typical methods to compute average without overflow, one of which is limited to uint32_t (on 64-bit architectures).
// average "SWAR" / Montgomery
uint32_t avg(uint32_t a, uint32_t b) {
return (a & b) + ((a ^ b) >> 1);
}
// in case the relative magnitudes are known
uint32_t avg2(uint32_t min, uint32_t max) {
return min + (max - min) / 2;
}
// in case the relative magnitudes are not known
uint32_t avg2_constrained(uint32_t a, uint32_t b) {
return a + (int32_t)(b - a) / 2;
}
// average increase width (not applicable to uint64_t)
uint32_t avg3(uint32_t a, uint32_t b) {
return ((uint64_t)a + b) >> 1;
}
The corresponding assembler sequences from clang in two architectures are
avg(unsigned int, unsigned int)
mov eax, esi
and eax, edi
xor esi, edi
shr esi
add eax, esi
avg2(unsigned int, unsigned int)
sub esi, edi
shr esi
lea eax, [rsi + rdi]
avg3(unsigned int, unsigned int)
mov ecx, edi
mov eax, esi
add rax, rcx
shr rax
vs.
avg(unsigned int, unsigned int)
and w8, w1, w0
eor w9, w1, w0
add w0, w8, w9, lsr #1
ret
avg2(unsigned int, unsigned int)
sub w8, w1, w0
add w0, w0, w8, lsr #1
ret
avg3(unsigned int, unsigned int):
mov w8, w1
add x8, x8, w0, uxtw
lsr x0, x8, #1
ret
Out of these three versions, avg2 would perform in ARM64 as well, as the optimal sequence using carry flag -- and also it's likely that avg3 would perform as well, noticing that the mov w8,w1 is used to clear the top 32-bits, which may be unnecessary given that the compiler knows they are cleared by any previous instruction that is used to produce the value.
Similar statement can be made of the Intel version for avg3, which would in optimal case compiled to just the two meaningful instructions:
add rax, rcx
shr rax
See https://godbolt.org/z/5TMd3zr81 for online comparison.
The "SWAR"/Montgomery version is typically only justified, when trying to compute multiple averages packed to a single (large) integer in which case the full formula contains masking with the bit positions of the highest bits: return (a & b) + (((a ^ b) >> 1) & ~kH;.
Below are the given c++ and ARM code for same program. Can you tell me this ARM code is optimized or not and how many does the loop requires(The size of the array n is large, and is a multiple of 64 elements and exclusive-OR bit-wise operation with the 8-bit mask and produces an output array outArr.)? What should I do to optimize the code using loop unrolling (process 4 elements at a time)?
c++ code:
// Gray scale image pixel inversion
void invert(unsigned char *outArr, unsigned char *inArr,
unsigned char k, int n)
{
for(int i=0; i<n; i++)
*outArr++ = *inArr++ ^ k; // ^ is bitwise xor
}
ARM CODE:
invert:
cmp r3, #0
bxle lr
add ip, r0, r3
.L3:
ldrb r3, [r1], #1 # zero_extendqisi2
eor r3, r3, r2
strb r3, [r0], #1
cmp ip, r0
bne .L3
bx lr
I have no idea what 'code preload' means. There is data preloading with the pld instruction. It would make sense in the context of the sample code.
Here is the basic 'C' version given the assumptions,
The size of the array n is large, and is a multiple of 64 elements and exclusive-OR bit-wise operation with the 8-bit mask and produces an output array outArr.
The code is probably not perfect, but meant to illustrate.
// Gray scale image pixel inversion
void invert(unsigned char *outArr, unsigned char *inArr,
unsigned char k, int n)
{
unsigned int *out = (void*)outArr;
unsigned int *in = (void*)inArr;
unsigned int mask = k<<24|k<<16|k<<8|k;
/* Check arguments */
if( n % 64 != 0) return;
if((int)outArr & 3) return;
if((int)inArr & 3) return;
assert(sizeof(int)==4);
for(int i=0; i<n/sizeof(int); i+=64/(sizeof(int)) {
/* 16 transfers per loop 64/4 */
*out++ = *in++ ^ k; // 1
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
*out++ = *in++ ^ k; // 5
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
*out++ = *in++ ^ k; // 9
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
*out++ = *in++ ^ k; // 13
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
}
}
You can view the output on godbolt.
The ldm and stm instructions can be used to load consecutive memory addresses to registers. We can not use all 16 ARM registers, so the core of the loop in assembler would look like this,
ldmia [r1], {r4,r5,r6,r7,r8,r9,r10,r11} # r1 is inArr
eor r4,r4,r2 # r2 is expanded k
eor r5,r5,r2
eor r6,r6,r2
eor r7,r7,r2
eor r8,r8,r2
eor r9,r9,r2
eor r10,r10,r2
eor r11,r11,r2
stmia [r0], {r4,r5,r6,r7,r8,r9,r10,r11} # r0 is outArr
This is repeated twice and the R0 or R1 can be checked against the array limits stored in R3. You need to save all of the callee saved registers if you want to be EABI compliant. The register set r4-r11 can generally be used, but it will depend on the system. You can also use lr, fp, etc if you save them and are not exception safe.
From the comments,
I am trying to find that how many cycles does this subroutine take per
array element when it is optimized and when it isn't.
Cycle counts are extremely difficult on modern CPUs. However you have five instructions in the core with a simple loop,
.L3:
ldrb r3, [r1], #1 # zero_extendqisi2
eor r3, r3, r2
strb r3, [r0], #1
cmp ip, r0
bne .L3
To do 32 bytes, this is 32 * 5 (160) instructions. With 32 * 2 memory accesses.
The expanded options is just one 32byte memory read and write. These will complete, with the lowest value available first. Then just a single EOR instruction. So it is just 10 instructions versus 160. On modern processors the memory will be the limiting factor. Because of memory stalls, it maybe better to only process four words at a time such as,
ldmia [r1], {r4,r5,r6,r7} # r1 is inArr
eor r4,r4,r2 # r2 is expanded k
eor r5,r5,r2
eor r6,r6,r2
eor r7,r7,r2
ldmia [r1], {r8,r9,r10,r11} # r1 is inArr
stmia [r0], {r4,r5,r6,r7} # r0 is outArr
...
This (or some permutation) will allow the load/store unit and the 'eor' to not block each other, but this will depend on the particular CPU type. This topic is called instruction scheduling; it is more powerful than pld or data preloading. As well, you can use NEON or ARM64 instructions so that the body of the loop can do more eor operations before a load/store.
These days, this is done like this:
void invert(unsigned char* const outArr, unsigned char const* const inArr,
unsigned char const k, std::size_t const n) noexcept
{
std::transform(std::execution::unseq, inArr, inArr + n, outArr,
[k](auto const i)noexcept{return i ^ k;});
}
You set -Ofast, cross your fingers and hope that good code will be generated.
EDIT: You can also try this:
void invert(unsigned char* const outArr, unsigned char const* const inArr,
unsigned char const k, std::size_t const n) noexcept
{
std::transform(std::execution::unseq,
reinterpret_cast<std::uint32_t const*>(inArr),
reinterpret_cast<std::uint32_t const*>(inArr) + n/4,
reinterpret_cast<std::uint32_t*>(outArr),
[k=std::uint32_t(k<<24|k<<16|k<<8|k)](auto const i)noexcept{return i ^ k;});
}
I'm writing a C++ template wrapper for GPIOs. For the STM32 I'm using the HAL and LL code as basis. The GPIO initialization comes down to a series of read register to temp variable -> Mask pin specific bits in temp -> shift and write pin specific bits in temp -> write temp back to register. The registers are declared volatile.
Would it make sense (in terms of reducing overhead / improving performance) to first do all the reads to the volatiles, then all the updates, then all the writes to the volatiles, instead of sequentially, as it is now (in ST's code, for example)? The writes would still be in-order, of course.
So from scenario A:
uint32_t temp;
temp = struct->reg1;
temp |= ...
temp &= ...
struct->reg1 = temp;
temp = struct->reg2;
temp |= ...
temp &= ...
struct->reg2 = temp;
to scenario B:
uint32_t temp1, temp2;
temp1 = struct->reg1;
temp2 = struct->reg2;
temp1 |= ...
temp1 &= ...
temp2 |= ...
temp2 &= ...
struct->reg1 = temp1;
struct->reg2 = temp2;
Scenario B might use a bit (or 4) more memory, but doesn't have to interrupt the main program flow as often I'd expect. Can the code be optimised more in scenario B, for example by combining reads or writes?
It will not make any difference. The code will be exactly same efficient
void zoo(uint32_t val1, uint32_t val2)
{
uint32_t moder = GPIOA -> MODER;
uint32_t otyper = GPIOA -> OTYPER;
moder &= val1;
moder |= val2;
otyper &= val1;
otyper |= val2;
GPIOA -> MODER = moder;
GPIOA -> OTYPER = otyper;
}
void boo(uint32_t val1, uint32_t val2)
{
uint32_t val = GPIOA -> MODER;
val &= val1;
val |= val2;
GPIOA -> MODER = val;
val = GPIOA -> OTYPER;
val &= val1;
val |= val2;
GPIOA -> OTYPER = val;
}
And it is not existing problems as you access more than one register of the GPIO only during the initialization. The pin configuration is usually set only at the program startup and sometimes when entering and exiting the low power modes (for example we set pins to be in the analogue mode to consume as less as possible current). Performance is not the first priority at this stage.
Normally you will access only one register:
BSRR - to set pins (but this register is write only)
ODR - to set and read what have we set
IDR - actual pin levels (read only)
BSRR in some STM micros is split into two registers BRR & BSR but they are also write only.
IMO you try to microoptimize something which completely does not require it.
https://godbolt.org/z/xWqWo9
Would it make sense (in terms of reducing overhead / improving performance) to first do all the reads to the volatiles, then all the updates, then all the writes to the volatiles, instead of sequentially, as it is now (in ST's code, for example)?
So nothing more to do than check it! The following code:
// based on code from https://github.com/ARM-software/CMSIS
#include <stdint.h>
#define __IO volatile
typedef struct
{
__IO uint32_t CR;
__IO uint32_t CSR;
} PWR_TypeDef;
#define PERIPH_BASE ((uint32_t)0x40000000) /*!< Peripheral base address in the alias region */
#define APB1PERIPH_BASE PERIPH_BASE
#define PWR_BASE (APB1PERIPH_BASE + 0x7000)
#define PWR ((PWR_TypeDef *) PWR_BASE)
#define PWR_CR_LPDS ((uint16_t)0x0001) /*!< Low-Power Deepsleep */
#define PWR_CR_PDDS ((uint16_t)0x0002) /*!< Power Down Deepsleep */
#define PWR_CR_CWUF ((uint16_t)0x0004) /*!< Clear Wakeup Flag */
#define PWR_CR_CSBF ((uint16_t)0x0008) /*!< Clear Standby Flag */
#define PWR_CR_PVDE ((uint16_t)0x0010) /*!< Power Voltage Detector Enable */
#define PWR_CSR_WUF ((uint16_t)0x0001) /*!< Wakeup Flag */
#define PWR_CSR_SBF ((uint16_t)0x0002) /*!< Standby Flag */
#define PWR_CSR_PVDO ((uint16_t)0x0004) /*!< PVD Output */
#define PWR_CSR_EWUP ((uint16_t)0x0100) /*!< Enable WKUP pin */
void func_separate() {
// just a meaningless example for testing
uint32_t temp;
temp = PWR->CR;
temp &= PWR_CR_LPDS | PWR_CR_PDDS | PWR_CR_CWUF;
temp |= PWR_CR_CWUF;
PWR->CR = temp;
temp = PWR->CSR;
temp &= PWR_CSR_WUF | PWR_CSR_SBF;
temp |= PWR_CSR_PVDO | PWR_CSR_EWUP;
PWR->CSR = temp;
}
void func_together() {
uint32_t temp1, temp2;
temp1 = PWR->CR;
temp2 = PWR->CSR;
temp1 &= PWR_CR_LPDS | PWR_CR_PDDS | PWR_CR_CWUF;
temp1 |= PWR_CR_CWUF;
temp2 &= PWR_CSR_WUF | PWR_CSR_SBF;
temp2 |= PWR_CSR_PVDO | PWR_CSR_EWUP;
PWR->CR = temp1;
PWR->CSR = temp2;
}
outputs on godbolt with gcc ARM 8.2 -O3 -mlittle-endian -mthumb -mcpu=cortex-m3:
func_separate:
ldr r2, .L3
ldr r3, [r2]
and r3, r3, #7
orr r3, r3, #4
str r3, [r2]
ldr r3, [r2, #4]
and r3, r3, #3
orr r3, r3, #260
str r3, [r2, #4]
bx lr
.L3:
.word 1073770496
func_together:
ldr r1, .L6
ldr r2, [r1]
ldr r3, [r1, #4]
and r2, r2, #7
and r3, r3, #3
orr r2, r2, #4
orr r3, r3, #260
str r2, [r1]
str r3, [r1, #4]
bx lr
.L6:
.word 1073770496
The only difference is the order of instruction. There is no difference in terms of performance. So Would it make sense (in terms of reducing overhead / improving performance) - no.
But it would make sense to prefer the first version in terms of readability.
In this specific case, it doesn't matter.
Generally speaking though, it is recommended practice not to access individual hardware registers in several lines when it can be avoided. It is good practice to write everything in temporary RAM variables and only read and write to the register once.
This doesn't have so much to do with execution time, but rather that reading & writing hardware registers can come with many side-effects such as clearing flags or affect real time.
Furthermore, stuff like temp1 |= ... temp1 &= ... on temporary variables can get easily optimized by the compiler, which is very likely to use CPU register for such rather than stack allocation.
Another thing worth mentioning is that read/writes to hardware registers cannot get optimized or re-sequenced, since they are volatile qualified. For this reason you'll want to minimize register accesses to save a tiny bit of execution time, but also to allow the compiler to more efficiently optimize the surrounding code.
I have two tabs of floats. I need to multiply elements from the first tab by corresponding elements from the second tab and store the result in a third tab.
I would like to use NEON to parallelize floats multiplications: four float multiplications simultaneously instead of one.
I have expected significant acceleration but I achieved only about 20% execution time reduction. This is my code:
#include <stdlib.h>
#include <iostream>
#include <arm_neon.h>
const int n = 100; // table size
/* fill a tab with random floats */
void rand_tab(float *t) {
for (int i = 0; i < n; i++)
t[i] = (float)rand()/(float)RAND_MAX;
}
/* Multiply elements of two tabs and store results in third tab
- STANDARD processing. */
void mul_tab_standard(float *t1, float *t2, float *tr) {
for (int i = 0; i < n; i++)
tr[i] = t1[i] * t2[i];
}
/* Multiply elements of two tabs and store results in third tab
- NEON processing. */
void mul_tab_neon(float *t1, float *t2, float *tr) {
for (int i = 0; i < n; i+=4)
vst1q_f32(tr+i, vmulq_f32(vld1q_f32(t1+i), vld1q_f32(t2+i)));
}
int main() {
float t1[n], t2[n], tr[n];
/* fill tables with random values */
srand(1); rand_tab(t1); rand_tab(t2);
// I repeat table multiplication function 1000000 times for measuring purposes:
for (int k=0; k < 1000000; k++)
mul_tab_standard(t1, t2, tr); // switch to next line for comparison:
//mul_tab_neon(t1, t2, tr);
return 1;
}
I run the following command to compile:
g++ -mfpu=neon -ffast-math neon_test.cpp
My CPU: ARMv7 Processor rev 0 (v7l)
Do you have any ideas how I can achieve more significant speed-up?
Cortex-A8 and Cortex-A9 can do only two SP FP multiplications per cycle, so you may at most double the performance on those (most popular) CPUs. In practice, ARM CPUs have very low IPC, so it is preferably to unroll the loops as much as possible. If you want ultimate performance, write in assembly: gcc's code generator for ARM is nowhere as good as for x86.
I also recommend to use CPU-specific optimization options: "-O3 -mcpu=cortex-a9 -march=armv7-a -mtune=cortex-a9 -mfpu=neon -mthumb" for Cortex-A9; for Cortex-A15, Cortex-A8 and Cortex-A5 replace -mcpu=-mtune=cortex-a15/a8/a5 accordingly. gcc does not have optimizations for Qualcomm CPUs, so for Qualcomm Scorpion use Cortex-A8 parameters (and also unroll even more than you usually do), and for Qualcomm Krait try Cortex-A15 parameters (you will need a recent version of gcc which supports it).
One shortcoming with neon intrinsics, you can't use auto increment on loads, which shows up as extra instructions with your neon implementation.
Compiled with gcc version 4.4.3 and options -c -std=c99 -mfpu=neon -O3 and dumped with objdump, this is loop part of mul_tab_neon
000000a4 <mul_tab_neon>:
ac: e0805003 add r5, r0, r3
b0: e0814003 add r4, r1, r3
b4: e082c003 add ip, r2, r3
b8: e2833010 add r3, r3, #16
bc: f4650a8f vld1.32 {d16-d17}, [r5]
c0: f4642a8f vld1.32 {d18-d19}, [r4]
c4: e3530e19 cmp r3, #400 ; 0x190
c8: f3400df2 vmul.f32 q8, q8, q9
cc: f44c0a8f vst1.32 {d16-d17}, [ip]
d0: 1afffff5 bne ac <mul_tab_neon+0x8>
and this is loop part of mul_tab_standard
00000000 <mul_tab_standard>:
58: ecf01b02 vldmia r0!, {d17}
5c: ecf10b02 vldmia r1!, {d16}
60: f3410db0 vmul.f32 d16, d17, d16
64: ece20b02 vstmia r2!, {d16}
68: e1520003 cmp r2, r3
6c: 1afffff9 bne 58 <mul_tab_standard+0x58>
As you can see in standard case, compiler creates much tighter loop.
float f = 0.7;
if( f == 0.7 )
printf("equal");
else
printf("not equal");
Why is the output not equal ?
Why does this happen?
This happens because in your statement
if(f == 0.7)
the 0.7 is treated as a double. Try 0.7f to ensure the value is treated as a float:
if(f == 0.7f)
But as Michael suggested in the comments below you should never test for exact equality of floating-point values.
This answer to complement the existing ones: note that 0.7 is not representable exactly either as a float (or as a double). If it was represented exactly, then there would be no loss of information when converting to float and then back to double, and you wouldn't have this problem.
It could even be argued that there should be a compiler warning for literal floating-point constants that cannot be represented exactly, especially when the standard is so fuzzy regarding whether the rounding will be made at run-time in the mode that has been set as that time or at compile-time in another rounding mode.
All non-integer numbers that can be represented exactly have 5 as their last decimal digit. Unfortunately, the converse is not true: some numbers have 5 as their last decimal digit and cannot be represented exactly. Small integers can all be represented exactly, and division by a power of 2 transforms a number that can be represented into another that can be represented, as long as you do not enter the realm of denormalized numbers.
First of all let look inside float number. I take 0.1f it is 4 byte long (binary32), in hex it is
3D CC CC CD.
By the standart IEEE 754 to convert it to decimal we must do like this:
In binary 3D CC CC CD is
0 01111011 1001100 11001100 11001101
here first digit is a Sign bit. 0 means (-1)^0 that our number is positive.
Second 8 bits is an Exponent. In binary it is 01111011 - in decimal 123. But the real Exponent is 123-127 (always 127)=-4, it's mean we need to multiply the number we will get by 2^ (-4).
The last 23 bytes is the Significand precision. There the first bit we multiply by 1/ (2^1) (0.5), second by 1/ (2^2) (0.25) and so on. Here what we get:
We need to add all numbers (power of 2) and add to it 1 (always 1, by standart). It is
1,60000002384185791015625
Now let's multiply this number by 2^ (-4), it's from Exponent. We just devide number above by 2 four time:
0,100000001490116119384765625
I used MS Calculator
**
Now the second part. Converting from decimal to binary.
**
I take the number 0.1
It ease because there is no integer part. First Sign bit - it is 0.
Exponent and Significand precision I will calculate now. The logic is multiply by 2 whole number (0.1*2=0.2) and if it's bigger than 1 substract and continue.
And the number is .00011001100110011001100110011, standart says that we must shift left before we get 1. (something). How you see we need 4 shifts, from this number calculating Exponent (127-4=123). And the Significand precision now is 10011001100110011001100 (and there is lost bits).
Now the whole number. Sign bit 0 Exponent is 123 (01111011) and Significand precision is 10011001100110011001100 and whole it is
00111101110011001100110011001100
let's compare it with those we have from previous chapter
00111101110011001100110011001101
As you see the lasts bit are not equal. It is because I truncate the number. The CPU and compiler know that the is something after Significand precision can not hold and just set the last bit to 1.
Another near exact question was linked to this one, thus the years late answer. I don't think the above answers are complete.
int fun1 ( void )
{
float x=0.7;
if(x==0.7) return(1);
else return(0);
}
int fun2 ( void )
{
float x=1.1;
if(x==1.1) return(1);
else return(0);
}
int fun3 ( void )
{
float x=1.0;
if(x==1.0) return(1);
else return(0);
}
int fun4 ( void )
{
float x=0.0;
if(x==0.0) return(1);
else return(0);
}
int fun5 ( void )
{
float x=0.7;
if(x==0.7f) return(1);
else return(0);
}
float fun10 ( void )
{
return(0.7);
}
double fun11 ( void )
{
return(0.7);
}
float fun12 ( void )
{
return(1.0);
}
double fun13 ( void )
{
return(1.0);
}
Disassembly of section .text:
00000000 <fun1>:
0: e3a00000 mov r0, #0
4: e12fff1e bx lr
00000008 <fun2>:
8: e3a00000 mov r0, #0
c: e12fff1e bx lr
00000010 <fun3>:
10: e3a00001 mov r0, #1
14: e12fff1e bx lr
00000018 <fun4>:
18: e3a00001 mov r0, #1
1c: e12fff1e bx lr
00000020 <fun5>:
20: e3a00001 mov r0, #1
24: e12fff1e bx lr
00000028 <fun10>:
28: e59f0000 ldr r0, [pc] ; 30 <fun10+0x8>
2c: e12fff1e bx lr
30: 3f333333 svccc 0x00333333
00000034 <fun11>:
34: e28f1004 add r1, pc, #4
38: e8910003 ldm r1, {r0, r1}
3c: e12fff1e bx lr
40: 66666666 strbtvs r6, [r6], -r6, ror #12
44: 3fe66666 svccc 0x00e66666
00000048 <fun12>:
48: e3a005fe mov r0, #1065353216 ; 0x3f800000
4c: e12fff1e bx lr
00000050 <fun13>:
50: e3a00000 mov r0, #0
54: e59f1000 ldr r1, [pc] ; 5c <fun13+0xc>
58: e12fff1e bx lr
5c: 3ff00000 svccc 0x00f00000 ; IMB
Why did fun3 and fun4 return one and not the others? why does fun5 work?
It is about the language. The language says that 0.7 is a double unless you use this syntax 0.7f then it is a single. So
float x=0.7;
the double 0.7 is converted to a single and stored in x.
if(x==0.7) return(1);
The language says we have to promote to the higher precision so the single in x is converted to a double and compared with the double 0.7.
00000028 <fun10>:
28: e59f0000 ldr r0, [pc] ; 30 <fun10+0x8>
2c: e12fff1e bx lr
30: 3f333333 svccc 0x00333333
00000034 <fun11>:
34: e28f1004 add r1, pc, #4
38: e8910003 ldm r1, {r0, r1}
3c: e12fff1e bx lr
40: 66666666 strbtvs r6, [r6], -r6, ror #12
44: 3fe66666 svccc 0x00e66666
single 3f333333
double 3fe6666666666666
As Alexandr pointed out if that answer remains IEEE 754 a single is
seeeeeeeefffffffffffffffffffffff
And double is
seeeeeeeeeeeffffffffffffffffffffffffffffffffffffffffffffffffffff
with 52 bits of fraction rather than the 23 that single has.
00111111001100110011... single
001111111110011001100110... double
0 01111110 01100110011... single
0 01111111110 01100110011... double
Just like 1/3rd in base 10 is 0.3333333... forever. We have a repeating pattern here 0110
01100110011001100110011 single, 23 bits
01100110011001100110011001100110.... double 52 bits.
And here is the answer.
if(x==0.7) return(1);
x contains 01100110011001100110011 as its fraction, when that gets converted back
to double the fraction is
01100110011001100110011000000000....
which is not equal to
01100110011001100110011001100110...
but here
if(x==0.7f) return(1);
That promotion doesn't happen the same bit patterns are compared with each other.
Why does 1.0 work?
00000048 <fun12>:
48: e3a005fe mov r0, #1065353216 ; 0x3f800000
4c: e12fff1e bx lr
00000050 <fun13>:
50: e3a00000 mov r0, #0
54: e59f1000 ldr r1, [pc] ; 5c <fun13+0xc>
58: e12fff1e bx lr
5c: 3ff00000 svccc 0x00f00000 ; IMB
0011111110000000...
0011111111110000000...
0 01111111 0000000...
0 01111111111 0000000...
In both cases the fraction is all zeros. So converting from double to single to double there is no loss of precision. It converts from single to double exactly and the bit comparison of the two values works.
The highest voted and checked answer by halfdan is the correct answer, this is a case of mixed precision AND you should never do an equals comparison.
The why wasn't shown in that answer. 0.7 fails 1.0 works. Why did 0.7 fail wasn't shown. A duplicate question 1.1 fails as well.
Edit
The equals can be taken out of the problem here, it is a different question that has already been answered, but it is the same problem and also has the "what the ..." initial shock.
int fun1 ( void )
{
float x=0.7;
if(x<0.7) return(1);
else return(0);
}
int fun2 ( void )
{
float x=0.6;
if(x<0.6) return(1);
else return(0);
}
Disassembly of section .text:
00000000 <fun1>:
0: e3a00001 mov r0, #1
4: e12fff1e bx lr
00000008 <fun2>:
8: e3a00000 mov r0, #0
c: e12fff1e bx lr
Why does one show as less than and the other not less than? When they should be equal.
From above we know the 0.7 story.
01100110011001100110011 single, 23 bits
01100110011001100110011001100110.... double 52 bits.
01100110011001100110011000000000....
is less than.
01100110011001100110011001100110...
0.6 is a different repeating pattern 0011 rather than 0110.
but when converted from a double to a single or in general when represented
as a single IEEE 754.
00110011001100110011001100110011.... double 52 bits.
00110011001100110011001 is NOT the fraction for single
00110011001100110011010 IS the fraction for single
IEEE 754 uses rounding modes, round up, round down or round to zero. Compilers tend to round up by default. If you remember rounding in grade school 12345678 if I wanted to round to the 3rd digit from the top it would be 12300000 but round to the next digit 1235000 if the digit after is 5 or greater then round up. 5 is 1/2 of 10 the base (Decimal) in binary 1 is 1/2 of the base so if the digit after the position we want to round is 1 then round up else don't. So for 0.7 we didn't round up, for 0.6 we do round up.
And now it is easy to see that
00110011001100110011010
converted to a double because of (x<0.7)
00110011001100110011010000000000....
is greater than
00110011001100110011001100110011....
So without having to talk about using equals the issue still presents itself 0.7 is double 0.7f is single, the operation is promoted to the highest precision if they differ.
The problem you're facing is, as other commenters have noted, that it's generally unsafe to test for exact equivalency between floats, as initialization errors, or rounding errors in calculations can introduce minor differences that will cause the == operator to return false.
A better practice is to do something like
float f = 0.7;
if( fabs(f - 0.7) < FLT_EPSILON )
printf("equal");
else
printf("not equal");
Assuming that FLT_EPSILON has been defined as an appropriately small float value for your platform.
Since the rounding or initialization errors will be unlikely to exceed the value of FLT_EPSILON, this will give you the reliable equivalency test you're looking for.
A lot of the answers around the web make the mistake of looking at the abosulute difference between floating point numbers, this is only valid for special cases, the robust way is to look at the relative difference as in below:
// Floating point comparison:
bool CheckFP32Equal(float referenceValue, float value)
{
const float fp32_epsilon = float(1E-7);
float abs_diff = std::abs(referenceValue - value);
// Both identical zero is a special case
if( referenceValue==0.0f && value == 0.0f)
return true;
float rel_diff = abs_diff / std::max(std::abs(referenceValue) , std::abs(value) );
if(rel_diff < fp32_epsilon)
return true;
else
return false;
}
Consider this:
int main()
{
float a = 0.7;
if(0.7 > a)
printf("Hi\n");
else
printf("Hello\n");
return 0;
}
if (0.7 > a) here a is a float variable and 0.7 is a double constant. The double constant 0.7 is greater than the float variable a. Hence the if condition is satisfied and it prints 'Hi'
Example:
int main()
{
float a=0.7;
printf("%.10f %.10f\n",0.7, a);
return 0;
}
Output:
0.7000000000 0.6999999881
Pointing value saved in variable and constant have not same data types. It's the difference in the precision of data types.
If you change the datatype of f variable to double, it'll print equal, This is because constants in floating-point stored in double and non-floating in long by default, double's precision is higher than float. it'll be completely clear if you see the method of floating-point numbers conversion to binary conversion