conversion from long double to long long int - c++

I have a long double sine and a long long int amp. I am using <cmath> and have code as follows:
sine = sin(point);
amp = round(sine * 2^31);
Here the variable point is incrementing in 0.009375 intervals. The first line here works fine but on the second I receive this error message:
error: invalid operands of types 'long double' and 'int' to binary 'operator^'
I'm unsure what this error means and the main request here is 'How can I get around this to get an output integer into the variable amp?'

In C++ the ^ operator means exclusive or, not exponentiation. You probably meant (1ULL << 31).

The reason for the error is that * is multiplication, and ^ is the bitwise xor operator which can only be applied to integral types.
Multiplication (*) has higher precedence than ^. So the compiler interprets amp = round(sine * 2^31); as amp = round( (sine *2)^31);. sine (presumably) has type long double, so the result of sine*2 is also of type long double. long double is not an integral type, so cannot be an operand of ^. Hence the error.
Your mistake is assuming that ^ represents exponentiation, which it does not.
You can fix the problem by either
amp = round (sine * pow(2.0, 31)); // all floating point
or
amp = round (sine * (1UL << 31));
The second computes 1 leftshifted 31 bits as an unsigned long (which is guaranteed able to represeny the result, unlike unsigned or int for which there is not such a guarantee). Then, in doing the multiplication, it promotes that value to long double.
If you are doing predominantly floating point operations, the first is more understandable to people who will maintain such code. The second is probably rather cryptic to someone who writes numeric code but is not well acquainted with bit fiddling operations - as, ironically, you have demonstrated in your belief that ^ is exponentiation.
You would need to test to determine which option offers greater performance (given the need to convert unsigned long to long double in the second , and potential for std::pow() in the first to be optimised for some special cases). In other words, there is potential for the compiler optimiser to get aggressive in both cases, or for the implementation of pow() to be lovingly hand-crafted, or both.

Related

Narrowing conversion on a (int64_t * static_cast<float>(double))

I'm fixing some lint errors in a codebase I am working with, and I see the following line
// note offset is int64_t, scale is int64_t, x is double
int64_t y = offset + lrintf(scale * static_cast<float>(x))
the linter complains and says
narrowing conversion from 'int64_t' (aka 'long') to 'float'.
It seems the scale * static_cast<float>(x) is what's causing issues, and I'm wondering what is the best way to handle this error? should I cast scale to a double, and then call lrintd instead? Or is there a better approach for this case?
You're right. The static_cast discards some bits from the mantissa and exponent. This also limits the number of bits in the mantissa of float(scale).
Note that a typical IEEE754 double also has insufficient bits in the mantissa to fully express every int64_t value. That's logical: both are 8 byte types. But float is only 4 bytes in IEEE754.
Seems more logical to use
int64_t y = offset + llrint(scale * x);
scale * x will be double and should be rounded to long long (AKA int64_t).
Note anyway that this might modify the program behavior in some cases (truncating the double to float or not could result in different y).

long long value in Visual Studio

We know that -2*4^31 + 1 = -9.223.372.036.854.775.807, the lowest value you can store in long long, as being said here: What range of values can integer types store in C++.
So I have this operation:
#include <iostream>
unsigned long long pow(unsigned a, unsigned b) {
unsigned long long p = 1;
for (unsigned i = 0; i < b; i++)
p *= a;
return p;
}
int main()
{
long long nr = -pow(4, 31) + 5 -pow(4,31);
std::cout << nr << std::endl;
}
Why does it show -9.223.372.036.854.775.808 instead of -9.223.372.036.854.775.803? I'm using Visual Studio 2015.
This is a really nasty little problem which has three(!) causes.
Firstly there is a problem that floating point arithmetic is approximate. If the compiler picks a pow function returning float or double, then 4**31 is so large that 5 is less than 1ULP (unit of least precision), so adding it will do nothing (in other words, 4.0**31+5 == 4.0**31). Multiplying by -2 can be done without loss, and the result can be stored in a long long without loss as the wrong answer: -9.223.372.036.854.775.808.
Secondly, a standard header may include other standard headers, but is not required to. Evidently, Visual Studio's version of <iostream> includes <math.h> (which declares pow in the global namespace), but Code::Blocks' version doesn't.
Thirdly, the OP's pow function is not selected because he passes arguments 4, and 31, which are both of type int, and the declared function has arguments of type unsigned. Since C++11, there are lots of overloads (or a function template) of std::pow. These all return float or double (unless one of the arguments is of type long double - which doesn't apply here).
Thus an overload of std::pow will be a better match ... with a double return values, and we get floating point rounding.
Moral of the story: Don't write functions with the same name as standard library functions, unless you really know what you are doing!
Visual Studio has defined pow(double, int), which only requires a conversion of one argument, whereas your pow(unsigned, unsigned) requires conversion of both arguments unless you use pow(4U, 31U). Overloading resolution in C++ is based on the inputs - not the result type.
The lowest long long value can be obtained through numeric_limits. For long long it is:
auto lowest_ll = std::numeric_limits<long long>::lowest();
which results in:
-9223372036854775808
The pow() function that gets called is not yours hence the observed results. Change the name of the function.
The only possible explaination for the -9.223.372.036.854.775.808 result is the use of the pow function from the standard library returning a double value. In that case, the 5 will be below the precision of the double computation, and the result will be exactly -263 and converted to a long long will give 0x8000000000000000 or -9.223.372.036.854.775.808.
If you use you function returning an unsigned long long, you get a warning saying that you apply unary minus to an unsigned type and still get an ULL. So the whole operation should be executed as unsigned long long and should give without overflow 0x8000000000000005 as unsigned value. When you cast it to a signed value, the result is undefined, but all compilers I know simply use the signed integer with same representation which is -9.223.372.036.854.775.803.
But it would be simple to make the computation as signed long long without any warning by just using:
long long nr = -1 * pow(4, 31) + 5 - pow(4,31);
As a addition, you have neither undefined cast nor overflow here so the result is perfectly defined per standard provided unsigned long long is at least 64 bits.
Your first call to pow is using the C standard library's function, which operates on floating points. Try giving your pow function a unique name:
unsigned long long my_pow(unsigned a, unsigned b) {
unsigned long long p = 1;
for (unsigned i = 0; i < b; i++)
p *= a;
return p;
}
int main()
{
long long nr = -my_pow(4, 31) + 5 - my_pow(4, 31);
std::cout << nr << std::endl;
}
This code reports an error: "unary minus operator applied to unsigned type, result still unsigned". So, essentially, your original code called a floating point function, negated the value, applied some integer arithmetic to it, for which it did not have enough precision to give the answer you were looking for (at 19 digits of presicion!). To get the answer you're looking for, change the signature to:
long long my_pow(unsigned a, unsigned b);
This worked for me in MSVC++ 2013. As stated in other answers, you're getting the floating-point pow because your function expects unsigned, and receives signed integer constants. Adding U to your integers invokes your version of pow.

C++ pow unusual type conversion

When I directly output std::pow(10,2), I get 100 while doing (long)(pow(10,2)) gives 99. Can someone explained this please ?
cout<<pow(10,2)<<endl;
cout<<(long)(pow(10,2))<<endl;
The code is basically this in the main function.
The compiler is mingw32-g++.exe -std=c++11 using CodeBlocks
Windows 8.1 if that helps
Floating point numbers are approximations. Occasionally you get a number that can be exactly represented, but don't count on it. 100 should be representable, but in this case it isn't. Something injected an approximation and ruined it for everybody.
When converting from a floating point type to an integer, the integer cannot hold any fractional values so they are unceremoniously dropped. There is no implicit rounding off, the fraction is discarded. 99.9 converts to 99. 99 with a million 9s after it is 99.
So before converting from a floating point type to an integer, round the number, then convert. Unless discarding the fraction is what you want to do.
cout, and most output routines, politely and silently round floating point values before printing, so if there is a bit of an approximation the user isn't bothered with it.
This inexactness is also why you shouldn't directly compare floating point values. X probably isn't exactly pi, but it might be close enough for your computations, so you perform the comparison with an epsilon, a fudge factor, to tell if you are close enough.
What I find amusing, and burned a lot of time trying to sort out, is would not have even seen this problem if not for using namespace std;.
(long)pow(10,2) provides the expected result of 100. (long)std::pow(10,2) does not. Some difference in the path from 10,2 to 100 taken by pow and std::pow results in slightly different results. By pulling the entire std namespace into their file, OP accidentally shot themselves in the foot.
Why is that?
Up at the top of the file we have using namespace std; this means the compiler is not just considering double pow(double, double) when looking for pow overloads, it can also call std::pow and std::pow is a nifty little template making sure that when called with datatypes other than float and double the right conversions are taking place and everything is the same type.
(long)(pow(10,2))
Does not match
double pow(double, double)
as well as it matches a template instantiation of
double std::pow(int, int)
Which, near as I can tell resolves down to
return pow(double(10), double(2));
after some template voodoo.
What the difference between
pow(double(10), double(2))
and
pow(10, 2)
with an implied conversion from int to double on the call to pow is, I do not know. Call in the language lawyers because it's something subtle.
If this is purely a rounding issue then
auto tempa = std::pow(10, 2);
should be vulnerable because tempa should be exactly what std::pow returns
cout << tempa << endl;
cout << (long) tempa << endl;
and the output should be
100
99
I get
100
100
So immediately casting the return of std::pow(10, 2) into a long is different from storing and then casting. Weird. auto tempa is not exactly what std::pow returns or there is something else going on that is too deep for me.
These are the std::pow overloads:
float pow( float base, float exp );
double pow( double base, double exp );
long double pow( long double base, long double exp );
float pow( float base, int iexp );//(until C++11)
double pow( double base, int iexp );//(until C++11)
long double pow( long double base, int iexp ); //(until C++11)
Promoted pow( Arithmetic1 base, Arithmetic2 exp ); //(since C++11)
But your strange behaviour is MINGW's weirdness about double storage and how the windows run-time doesnt like it. I'm assuming windows is seeing something like 99.9999 and when that is cast to an integral type it takes the floor.
int a = 3/2; // a is = 1
mingw uses the Microsoft C run-time libraries and their implementation of printf does not support the 'long double' type. As a work-around, you could cast to 'double' and pass that to printf instead.
Therefore, you need double double:
On the x86 architecture, most C compilers implement long double as the 80-bit extended precision type supported by x86 hardware (sometimes stored as 12 or 16 bytes to maintain data structure alignment), as specified in the C99 / C11 standards (IEC 60559 floating-point arithmetic (Annex F)). An exception is Microsoft Visual C++ for x86, which makes long double a synonym for double.[2] The Intel C++ compiler on Microsoft Windows supports extended precision, but requires the /Qlong‑double switch for long double to correspond to the hardware's extended precision format.[3]

Confusion about float data type declaration in C++

a complete newbie here. For my school homework, I was given to write a program that displays -
s= 1 + 1/2 + 1/3 + 1/4 ..... + 1/n
Here's what I did -
#include<iostream.h>
#include<conio.h>
void main()
{
clrscr();
int a;
float s=0, n;
cin>>a;
for(n=1;n<=a;n++)
{
s+=1/n;
}
cout<<s;
getch();
}
It perfectly displays what it should. However, in the past I have only written programs which uses int data type. To my understanding, int data type does not contain any decimal place whereas float does. So I don't know much about float yet. Later that night, I was watching some video on YouTube in which he was writing the exact same program but in a little different way. The video was in some foreign language so I couldn't understand it. What he did was declared 'n' as an integer.
int a, n;
float s=0;
instead of
int a
float s=0, n;
But this was not displaying the desired result. So he went ahead and showed two ways to correct it. He made changes in the for loop body -
s+=1.0f/n;
and
s+=1/(float)n;
To my understanding, he declared 'n' a float data type later in the program(Am I right?). So, my question is, both display the same result but is there any difference between the two? As we are declaring 'n' a float, why he has written 1.0f instead of n.f or f.n. I tried it but it gives error. And in the second method, why we can't write 1(float)/n instead of 1/(float)n? As in the first method we have added float suffix with 1. Also, is there a difference between 1.f and 1.0f?
I tried to google my question but couldn't find any answer. Also, another confusion that came to my mind after a few hours is - Why are we even declaring 'n' a float? As per the program, the sum should come out as a real number. So, shouldn't we declare only 's' a float. The more I think the more I confuse my brain. Please help!
Thank You.
The reason is that integer division behaves different than floating point division.
4 / 3 gives you the integer 1. 10 / 3 gives you the integer 3.
However, 4.0f / 3 gives you the float 1.3333..., 10.0f / 3 gives you the float 3.3333...
So if you have:
float f = 4 / 3;
4 / 3 will give you the integer 1, which will then be stored into the float f as 1.0f.
You instead have to make sure either the divisor or the dividend is a float:
float f = 4.0f / 3;
float f = 4 / 3.0f;
If you have two integer variables, then you have to convert one of them to a float first:
int a = ..., b = ...;
float f = (float)a / b;
float f = a / (float)b;
The first is equivalent to something like:
float tmp = a;
float f = tmp / b;
Since n will only ever have an integer value, it makes sense to define it as as int. However doing so means that this won't work as you might expect:
s+=1/n;
In the division operation both operands are integer types, so it performs integer division which means it takes the integer part of the result and throws away any fractional component. So 1/2 would evaluate to 0 because dividing 1 by 2 results in 0.5, and throwing away the fraction results in 0.
This in contrast to floating point division which keeps the fractional component. C will perform floating point division if either operand is a floating point type.
In the case of the above expression, we can force floating point division by performing a typecast on either operand:
s += (float)1/n
Or:
s += 1/(float)n
You can also specify the constant 1 as a floating point constant by giving a decimal component:
s += 1.0/n
Or appending the f suffix:
s += 1.0f/n
The f suffix (as well as the U, L, and LL suffixes) can only be applied to numerical constants, not variables.
What he is doing is something called casting. I'm sure your school will mention it in new lectures. Basically n is set as an integer for the entire program. But since integer and double are similar (both are numbers), the c/c++ language allows you to use them as either as long as you tell the compiler what you want to use it as. You do this by adding parenthesis and the data type ie
(float) n
he declared 'n' a float data type later in the program(Am I right?)
No, he defined (thereby also declared) n an int and later he explicitly converted (casted) it into a float. Both are very different.
both display the same result but is there any difference between the two?
Nope. They're the same in this context. When an arithmetic operator has int and float operands, the former is implicitly converted into the latter and thereby the result will also be a float. He's just shown you two ways to do it. When both the operands are integers, you'd get an integer value as a result which may be incorrect, when proper mathematical division would give you a non-integer quotient. To avoid this, usually one of the operands are made into a floating-point number so that the actual result is closer to the expected result.
why he has written 1.0f instead of n.f or f.n. I tried it but it gives error. [...] Also, is there a difference between 1.f and 1.0f?
This is because the language syntax is defined thus. When you're declaring a floating-point literal, the suffix is to use .f. So 5 would be an int while 5.0f or 5.f is a float; there's no difference when you omit any trailing 0s. However, n.f is syntax error since n is a identifier (variable) name and not a constant number literal.
And in the second method, why we can't write 1(float)/n instead of 1/(float)n?
(float)n is a valid, C-style casting of the int variable n, while 1(float) is just syntax error.
s+=1.0f/n;
and
s+=1/(float)n;
... So, my question is, both display the same result but is there any difference between the two?
Yes.
In both C and C++, when a calculation involves expressions of different types, one or more of those expressions will be "promoted" to the type with greater precision or range. So if you have an expression with signed and unsigned operands, the signed operand will be "promoted" to unsigned. If you have an expression with float and double operands, the float operand will be promoted to double.
Remember that division with two integer operands gives an integer result - 1/2 yields 0, not 0.5. To get a floating point result, at least one of the operands must have a floating point type.
In the case of 1.0f/n, the expression 1.0f has type float1, so the n will be "promoted" from type int to type float.
In the case of 1/(float) n, the expression n is being explicitly cast to type float, so the expression 1 is promoted from type int to float.
Nitpicks:
Unless your compiler documentation explicitly lists void main() as a legal signature for the main function, use int main() instead. From the online C++ standard:
3.6.1 Main function
...
2 An implementation shall not predefine the main function. This function shall not be overloaded. It shall have a declared return type of type int, but otherwise its type is implementation-defined...
Secondly, please format your code - it makes it easier for others to read and debug. Whitespace and indentation are your friends - use them.
1. The constant expression 1.0 with no suffix has type double. The f suffix tells the compiler to treat it as float. 1.0/n would result in a value of type double.

C++ calculation with type "long"

I have a inline function does a frequency to period conversion. The calculation precision has to be using type long, not type double. Otherwise, it may cause some rounding errors. The function then converts the result back to double. I was wondering in below code, which line would keep the calculation in type long. No matter the parameter bar is 100, 100.0 or 33.3333.
double foo(long bar)
{
return 1000000/bar;
return 1000000.0/bar;
return (long)1000000/bar;
return (long)1000000.0/bar;
}
I tried it myself, and the 4th line works. But just wondering the concept of type conversion in this case.
EDIT:
One of the error is 1000000/37038 = 26, not 26.9993.
return 1000000/bar;
This will do the math as a long.
return 1000000.0/bar;
This will do the math as a double.
return (long)1000000.0/bar;
This is equivalent to the first -- 1000000.0 is a double, but then you cast it to long before the division, so the division will be done on longs.
This problem, as you posed it, doesn't make sense.
bar is of an integral type, so 1000000/bar will surely be less than 1000000, which can be represented exactly by a double1, so there's no way in which performing the calculation all in integral arithmetic can give better precision - actually, you will get integer division, that in this case is less precise for any value of bar, since it will truncate the decimal part. The only way you can have a problem in a long to double conversion here is in bar conversion to double, but if it exceeds the range of double the final result of the division will be 0, as it would be anyway in integer arithmetic.
Still:
1000000/bar
performs a division between longs: 1000000 is an int or a long, depending on the platform, bar is a long; the first operand gets promoted to a long if necessary and then an integer division is performed.
1000000.0/bar
performs a division between doubles: 1000000.0 is a double literal, so bar gets promoted to double before the division.
(long)1000000/bar
is equivalent to the first one: the cast has precedence over the division, and forces 1000000 (which is either a long or an int) to be a long; bar is a long, division between longs is performed.
(long)1000000.0/bar
is equivalent to the previous one: 1000000.0 is a double, but you cast it to a long and then integer division is performed.
The C standard, to which the C++ standard delegates the matter, asks for a minimum of 10 decimal digits for the mantissa of doubles (DBL_DIG) and at least 10**37 as representable power of ten before going out of range (DBL_MAX_10_EXP) (C99, annex E, ¶4).
The first line (and third more verbosely) will do the math as long (whihc in C++ always truncates down any result) and then return the integral value as a double. I don't understand what you're saying in your question about bar being 33.3333 because that's not a possible long value.