casting a integer variable to float - casting

#FRob's answer to my recent question (to_float() and dividing errors) led me to analyze the float_pkg_c.vhdl, particularly the to_float method.
When trying the following operation:
variable denum : integer;
variable num : integer;
variable dividend : float (4 downto -27);
begin
dividend := to_float(num, 4, 27) / to_float(denum, 4, 27);
...
I keep getting this error: "Error (10454): VHDL syntax error at float_pkg_c.vhdl(3840): right bound of range must be a constant"
Now, at the mentioned line:
for I in fract'high downto maximum (fract'high - shift + 1, 0) loop
The variable fract is calculated based on the parameter fraction_width, which is 27 in my case, therefore a constant.
However, the shift variable is calculated based on the arg parameter (basically, a log2 of the absolute value of arg), which is the num variable in my case, therefore not a constant.
So, the error is clear, but my question is: How can I cast a integer variable to float?
This is the definition of to_float:
function to_float (
arg : INTEGER;
constant exponent_width : NATURAL := float_exponent_width; -- length of FP output exponent
constant fraction_width : NATURAL := float_fraction_width; -- length of FP output fraction
constant round_style : round_type := float_round_style) -- rounding option
What is even more confusing to me is that arg in the above definition is not required ti be a constant.

After spending a few hours reading up on synthesizing loops and trying to translate the to_float with integer arg I had a thought:
library ieee;
library ieee_proposed;
use ieee_proposed.float_pkg.all;
use ieee.numeric_std.all;
entity SM is
end entity;
architecture foo of SM is
-- From float_pkg_c.vhdl line 391/3927 (package float_pkg):
-- -- to_float (signed)
-- function to_float (
-- arg : SIGNED;
-- constant exponent_width : NATURAL := float_exponent_width; -- length of FP output exponent
-- constant fraction_width : NATURAL := float_fraction_width; -- length of FP output fraction
-- constant round_style : round_type := float_round_style) -- rounding option
-- return UNRESOLVED_float is
begin
UNLABELLED:
process
variable denum : integer;
variable num : integer;
variable dividend : float (4 downto -27);
begin
denum := 42;
num := 21;
dividend := to_float(TO_SIGNED(num,32), 4, 27) / to_float(TO_SIGNED(denum,32), 4, 27);
assert dividend /= 0.5
report "dividend = " & to_string(dividend)
severity NOTE;
wait;
end process;
end architecture;
I don't think you really want to synthesize the integer version of to_float. Unfolding the loop gives you a bunch of ripple adds for decrementing shiftr and adjusting arg_int. Trying to get rid of those operations leads you to a bit array style representation of an integer.
Note there is no loop in the to_float who's arg type is signed. It's likely the TO_SIGNED calls are simply seen as defining the number of bits representing the size of integers instead of implying additional hardware. You end up with something converting bit fields and normalizing, clamping to infinity, etc..

You cast to float using the to_float function overload you are already using.
Your variables num and denum are uninitialized and default to integer'left which is -2**31. The to_float function tries to convert negative numbers to positive to stay within the natural range of arg_int but integer'high is limited to 2**31-1 and can't represent -integer'low. Set them to an initial value other than the default and see what happens.
From float_pkg_c.vhdl:
if arg < 0 then
result (exponent_width) := '1';
arg_int := -arg; -- Make it positive.

Related

Efficient way to sum all the bits values of a variable in Fortran

Say I have an integer variable a with bit representation 101010, I need to sum all the bit values 1, 0 together which result 3 in this case. Is there a more efficient way to do than this naive code
sum = 0
do i=0, bit_size(a) - 1
sum = sum + ibits(a, i, 1)
end do
From https://gcc.gnu.org/onlinedocs/gfortran/POPCNT.html :
Description:
POPCNT(I) returns the number of bits set (’1’ bits) in the binary representation of I.
Standard:
Fortran 2008 and later
Class:
Elemental function
Syntax:
RESULT = POPCNT(I)

sprintf %g specifier gives too few digits after point

I'm trying to write floating point vars into my ini file and i encountered a problem with format specifiers.
I have a float value, let it be 101.9716. Now i want to write it to my ini file, but the problem is i have another float values, which have less preceision (such as 15.85), and that values are writing to ini file in the same loop.
so i do:
sprintf(valLineY, "%g", grade[i].yArr[j]);
All my other variables become nice chars like "20" (if it was 20.00000), "13.85" (if it was 13.850000) and so on. But 101.9716 becomes "101.972" for some reason.
Can you please tell me why does this happen and how to make it "101.9716" without ruining my ideology (which is about removing trailing zeroes and unneeded perceisions).
Thanks for any help.
Why this happens?
I tested:
double f = 101.9716;
printf("%f\n", f);
printf("%e\n", f);
printf("%g\n", f);
And it output:
101.971600
1.019716e+02 // Notice the exponent +02
101.972
Here's what C standard (N1570 7.21.6.1) says about conversion specifier g:
A double argument representing a floating-point number is converted in
style f or e (or in style F or E in the case of a G conversion specifier), depending on the value converted and the precision. Let P equal the precision if nonzero, 6 if the precision is omitted, or 1 if the precision is zero. Then, if a conversion with style E would have an exponent of X:
— if P > X ≥ −4, the conversion is with style f (or F) and precision
P − (X + 1).
— otherwise, the conversion is with style e (or E) and precision P − 1.
So given above, P will equal 6, because precision is not specified, and X will equal 2, because it's the exponent on style e.
Formula 6 > 2 >= -4 is thus true, and style f is selected. And precision will then be 6 - (2 + 1) = 3.
How to fix?
Unlike f, style g will strip unnecessary zeroes, even when precision is set.
Finally, unless the # flag is used, any trailing zeros are removed from the fractional portion of the result and the decimal-point character is removed if there is no fractional portion remaining.
So set high enough precision:
printf("%.8g\n", f);
prints:
101.9716

DWORD division Delphi / C++

So in C++ i can do something like:
DWORD count;
count = 3 / 1.699999;
cout << count;
which will result in:
1
Delphi however complains Cardinal and Extended mismatch.
var
count: DWORD;
begin
count := 3 / 1.6;
Writeln(inttostr(count));
So i either have to round count := round(3 / 1.6) which results in:
2
or trunc count := trunc(3 / 1.6) which results in
1
Is trunc really the way to go?
Is there maybe a compiler switch i would have to toggle?
You would think it's easy to google something like that but trust me it isn't.
Thanks for your time!
C/C++ only has one arithmetic division operator - / - but its behavior depends on the type of operands you pass to it. It can perform both integer division and floating point division.
Delphi has two arithmetic division operations - div for integer division, and / for floating point division.
Your C++ code is performing floating point division, and then assigning the result to a DWORD, which is not a floating point type, so the assignment truncates off the decimal point:
1 / 1.699999 is 1.764706920415836, which truncates to 1.
In Delphi, the / operator returns an Extended, which is a floating-point type. Unlike C/C++, Delphi does not allow a floating-point type to be assigned directly to an integral type. You have to use Round() or Trunc().
In this case, the Delphi equivalent of your C++ code is to use Trunc():
var
count: DWORD;
begin
count := Trunc(3 / 1.699999);
Write(IntToStr(count));
The easiest is to use trunc(3 /1.699999)..
Another way is to use a previous multiplication before the division.
var
count: DWORD;
begin
count := 3;
count := (count*1000000) div 1699999;
Writeln(inttostr(count));
Of course, to avoid overflow, count should be < maxInt div 1000000.

What does overflowing mean in this case?

I have found an algorithm to multiply in modoulus. The next pseudocode is taken from wikipedia, page Modular exponention, section Right-to-left binary method.
The full pseudocode is
function modular_pow(base, exponent, modulus)
Assert :: (modulus - 1) * (modulus - 1) does not overflow base
result := 1
base := base mod modulus
while exponent > 0
if (exponent mod 2 == 1):
result := (result * base) mod modulus
exponent := exponent >> 1
base := (base * base) mod modulus
return result
I don't understand what this line of pseudocode means Assert :: (modulus - 1) * (modulus - 1) does not overflow base; what does this line mean and how would it be best programmed in C++?
In most computer programming languages numbers can only be stored with a limited precision or over a certain range.
For example a C++ integer will often be a 32 bit signed int, capable of storing at most 2^31 as a value.
If you try and multiply together two numbers and the result would be greater than 2^31 you do not get the result you were expecting, it has overflowed.
Assert is (roughly speaking) a way to check preconditions; "this must be true to proceed". In C++ you'll code it using the assert macro, or your own hand-rolled assertion system.
'does not overflow' means that the statement shouldn't be too large to fit into whatever integer type base is; it's a multiplication so it's quite possible. Integer overflow in C++ is undefined behaviour, so it's wise to guard against it! There are plenty of resources out there to explain integer overflow, such as this article on Wikipedia
To do the check in C++, a nice simple approach is to store the intermediate result in a larger integer type and check that it's small enough to fit in the destination type. For example, if base is int32_t use int64_t and check that it's lower than static_cast<int64_t>(std::numeric_limits<int32_t>::max()):
const int64_t intermediate = (static_cast<int64_t>(modulus) - 1) * (static_cast<int64_t>(modulus) - 1);
assert(intermediate < static_cast<int64_t>(std::numeric_limits<int32_t>::max()));

spliting 64 bit value to fit in argument type of double

I have a function for which i cannot change the syntax, say this is some library function that i am calling:
void schedule(double _val);
void caller() {
uint64_t value = 0xFFFFFFFFFFFFFFF;
schedule(value);
}
as the function schedule accepts double as the argument type, in cases where the value of the argument is greater that 52 bits ( considering double stores mantissa as 52 bit value) i loose the precision in such cases.
what i intend to do is , if the value if greater than the max value a double can hold, i need to loop for the remaining value, so that in the end it sums up to correct value.
void caller() {
uint64_t value = 0xFFFFFFFFFFFFFFF;
for(count = 0; count < X ; count++) {
schedule(Y);
}
}
i need to extract X and Y from variable 'value'.
How can this be achieved ?
My objective is not to loose the precision because of the type casting.
If your problem is only losing precision in caller and not in schedule, then no loop is needed:
void caller() {
uint64_t value = 0xFFFFFFFFFFFFFFF;
uint64_t modulus = (uint64_t) 1 << 53;
schedule(value - value % modulus);
schedule(value % modulus)
}
In value - value % modulus, only the high 11 bits are significant, because the low 53 have been cleared. So, when it is converted to double, there is no error, and the exact value is passed to schedule. Similarly, value % modulus has only 53 bits and is converted to double exactly.
(The encoding of the significand of an IEEE-754 64-bit binary floating-point object has 52 bits, but the actual significand has 53 bits, due to the implicit leading bit.)
Note: The above may result in schedule being called with an argument of zero, which we have not established is permitted. If that is a problem, such a call should be skipped.
If N is the max integral value your double can represent precisely, then obviously you can use
Y = N
and
X = amount / Y
(assuming integral division). Once you finished iterating over X you still have to schedule the remainder
R = amount % Y
Just keep in mind that all integral calculations have to be performed within the domain of uint64_t type, i.e. you have to add proper suffix to the constants (UL or ULL), or use type casts to uint64_t or use intermediate variables of type uint64_t.
Of course, if your program doesn't really care how many times schedule is called as long as the total is correct, then you can use virtually any value for N, as long as it can be represented precisely. For example, you can simply set N = 10000.
On the other hand, if you want to minimize the number of schedule calls, then it be worth noting that due to "implicit 1" rule the max integer that can be represented precisely in 52 bit mantissa is (1 << 53) - 1.