How to know how my compiler encodes floating-point data?

How to know how my compiler encodes floating-point data? - c++

How to know how floating-point data are stored in a C++ program?
If I assign the number 1.23 to a double object for example, how can I know how this number is encoded?

The official way to know how floating-point data are encoded is to read the documentation of the C++ implementation, because the 2017 C++ standard says, in 6.9.1 “Fundamental types” [basic.fundamental], paragraph 8, draft N4659:
… The value representation of floating-point types is implementation-defined…
“Implementation-defined” means the implementation must document it (3.12 “implementation-defined behavior” [defns.impl.defined]).
The C++ standard appears to be incomplete in this regard, as it says “… the value representation is a set of bits in the object representation that determines a value…” (6.9 “Types” [basic.types] 4) and “The object representation of an object of type T is the sequence of N unsigned char objects taken up by the object of type T,…” (ibid), but I do not see that it says the implementation must define which of the bits in the object representation are the value representation, or in which order/mapping. Nonetheless, the responsibility of informing you about the characteristics of the C++ implementation lies with the implementation and the implementors, because no other party can do it. (That is, the implementors create the implementation, and they can do so in arbitrary ways, so they are the ones who determine what the characteristics are, so they are the source of that information.)
The C standard defines some mathematical characteristics of floating-point types and requires implementations to describe them in <float.h>. C++ inherits these in <cfloat> (C++ 20.5.1.2 “Header” [headers] 3-4). C 2011 5.2.4.2.2 “Characteristics of floating types <float.h>” defines a model in which a floating-point number x equals sbe sum(fkb−k for k=1 to p), where s is a sign (±1), b is the base or radix, e is an exponent between emin and emax, inclusive, p is the precision (number of base-b digits in the significand), and fk are base-b digits of the significand (nonnegative integers less than b). The floating-point type may also contain infinities and Not-a-Number (NaN) “values”, and some values are distinguished as normal or subnormal. Then <float.h> relates the parameters of this model:
FLT_RADIX provides the base, b.
FLT_MANT_DIG, DBL_MANT_DIG, and LDBL_MANT_DIG provide the number of significand digits, also known as the precision, p, for the float, double, and long double types, respectively.
FLT_MIN_EXP, DBL_MIN_EXP, LDBL_MIN_EXP, FLT_MAX_EXP, DBL_MAX_EXP, and LDBL_MAX_EXP provide the minimum and maximum exponents, emin and emax.
In addition to providing these in <cfloat>, C++ provides them in the numeric_limits template defined in the <numeric> header (21.3.4.1 “numeric_limits members” [numeric.limits.members]) in radix (b), digits (p), min_exponent (emin) and max_exponent (emax). For example, std::numeric_limits<double>::digits gives the number of digits in the significand of the double type. That template includes other members that describe the floating-point type, such as whether it supports infinities, NaNs, and subnormal values.
These provide a complete description of the mathematical properties of the floating-point format. However, as stated above, C++ appears to fail to specify that the implementation should document how the value bits that represent a type appear in the object bits.
Many C++ implementations use the IEEE-754 basic 32-bit binary format for float and the 64-bit format for double, and the value bits are mapped to the object bits in the same way as for integers of the corresponding width. If so, for normal numbers, the sign s is encoded in the most significant bit (0 or 1 for +1 or −1, respectively), the exponent e is encoded using the biased value e+126 (float) or e+1022 (double) in the next 8 (float) or 11 (double) bits, and the remaining bits contain the digits fk for k from 2 to p. The first digit, f1, is 1 for normal numbers. For subnormal numbers, the exponent field is zero, and f1 is 0. (Note the biases here are 126 and 1022 instead of the 127 and 1023 used in IEEE-754 because the C model expresses the significand using b−k instead of b1−k as is used in IEEE-754.) Infinities are encoded with all ones in the exponent field and all zeros in the significand field. NaNs are encoded with all ones in the exponent field and not all zeros in the significand field.

The compiler will use the encoding that is used by the CPU architecture that you are compiling for. (Unless that architecture doesn't support floating point, in which case the compiler probably would choose the encoding that they'll use the emulate).
The vendor that designed the CPU architecture should document the encoding that the CPU it uses. You can know what the documentation says by reading it.
The IEEE 754 standard is fairly ubiquitous.

Related

Do floats, doubles, and long doubles have a guaranteed minimum precision?

From my previous question "Is floating point precision mutable or invariant?" I received a response which said,
C provides DBL_DIG, DBL_DECIMAL_DIG, and their float and long double
counterparts. DBL_DIG indicates the minimum relative decimal
precision. DBL_DECIMAL_DIG can be thought of as the maximum relative
decimal precision.
I looked these macros up. They are found in the header <cfloat>. From the cplusplus reference page they list macros for float, double, and long double.
Here are the macros for minimum precision values.
FLT_DIG 6 or greater
DBL_DIG 10 or greater
LDBL_DIG 10 or greater
If I took these macros at face value, I would assume that a float has a minimum decimal precision of 6, while a double and long double have a minimum decimal precision of 10. However, being a big boy, I know that some things may be too good to be true.
Therefore, I would like to know. Do floats, doubles, and long doubles have guaranteed minimum decimal precision, and is this minimum decimal precision the values of the macros given above?
If not, why?
Note: Assume we are using programming language C++.

If std::numeric_limits<F>::is_iec559 is true, then the guarantees of the IEEE 754 standard apply to floating point type F.
Otherwise (and anyway), minimum permitted values of symbols such as DBL_DIG are specified by the C standard, which, undisputably for the library, “is incorporated into [the C++] International Standard by reference”, as quoted from C++11 §17.5.1.5/1.
Edit:
As noted by TC in a comment here,
” <climits> and <cfloat> are normatively incorporated by §18.3.3 [c.limits]; the minimum values are specified in turn in §5.2.4.2.2 of the C standard
Unfortunately for the formal view, first of all that quote from C++11 is from section 17.5 which is only informative, not normative. And secondly, the wording in the C standard that the values specified there are minimums, is also in a section (the C99 standard's Annex E) that's informative, not normative. So while it can be regarded as an in-practice guarantee, it's not a formal guarantee.
One strong indication that the in-practice minimum precision for float is 6 decimal digits, that no implementation will give less:
output operations default to precision 6, and this is normative text.
Disclaimer: It may be that there is additional wording that provides guarantees that I didn't notice. Not very likely, but possible.

Do floats, doubles, and long doubles have guaranteed minimum decimal precision, and is this minimum decimal precision the values of the macros given above?
I can't find any place in the standard that guarantees any minimal values for decimal precision.
The following quote from http://en.cppreference.com/w/cpp/types/numeric_limits/digits10 might be useful:
Example
An 8-bit binary type can represent any two-digit decimal number exactly, but 3-digit decimal numbers 256..999 cannot be represented. The value of digits10 for an 8-bit type is 2 (8 * std::log10(2) is 2.41)
The standard 32-bit IEEE 754 floating-point type has a 24 bit fractional part (23 bits written, one implied), which may suggest that it can represent 7 digit decimals (24 * std::log10(2) is 7.22), but relative rounding errors are non-uniform and some floating-point values with 7 decimal digits do not survive conversion to 32-bit float and back: the smallest positive example is 8.589973e9, which becomes 8.589974e9 after the roundtrip. These rounding errors cannot exceed one bit in the representation, and digits10 is calculated as (24-1)*std::log10(2), which is 6.92. Rounding down results in the value 6.
However, the C standard specifies the minimum values that need to be supported.
From the C Standard:
5.2.4.2.2 Characteristics of floating types
...
9 The values given in the following list shall be replaced by constant expressions with implementation-defined values that are greater or equal in magnitude (absolute value) to those shown, with the same sign
...
-- number of decimal digits, q, such that any floating-point number with q decimal digits can be rounded into a floating-point number with p radix b digits and back again without change to the q decimal digits,
...
FLT_DIG 6
DBL_DIG 10
LDBL_DIG 10

To be more specific. Since my compiler uses the IEEE 754 Standard, then the precision of my decimal digits are guaranteed to be 6 to 9 significant decimal digits for float and 15 to 17 significant decimal digits for double. Also, since a long double on my compiler is the same size as a double, it too has 15 to 17 significant decimal digits.
These ranges can be verified from IEEE 754 single-precision binary floating-point format: binary32 and IEEE 754 double-precision binary floating-point format: binary64 respectively.

The C++ Standard says nothing specific about limits on floating point types. You may interpret the incorporation of the C Standard "by reference" as you wish, but if you take the limits as specified there (N1570), section 5.2.4.2.2 subpoint 15:
EXAMPLE 1
The following describes an artificial floating-point representation that meets the minimum requirements of this International Standard, and the appropriate values in a header for type
float:
FLT_RADIX 16
FLT_MANT_DIG 6
FLT_EPSILON 9.53674316E-07F
FLT_DECIMAL_DIG 9
FLT_DIG 6
FLT_MIN_EXP -31
FLT_MIN 2.93873588E-39F
FLT_MIN_10_EXP -38
FLT_MAX_EXP +32
FLT_MAX 3.40282347E+38F
FLT_MAX_10_EXP +38
By this section, float, double and long double have these properties at the least*.

Is `double` guaranteed by C++03 to represent small integers exactly?

Does the C++03 standard guarantee that sufficiently small non-zero integers are represented exactly in double? If not, what about C++11? Note, I am not assuming IEEE compliance here.
I suspect that the answer is no, but I would love to be proved wrong.
When I say sufficiently small, I mean, bounded by some value that can be derived from the guarantees of C++03, and maybe even be calculated from values made available via std::numeric_limits<double>.
EDIT:
It is clear (now that I have checked) that std::numeric_limits<double>::digits is the same thing as DBL_MANT_DIG, and std::numeric_limits<double>::digits10 is the same thing as DBL_DIG, and this is true for both C++03 and C++11.
Further more, C++03 defers to C90, and C++11 defers to C99 with respect to the meaning of DBL_MANT_DIG and DBL_DIG.
Both C90 and C99 states that the minimum allowable value for DBL_DIG is 10, i.e., 10 decimal digits.
The question then is, what does that mean? Does it mean that integers of up to 10 decimal digits are guaranteed to be represented exactly in double?
In that case, what is then the purpose of DECIMAL_DIG in C99, and the following remark in C99 §5.2.4.2.2 / 12?
Conversion from (at least) double to decimal with DECIMAL_DIG digits and back
should be the identity function.
Here is what C99 §5.2.4.2.2 / 9 has to say about DBL_DIG:
Number of decimal digits, 'q', such that any floating-point
number with 'q' decimal digits can be rounded into a
floating-point number with 'p' radix 'b' digits and back again
without change to the q decimal digits,
{ p * log10(b) if 'b' is a power of 10
{
{ floor((p-1) * log10(b)) otherwise
FLT_DIG 6
DBL_DIG 10
LDBL_DIG 10
I'll be happy if someone can help me unpack this.

Well, 3.9.1 [basic.fundamental] paragraph 8 states
... The value representation of floating-point types is implementation-defined. ...
At least, the implementation has to define what representation it uses.
On the other hand, std::numeric_limits<F> defines a couple of members which seem to imply that the representation is some in the form of significand radix exponent:
std::numeric_limits<F>::radix: the radix of the exponent
std::numeric_limtis<F>::digits: the number of radix digits
I think these statements imply that you can represent integers in the range of 0 ... radix digits - 1 exactly.

From the C standard, "Characteristics of floating types <float.h>", which is normative for C++, I would assume that you can combine FLT_RADIX and FLT_MANT_DIG into useful information: The number of digits in the mantissa and the base in which they are expressed.
For example, for a single-precision IEEE754 float, this would be respectively 2 and 24, so you should be able to store integers of absolute value up to 224.

std::numeric_limits::is_exact ... what is a usable definition?

As I interpret it, MSDN's definition of numeric_limits::is_exactis almost always false:
[all] calculations done on [this] type are free of rounding errors.
And IBM's definition is almost always true: (Or a circular definition, depending on how you read it)
a type that has exact representations for all its values
What I'm certain of is that I could store a 2 in both a double and a long and they would both be represented exactly.
I could then divide them both by 10 and neither would hold the mathematical result exactly.
Given any numeric data type T, what is the correct way to define std::numeric_limits<T>::is_exact?
Edit:
I've posted what I think is an accurate answer to this question from details supplied in many answers. This answer is not a contender for the bounty.

The definition in the standard (see NPE's answer) isn't very exact, is it? Instead, it's circular and vague.
Given that the IEC floating point standard has a concept of "inexact" numbers (and an inexact exception when a computation yields an inexact number), I suspect that this is the origin of the name is_exact. Note that of the standard types, is_exact is false only for float, double, and long double.
The intent is to indicate whether the type exactly represents all of the numbers of the underlying mathematical type. For integral types, the underlying mathematical type is some finite subset of the integers. Since each integral types exactly represents each and every one of the members of the subset of the integers targeted by that type, is_exact is true for all of the integral types. For floating point types, the underlying mathematical type is some finite range subset of the real numbers. (An example of a finite range subset is "all real numbers between 0 and 1".) There's no way to represent even a finite range subset of the reals exactly; almost all are uncomputable. The IEC/IEEE format makes matters even worse. With that format, computers can't even represent a finite range subset of the rational numbers exactly (let alone a finite range subset of the computable numbers).
I suspect that the origin of the term is_exact is the long-standing concept of "inexact" numbers in various floating point representation models. Perhaps a better name would have been is_complete.
Addendum
The numeric types defined by the language aren't the be-all and end-all of representations of "numbers". A fixed point representation is essentially the integers, so they too would be exact (no holes in the representation). Representing the rationals as a pair of standard integral types (e.g., int/int) would not be exact, but a class that represented the rationals as a Bignum pair would, at least theoretically, be "exact".
What about the reals? There's no way to represent the reals exactly because almost all of the reals are not computable. The best we could possibly do with computers is the computable numbers. That would require representing a number as some algorithm. While this might be useful theoretically, from a practical standpoint, it's not that useful at all.
Second Addendum
The place to start is with the standard. Both C++03 and C++11 define is_exact as being
True if the type uses an exact representation.
That is both vague and circular. It's meaningless. Not quite so meaningless is that integer types (char, short, int, long, etc.) are "exact" by fiat:
All integer types are exact, ...
What about other arithmetic types? The first thing to note is that the only other arithmetic types are the floating point types float, double, and long double (3.9.1/8):
There are three floating point types: float, double, and long double. ... The value representation of floating-point types is implementation-defined. Integral and floating types are collectively called arithmetic types.
The meaning of the floating point types in C++ is markedly murky. Compare with Fortran:
A real datum is a processor approximation to the value of a real number.
Compare with ISO/IEC 10967-1, Language independent arithmetic (which the C++ standards reference in footnotes, but never as a normative reference):
A floating point type F shall be a finite subset of ℝ.
C++ on the other hand is moot with regard to what the floating point types are supposed to represent. As far as I can tell, an implementation could get away with making float a synonym for int, double a synonym for long, and long double a synonym for long long.
Once more from the standards on is_exact:
... but not all exact types are integer. For example, rational and fixed-exponent representations are exact but not integer.
This obviously doesn't apply to user-developed extensions for the simple reason that users are not allowed to define std::whatever<MyType>. Do that and you're invoking undefined behavior. This final clause can only pertain to implementations that
Define float, double, and long double in some peculiar way, or
Provide some non-standard rational or fixed point type as an arithmetic type and decide to provide a std::numeric_limits<non_standard_type> for these non-standard extensions.

I suggest that is_exact is true iff all literals of that type have their exact value. So is_exact is false for the floating types because the value of literal 0.1 is not exactly 0.1.
Per Christian Rau's comment, we can instead define is_exact to be true when the results of the four arithmetic operations between any two values of the type are either out of range or can be represented exactly, using the definitions of the operations for that type (i.e., truncating integer division, unsigned wraparound). With this definition you can cavil that floating-point operations are defined to produce the nearest representable value. Don't :-)

The problem of exactnes is not restricted to C, so lets look further.
Germane dicussion about redaction of standards apart, inexact has to apply to mathematical operations that require rounding for representing the result with the same type. For example, Scheme has such kind of definition of exactness/inexactness by mean of exact operations and exact literal constants see R5RS §6. standard procedures from http://www.schemers.org/Documents/Standards/R5RS/HTML
For case of double x=0.1 we either consider that 0.1 is a well defined double literal, or as in Scheme, that the literal is an inexact constant formed by an inexact compile time operation (rounding to the nearest double the result of operation 1/10 which is well defined in Q). So we always end up on operations.
Let's concentrate on +, the others can be defined mathematically by mean of + and group property.
A possible definition of inexactness could then be:
If there exists any pair of values (a,b) of a type such that a+b-a-b != 0,
then this type is inexact (in the sense that + operation is inexact).
For every floating point representation we know of (trivial case of nan and inf apart) there obviously exist such pair, so we can tell that float (operations) are inexact.
For well defined unsigned arithmetic model, + is exact.
For signed int, we have the problem of UB in case of overflow, so no warranty of exactness... Unless we refine the rule to cope with this broken arithmetic model:
If there exists any pair (a,b) such that (a+b) is well defined
and a+b-a-b != 0,
then the + operation is inexact.
Above well definedness could help us extend to other operations as well, but it's not really necessary.
We would then have to consider the case of / as false polymorphism rather than inexactness
(/ being defined as the quotient of Euclidean division for int).
Of course, this is not an official rule, validity of this answer is limited to the effort of rational thinking

The definition given in the C++ standard seems fairly unambiguous:
static constexpr bool is_exact;
True if the type uses an exact representation. All integer types are exact, but not all exact types are
integer. For example, rational and ﬁxed-exponent representations are exact but not integer.
Meaningful for all specializations.

In C++ the int type is used to represent a mathematical integer type (i.e. one of the set of {..., -1, 0, 1, ...}). Due to the practical limitation of implementation, the language defines the minimum range of values that should be held by that type, and all valid values in that range must be represented without ambiguity on all known architectures.
The standard also defines types that are used to hold floating point numbers, each with their own range of valid values. What you won't find is the list of valid floating point numbers. Again, due to practical limitations the standard allows for approximations of these types. Many people try to say that only numbers that can be represented by the IEEE floating point standard are exact values for those types, but that's not part of the standard. Though it is true that the implementation of the language on binary computers has a standard for how double and float are represented, there is nothing in the language that says it has to be implemented on a binary computer. In other words float isn't defined by the IEEE standard, the IEEE standard is just an acceptable implementation. As such, if there were an implementation that could hold any value in the range of values that define double and float without rounding rules or estimation, you could say that is_exact is true for that platform.
Strictly speaking, T can't be your only argument to tell whether a type "is_exact", but we can infer some of the other arguments. Because you're probably using a binary computer with standard hardware and any publicly available C++ compiler, when you assign a double the value of .1 (which is in the acceptable range for the floating point types), that's not the number the computer will use in calculations with that variable. It uses the closest approximation as defined by the IEEE standard. Granted, if you compare a literal with itself your compiler should return true, because the IEEE standard is pretty explicit. We know that computers don't have infinite precision and therefore calculations that we expect to have a value of .1 won't necessarily end up with the same approximate representation that the literal value has. Enter the dreaded epsilon comparison.
To practically answer your question, I would say that for any type which requires an epsilon comparison to test for approximate equality, is_exact should return false. If strict comparison is sufficient for that type, it should return true.

std::numeric_limits<T>::is_exact should be false if and only if T's definition allows values that may be unstorable.
C++ considers any floating point literal to be a valid value for its type. And implementations are allowed to decide which values have exact stored representation.
So for every real number in the allowed range (such as 2.0 or 0.2), C++ always promises that the number is a valid double and never promises that the value can be stored exactly.
This means that two assumptions made in the question - while true for the ubiquitous IEEE floating point standard - are incorrect for the C++ definition:
I'm certain that I could store a 2 in a double exactly.
I could then divide [it] by 10 and [the double would not] hold the
mathematical result exactly.

At what point do doubles begin to lose precision?

My application needs to perform some operations: >, <, ==, !=, +, -, ++, etc. (but without division) on some numbers. Those numbers are sometimes integer, and more rarely floats.
If I use internally the "double" type (as defined by IEEE 754) even for integers, up until what point can I be safe to use them as if they were ints, without running in strange rounding errors (for example, n == 5 && n == 6 are both true because they round to the same number)?
Obviously the second input of the various operations (+, -, etc.) is always an integer and I know that with 0.000[..]01 I'll have troubles since the start.
As a bonus answer, the same question but for float.

The number of bits in a IEEE-754 double mantissa is 52, and there's an extra implied bit that is always 1. This means the maximum value that can be contained exactly is 2^53, or 9007199254740992.
A float mantissa is 23 bits, again with an implied bit. The maximum integer that can be exactly represented is 2^24, or 16777216.
If your intent is to hold integer values only, there's usually a 64-bit integer type that would be more appropriate than a double.
Edit: originally I had 2^53-1 and 2^24-1, but I realized there's no need to subtract 1 - an even number can take advantage of an implied 0 bit to the right of the mantissa.

C# Refer to:
However, do be aware that the range of the decimal type is smaller than a double. That is double can hold a larger value, but it does so by losing precision. Or, as stated on MSDN:
The decimal keyword denotes a 128-bit
data type. Compared to floating-point
types, the decimal type has a greater
precision and a smaller range, which
makes it suitable for financial and
monetary calculations. The approximate
range and precision for the decimal
type are shown in the following table.
The primary difference between decimal and double is that decimal is fixed-point and double is floating point. That means that decimal stores an exact value, while double represents a value represented by a fraction, and is less precise. A decimalis 128 bits, so it takes the double space to store. Calculations on decimal is also slower (measure !).
If you need even larger precision, then BigInteger can be used from .NET 4. (You will need to handle decimal points yourself). Here you should be aware, that BigInteger is immutable, so any arithmetic operation on it will create a new instance - if numbers are large, this might be cribbling for performance.
I suggest you look into exactly how much precision you need. Perhaps your algorithm can work with normalized values, that can be smaller ? If performance is an issue, one of the built in floating point types are likely to be faster.

Some questions about floating points

I'm wondering if a number is represented one way in a floating point representation, is it going to be represented in the same way in a representation that has a larger size.
That is, if a number has a particular representation as a float, will it have the same representation if that float is cast to a double and then still the same when cast to a long double.
I'm wondering because I'm writing a BigInteger implementation and any floating point number that is passed in I am sending to a function that accepts a long double to convert it. Which leads me to my next question. Obviously floating points do not always have exact representations, so in my BigInteger class what should I be attempting to represent when given a float. Is it reasonable to try and represent the same number as given by std::cout << std::fixed << someFloat; even if that is not the same as the number passed in. Is that the most accurate representation I will be able to get? If so, ...
What's the best way to extract that value (in base some power of 10), at the moment I'm just grabbing it as a string and passing it to my string constructor. This will work, but I can't help but feel theres a better way, but certainly taking the remainder when dividing by my base is not accurate with floats.
Finally, I wonder if there is a floating point equivalent of uintmax_t, that is a typename that will always be the largest floating point type on a system, or is there no point because long double will always be the largest (even if it 's the same as a double).
Thanks, T.

If by "same representation" you mean "exactly the same binary representation in memory except for padding", then no. Double-precision has more bits of both exponent and mantissa, and also has a different exponent bias. But I believe that any single-precision value is exactly representable in double-precision (except possibly denormalised values).
I'm not sure what you mean when you say "floating points do not always have exact representations". Certainly, not all decimal floating-point values have exact binary floating-point values (and vice versa), but I'm not sure that's a problem here. So long as your floating-point input has no fractional part, then a suitably large "BigInteger" format should be able to represent it exactly.
Conversion via a base-10 representation is not the way to go. In theory, all you need is a bit-array of length ~1024, initialise it all to zero, and then shift the mantissa bits in by the exponent value. But without knowing more about your implementation, there's not a lot more I can suggest!

double includes all values of float; long double includes all values of double. So you're not losing any value information by conversion to long double. However, you're losing information about the original type, which is relevant (see below).
In order to follow common C++ semantics, conversion of a floating point value to integer should truncate the value, not round.
The main problem is with large values that are not exact. You can use the frexp function to find the base 2 exponent of the floating point value. You can use std::numeric_limits<T>::digits to check if that's within the integer range that can be exactly represented.
My personal design choice would be an assert that the fp value is within the range that can be exactly represented, i.e. a restriction on the range of any actual argument.
To do that properly you need overloads taking float and double arguments, since the range that can be represented exactly depends on the actual argument's type.
When you have an fp value that is within the allowed range, you can use floor and fmod to extract digits in any numeral system you want.

yes, going from IEEE float to double to extended you will see bits from the smaller format to the larger format, for example
single
S EEEEEEEE MMMMMMM.....
double
S EEEEEEEEEEEE MMMMM....
6.5 single
0 10000001 101000...
6.5 double
0 10000000001 101000...
13 single
0 10000010 101000...
13 double
0 10000000010 101000...
The mantissa you will left justify and then add zeros.
The exponent is right justified, sign extend the next to msbit then copy the msbit.
An exponent of -2 for example. take -2 subtract 1 which is -3. -3 in twos complement is 0xFD or 0b11111101 but the exponent bits in the format are 0b01111101, the msbit inverted. And for double a -2 exponent -2-1 = -3. or 0b1111...1101 and that becomes 0b0111...1101, the msbit inverted. (exponent bits = twos_complement(exponent-1) with the msbit inverted).
As we see above an exponent of 3 3-1 = 2 0b000...010 invert the upper bit 0b100...010
So yes you can take the bits from single precision and copy them to the proper locations in the double precision number. I dont have an extended float reference handy but pretty sure it works the same way.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js