Reading binary file into SAS

Reading binary file into SAS - sas

I have a binary file of fixed length records created in MS SSIS which I need to read into SAS 9.4 64bit. Currently the file is read within a data step using this code:
data outputdata.(EOC=no
compress = yes
keep = a b c);
length a $4.;
length b 4.;
infile "&inputfile." obs= 999999999 lrecl=308 recfm=F;
input #5 a $4.
#9 b ib4.
#13 c rb4.
;
...
...
...
All variables are read correctly into the output dataset except c. c is a floating point number with 2dp, minimum value 0.00 and maximum value 99.99. In case it's useful, c starts its life off as a VB.Net Single value which is converted to binary using VB.Net's BitConverter.GetBytes(Single) which returns a 4-byte array. This array is then written to the binary record.
From what I can tell from my research on the subject rb4. is the correct way to read a 4-byte floating point ('real'?) value from a binary record in SAS so presumably the issue lies in how to then format that value so that it appears correctly in the output dataset. I've tried the following:
format c rb2.2;
format c 2.2;
format c 4.;
along with variations on the values of the formats statements (e.g. format c 5.; etc). None of the formats I've tried have resulted in anything close to the correct values; most result in numbers in scientific form such as 17E9.
c is a new addition to the binary file and is the only 'real' variable contained within it so I don't have an example to work from. I'm new to SAS and have inherited this project so there's a good chance the issue is something fairly fundamental!
Any guidance appreciated. Thanks

Repeating my comment as an answer...
You should use FLOAT4. to read a value that was written by the VB.NET BitConverter.GetBytes(Single) function. The RB4. informat reads four input bytes as if they are a truncated double-precision floating-point value, but the output of the VB.NET function is a single-precision floating-point value, aka a 'float', which is not the same thing.
The note on SAS's documentation page for the FLOAT format explains:
The FLOATw.d informat is useful in operating environments where a float value is not the same as a truncated double.
On the IBM mainframe systems, a four-byte floating-point number is the same as a truncated eight-byte floating-point number. However, in operating environments that use the IEEE floating-point standard, such as the IBM PC-based operating environments and most UNIX platforms, a four-byte floating-point number is not the same as a truncated double. Therefore, the RB4. informat does not produce the same results as FLOAT4. Floating-point representations other than IEEE might have this same characteristic. Values read with FLOAT4. typically come from some other external program that is running in your operating environment.

Related

Long Numeric to Character in SAS

When I try to convert a 10 digit number which is stored as 8. into a character using put(,8.) it gives me the character value as something like 1.2345e10. I want the character string to be as 1234567890. Can someone please help ?

8. is the width (number of characters) you'd like to use. So of course it is 1.2345e9; that's what it can fit in 8 characters.
x_char = put(x,10.);
That asks for it to be put in 10 characters. Keep extending it if you want it more than 10. Just be aware you may need to use the optional alignment option (put(x,10. -l)) if you aren't happy with the default alignment of right aligned for shorter-than-maximum-length values.
Do note that when you originally describe the number as 8., I suspect you're actually referring to length; length is not the display size but the storage size (bytes in memory set aside). For character variables in a SBCS system they're identical, but for numeric variables length and format are entirely unrelated.

Unless very sure of your inputs, I find it best to use best.:
data have;
x=1234567890;output;
x=12345.67890;output;
x=.1234567890;output;
x=12345678901234;output;
run;
data want;
set have;
length ten best $32;
ten=put(x,10.);
best=put(x,best32.);
run;
Note that using 10. here would wipe out any decimals, as well as larger numbers:

SAS stores numbers as IEEE 64bit floating point numbers. So when you see that the LENGTH of the variable is 8 that just means you are storing the full 8 bytes of the floating point representation. If the length is less than 8 then you will lose some of the ability to store more digits of precision, but when it pulls the data from the dataset to be used by a data step or proc it will always convert it back to the full 64 bit IEEE format.
The LENGTH that you use to store the data has little to do with how many characters you should use to display the value for humans to read. How to display the value is controlled by the FORMAT that you use. What format to use will depend on the type of values your variable will contain.
So if you know your values are always integers and the maximum value is 9,999,999,999 then you can use the format 10. (also known as F10.).
charvar= put(numvar,F10.);

SAS LENGTH statement: what is a byte?

How would you explain to someone how much a "byte" is in the LENGTH statement? I always thought 1 byte equaled 1 character or 1 number, but that doesn't seem to be the case. Also, why is the syntax for it different than the syntax for the FORMAT statement? i.e.:
/*FORMAT Statement Syntax*/
FORMAT variable_name $8.;
/*LENGTH Statement*/
LENGTH variable_name $ 8

The syntax is different because they do different things. The LENGTH statement defines the type of the variable and how much room it takes to store the variable in the dataset. The FORMAT statement defines which FORMAT you want to attach to the variable so that SAS knows how to transform the variable when writing the value out to the log or output window.
The $ in the length statement means you are defining a character variable. The $ in a format statement is just part of the name of the format that you are attaching to the variable. Formats that can be used with character variables start with a $ and numeric formats do not. Formats need to have a period so that SAS can distinguish them from variable names. But the lengths used in a LENGTH statement are integers and so periods are not needed (although SAS will ignore them if you add them after the integer value).
I see a lot of confusion in SAS code where the FORMAT statement is used as if it is intended to define variables. This only works because SAS will guess at how to define a variable the first time it appears in the data step. So it will use the details of the format you are attaching to guess at what type of variable you mean. So if you first reference X in an assignment statement x=2+3 then SAS will guess that X should numeric and give it the default length of 8. But if the first place it sees X is in a format statement like format x $10. then it will guess that you wanted to make X a character variable with length 10 to match the width of the format.
As to how characters are represented and stored it depends on what encoding you are using. If you are only using simple 7-bit ASCII codes then there is a 1-1 relationship between characters and how many bytes it takes to store them. But if you are using UTF-8 it can take up to 4 bytes to store a single character.
For numeric variables SAS uses the IEEE 64 bit format so the relationship between the LENGTH used to store the variable and the width of a format used to display it is much more complex. It is best to just define all numeric variables as length 8. SAS will allow you to define numeric variables with length less than 8 bytes, but that just means it throws away those extra bits of precision when writing the values to the SAS dataset. When storing integers you can do this without loss of precision as long as there are enough bits left to store the largest number you expect. For floating point values you will lose precision.

NumPy: Is it safe to store an int64 value in an np.array with dtype float64 and later convert it back to integer?

I am wondering if I am causing problems because I am assigning and converting data types incorrectly to and from numpy-arrays in Python2.7.
What I am doing is reading a hdf5 64-bit integer value to an numpy.zeros() array from type numpy.float64! Then writing these values to another hdf5 assigning 64-bit unsigned integer!
two example of some original values which are actually ID numbers (so it is crucial that they do not change due to data type conversion):
12028545243
12004994169
Question 1: Will that unsigned integer in the second hdf5-file be the same as in the original?
I checked this with a small subsample but I cannot control if that is true for all of them (there are millions)!
Question 2: If I am reading the 64-bit value from the original file to the numpy-array with data type=float64 and then doing something like:
value=int(value)
value.astype(int64)
will that be exactly the original value or does it change due to the transformation?
Question 3: Will Python interpret the values as I assumed as (a), (b), (c), and (d)? Will there be an issue with formatting the values too, like using scientific notations 'e+10'? Or does Python recognise them as the same value (since it is only a different way to display them ...)?
1.20285452e+10 == 12028545243.0 == 12028545243 == 12028545243
1.20049942e+10 == 12004994169.0 == 12004994169 == 12004994169
(a) (b) (c) (d)
(a) listed value printing one column of array named data:
print data[:,0] <type 'numpy.ndarray'>
(b) printing a single element in data
print data[0,0] <type 'numpy.float64'>
(c) after doing the conversion
print int(data[0,0]) <type int>
(d) same as (a) but using astype() to convert!
print data[:,0].astype(numpy.int64) <type 'numpy.ndarray'>
You may ask why I am not assigning a int64 type to the numpy-array to be safe? Yes I will do that, but there is data which is already stored wrongly and I need to know if I can still trust this data ...
I am using: Python2.7, Pythonbrew, Ubuntu 14.04 LTS 64-bit on Lenovo T410

Generally, it is NOT save to store a 64 bit integer in a 64 bit float. You can easily see that for example by looking at:
import numpy as np
print(np.int64(2**63-1))
print(np.int64(np.float64(2**63-1))
While the first will give you the correct result (9223372036854775807) the second has a round-off error which results in an integer overflow (-9223372036854775808).
To understand this you have to look at how these numbers are stored. While an integer is basically only storing its absolute value in binary (plus one bit used for the sign of the number) this does not hold for a floating point number.
A floating point stores a number in three parts. One being the sign bit, the next being the significant/mantissa and the last being the exponent. The number is then given as sign times mantissa times 2^exponent. These three have to share the bits available (in your case 64). As specified in numpy's documentation for a np.float64 52 bits are used for the significant and 11 bits are used for the exponent. Therefore, only for integers up to 52 bits you will definitively get the right result if you convert them to a np.float64 and back.
So to answer your first and second question: No you cannot be sure that the numbers are the same if there are any numbers bigger than 2**52-1 in your data set.
Concerning your third question: The formatting is done only when printing the values. When comparing numbers internally the numbers do not have any formatting such that all those values will be considered equal as long as they have exactly the same value.
Btw, if you want to learn more about floating point arithmetic, a very good read is the paper "What every computer scientist should know about floating-point arithmetic" by David Goldberg.

It depends on whether Numpy converts your int64 values into float64 and then back into ints or just store the int-data in the memory reserved for float64. I assume the first option is true.
Even without inspecting float64 interna (witch is something one should do anyhow). It's clear that float64 can't have a unique representation for all 2**64 different integers, if it has itself only 2**64 different codes and needing some for 0.1 and so on as well. Float64 uses 52 bit to store a 53 bit long normalized mantissa (the most significant bit is a implicit 1) so if your int has non zero bits more them 52 bits after the the first one like with:
5764607523034234887
= 0x5000000000000007
= 0b0101000000000000000000000000000000000000000000000000000000000111
(witch is a perfectly fine 64 bit integer)
the 0b111 part in the end will just get rounded away and lost after converting it to double in order to fit the number into the mantissa. This information will then be lost for ever. This will likely happen with some of your IDs since they are usually rather big numbers.
So try adjusting your array to int64 instead.

Fortran - want to round to one decimal point

In fortran I have to round latitude and longitude to one digit after decimal point.
I am using gfortran compiler and the nint function but the following does not work:
print *, nint( 1.40 * 10. ) / 10. ! prints 1.39999998
print *, nint( 1.49 * 10. ) / 10. ! prints 1.50000000
Looking for both general and specific solutions here. For example:
How can we display numbers rounded to one decimal place?
How can we store such rounded numbers in fortran. It's not possible in a float variable, but are there other ways?
How can we write such numbers to NetCDF?
How can we write such numbers to a CSV or text file?

As others have said, the issue is the use of floating point representation in the NetCDF file. Using nco utilities, you can change the latitude/longitude to short integers with scale_factor and add_offset. Like this:
ncap2 -s 'latitude=pack(latitude, 0.1, 0); longitude=pack(longitude, 0.1, 0);' old.nc new.nc

There is no way to do what you are asking. The underlying problem is that the rounded values you desire are not necessarily able to be represented using floating point.
For example, if you had a value 10.58, this is represented exactly as 1.3225000 x 2^3 = 10.580000 in IEEE754 float32.
When you round this to value to one decimal point (however you choose to do so), the result would be 10.6, however 10.6 does not have an exact representation. The nearest representation is 1.3249999 x 2^3 = 10.599999 in float32. So no matter how you deal with the rounding, there is no way to store 10.6 exactly in a float32 value, and no way to write it as a floating point value into a netCDF file.

YES, IT CAN BE DONE! The "accepted" answer above is correct in its limited range, but is wrong about what you can actually accomplish in Fortran (or various other HGL's).
The only question is what price are you willing to pay, if the something like a Write with F(6.1) fails?
From one perspective, your problem is a particularly trivial variation on the subject of "Arbitrary Precision" computing. How do you imagine cryptography is handled when you need to store, manipulate, and perform "math" with, say, 1024 bit numbers, with exact precision?
A simple strategy in this case would be to separate each number into its constituent "LHSofD" (Left Hand Side of Decimal), and "RHSofD" values. For example, you might have an RLon(i,j) = 105.591, and would like to print 105.6 (or any manner of rounding) to your netCDF (or any normal) file. Split this into RLonLHS(i,j) = 105, and RLonRHS(i,j) = 591.
... at this point you have choices that increase generality, but at some expense. To save "money" the RHS might be retained as 0.591 (but loose generality if you need to do fancier things).
For simplicity, assume the "cheap and cheerful" second strategy.
The LHS is easy (Int()).
Now, for the RHS, multiply by 10 (if, you wish to round to 1 DEC), e.g. to arrive at RLonRHS(i,j) = 5.91, and then apply Fortran "round to nearest Int" NInt() intrinsic ... leaving you with RLonRHS(i,j) = 6.0.
... and Bob's your uncle:
Now you print the LHS and RHS to your netCDF using a suitable Write statement concatenating the "duals", and will created an EXACT representation as per the required objectives in the OP.
... of course later reading-in those values returns to the same issues as illustrated above, unless the read-in also is ArbPrec aware.
... we wrote our own ArbPrec lib, but there are several about, also in VBA and other HGL's ... but be warned a full ArbPrec bit of machinery is a non-trivial matter ... lucky you problem is so simple.

There are several aspects one can consider in relation to "rounding to one decimal place". These relate to: internal storage and manipulation; display and interchange.
Display and interchange
The simplest aspects cover how we report stored value, regardless of the internal representation used. As covered in depth in other answers and elsewhere we can use a numeric edit descriptor with a single fractional digit:
print '(F0.1,2X,F0.1)', 10.3, 10.17
end
How the output is rounded is a changeable mode:
print '(RU,F0.1,2X,RD,F0.1)', 10.17, 10.17
end
In this example we've chosen to round up and then down, but we could also round to zero or round to nearest (or let the compiler choose for us).
For any formatted output, whether to screen or file, such edit descriptors are available. A G edit descriptor, such as one may use to write CSV files, will also do this rounding.
For unformatted output this concept of rounding is not applicable as the internal representation is referenced. Equally for an interchange format such as NetCDF and HDF5 we do not have this rounding.
For NetCDF your attribute convention may specify something like FORTRAN_format which gives an appropriate format for ultimate display of the (default) real, non-rounded, variable .
Internal storage
Other answers and the question itself mention the impossibility of accurately representing (and working with) decimal digits. However, nothing in the Fortran language requires this to be impossible:
integer, parameter :: rk = SELECTED_REAL_KIND(radix=10)
real(rk) x
x = 0.1_rk
print *, x
end
is a Fortran program which has a radix-10 variable and literal constant. See also IEEE_SELECTED_REAL_KIND(radix=10).
Now, you are exceptionally likely to see that selected_real_kind(radix=10) gives you the value -5, but if you want something positive that can be used as a type parameter you just need to find someone offering you such a system.
If you aren't able to find such a thing then you will need to work accounting for errors. There are two parts to consider here.
The intrinsic real numerical types in Fortran are floating point ones. To use a fixed point numeric type, or a system like binary-coded decimal, you will need to resort to non-intrinsic types. Such a topic is beyond the scope of this answer, but pointers are made in that direction by DrOli.
These efforts will not be computationally/programmer-time cheap. You will also need to take care of managing these types in your output and interchange.
Depending on the requirements of your work, you may find simply scaling by (powers of) ten and working on integers suits. In such cases, you will also want to find the corresponding NetCDF attribute in your convention, such as scale_factor.
Relating to our internal representation concerns we have similar rounding issues to output. For example, if my input data has a longitude of 10.17... but I want to round it in my internal representation to (the nearest representable value to) a single decimal digit (say 10.2/10.1999998) and then work through with that, how do I manage that?
We've seen how nint(10.17*10)/10. gives us this, but we've also learned something about how numeric edit descriptors do this nicely for output, including controlling the rounding mode:
character(10) :: intermediate
real :: rounded
write(intermediate, '(RN,F0.1)') 10.17
read(intermediate, *) rounded
print *, rounded ! This may look not "exact"
end
We can track the accumulation of errors here if this is desired.

The `round_x = nint(x*10d0)/10d0' operator rounds x (for abs(x) < 2**31/10, for large numbers use dnint()) and assigns the rounded value to the round_x variable for further calculations.
As mentioned in the answers above, not all numbers with one significant digit after the decimal point have an exact representation, for example, 0.3 does not.
print *, 0.3d0
Output:
0.29999999999999999
To output a rounded value to a file, to the screen, or to convert it to a string with a single significant digit after the decimal point, use edit descriptor 'Fw.1' (w - width w characters, 0 - variable width). For example:
print '(5(1x, f0.1))', 1.30, 1.31, 1.35, 1.39, 345.46
Output:
1.3 1.3 1.4 1.4 345.5
#JohnE, using 'G10.2' is incorrect, it rounds the result to two significant digits, not to one digit after the decimal point. Eg:
print '(g10.2)', 345.46
Output:
0.35E+03
P.S.
For NetCDF, rounding should be handled by NetCDF viewer, however, you can output variables as NC_STRING type:
write(NetCDF_out_string, '(F0.1)') 1.49
Or, alternatively, get "beautiful" NC_FLOAT/NC_DOUBLE numbers:
beautiful_float_x = nint(x*10.)/10. + epsilon(1.)*nint(x*10.)/10./2.
beautiful_double_x = dnint(x*10d0)/10d0 + epsilon(1d0)*dnint(x*10d0)/10d0/2d0
P.P.S. #JohnE
The preferred solution is not to round intermediate results in memory or in files. Rounding is performed only when the final output of human-readable data is issued;
Use print with edit descriptor ‘Fw.1’, see above;
There are no simple and reliable ways to accurately store rounded numbers (numbers with a decimal fixed point):
2.1. Theoretically, some Fortran implementations can support decimal arithmetic, but I am not aware of implementations that in which ‘selected_real_kind(4, 4, 10)’ returns a value other than -5;
2.2. It is possible to store rounded numbers as strings;
2.3. You can use the Fortran binding of GIMP library. Functions with the mpq_ prefix are designed to work with rational numbers;
There are no simple and reliable ways to write rounded numbers in a netCDF file while preserving their properties for the reader of this file:
3.1. netCDF supports 'Packed Data Values‘, i.e. you can set an integer type with the attributes’ scale_factor‘,’ add_offset' and save arrays of integers. But, in the file ‘scale_factor’ will be stored as a floating number of single or double precision, i.e. the value will differ from 0.1. Accordingly, when reading, when calculating by the netCDF library unpacked_data_value = packed_data_value*scale_factor + add_offset, there will be a rounding error. (You can set scale_factor=0.1*(1.+epsilon(1.)) or scale_factor=0.1d0*(1d0+epsilon(1d0)) to exclude a large number of digits '9'.);
3.2. There are C_format and FORTRAN_format attributes. But it is quite difficult to predict which reader will use which attribute and whether they will use them at all;
3.3. You can store rounded numbers as strings or user-defined types;
Use write() with edit descriptor ‘Fw.1’, see above.

Precision using VariantCopyInd

I am using VariantCopyInd . The source contains 1111.199999999. However after VariantCopyInd the value gets rounded off in the destination as 1111.200000. I would like to retain the original value . how can this be achieved ?

This has nothing to do with VariantCopyInd, but merely the fact that the literal as it exists in the code, has not exact representation in the floating point format used internally by COM Variants.
Therefore, there is no way to achieve what you want, except to use the CURRENCY type of variant. It will have limited precision, see MSDN:
http://msdn.microsoft.com/en-us/library/e305240e-9e11-4006-98cc-26f4932d2118(VS.85)
CURRENCY types use a decimal representation internally, just like the code literal. You will still have to provide an indirect initialization (from string, not a float/double literal) in code, to prevent any unwanted representation effects.
MSDN on CURRENCY:
A currency number stored as an 8-byte, two's complement integer, scaled by 10,000 to give a fixed-point number with 15 digits to the left of the decimal point and 4 digits to the right. This IDispatch::GetTypeInforesentation provides a range of 922337203685477.5807 to -922337203685477.5808.
The CURRENCY data type is useful for calculations involving money, or for any fixed-point calculation where accuracy is particularly important.

I found a very good link from msdn
enter link description here
The link clearly indicates any number whose length is greater than 15 will evaluate into incorrect results .
Take 2 cases
1) 101126.199999999 will store a correct value , since the length is 15 . No conversion or precision loss
2) 111.12345678912345 will store incorrect value since the length is 17 . Conversion will be done

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js