It's commonly said that "SAS missing values equal minus infinity". But There is a problem with that statement, since there can be 27 or 28 "flavors" of missing values (the default . and .a to .z and ._), each having a predefined sort order.
Since it can't be that some infinities are larger than others, I came to understand that:
Missing values are treated like minus infinity when compared to valid numerical data, and that
When compared to other missing values, they are ranked with another set of predefined rules.
So my question is: at the lowest level, how does SAS store numerical data in a way that it can distinguish the missing from the non-missing numerical values? Is there a "missingness bit" like there is a "sign bit"?
SAS stores numbers as floating point values using 64bit IEEE format. They picked 28 specific bit combinations and use them to represent ., ._, and .a to .z. By convention they are ordered ._ to . to .a to .z. I am not sure if the values were picked to make it easier to test that ordering, or if the ordering was an accident of the particular bit patterns they used.
You can look at the bit patterns used by peeking into the values that are stored.
data _null_;
length i 8 str $8 ;
do i=._,.,.a,.z,constant('small'),0,1,constant('big');
str=peekclong(addrlong(i));
str=reverse(str);
put i best12. #15 i hex16. #35 str $hex16. ;
end;
run;
result
_ _ FFFFFF0000000000
. . FFFFFE0000000000
A A FFFFFD0000000000
Z Z FFFFE40000000000
2.22507E-308 0010000000000000 0010000000000000
0 0000000000000000 0000000000000000
1 3FF0000000000000 3FF0000000000000
1.797693E308 7FEFFFFFFFFFFFFF 7FEFFFFFFFFFFFFF
Related
When I try to convert a 10 digit number which is stored as 8. into a character using put(,8.) it gives me the character value as something like 1.2345e10. I want the character string to be as 1234567890. Can someone please help ?
8. is the width (number of characters) you'd like to use. So of course it is 1.2345e9; that's what it can fit in 8 characters.
x_char = put(x,10.);
That asks for it to be put in 10 characters. Keep extending it if you want it more than 10. Just be aware you may need to use the optional alignment option (put(x,10. -l)) if you aren't happy with the default alignment of right aligned for shorter-than-maximum-length values.
Do note that when you originally describe the number as 8., I suspect you're actually referring to length; length is not the display size but the storage size (bytes in memory set aside). For character variables in a SBCS system they're identical, but for numeric variables length and format are entirely unrelated.
Unless very sure of your inputs, I find it best to use best.:
data have;
x=1234567890;output;
x=12345.67890;output;
x=.1234567890;output;
x=12345678901234;output;
run;
data want;
set have;
length ten best $32;
ten=put(x,10.);
best=put(x,best32.);
run;
Note that using 10. here would wipe out any decimals, as well as larger numbers:
SAS stores numbers as IEEE 64bit floating point numbers. So when you see that the LENGTH of the variable is 8 that just means you are storing the full 8 bytes of the floating point representation. If the length is less than 8 then you will lose some of the ability to store more digits of precision, but when it pulls the data from the dataset to be used by a data step or proc it will always convert it back to the full 64 bit IEEE format.
The LENGTH that you use to store the data has little to do with how many characters you should use to display the value for humans to read. How to display the value is controlled by the FORMAT that you use. What format to use will depend on the type of values your variable will contain.
So if you know your values are always integers and the maximum value is 9,999,999,999 then you can use the format 10. (also known as F10.).
charvar= put(numvar,F10.);
I have a binary file of fixed length records created in MS SSIS which I need to read into SAS 9.4 64bit. Currently the file is read within a data step using this code:
data outputdata.(EOC=no
compress = yes
keep = a b c);
length a $4.;
length b 4.;
infile "&inputfile." obs= 999999999 lrecl=308 recfm=F;
input #5 a $4.
#9 b ib4.
#13 c rb4.
;
...
...
...
All variables are read correctly into the output dataset except c. c is a floating point number with 2dp, minimum value 0.00 and maximum value 99.99. In case it's useful, c starts its life off as a VB.Net Single value which is converted to binary using VB.Net's BitConverter.GetBytes(Single) which returns a 4-byte array. This array is then written to the binary record.
From what I can tell from my research on the subject rb4. is the correct way to read a 4-byte floating point ('real'?) value from a binary record in SAS so presumably the issue lies in how to then format that value so that it appears correctly in the output dataset. I've tried the following:
format c rb2.2;
format c 2.2;
format c 4.;
along with variations on the values of the formats statements (e.g. format c 5.; etc). None of the formats I've tried have resulted in anything close to the correct values; most result in numbers in scientific form such as 17E9.
c is a new addition to the binary file and is the only 'real' variable contained within it so I don't have an example to work from. I'm new to SAS and have inherited this project so there's a good chance the issue is something fairly fundamental!
Any guidance appreciated. Thanks
Repeating my comment as an answer...
You should use FLOAT4. to read a value that was written by the VB.NET BitConverter.GetBytes(Single) function. The RB4. informat reads four input bytes as if they are a truncated double-precision floating-point value, but the output of the VB.NET function is a single-precision floating-point value, aka a 'float', which is not the same thing.
The note on SAS's documentation page for the FLOAT format explains:
The FLOATw.d informat is useful in operating environments where a float value is not the same as a truncated double.
On the IBM mainframe systems, a four-byte floating-point number is the same as a truncated eight-byte floating-point number. However, in operating environments that use the IEEE floating-point standard, such as the IBM PC-based operating environments and most UNIX platforms, a four-byte floating-point number is not the same as a truncated double. Therefore, the RB4. informat does not produce the same results as FLOAT4. Floating-point representations other than IEEE might have this same characteristic. Values read with FLOAT4. typically come from some other external program that is running in your operating environment.
How would you explain to someone how much a "byte" is in the LENGTH statement? I always thought 1 byte equaled 1 character or 1 number, but that doesn't seem to be the case. Also, why is the syntax for it different than the syntax for the FORMAT statement? i.e.:
/*FORMAT Statement Syntax*/
FORMAT variable_name $8.;
/*LENGTH Statement*/
LENGTH variable_name $ 8
The syntax is different because they do different things. The LENGTH statement defines the type of the variable and how much room it takes to store the variable in the dataset. The FORMAT statement defines which FORMAT you want to attach to the variable so that SAS knows how to transform the variable when writing the value out to the log or output window.
The $ in the length statement means you are defining a character variable. The $ in a format statement is just part of the name of the format that you are attaching to the variable. Formats that can be used with character variables start with a $ and numeric formats do not. Formats need to have a period so that SAS can distinguish them from variable names. But the lengths used in a LENGTH statement are integers and so periods are not needed (although SAS will ignore them if you add them after the integer value).
I see a lot of confusion in SAS code where the FORMAT statement is used as if it is intended to define variables. This only works because SAS will guess at how to define a variable the first time it appears in the data step. So it will use the details of the format you are attaching to guess at what type of variable you mean. So if you first reference X in an assignment statement x=2+3 then SAS will guess that X should numeric and give it the default length of 8. But if the first place it sees X is in a format statement like format x $10. then it will guess that you wanted to make X a character variable with length 10 to match the width of the format.
As to how characters are represented and stored it depends on what encoding you are using. If you are only using simple 7-bit ASCII codes then there is a 1-1 relationship between characters and how many bytes it takes to store them. But if you are using UTF-8 it can take up to 4 bytes to store a single character.
For numeric variables SAS uses the IEEE 64 bit format so the relationship between the LENGTH used to store the variable and the width of a format used to display it is much more complex. It is best to just define all numeric variables as length 8. SAS will allow you to define numeric variables with length less than 8 bytes, but that just means it throws away those extra bits of precision when writing the values to the SAS dataset. When storing integers you can do this without loss of precision as long as there are enough bits left to store the largest number you expect. For floating point values you will lose precision.
I tried with different length for numeric variables. I referred below link
http://support.sas.com/documentation/cdl/en/hostwin/63285/HTML/default/viewer.htm#numvar.htm
where it is given that largest integer that can be represented with length 3 is 8192.
I tried the sample program below. I have declared a variable num with length 3. And tried storing different values which exceeds 8192.
data numeric_values;
input num;
length num 3;
datalines;
8194
8192
8193
9000
10000
10008
;
run;
I am not getting any error after executing this program.
Dataset numeric_values got created with all the values
8194
8192
8192
9000
10000
10008
Can someone please explain me the concept of length in numeric data type. Please correct me if my understanding is wrong
SAS stores numbers as floating points. The largest integer that can safely be held in length 3 may be 8192, but larger values can also be stored, with a loss of precision. In your example, you can see that 8193 is actually be corrupted to 8192. Your other example numbers are even, which happen to be safe up to a higher threshold, but if you picked 10009 as an example, you'd see it gets corrupted too, into 10008.
It is interesting that SAS doesn't offer any warnings or notes when this happens. I guess they've decided the burden is on the programmer to be aware of tricks of floating point notation.
[Edited answer to refer specifically to integers, in light of DWal's important comment.]
I am using VariantCopyInd . The source contains 1111.199999999. However after VariantCopyInd the value gets rounded off in the destination as 1111.200000. I would like to retain the original value . how can this be achieved ?
This has nothing to do with VariantCopyInd, but merely the fact that the literal as it exists in the code, has not exact representation in the floating point format used internally by COM Variants.
Therefore, there is no way to achieve what you want, except to use the CURRENCY type of variant. It will have limited precision, see MSDN:
http://msdn.microsoft.com/en-us/library/e305240e-9e11-4006-98cc-26f4932d2118(VS.85)
CURRENCY types use a decimal representation internally, just like the code literal. You will still have to provide an indirect initialization (from string, not a float/double literal) in code, to prevent any unwanted representation effects.
MSDN on CURRENCY:
A currency number stored as an 8-byte, two's complement integer, scaled by 10,000 to give a fixed-point number with 15 digits to the left of the decimal point and 4 digits to the right. This IDispatch::GetTypeInforesentation provides a range of 922337203685477.5807 to -922337203685477.5808.
The CURRENCY data type is useful for calculations involving money, or for any fixed-point calculation where accuracy is particularly important.
I found a very good link from msdn
enter link description here
The link clearly indicates any number whose length is greater than 15 will evaluate into incorrect results .
Take 2 cases
1) 101126.199999999 will store a correct value , since the length is 15 . No conversion or precision loss
2) 111.12345678912345 will store incorrect value since the length is 17 . Conversion will be done