SAS LENGTH statement: what is a byte? - sas

How would you explain to someone how much a "byte" is in the LENGTH statement? I always thought 1 byte equaled 1 character or 1 number, but that doesn't seem to be the case. Also, why is the syntax for it different than the syntax for the FORMAT statement? i.e.:
/*FORMAT Statement Syntax*/
FORMAT variable_name $8.;
/*LENGTH Statement*/
LENGTH variable_name $ 8

The syntax is different because they do different things. The LENGTH statement defines the type of the variable and how much room it takes to store the variable in the dataset. The FORMAT statement defines which FORMAT you want to attach to the variable so that SAS knows how to transform the variable when writing the value out to the log or output window.
The $ in the length statement means you are defining a character variable. The $ in a format statement is just part of the name of the format that you are attaching to the variable. Formats that can be used with character variables start with a $ and numeric formats do not. Formats need to have a period so that SAS can distinguish them from variable names. But the lengths used in a LENGTH statement are integers and so periods are not needed (although SAS will ignore them if you add them after the integer value).
I see a lot of confusion in SAS code where the FORMAT statement is used as if it is intended to define variables. This only works because SAS will guess at how to define a variable the first time it appears in the data step. So it will use the details of the format you are attaching to guess at what type of variable you mean. So if you first reference X in an assignment statement x=2+3 then SAS will guess that X should numeric and give it the default length of 8. But if the first place it sees X is in a format statement like format x $10. then it will guess that you wanted to make X a character variable with length 10 to match the width of the format.
As to how characters are represented and stored it depends on what encoding you are using. If you are only using simple 7-bit ASCII codes then there is a 1-1 relationship between characters and how many bytes it takes to store them. But if you are using UTF-8 it can take up to 4 bytes to store a single character.
For numeric variables SAS uses the IEEE 64 bit format so the relationship between the LENGTH used to store the variable and the width of a format used to display it is much more complex. It is best to just define all numeric variables as length 8. SAS will allow you to define numeric variables with length less than 8 bytes, but that just means it throws away those extra bits of precision when writing the values to the SAS dataset. When storing integers you can do this without loss of precision as long as there are enough bits left to store the largest number you expect. For floating point values you will lose precision.

Related

What is duration and how is it different to length? And why when a char variable with duration=1 is transformed to numeric, the duration becomes 8

What is the duration of a variable and how is it different to the length?
And why when you use the input function to transform a char variable with duration=1 to numeric, the duration of the new numeric variable is 8?
Ex.
A string variable that contains numbers from 0 to 9 (only 1 digit), so that it has duration=1.
When converted to numeric, the numeric variable contains numbers from 0 to 9, but duration=8
Not sure where you are seeing duration used in connection with variable definitions. Duration is a measure of time. Perhaps you meant width?
When you talk about a width for a variable you are talking about how many characters does it take to display the variable as a character string. When you specify a format or an informat you include the width you want to use after the format name and before the period. If you are reading a single digit number from a text file then you would use an informat with a width of 1. Or to write an integer between 0 and 9 you can use a format with a width of 1. But the width used in a format or an informat is independent of the length of the variable.
The length of a variable is the number of bytes that SAS will use to store the variable in a dataset. SAS only has two types of variables, floating point numbers and fixed length character strings.
For numbers SAS uses 64 bit floating point numbers so they take 8 bytes. So you cannot define a number with a length larger than 8. If you set the length for a numeric variable to less than 8 then SAS will store truncated values by discarding some of the bits from the mantissa so you lose some of the precision of the value.
For character variables the length is the number of bytes it will store. With single byte encodings (like WLATIN1) each character will take only one byte. But if you use UTF-8 encoding then each individual character could take between 1 and 4 bytes of storage.
For example the DATE9. format was a width of 9 and is used to print date values using 9 characters. But since dates are numbers the length needed to store the variable will be 8, not 9.
Or take your example of a character variable of length one that contains a single digit. You could convert it to a number using an informat like F1. that has a width of just one. But it will still take 8 bytes to represent the number as a floating point value. And SAS will force you use a length of at least 3 to store it into a dataset. (Note on IBM mainframes the minimum length for numeric variables is 2 instead of 3 because they use a different floating point representation.)

Long Numeric to Character in SAS

When I try to convert a 10 digit number which is stored as 8. into a character using put(,8.) it gives me the character value as something like 1.2345e10. I want the character string to be as 1234567890. Can someone please help ?
8. is the width (number of characters) you'd like to use. So of course it is 1.2345e9; that's what it can fit in 8 characters.
x_char = put(x,10.);
That asks for it to be put in 10 characters. Keep extending it if you want it more than 10. Just be aware you may need to use the optional alignment option (put(x,10. -l)) if you aren't happy with the default alignment of right aligned for shorter-than-maximum-length values.
Do note that when you originally describe the number as 8., I suspect you're actually referring to length; length is not the display size but the storage size (bytes in memory set aside). For character variables in a SBCS system they're identical, but for numeric variables length and format are entirely unrelated.
Unless very sure of your inputs, I find it best to use best.:
data have;
x=1234567890;output;
x=12345.67890;output;
x=.1234567890;output;
x=12345678901234;output;
run;
data want;
set have;
length ten best $32;
ten=put(x,10.);
best=put(x,best32.);
run;
Note that using 10. here would wipe out any decimals, as well as larger numbers:
SAS stores numbers as IEEE 64bit floating point numbers. So when you see that the LENGTH of the variable is 8 that just means you are storing the full 8 bytes of the floating point representation. If the length is less than 8 then you will lose some of the ability to store more digits of precision, but when it pulls the data from the dataset to be used by a data step or proc it will always convert it back to the full 64 bit IEEE format.
The LENGTH that you use to store the data has little to do with how many characters you should use to display the value for humans to read. How to display the value is controlled by the FORMAT that you use. What format to use will depend on the type of values your variable will contain.
So if you know your values are always integers and the maximum value is 9,999,999,999 then you can use the format 10. (also known as F10.).
charvar= put(numvar,F10.);

Reading binary file into SAS

I have a binary file of fixed length records created in MS SSIS which I need to read into SAS 9.4 64bit. Currently the file is read within a data step using this code:
data outputdata.(EOC=no
compress = yes
keep = a b c);
length a $4.;
length b 4.;
infile "&inputfile." obs= 999999999 lrecl=308 recfm=F;
input #5 a $4.
#9 b ib4.
#13 c rb4.
;
...
...
...
All variables are read correctly into the output dataset except c. c is a floating point number with 2dp, minimum value 0.00 and maximum value 99.99. In case it's useful, c starts its life off as a VB.Net Single value which is converted to binary using VB.Net's BitConverter.GetBytes(Single) which returns a 4-byte array. This array is then written to the binary record.
From what I can tell from my research on the subject rb4. is the correct way to read a 4-byte floating point ('real'?) value from a binary record in SAS so presumably the issue lies in how to then format that value so that it appears correctly in the output dataset. I've tried the following:
format c rb2.2;
format c 2.2;
format c 4.;
along with variations on the values of the formats statements (e.g. format c 5.; etc). None of the formats I've tried have resulted in anything close to the correct values; most result in numbers in scientific form such as 17E9.
c is a new addition to the binary file and is the only 'real' variable contained within it so I don't have an example to work from. I'm new to SAS and have inherited this project so there's a good chance the issue is something fairly fundamental!
Any guidance appreciated. Thanks
Repeating my comment as an answer...
You should use FLOAT4. to read a value that was written by the VB.NET BitConverter.GetBytes(Single) function. The RB4. informat reads four input bytes as if they are a truncated double-precision floating-point value, but the output of the VB.NET function is a single-precision floating-point value, aka a 'float', which is not the same thing.
The note on SAS's documentation page for the FLOAT format explains:
The FLOATw.d informat is useful in operating environments where a float value is not the same as a truncated double.
On the IBM mainframe systems, a four-byte floating-point number is the same as a truncated eight-byte floating-point number. However, in operating environments that use the IEEE floating-point standard, such as the IBM PC-based operating environments and most UNIX platforms, a four-byte floating-point number is not the same as a truncated double. Therefore, the RB4. informat does not produce the same results as FLOAT4. Floating-point representations other than IEEE might have this same characteristic. Values read with FLOAT4. typically come from some other external program that is running in your operating environment.

Length of numeric values in SAS

I tried with different length for numeric variables. I referred below link
http://support.sas.com/documentation/cdl/en/hostwin/63285/HTML/default/viewer.htm#numvar.htm
where it is given that largest integer that can be represented with length 3 is 8192.
I tried the sample program below. I have declared a variable num with length 3. And tried storing different values which exceeds 8192.
data numeric_values;
input num;
length num 3;
datalines;
8194
8192
8193
9000
10000
10008
;
run;
I am not getting any error after executing this program.
Dataset numeric_values got created with all the values
8194
8192
8192
9000
10000
10008
Can someone please explain me the concept of length in numeric data type. Please correct me if my understanding is wrong
SAS stores numbers as floating points. The largest integer that can safely be held in length 3 may be 8192, but larger values can also be stored, with a loss of precision. In your example, you can see that 8193 is actually be corrupted to 8192. Your other example numbers are even, which happen to be safe up to a higher threshold, but if you picked 10009 as an example, you'd see it gets corrupted too, into 10008.
It is interesting that SAS doesn't offer any warnings or notes when this happens. I guess they've decided the burden is on the programmer to be aware of tricks of floating point notation.
[Edited answer to refer specifically to integers, in light of DWal's important comment.]

SAS why invalid data length

data temp;
length a 1 b 3 x;
infile '';
input a b x;
run;
The answer said "The data set TEMP is not created because variable A has an invalid length".
Why it's invalid in this small program?
It's invalid because SAS doesn't let you create numeric variables with a length of less then 3 or greater then 8.
Length for numeric variables is not related to the display width (that is controlled solely by format); it is the storage used to hold the variable. In character variables it can be used in that manner, because characters take up 1 byte each, so $7 length is equivalent to $7. format directly. If you want to limit how a number is represented on the screen, use the format statement to control that (format a 1.;). If you want to tell SAS how many characters to input into a number, use informat (informat a 1.;).
However, for numeric variables, there is not the same relationship. Most numbers are 8 bytes, which stores the binary representation of the number as a double precision floating point number. So, a number with format 1. still typically takes up those 8 bytes, just as a number with format 16.3.
Now, you could limit the length somewhat if you wanted to, subject to some considerations. If you limit the length of a numeric variable, you risk losing some precision. In a 1. format number, the odds are that's not a concern; you can store up to 8192 (as an integer) precisely in a three byte numeric (3 digits of precision), so one digit is safe.
In general, unless dealing with very large amounts of data where storage is very costly, it is safer not to manipulate the length of numbers, as you may encounter problems with calculation accuracy (for example, division will very likely cause problems). The limit is not the integer size, but the precision; so for example, while 8192 is the maximum integer storable in a 3 byte number, 8191.5 is not storable in 3 bytes. In fact, 9/8 is, but 11/8 is not storable precisely - 8.192 is the maximum with 3 digits after the decimal, so 8.125 is storable but 8.375 is not.
You can read this article for more details on SAS numeric precision in Windows.
Numeric length can be 3 to 8. SAS uses nearly all of the first two bytes to store the sign and the exponent (the first bit is the sign and the next 11 bits are the exponent), so a 2 byte numeric would only have 5 bits of precision. While some languages have a type this small, SAS chooses not to.