data temp;
length a 1 b 3 x;
infile '';
input a b x;
run;
The answer said "The data set TEMP is not created because variable A has an invalid length".
Why it's invalid in this small program?
It's invalid because SAS doesn't let you create numeric variables with a length of less then 3 or greater then 8.
Length for numeric variables is not related to the display width (that is controlled solely by format); it is the storage used to hold the variable. In character variables it can be used in that manner, because characters take up 1 byte each, so $7 length is equivalent to $7. format directly. If you want to limit how a number is represented on the screen, use the format statement to control that (format a 1.;). If you want to tell SAS how many characters to input into a number, use informat (informat a 1.;).
However, for numeric variables, there is not the same relationship. Most numbers are 8 bytes, which stores the binary representation of the number as a double precision floating point number. So, a number with format 1. still typically takes up those 8 bytes, just as a number with format 16.3.
Now, you could limit the length somewhat if you wanted to, subject to some considerations. If you limit the length of a numeric variable, you risk losing some precision. In a 1. format number, the odds are that's not a concern; you can store up to 8192 (as an integer) precisely in a three byte numeric (3 digits of precision), so one digit is safe.
In general, unless dealing with very large amounts of data where storage is very costly, it is safer not to manipulate the length of numbers, as you may encounter problems with calculation accuracy (for example, division will very likely cause problems). The limit is not the integer size, but the precision; so for example, while 8192 is the maximum integer storable in a 3 byte number, 8191.5 is not storable in 3 bytes. In fact, 9/8 is, but 11/8 is not storable precisely - 8.192 is the maximum with 3 digits after the decimal, so 8.125 is storable but 8.375 is not.
You can read this article for more details on SAS numeric precision in Windows.
Numeric length can be 3 to 8. SAS uses nearly all of the first two bytes to store the sign and the exponent (the first bit is the sign and the next 11 bits are the exponent), so a 2 byte numeric would only have 5 bits of precision. While some languages have a type this small, SAS chooses not to.
Related
I went through many documents and I still didn't get the size that is used by data types in C especially, maybe the case with all programming language.
Say we have declared a variable like "int = a", we all know the data type “int” has the size of 4 bytes(forget for a moment the arch computer, this is for simplicity) where it can hold values from -2^31 to 2^31 – 1 which is [0 to 4,294,967,295].
Now the question is, each character or letter or number is stored as 8-bits or a byte, so if int has only 4 bytes as the data types size, how can one store a number like 200,000 or even the max value 4,294,967,295.
There is no concrete document that explains what is exactly the size of the data type.
A 4 byte character string can indeed only store numbers up to 9999 or only 999 if you leave space for the minus sign. This is very inefficient though, only storing around 13-bit numbers in 32-bits of data.
Integer data types are instead stored as binary two's complement for signed numbers or simply in binary for unsigned numbers. This uses all the bits allowing storage of -2^(n-1) to 2^(n-1)-1 for signed numbers or 0 to 2^n-1 for unsigned numbers where n is the number of bits in the number.
What is the duration of a variable and how is it different to the length?
And why when you use the input function to transform a char variable with duration=1 to numeric, the duration of the new numeric variable is 8?
Ex.
A string variable that contains numbers from 0 to 9 (only 1 digit), so that it has duration=1.
When converted to numeric, the numeric variable contains numbers from 0 to 9, but duration=8
Not sure where you are seeing duration used in connection with variable definitions. Duration is a measure of time. Perhaps you meant width?
When you talk about a width for a variable you are talking about how many characters does it take to display the variable as a character string. When you specify a format or an informat you include the width you want to use after the format name and before the period. If you are reading a single digit number from a text file then you would use an informat with a width of 1. Or to write an integer between 0 and 9 you can use a format with a width of 1. But the width used in a format or an informat is independent of the length of the variable.
The length of a variable is the number of bytes that SAS will use to store the variable in a dataset. SAS only has two types of variables, floating point numbers and fixed length character strings.
For numbers SAS uses 64 bit floating point numbers so they take 8 bytes. So you cannot define a number with a length larger than 8. If you set the length for a numeric variable to less than 8 then SAS will store truncated values by discarding some of the bits from the mantissa so you lose some of the precision of the value.
For character variables the length is the number of bytes it will store. With single byte encodings (like WLATIN1) each character will take only one byte. But if you use UTF-8 encoding then each individual character could take between 1 and 4 bytes of storage.
For example the DATE9. format was a width of 9 and is used to print date values using 9 characters. But since dates are numbers the length needed to store the variable will be 8, not 9.
Or take your example of a character variable of length one that contains a single digit. You could convert it to a number using an informat like F1. that has a width of just one. But it will still take 8 bytes to represent the number as a floating point value. And SAS will force you use a length of at least 3 to store it into a dataset. (Note on IBM mainframes the minimum length for numeric variables is 2 instead of 3 because they use a different floating point representation.)
When I try to convert a 10 digit number which is stored as 8. into a character using put(,8.) it gives me the character value as something like 1.2345e10. I want the character string to be as 1234567890. Can someone please help ?
8. is the width (number of characters) you'd like to use. So of course it is 1.2345e9; that's what it can fit in 8 characters.
x_char = put(x,10.);
That asks for it to be put in 10 characters. Keep extending it if you want it more than 10. Just be aware you may need to use the optional alignment option (put(x,10. -l)) if you aren't happy with the default alignment of right aligned for shorter-than-maximum-length values.
Do note that when you originally describe the number as 8., I suspect you're actually referring to length; length is not the display size but the storage size (bytes in memory set aside). For character variables in a SBCS system they're identical, but for numeric variables length and format are entirely unrelated.
Unless very sure of your inputs, I find it best to use best.:
data have;
x=1234567890;output;
x=12345.67890;output;
x=.1234567890;output;
x=12345678901234;output;
run;
data want;
set have;
length ten best $32;
ten=put(x,10.);
best=put(x,best32.);
run;
Note that using 10. here would wipe out any decimals, as well as larger numbers:
SAS stores numbers as IEEE 64bit floating point numbers. So when you see that the LENGTH of the variable is 8 that just means you are storing the full 8 bytes of the floating point representation. If the length is less than 8 then you will lose some of the ability to store more digits of precision, but when it pulls the data from the dataset to be used by a data step or proc it will always convert it back to the full 64 bit IEEE format.
The LENGTH that you use to store the data has little to do with how many characters you should use to display the value for humans to read. How to display the value is controlled by the FORMAT that you use. What format to use will depend on the type of values your variable will contain.
So if you know your values are always integers and the maximum value is 9,999,999,999 then you can use the format 10. (also known as F10.).
charvar= put(numvar,F10.);
I got a list of numbers (int and doubles) which I need to export to a buffer as strings. The buffer has to be reserved beforehand. For speed and size reasons I do not want to create the strings, measure its size and then create it again into the buffer. And no, the used system does not allow to create the whole string and copy it afterwards.
For integers, you'll need floor(log10(number)) + 1 decimal digits (adjusted for 0 and sign as necessary).
For doubles, the situation is a bit more complicated - it really depends on how you want to represent them. Most importantly, do you mind trailing 0s after the decimal point? Is scientific notation an option?
One way to approach this would be: you need 17 decimal digits after the decimal point to represent an IEEE double in a string so that it can be reconstructed unambiguously. So always reserve those 17 digits, plus the period, and use the integer formula above for the integral part.
Or, maybe, what I don't get is unary coding:
In Golomb, or Rice, coding, you split a number N into two parts by dividing it by another number M and then code the integer result of that division in unary and the remainder in binary.
In the Wikipedia example, they use 42 as N and 10 as M, so we end up with a quotient q of 4 (in unary: 1110) and a remainder r of 2 (in binary 010), so that the resulting message is 1110,010, or 8 bits (the comma can be skipped). The simple binary representation of 42 is 101010, or 6 bits.
To me, this seems due to the unary representation of q which always has to be more bits than binary.
Clearly, I'm missing some important point here. What is it?
The important point is that Golomb codes are not meant to be shorter than the shortest binary encoding for one particular number. Rather, by providing a specific kind of variable-length encoding, they reduce the average length per encoded value compared to fixed-width encoding, if the encoded values are from a large range, but the most common values are generally small (and hence are using only a small fraction of that range most of the time).
As an example, if you were to transmit integers in the range from 0 to 1000, but a large majority of the actual values were in the range between 0 and 10, in a fixed-width encoding, most of the transmitted codes would have leading 0s that contain no information:
To cover all values between 0 and 1000, you need a 10-bit wide encoding in fixed-width binary. Now, as most of your values would be below 10, at least the first 6 bits of most numbers would be 0 and would carry little information.
To rectify this with Golomb codes, you split the numbers by dividing them by 10 and encoding the quotient and the remainder separately. For most values, all that would have to be transmitted is the remainder which can be encoded using 4 bits at most (if you use truncated binary for the remainder it can be less). The quotient is then transmitted in unary, which encodes as a single 0 bit for all values below 10, as 10 for 10..19, 110 for 20..29 etc.
Now, for most of your values, you have reduced the message size to 5 bits max, but you are still able to transmit all values unambigously without separators.
This comes at a rather high cost for the larger values (for example, values in the range 990..999 need 100 bits for the quotient), which is why the coding is optimal for 2-sided geometric distributions.
The long runs of 1 bits in the quotients of larger values can be addressed with subsequent run-length encoding. However, if the quotients consume too much space in the resulting message, this could indicate that other codes might be more appropriate than Golomb/Rice.
One difference between the Golomb coding and binary code is that binary code is not a prefix code, which is a no-go for coding strings of arbitrarily large numbers (you cannot decide if 1010101010101010 is a concatenation of 10101010 and 10101010 or something else). Hence, they are not that easily comparable.
Second, the Golomb code is optimal for geometric distribution, in this case with parameter 2^(-1/10). The probability of 42 is some 0.3 %, so you get the idea about how important is this for the length of the output string.