Length of numeric values in SAS - sas

I tried with different length for numeric variables. I referred below link
http://support.sas.com/documentation/cdl/en/hostwin/63285/HTML/default/viewer.htm#numvar.htm
where it is given that largest integer that can be represented with length 3 is 8192.
I tried the sample program below. I have declared a variable num with length 3. And tried storing different values which exceeds 8192.
data numeric_values;
input num;
length num 3;
datalines;
8194
8192
8193
9000
10000
10008
;
run;
I am not getting any error after executing this program.
Dataset numeric_values got created with all the values
8194
8192
8192
9000
10000
10008
Can someone please explain me the concept of length in numeric data type. Please correct me if my understanding is wrong

SAS stores numbers as floating points. The largest integer that can safely be held in length 3 may be 8192, but larger values can also be stored, with a loss of precision. In your example, you can see that 8193 is actually be corrupted to 8192. Your other example numbers are even, which happen to be safe up to a higher threshold, but if you picked 10009 as an example, you'd see it gets corrupted too, into 10008.
It is interesting that SAS doesn't offer any warnings or notes when this happens. I guess they've decided the burden is on the programmer to be aware of tricks of floating point notation.
[Edited answer to refer specifically to integers, in light of DWal's important comment.]

Related

Long Numeric to Character in SAS

When I try to convert a 10 digit number which is stored as 8. into a character using put(,8.) it gives me the character value as something like 1.2345e10. I want the character string to be as 1234567890. Can someone please help ?
8. is the width (number of characters) you'd like to use. So of course it is 1.2345e9; that's what it can fit in 8 characters.
x_char = put(x,10.);
That asks for it to be put in 10 characters. Keep extending it if you want it more than 10. Just be aware you may need to use the optional alignment option (put(x,10. -l)) if you aren't happy with the default alignment of right aligned for shorter-than-maximum-length values.
Do note that when you originally describe the number as 8., I suspect you're actually referring to length; length is not the display size but the storage size (bytes in memory set aside). For character variables in a SBCS system they're identical, but for numeric variables length and format are entirely unrelated.
Unless very sure of your inputs, I find it best to use best.:
data have;
x=1234567890;output;
x=12345.67890;output;
x=.1234567890;output;
x=12345678901234;output;
run;
data want;
set have;
length ten best $32;
ten=put(x,10.);
best=put(x,best32.);
run;
Note that using 10. here would wipe out any decimals, as well as larger numbers:
SAS stores numbers as IEEE 64bit floating point numbers. So when you see that the LENGTH of the variable is 8 that just means you are storing the full 8 bytes of the floating point representation. If the length is less than 8 then you will lose some of the ability to store more digits of precision, but when it pulls the data from the dataset to be used by a data step or proc it will always convert it back to the full 64 bit IEEE format.
The LENGTH that you use to store the data has little to do with how many characters you should use to display the value for humans to read. How to display the value is controlled by the FORMAT that you use. What format to use will depend on the type of values your variable will contain.
So if you know your values are always integers and the maximum value is 9,999,999,999 then you can use the format 10. (also known as F10.).
charvar= put(numvar,F10.);

SAS why invalid data length

data temp;
length a 1 b 3 x;
infile '';
input a b x;
run;
The answer said "The data set TEMP is not created because variable A has an invalid length".
Why it's invalid in this small program?
It's invalid because SAS doesn't let you create numeric variables with a length of less then 3 or greater then 8.
Length for numeric variables is not related to the display width (that is controlled solely by format); it is the storage used to hold the variable. In character variables it can be used in that manner, because characters take up 1 byte each, so $7 length is equivalent to $7. format directly. If you want to limit how a number is represented on the screen, use the format statement to control that (format a 1.;). If you want to tell SAS how many characters to input into a number, use informat (informat a 1.;).
However, for numeric variables, there is not the same relationship. Most numbers are 8 bytes, which stores the binary representation of the number as a double precision floating point number. So, a number with format 1. still typically takes up those 8 bytes, just as a number with format 16.3.
Now, you could limit the length somewhat if you wanted to, subject to some considerations. If you limit the length of a numeric variable, you risk losing some precision. In a 1. format number, the odds are that's not a concern; you can store up to 8192 (as an integer) precisely in a three byte numeric (3 digits of precision), so one digit is safe.
In general, unless dealing with very large amounts of data where storage is very costly, it is safer not to manipulate the length of numbers, as you may encounter problems with calculation accuracy (for example, division will very likely cause problems). The limit is not the integer size, but the precision; so for example, while 8192 is the maximum integer storable in a 3 byte number, 8191.5 is not storable in 3 bytes. In fact, 9/8 is, but 11/8 is not storable precisely - 8.192 is the maximum with 3 digits after the decimal, so 8.125 is storable but 8.375 is not.
You can read this article for more details on SAS numeric precision in Windows.
Numeric length can be 3 to 8. SAS uses nearly all of the first two bytes to store the sign and the exponent (the first bit is the sign and the next 11 bits are the exponent), so a 2 byte numeric would only have 5 bits of precision. While some languages have a type this small, SAS chooses not to.

Protect C++ Variables from Getting OverFlow ? if value is less UpperBound of any DataType

I want to protect my variable from storing overflow values .
I am Calculating Loss at every level in a tree and at some stages.
it gives values like 4.94567e+302 ;Is this value Correct .If i compare it(Like Minimum ,Maximum etc) with any other values .Will it give the right answer?
Some Times it gives negative answer but Formula can not give negative values so surely this kind of value is wrong
I want to do Following thing in my c++ code .
ForExample:
long double loss; //8 Bytes Floating Number
loss=calculate_loss();
if(loss value is greater than Capacity)
do
store 8 bytes in loss abd neglect remaining;
end if
if your capacity should be limited to the maximum or minimum capacity of the double (or float) datatype you can use floating point exceptions (not to be mistaken with C++ exceptions). Their signalling needs to be enabled in the compiler options and the you can map those to C++ exceptions detecting an overflow of the datatype.
here's an msdn page that describes the FP exceptions pretty good. At the bottom of the page you will find examples how to map that to C++ exceptions.
http://msdn.microsoft.com/en-us/library/aa289157%28v=vs.71%29.aspx
There is a limits.h in C++, but this
if(loss value is greater than Capacity)
cannot work by definition. The value in loss can not be greater than its own data type maximum.
Your example value of 5 times 10^302 indeed is suspiciously large. Coupled with your statement that you sometimes get unexpected negative values, I suggest you take a good long look at your calculations.
A reasonable guess: you are tinkering with pointers and mix up pointers to integers and floating point numbers. But noone can tell without seeing the code.

PdhGetFormattedCounterValue and correct Format

Im trying to understand the performance API but I have a problem to understand the PdhGetFormattedCounterValue function and the dwFormat parameter.
How do I know which format to choose when calling this function?
I found the PDH_COUNTER_INFO structure on MSDN and saw that this structure has a dwType member but I still not understand how to use this structure to get information about the counter format to successfully call the PdhGetFormattedCounterValue function.
You get to choose.
Many counters are calculated as fractions or they are scaled or both, so they have fractional parts. Which makes PDH_FMT_DOUBLE a good choice.
Some counter types never have fractional parts. You could read all the documentation and work out which, and then add two code paths to handle "counters that might be fractional" and "counters that won't be fractional", but this would be a lot of work for very little gain or none.
Just use PDH_FMT_DOUBLE.
Update
For most counters the precision is only a concern during the calculation of the final value, which happens within the PDH library. But, as you say, the precision of bulk totals (such as disk free space) could be an issue. Happily, it isn't.
A double has 52 bits of significand. For a counter reported in bytes this has room for 4 petabytes without any loss of precision.
But disk free space is reported in megabytes, not bytes, so a double can report up to 272 bytes or 4 zettabytes with no loss of precision. And I'm pretty sure you can't have a single volume bigger than 264 bytes.
Memory use (of processes and so) is reported in bytes but the values are always multiples of the page size (4096 bytes), so the bottom 12 bits of any reported value are zero. Hence a double can represent a memory size of up to 264 bytes - i.e. the entire 64-bit memory space - without loss of precision.
Precision isn't a problem. Just use PDH_FMT_DOUBLE.

Shannon's entropy formula. Help my confusion

my understanding of the entropy formula is that it's used to compute the minimum number of bits required to represent some data. It's usually worded differently when defined, but the previous understanding is what I relied on until now.
Here's my problem. Suppose I have a sequence of 100 '1' followed by 100 '0' = 200 bits. The alphabet is {0,1}, base of entropy is 2. Probability of symbol "0" is 0.5 and "1" is 0.5. So the entropy is 1 or 1 bit to represent 1 bit.
However you can run-length encode it with something like 100 / 1 / 100 / 0 where it's number of bits to output followed by the bit. It seems like I have a representation smaller than the data. Especially if you increase the 100 to much larger number.
I'm using: http://en.wikipedia.org/wiki/Information_entropy as reference at the moment.
Where did I go wrong? Is it the probability assigned to symbols? I don't think it's wrong. Or did I get the connection between compression and entropy wrong? Anything else?
Thanks.
Edit
Following some of the answers my followup are: would you apply the entropy formula to a particular instance of a message to try to find out its information content? Would it be valid to take the message "aaab" and say the entropy is ~0.811. If yes then what's the entropy of 1...10....0 where 1s and 0s are repeated n times using the entropy formula. Is the answer 1?
Yes I understand that you are creating a random variable of your input symbols and guessing at the probability mass function based on your message. What I'm trying to confirm is the entropy formula does not take into account the position of the symbols in the message.
Or did I get the connection between compression and entropy wrong?
You're pretty close, but this last question is where the mistake was. If you're able to compress something into a form that was smaller than its original representation, it means that the original representation had at least some redundancy. Each bit in the message really wasn't conveying 1 bit of information.
Because redundant data does not contribute to the information content of a message, it also does not increase its entropy. Imagine, for example, a "random bit generator" that only returns the value "0". This conveys no information at all! (Actually, it conveys an undefined amount of information, because any binary message consisting of only one kind of symbol requires a division by zero in the entropy formula.)
By contrast, had you simulated a large number of random coin flips, it would be very hard to reduce the size of this message by much. Each bit would be contributing close to 1 bit of entropy.
When you compress data, you extract that redundancy. In exchange, you pay a one-time entropy price by having to devise a scheme that knows how to compress and decompress this data; that itself takes some information.
However you can run-length encode it with something like 100 / 1 / 100 / 0 where it's number of bits to output followed by the bit. It seems like I have a representation smaller than the data. Especially if you increase the 100 to much larger number.
To summarize, the fact that you could devise a scheme to make the encoding of the data smaller than the original data tells you something important. Namely, it says that your original data contained very little information.
Further reading
For a more thorough treatment of this, including exactly how you'd calculate the entropy for any arbitrary sequence of digits with a few examples, check out this short whitepaper.
Have a look at Kolmogorov complexity
The minimum number of bits into which a string can be compressed without losing information. This is defined with respect to a fixed, but universal decompression scheme, given by a universal Turing machine.
And in your particular case, don't restrict yourself to alphabet {0,1}. For your example use {0...0, 1...1} (hundred of 0's and hundred of 1's)
Your encoding works in this example, but it is possible to conceive an equally valid case: 010101010101... which would be encoded as 1 / 0 / 1 / 1 / ...
Entropy is measured across all possible messages that can be constructed in the given alphabet, and not just pathological examples!
John Feminella got it right, but I think there is more to say.
Shannon entropy is based on probability, and probability is always in the eye of the beholder.
You said that 1 and 0 were equally likely (0.5). If that is so, then the string of 100 1s followed by 100 0s has a probability of 0.5^200, of which -log(base 2) is 200 bits, as you expect. However, the entropy of that string (in Shannon terms) is its information content times its probability, or 200 * 0.5^200, still a really small number.
This is important because if you do run-length coding to compress the string, in the case of this string it will get a small length, but averaged over all 2^200 strings, it will not do well. With luck, it will average out to about 200, but not less.
On the other hand, if you look at your original string and say it is so striking that whoever generated it is likely to generate more like it, then you are really saying its probability is larger than 0.5^200, so you are making a different assumptions about the original probability structure of the generator of the string, namely that it has lower entropy than 200 bits.
Personally, I find this subject really interesting, especially when you look into Kolmogorov (Algorithmic) information. In that case, you define the information content of a string as the length of the smallest program that could generate it. This leads to all sorts of insights into software engineering and language design.
I hope that helps, and thanks for your question.