I'm trying to determine how SAS is reading the length statement and then the informat statement. I could be misunderstanding, but I'm under the impression that the informat statement for numeric variables worked like this:
informat number 5.;
This would give the variable number the informat 5, allowing 5 numbers to fill it. E.G. 12345
However, when I run the below program, I have a number that has 9 digits, 987654321, with the appropriate length to fit the digits, 6, which will represent all numbers up to 137,438,953,472
Q: is length statement 'overriding' the informat statement and allowing all 9 digits to fill the variable number? How are all 9 digits able to fit in the variable number with an informat of 5.?
data tst;
input number;
length number 6;
informat number 5.;
datalines;
987654321
;
run;
proc print data=tst;
run;
Based on this SAS documentation:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000199348.htm
w specifies the width of the input field. Range: 1-32
It would seem that the informat w.d would work as I first described and not allow all 9 digits to fill number
Because you are using list mode input. In that situation SAS reads the next word, however long it is. Essentially in list mode input (including when using the : modifier before an informat specified in the input statement) the width on a informat is ignored.
Other than for creating metadata in the SAS dataset there is not much value in attaching informats like 5. or $10. to variables.
SAS does not need them to understand how to convert text into values, unlike informats like date..
In list mode it ignores the width part.
And in formatted input, where the width matters, you have to specify the informat in the INPUT statement itself.
First off: length is not overriding, or having any impact on, the informat or the read-in. length solely describes how many bytes are used to store the number, nothing more.
For numeric variables, informats don't work quite the intuitive way. I'm not sure why - but they don't.
See this quotation from the list input documentation:
For a character variable, this format modifier reads the value from the next non-blank column until the pointer reaches the next blank column, the defined length of the variable, or the end of the data line, whichever comes first. For a numeric variable, this format modifier reads the value from the next non-blank column until the pointer reaches the next blank column or the end of the data line, whichever comes first.
They do listen to the informat to some extent - add a .2 there and you'll get a forced decimal - but they don't listen to it as to how long of a value to read in. I'm not sure why; it seems intuitive that they should, but they don't.
Here's it with character variables - they respect the length but also ignore the informat:
data tst;
length number $9;
informat number $5.;
input number;
datalines;
987654321
;
run;
proc print data=tst;
run;
Though you do need to put the informat before the input statement (and the length for numeric variables).
More detail is available on the documentation page for INFORMAT:
How SAS Treats Variables When You Assign Informats with the INFORMAT Statement
Informats that are associated with variables by using the INFORMAT statement behave like informats that are used with modified list input. SAS reads the variables by using the scanning feature of list input, but applies the informat.
In modified list input, SAS
does not use the value of w in an informat to specify column positions or input field widths in an external file
uses the value of w in an informat to specify the length of previously undefined character variables
ignores the value of w in numeric informats
uses the value of d in an informat in the same way it usually does for numeric informats
treats blanks that are embedded as input data as delimiters unless you change their status with a DLM= or DLMSTR= option specification in an INFILE statement.
That is much more explicit about the fact that SAS ignores the value of w.
The length of a variable defines the amount of space the value occupies when stored to disk. NOTE: During a running DATA step all numerics are double precision, the truncation to a length < 8 only occurs during output media.
The informat is a separate concept from the length. Informat defines how incoming value representations are to be interpreted for storage as a SAS numeric value. Incoming value representations would be what ever text has to be processed; be it a INPUT statement reading a file, a VIEWTABLE field edit processing a typed in value, an EG grid cell edit, etc...
The format is similarly separate concept that defines how SAS renders a numeric value for output; be it a PUT statement, a VIEWTABLE row render, a placement in a PROCs output, an EG grid cell, etc...
Explanation
Now that that is out of the way, The informat is honored when explicitly stated in an INPUT statement:
data _null_;
attrib number length=6 informat=5.;
input number 5.;
put 'NOTE: ' number=;
datalines;
987654321
run;
===== LOG =====
NOTE: number=98765
And, as you question, the variables associated informat is not applied an explicit numeric informat is not stated
data _null_;
attrib number length=6 informat=5.;
input number;
put 'NOTE: ' number=;
datalines;
987654321
run;
===== LOG =====
NOTE: number=987654321
So the first is LIST input with format specified and the second is a simple LIST input (because no format is specified).
Simple list input will accept some absurdly large data, and the resultant value, while not tail-end precise, will be at the correct exponential level.
data _null_;
attrib number length=6 informat=5.;
input number;
put 'NOTE: ' number= ;
datalines;
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
run;
===== LOG =====
NOTE: number=1.2345679E89
What do the docs for INPUT Statement, List say ? Certainly nothing about using the variables declared informat when none indicated
Simple List Input
Simple list input places several restrictions on the type of data that
the INPUT statement can read:
• By default, at least one blank must separate the input values. Use
the DLM= or DLMSTR= option or the DSD option in the INFILE statement
to specify a delimiter other than a blank.
• Represent each missing value with a period, not a blank, or two
adjacent delimiters.
• Character input values cannot be longer than 8 bytes unless the
variable is given a longer length in an earlier LENGTH, ATTRIB, or
INFORMAT statement.
• Character values cannot contain embedded blanks unless you change
the delimiter.
• Data must be in standard numeric or character format. (footnote 1)
FOOTNOTE 1: See SAS Language Reference: Concepts for the information about standard and nonstandard data values. (my LOL)
The concepts for "SAS Variable Attributes" states
informat
refers to the instructions that SAS uses when reading data values. If
no informat is specified, the default informat is w.d for a numeric
variable, and $w. for a character variable. You can assign SAS
informats to a variable in the INFORMAT or ATTRIB statement. You can
use the FORMAT procedure to create your own informat for a variable.
(my bold)
Apparently there is no explicit default such as 32. or best32. because values with more than 32 digits will be inputted without error.
So does the documentation explain things ? Yea, well, sorta. What are the take aways:
The human intuition of a numeric variable inheriting its informat during simple list input does not align with the actual implemented behavior.
Tectonic amounts of existing SAS code means a change to implement this intuition is highly unlikely
Simple statements can involve a lot of concepts with wide ranging documentation
Possible change is that the documentation will be updated to be more explicit about the simple list input caveats
Related
My dataset has a column with a wide range of values in it, such as the one below:
Value
3223145.306
1.044303129
345.556033
17693.00837
8.03E-06
NaN
1.97E-04
2.29E-04
8.01E-04
7.46E-04
18345.82237
47.78282804
4.14E-06
When I read this column in SAS, observations are read as character. Once I convert this to numeric the observations with E-04, E-05, E-06, etc. are being converted to 1.9736273 instead of 0.00019736273.
How do I account for E-04, E-05, E-05 etc.?
code for character to numeric:
Value=input(Value, best12.);
You have to make a NEW variable if you want it to have a different type.
The INPUT function does not care if the width used on the informat is larger than the length of the string being read. So just use the maximum width that the informat supports. Also BEST is the name of a FORMAT, not an INFORMAT. If you use as the name of an informat then SAS will just default to using the normal numeric informat. So just go ahead and say that from the start instead of confusing format names for informat names.
The normal numeric informat can read those strings as numbers. So this code will work to create a new numeric variable named NUMBER from the existing character variable named VALUE.
number = input(VALUE,32.);
The only string in your list that will cause any issues is the string 'NaN'. SAS will not know how to translate that so you will just get a missing value as the result. Which is basically what systems that use that "not a number" symbol mean by it anyway. To prevent the notes in the log you can either test for it explicitly.
if upcase(value) not in ('NA','N/A','NAN') then number=input(value,32.);
Or just suppress the error messages by add the ?? modifier.
number=input(value,??32.);
But then you will not get any message if there is other gibberish in the value variable.
I am trying to convert simple date to date9. format.
%let snapshot_date=201806;
%let dt0=%sysfunc(intnx(month,%sysfunc(inputn(&snapshot_date.,yymmn6.)),0,b),yymmn6.);
data new;
set sample;
format cutoff_date date9.;
cutoff_date=input(&dt0.,anydtdte11.);
run;
I am getting cutof_date as 28jun2020 instead of 30jun2018. Is iam doing anything wrong here.
So the macro statements start with a YYYMM string. Convert it to the first day of the month using INPUTN() function. Then convert it from that date back to exact same date using INTNX() function with an interval of zero. (Perhaps in your real problem the interval is not zero?). Then convert it back to a new YYYYMM string.
The SAS code you are generating is :
cutoff_date=input(201806,anydtdte11.);
That is trying to convert the number 201,806 into a date using the ANYDTDTE11. informat. Since the INPUT() function needs a string and not a number as its input SAS will convert the number 201,806 into a string using the BEST12. format. So it runs this code:
cutoff_date=input(" 201806",anydtdte11.);
The ANYDTDTE informat has to decides to map those 6 characters into month, day and year so it splits into three parts 20 18 06. Since the first two are larger than 12 one must be day and the other year. It decides it is Y/D/M order. Not sure why as I have never seen that order used in real life.
Instead use the same informat in the SAS code that was used in the macro code. So to convert the string 201806 in SAS code you would use either of these statements:
cutoff_date=input("201806",yymmn6.);
cutoff_date=inputn("201806","yymmn6.");
To generate that from your macro variable you need to add the quotes. So use:
cutoff_date=input("&dt0.",yymmn6.);
SAS interprets dates as the number of days since Jan 1st 1960, and you are supplying a number to the input function which is designed to convert characters to numbers. anydtdte. is interpreting it incorrectly as a result. Put quotes around &dt0 and use the yymmn6. informat instead so that SAS converts it to a date correctly.
data new;
format cutoff_date date9.;
cutoff_date=input("&dt0.",yymmn6.);
run;
Output:
cutoff_date
01JUN2018
anydtdte. will not work here since yymmn6. is not in the list of formats it tries to read. A list of the date types it will read is located here:
https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/leforinforref/n04jh1fkv5c8zan14fhqcby7jsu4.htm?homeOnFail
I have an input file with a lot of dollar amounts given like this:
$433.5B $41.1B $331.1B $407.4B
$110.8B $19B $2,265.8B $170.1B
where the 'B' character stands for "billions". I do not have other suffixes like M or k. I need to read in this file using an INPUT statement in SAS inside a DATA step, and these figures should be numerics. There are several challenges to overcome, as well as a couple of features of the data to note:
There are dollar signs everywhere.
Some of the numbers have decimal points, and some don't, so we're dealing with variable-length data.
There are commas inside the numbers, such as $2,265.8B.
The most pesky aspect of this data are the B's after each amount.
The B's are always in the same columns.
What informat should I use to read in this numerical data?
I thought of using something along the lines of :DOLLAR4.1, like this:
Data bigcompanies;
Infile 'path\bigcompanies.dat' MISSOVER;
Input (sales profits assets market_value) (:DOLLAR4.1);
Run;
but it gives me nothing (as in, I get periods for those numbers). I don't know how to handle the B, which is, I think, the crux of the problem. The SAS documentation on the DOLLAR informat is rather sparse, unfortunately.
Many thanks for your help!
If the data is in fixed columns then just skip the columns where the B appears.
data test;
input sales dollar6. +1 profits dollar8. +1 assets dollar9. +1 market_value dollar9. +1 ;
*---+---10----+---20----+---30----+---40 ;
cards;
$433.5B $41.1B $331.1B $407.4B
$110.8B $19B $2,265.8B $170.1B
;
proc print;
run;
Results
market_
Obs sales profits assets value
1 433.5 41.1 331.1 407.4
2 110.8 19.0 2265.8 170.1
Note that you normally never want to add a decimal part to an informat. That is telling SAS where to place the decimal point when it does not appear in the source text. So "integers" will be divided by that power of 10.
I was wondering if there's a difference between, for example, using:
LENGTH var_1 $12.;
INPUT var_1 $;
vs
INPUT var_1 : $12.;
when reading in standard input from datalines or an external file;
They are the same as long as the LENGTH or the INPUT statement is the first place that the SAS compiler sees VAR_1 referenced and needs to decide what type and length to assign to it. Both will cause VAR_1 to be defined as a character variable of length 12. The LENGTH statement will do it explicitly and the INPUT statement will do it as a side effect. SAS assumes that you wanted the type to be character since you used a character informat. It also assumes that you want the length to be same as the width on the informat. (Note that that you could reference the variable in a RETAIN statement before hand and SAS will not make the decision as to the type and length at that time.)
Both INPUT statements will read VAR_1 in list mode because the second one includes the : modifier before the informat specification. So SAS will read the next word it sees (which depend on settings of DSD and TRUNCOVER options and whether the & modifier is used) into the VAR_1, even if the next word is longer than 12 characters. When you read data using list mode instead of formatted mode then SAS will actually ignore the width of the informat and read the number of characters in the next word. So if the next word is longer than 12 characters the extra characters will be ignored.
Note that if you have already defined VAR_1 as being a character variable then you do not need to add the $ after it in the INPUT statement in your first case.
Both do the same job. #tom has detailed and nice answer
I am trying to read the following data from a Notepad (text) file into a SAS data set:
name1,124325,08/10/2003,1250.03
name2,114565,08/11/2003,11115.11
name3,000007,08/11/2003,12500.02
When I use this SAS code:
data new;
filename tfile '~\transact2.txt';
infile tfile dsd;
input name $ id date mmddyy10. cost 8.2;
run;
I get this, where cost is all missing:
However, if I just replace dsd with dlm=',', then the cost variable is read in correctly. Why does dsd cause the cost variable to be read in incorrectly?
dsd does not say "use a delimiter". It tells SAS how to use that delimiter (mostly, saying anything in quotes is treated as one field, and modifying how consecutive delimiters are treated). dlm=',' is necessary to read this in correctly. I'm a bit surprised you got as close to correct as you did. (Fortunately, SAS makes some assumptions here that end up making it work correctly, more-or-less).
Also, you're mixing two styles of input, which isn't allowed.
When you use delimited input, you are using list, not column, input. You can only indicate character/not character, and cannot use informats directly. If you want to embed the informats like you do for the date, you need to use modified column input:
data new;
filename tfile '~\transact2.txt';
infile tfile dsd;
input name $ id date :mmddyy10. cost;
run;
Also note that reading in cost with 8.2 is incorrect. The decimal in an informat is only for reading in 12345678 as 123456.78 (back in the day when you had 80 column cards and didn't want to spend one on the decimal). In general in "modern" SAS you should not be using decimal portion of informat ever. SAS will see the decimal and work it out properly.