SAS Input: Dollar Amounts in Billions - sas

I have an input file with a lot of dollar amounts given like this:
$433.5B $41.1B $331.1B $407.4B
$110.8B $19B $2,265.8B $170.1B
where the 'B' character stands for "billions". I do not have other suffixes like M or k. I need to read in this file using an INPUT statement in SAS inside a DATA step, and these figures should be numerics. There are several challenges to overcome, as well as a couple of features of the data to note:
There are dollar signs everywhere.
Some of the numbers have decimal points, and some don't, so we're dealing with variable-length data.
There are commas inside the numbers, such as $2,265.8B.
The most pesky aspect of this data are the B's after each amount.
The B's are always in the same columns.
What informat should I use to read in this numerical data?
I thought of using something along the lines of :DOLLAR4.1, like this:
Data bigcompanies;
Infile 'path\bigcompanies.dat' MISSOVER;
Input (sales profits assets market_value) (:DOLLAR4.1);
Run;
but it gives me nothing (as in, I get periods for those numbers). I don't know how to handle the B, which is, I think, the crux of the problem. The SAS documentation on the DOLLAR informat is rather sparse, unfortunately.
Many thanks for your help!

If the data is in fixed columns then just skip the columns where the B appears.
data test;
input sales dollar6. +1 profits dollar8. +1 assets dollar9. +1 market_value dollar9. +1 ;
*---+---10----+---20----+---30----+---40 ;
cards;
$433.5B $41.1B $331.1B $407.4B
$110.8B $19B $2,265.8B $170.1B
;
proc print;
run;
Results
market_
Obs sales profits assets value
1 433.5 41.1 331.1 407.4
2 110.8 19.0 2265.8 170.1
Note that you normally never want to add a decimal part to an informat. That is telling SAS where to place the decimal point when it does not appear in the source text. So "integers" will be divided by that power of 10.

Related

In SAS, how would I remove a decimal from a value?

I have a few million rows, where a particular columns’ values are showing as I.e. ###.## and I’d like them to show them as #####.
How can I modify this in the INFILE statement?
Thanks.
It you made the mistake of including a decimal width on in INFORMAT then that might be the cause of what you are seeing. The decimal width on an informat is for letting SAS know where the implied decimal place should be placed. You only want to do that when you know that your source strings were purposely generated without a period to mark the decimal place to save one character.
Example:
data have;
input #1 right 10. #1 wrong 10.3 ;
cards;
1.2
1234
;
Result:
Obs right wrong
1 1.2 1.200
2 1234.0 1.234

SAS, converting numbers, from character format to numeric format, keeping all leading zeros, but length of numbers is NOT uniform

I'm working in SAS EG and I'm trying to convert a column that's in character format to numeric format, EXACTLY as they appear in their character format. The numbers vary in length and some have one or two leading zeros.
If I do it one way, it gets rid of all leading zeros. Another way I tried, it adds leading zeros to the point that it's as long as the longest number in the column, e.g., a 9-digit number with one leading zero now has four leading zeros because the longest number in the column is 12 digits. (I hope this description makes sense).
I'm working in SAS EG. When I run proc contents, it tells me my existing variable is a character variable of length 26. It is blank for both 'format' and 'informat.'
I need to convert it so that a new column is a numeric variable, with length 8, and 'F12.' for 'format' and 'BEST12.' for 'informat,' as I plan to use it to match two data sets.
I created the following test data set in 'regular' SAS, but I'm not sure if fully recreates the issue I'm working on in SAS EG:
data have;
input mrn $1-12;
cards;
118283586928
003875807
038087875
0385709873
0038576830
;
run;
As you can see, I have one number that's 12 digits long (no leading zeros); two that are 9 digits (with one or two leading zeros); and two that are 10 digits (with one or two leading zeros).
Any help would be greatly appreciated.
Thanks
You cannot store 26 digit strings exactly as a number in SAS. SAS stores numbers as floating point values. You can use the CONSTANT() function to see the end of the contiguous integers that can be stored exactly.
73 data _null_;
74 x=constant('exactint');
75 put x= comma30.;
76 run;
x=9,007,199,254,740,992
So if you actually have values longer than 15 digits in the character variable you will not be able to convert them to numbers.
But if they are only 12 digits long then just convert the strings into numbers and compare the numbers.
proc sql;
create table want as
select *
from a, b
where a.mrn = input(b.mrn_string,32.)
;
quit;
It's not possible to have different formats in the same column in SAS. The only way to keep them looking exactly as they do while in the same column is to keep them as text. If you need to do calculations on them I'd suggest just creating a 2nd column with their numeric values.
Leading zeros can be added to numbers using the z. format.
https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000205244.htm

SAS Numeric Informat vs Length

I'm trying to determine how SAS is reading the length statement and then the informat statement. I could be misunderstanding, but I'm under the impression that the informat statement for numeric variables worked like this:
informat number 5.;
This would give the variable number the informat 5, allowing 5 numbers to fill it. E.G. 12345
However, when I run the below program, I have a number that has 9 digits, 987654321, with the appropriate length to fit the digits, 6, which will represent all numbers up to 137,438,953,472
Q: is length statement 'overriding' the informat statement and allowing all 9 digits to fill the variable number? How are all 9 digits able to fit in the variable number with an informat of 5.?
data tst;
input number;
length number 6;
informat number 5.;
datalines;
987654321
;
run;
proc print data=tst;
run;
Based on this SAS documentation:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000199348.htm
w specifies the width of the input field. Range: 1-32
It would seem that the informat w.d would work as I first described and not allow all 9 digits to fill number
Because you are using list mode input. In that situation SAS reads the next word, however long it is. Essentially in list mode input (including when using the : modifier before an informat specified in the input statement) the width on a informat is ignored.
Other than for creating metadata in the SAS dataset there is not much value in attaching informats like 5. or $10. to variables.
SAS does not need them to understand how to convert text into values, unlike informats like date..
In list mode it ignores the width part.
And in formatted input, where the width matters, you have to specify the informat in the INPUT statement itself.
First off: length is not overriding, or having any impact on, the informat or the read-in. length solely describes how many bytes are used to store the number, nothing more.
For numeric variables, informats don't work quite the intuitive way. I'm not sure why - but they don't.
See this quotation from the list input documentation:
For a character variable, this format modifier reads the value from the next non-blank column until the pointer reaches the next blank column, the defined length of the variable, or the end of the data line, whichever comes first. For a numeric variable, this format modifier reads the value from the next non-blank column until the pointer reaches the next blank column or the end of the data line, whichever comes first.
They do listen to the informat to some extent - add a .2 there and you'll get a forced decimal - but they don't listen to it as to how long of a value to read in. I'm not sure why; it seems intuitive that they should, but they don't.
Here's it with character variables - they respect the length but also ignore the informat:
data tst;
length number $9;
informat number $5.;
input number;
datalines;
987654321
;
run;
proc print data=tst;
run;
Though you do need to put the informat before the input statement (and the length for numeric variables).
More detail is available on the documentation page for INFORMAT:
How SAS Treats Variables When You Assign Informats with the INFORMAT Statement
Informats that are associated with variables by using the INFORMAT statement behave like informats that are used with modified list input. SAS reads the variables by using the scanning feature of list input, but applies the informat.
In modified list input, SAS
does not use the value of w in an informat to specify column positions or input field widths in an external file
uses the value of w in an informat to specify the length of previously undefined character variables
ignores the value of w in numeric informats
uses the value of d in an informat in the same way it usually does for numeric informats
treats blanks that are embedded as input data as delimiters unless you change their status with a DLM= or DLMSTR= option specification in an INFILE statement.
That is much more explicit about the fact that SAS ignores the value of w.
The length of a variable defines the amount of space the value occupies when stored to disk. NOTE: During a running DATA step all numerics are double precision, the truncation to a length < 8 only occurs during output media.
The informat is a separate concept from the length. Informat defines how incoming value representations are to be interpreted for storage as a SAS numeric value. Incoming value representations would be what ever text has to be processed; be it a INPUT statement reading a file, a VIEWTABLE field edit processing a typed in value, an EG grid cell edit, etc...
The format is similarly separate concept that defines how SAS renders a numeric value for output; be it a PUT statement, a VIEWTABLE row render, a placement in a PROCs output, an EG grid cell, etc...
Explanation
Now that that is out of the way, The informat is honored when explicitly stated in an INPUT statement:
data _null_;
attrib number length=6 informat=5.;
input number 5.;
put 'NOTE: ' number=;
datalines;
987654321
run;
===== LOG =====
NOTE: number=98765
And, as you question, the variables associated informat is not applied an explicit numeric informat is not stated
data _null_;
attrib number length=6 informat=5.;
input number;
put 'NOTE: ' number=;
datalines;
987654321
run;
===== LOG =====
NOTE: number=987654321
So the first is LIST input with format specified and the second is a simple LIST input (because no format is specified).
Simple list input will accept some absurdly large data, and the resultant value, while not tail-end precise, will be at the correct exponential level.
data _null_;
attrib number length=6 informat=5.;
input number;
put 'NOTE: ' number= ;
datalines;
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
run;
===== LOG =====
NOTE: number=1.2345679E89
What do the docs for INPUT Statement, List say ? Certainly nothing about using the variables declared informat when none indicated
Simple List Input
Simple list input places several restrictions on the type of data that
the INPUT statement can read:
• By default, at least one blank must separate the input values. Use
the DLM= or DLMSTR= option or the DSD option in the INFILE statement
to specify a delimiter other than a blank.
• Represent each missing value with a period, not a blank, or two
adjacent delimiters.
• Character input values cannot be longer than 8 bytes unless the
variable is given a longer length in an earlier LENGTH, ATTRIB, or
INFORMAT statement.
• Character values cannot contain embedded blanks unless you change
the delimiter.
• Data must be in standard numeric or character format. (footnote 1)
FOOTNOTE 1: See SAS Language Reference: Concepts for the information about standard and nonstandard data values. (my LOL)
The concepts for "SAS Variable Attributes" states
informat
refers to the instructions that SAS uses when reading data values. If
no informat is specified, the default informat is w.d for a numeric
variable, and $w. for a character variable. You can assign SAS
informats to a variable in the INFORMAT or ATTRIB statement. You can
use the FORMAT procedure to create your own informat for a variable.
(my bold)
Apparently there is no explicit default such as 32. or best32. because values with more than 32 digits will be inputted without error.
So does the documentation explain things ? Yea, well, sorta. What are the take aways:
The human intuition of a numeric variable inheriting its informat during simple list input does not align with the actual implemented behavior.
Tectonic amounts of existing SAS code means a change to implement this intuition is highly unlikely
Simple statements can involve a lot of concepts with wide ranging documentation
Possible change is that the documentation will be updated to be more explicit about the simple list input caveats

Why does comma9.2 not work?

Can anyone tell me why comma9.2 is not working in my sas codes?
data have;
input x $16.;
y = input(x, comma9.2);
z = input(x, comma9.);
put x= y= z= ;
cards;
1,740.32
5200
520
52
7,425
9,000.00
36,000.00
;
run;
To expand on Reeza's answer:
Informat decimal places do not quite work the way Format decimal places do. In almost all cases, you will not want to or need to specify the d in the informat. Comma9. is almost always correct, no matter how many decimal places you expect - even if you expect always two.
The only use informat decimal places serve is when you have a number like 12345600, which has no decimal in it, but it ought to (the last two zeros are after the decimal).
data _null_;
input numval 8.2;
put numval=;
datalines;
12345600
12345605
99999989
1857.145
;;;;
run;
This was something that was common once upon a time in the age of punch cards, particularly for accounting; since everything was in dollars and cents, you could save a column by leaving out the decimal, and just read everything in with two decimals. It is no longer common in most fields (at least in my experience), but SAS is always backwards compatible.
SAS will ignore the .d specification if it encounters a decimal point in the data (and will then use the location of that decimal to read in the value correctly), but if there are no decimal points in the data it may read it in incorrectly if you specify the .d. Notice in my example the final row has a decimal point followed by three decimal places, and is read in correctly.
You can read SAS Documentation for more information.
Comma9.2 assumes that values will always have 2 decimal places.

long digit reading in sas

I have a long ID number (say, 12184447992012111111). BY using proc import from csv file that number shortens itself with a addition of 'E' in between the digits (1.2184448E19, with format best12. and informat best32.). Browsing here I got to know the csv format itself shortens it previously so it is nothing to do with SAS. So I tried to copy say about 5 numbers and use datalines statement then also it results same.... It wil be helpful if anyone can suggest which format I need to use. Using best32. format I donot get the original number since most probably it modifies that altered number, which infact gives me 12184447992012111872 which is not my desired number.
Because your ID variable is really an identifier rather than a "real" number, you need to read it in as a character string. The value you show as an example is too large to be represented as an integer, so since SAS stores all numerics as floating point, you are losing "precision".
Since you mention using PROC IMPORT, copy the SAS program it generates and change the FORMAT and INFORMAT specifications from "21." and "best32." to "$32." (or whatever value matched your data.
Best of course would be if you had SAS Access to PC File formats, in which case you cound format the column as "text" in Excel and let SAS read it directly.
I'm not sure about the csv changing the value (they are just plain text files) - unless you are saving an excel spreadsheet as a csv file. If you are using excel just set the column to number format, no decimal places.
It might be easier to treat the column as text when importing it to SAS - unless you need to perform mathematical operations on it! If you really need to keep it as a number the format 32. should force it to be a 32 digit number - best is fairly sensibly changing it into scientific notation (though I suspect the data is there in the background and just displayed unhelpfully).
There is a SAS informat for reading exponential notation - Ew.d where w is the width and d the number of decimal places. In your case, it probably won't help because you will "lose" the complete number - and the value stored in case you read with this informat will be 1.2184448 * (10^19). The only way in your case is to ensure that the program which produces the CSV file outputs it in the right way. If you are creating the data from an Excel worksheet, then format the number in the Excel worksheet to display all the digits correctly.