Use of : when reading multiple records in SAS - sas

I am studying SAS programming and there is one thing that is puzzling me. I tried to look up what colons (:) do in the text book I am using but I could not find anything.
The following program was one of the questions, and with the colon the program does read the instream data but without the colons it reads funny.
I am suspecting that the length of ABRAMS is less than 12 and that is why it reads it inappropriately, but with the colon for some reason it recognizes is fine.
I appreciate your help.
data a;
input #1 Lname $ Fname $ /
Department : $12. Salary : comma.10;
cards;
ABRAMS THOMAS
SALES $25,209.03
;
run;
proc print;
run;

Have a look at the documentation for the input statement. There is admittedly quite a lot of it, so here's a link to the specific page that deals with this:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000144370.htm
Relevant quote:
:
enables you to specify an informat that the INPUT statement uses to
read the variable value. For a character variable, this format
modifier reads the value from the next non-blank column until the
pointer reaches the next blank column, the defined length of the
variable, or the end of the data line, whichever comes first. For a
numeric variable, this format modifier reads the value from the next
non-blank column until the pointer reaches the next blank column or
the end of the data line, whichever comes first.

Related

Is there any function in SAS where we can read the exact value from the variable

Suppose i have a column called ABC and that variable has the data like
:
123_112233_66778_1122 or
123_112233_1122_11232 or
1122_112233_66778_123
so i want to generate the desire variable in the next column as 1122. like this "1122" i have a long list where i need to cross the value from the column called ABC, if found the exact match then need to generate. However, i don't want to generate the match like 112233 because it does not match the value what i am looking for.
For an example you can see all three line what i have given for reference. I am taking only the match records which is "1122" from all the above 3 lines.
I really have no clue to overcome on the problem. I have tried my hands with wildcards but did not get much success. Any help would be much apricated
It is hard to tell from your description, but from the values you show it looks like you want the INDEXW() function. That will let you search a string for matching words with a option to specify which characters are to be considered as the separators between the words. The result is the location of where the word starts within longer string. When the word is not found the result is a zero.
Let's create a simple example to demonstrate.
data have;
input abc $30. ;
cards;
123_112233_66778_1122
123_112233_1122_11232
1122_112233_66778_123
;
data want;
set have ;
location = indexw(trim(abc),'1122','_');
run;
Note that SAS will consider any value other than zero (or missing) as TRUE so you can just use the INDEXW() function call in a WHERE statement.
data want;
set have;
where indexw(trim(abc),'1122','_');
run;

SAS Numeric Informat vs Length

I'm trying to determine how SAS is reading the length statement and then the informat statement. I could be misunderstanding, but I'm under the impression that the informat statement for numeric variables worked like this:
informat number 5.;
This would give the variable number the informat 5, allowing 5 numbers to fill it. E.G. 12345
However, when I run the below program, I have a number that has 9 digits, 987654321, with the appropriate length to fit the digits, 6, which will represent all numbers up to 137,438,953,472
Q: is length statement 'overriding' the informat statement and allowing all 9 digits to fill the variable number? How are all 9 digits able to fit in the variable number with an informat of 5.?
data tst;
input number;
length number 6;
informat number 5.;
datalines;
987654321
;
run;
proc print data=tst;
run;
Based on this SAS documentation:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000199348.htm
w specifies the width of the input field. Range: 1-32
It would seem that the informat w.d would work as I first described and not allow all 9 digits to fill number
Because you are using list mode input. In that situation SAS reads the next word, however long it is. Essentially in list mode input (including when using the : modifier before an informat specified in the input statement) the width on a informat is ignored.
Other than for creating metadata in the SAS dataset there is not much value in attaching informats like 5. or $10. to variables.
SAS does not need them to understand how to convert text into values, unlike informats like date..
In list mode it ignores the width part.
And in formatted input, where the width matters, you have to specify the informat in the INPUT statement itself.
First off: length is not overriding, or having any impact on, the informat or the read-in. length solely describes how many bytes are used to store the number, nothing more.
For numeric variables, informats don't work quite the intuitive way. I'm not sure why - but they don't.
See this quotation from the list input documentation:
For a character variable, this format modifier reads the value from the next non-blank column until the pointer reaches the next blank column, the defined length of the variable, or the end of the data line, whichever comes first. For a numeric variable, this format modifier reads the value from the next non-blank column until the pointer reaches the next blank column or the end of the data line, whichever comes first.
They do listen to the informat to some extent - add a .2 there and you'll get a forced decimal - but they don't listen to it as to how long of a value to read in. I'm not sure why; it seems intuitive that they should, but they don't.
Here's it with character variables - they respect the length but also ignore the informat:
data tst;
length number $9;
informat number $5.;
input number;
datalines;
987654321
;
run;
proc print data=tst;
run;
Though you do need to put the informat before the input statement (and the length for numeric variables).
More detail is available on the documentation page for INFORMAT:
How SAS Treats Variables When You Assign Informats with the INFORMAT Statement
Informats that are associated with variables by using the INFORMAT statement behave like informats that are used with modified list input. SAS reads the variables by using the scanning feature of list input, but applies the informat.
In modified list input, SAS
does not use the value of w in an informat to specify column positions or input field widths in an external file
uses the value of w in an informat to specify the length of previously undefined character variables
ignores the value of w in numeric informats
uses the value of d in an informat in the same way it usually does for numeric informats
treats blanks that are embedded as input data as delimiters unless you change their status with a DLM= or DLMSTR= option specification in an INFILE statement.
That is much more explicit about the fact that SAS ignores the value of w.
The length of a variable defines the amount of space the value occupies when stored to disk. NOTE: During a running DATA step all numerics are double precision, the truncation to a length < 8 only occurs during output media.
The informat is a separate concept from the length. Informat defines how incoming value representations are to be interpreted for storage as a SAS numeric value. Incoming value representations would be what ever text has to be processed; be it a INPUT statement reading a file, a VIEWTABLE field edit processing a typed in value, an EG grid cell edit, etc...
The format is similarly separate concept that defines how SAS renders a numeric value for output; be it a PUT statement, a VIEWTABLE row render, a placement in a PROCs output, an EG grid cell, etc...
Explanation
Now that that is out of the way, The informat is honored when explicitly stated in an INPUT statement:
data _null_;
attrib number length=6 informat=5.;
input number 5.;
put 'NOTE: ' number=;
datalines;
987654321
run;
===== LOG =====
NOTE: number=98765
And, as you question, the variables associated informat is not applied an explicit numeric informat is not stated
data _null_;
attrib number length=6 informat=5.;
input number;
put 'NOTE: ' number=;
datalines;
987654321
run;
===== LOG =====
NOTE: number=987654321
So the first is LIST input with format specified and the second is a simple LIST input (because no format is specified).
Simple list input will accept some absurdly large data, and the resultant value, while not tail-end precise, will be at the correct exponential level.
data _null_;
attrib number length=6 informat=5.;
input number;
put 'NOTE: ' number= ;
datalines;
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
run;
===== LOG =====
NOTE: number=1.2345679E89
What do the docs for INPUT Statement, List say ? Certainly nothing about using the variables declared informat when none indicated
Simple List Input
Simple list input places several restrictions on the type of data that
the INPUT statement can read:
• By default, at least one blank must separate the input values. Use
the DLM= or DLMSTR= option or the DSD option in the INFILE statement
to specify a delimiter other than a blank.
• Represent each missing value with a period, not a blank, or two
adjacent delimiters.
• Character input values cannot be longer than 8 bytes unless the
variable is given a longer length in an earlier LENGTH, ATTRIB, or
INFORMAT statement.
• Character values cannot contain embedded blanks unless you change
the delimiter.
• Data must be in standard numeric or character format. (footnote 1)
FOOTNOTE 1: See SAS Language Reference: Concepts for the information about standard and nonstandard data values. (my LOL)
The concepts for "SAS Variable Attributes" states
informat
refers to the instructions that SAS uses when reading data values. If
no informat is specified, the default informat is w.d for a numeric
variable, and $w. for a character variable. You can assign SAS
informats to a variable in the INFORMAT or ATTRIB statement. You can
use the FORMAT procedure to create your own informat for a variable.
(my bold)
Apparently there is no explicit default such as 32. or best32. because values with more than 32 digits will be inputted without error.
So does the documentation explain things ? Yea, well, sorta. What are the take aways:
The human intuition of a numeric variable inheriting its informat during simple list input does not align with the actual implemented behavior.
Tectonic amounts of existing SAS code means a change to implement this intuition is highly unlikely
Simple statements can involve a lot of concepts with wide ranging documentation
Possible change is that the documentation will be updated to be more explicit about the simple list input caveats

SAS LENGTH vs INPUT for defining variables

I was wondering if there's a difference between, for example, using:
LENGTH var_1 $12.;
INPUT var_1 $;
vs
INPUT var_1 : $12.;
when reading in standard input from datalines or an external file;
They are the same as long as the LENGTH or the INPUT statement is the first place that the SAS compiler sees VAR_1 referenced and needs to decide what type and length to assign to it. Both will cause VAR_1 to be defined as a character variable of length 12. The LENGTH statement will do it explicitly and the INPUT statement will do it as a side effect. SAS assumes that you wanted the type to be character since you used a character informat. It also assumes that you want the length to be same as the width on the informat. (Note that that you could reference the variable in a RETAIN statement before hand and SAS will not make the decision as to the type and length at that time.)
Both INPUT statements will read VAR_1 in list mode because the second one includes the : modifier before the informat specification. So SAS will read the next word it sees (which depend on settings of DSD and TRUNCOVER options and whether the & modifier is used) into the VAR_1, even if the next word is longer than 12 characters. When you read data using list mode instead of formatted mode then SAS will actually ignore the width of the informat and read the number of characters in the next word. So if the next word is longer than 12 characters the extra characters will be ignored.
Note that if you have already defined VAR_1 as being a character variable then you do not need to add the $ after it in the INPUT statement in your first case.
Both do the same job. #tom has detailed and nice answer

SAS - Text Parsing for Case Sensitive Characters

Im having some problems cleaning up free form text strings from a series of notations. The last part of this task involves identifying any names, and removing them from the string. Luckily, all names are U-Cased (always), and the relevant information is placed before the name (always).
My first thought was to use the FIND function to isolate where the name starts, then just output all characters before the starting position...but I could not determine how to use a "wild card" like option to grab the starting position of ANY capital letter. Sample and attempts included below -
DATA SAMPLE;
INPUT TXT $;
CARDS;
firsT
Second
thIrd
foUrth
;
RUN;
Attempt1:
DATA TEST;
SET SAMPLE;
ID = FIND(TXT,'A'-'Z');
RUN;
Attempt2:
DATA TEST;
SET SAMPLE;
ID = FIND(TXT,'A-Z');
RUN;
Clearly both attempts above are not too far from one another, but I could not find (or think) of another approach. Hoping that some mysterious function will come to rescue here...
Assuming I understand what you want to do, you're close - just not doing things the 'SAS' way.
FIND has two siblings, FINDC and FINDW. FINDC finds a single character in a list of characters, which it sounds like what you want to do. It has a lot of options for adding lists of characters; you can't just give it A-Z as that would add those three characters, but you can give it a U option to add uppercase characters.
DATA TEST;
SET SAMPLE;
_endpos= FINDC(TXT,,'u');
ID = substr(TXT,1,_endpos-1);
RUN;

input an array in SAS

I need to read multiple raw text files into a SAS-dataset. Each file consists several ingredients as shown in the example files below. Each file (a dish) lists all the ingredients on one line, separated by a comma. The amount of ingredients is variable. Some example files (dishes):
Example file 1 (dish1.csv):
Tomate, Cheese, Ham, Bread
Example file 2 (dish2.csv):
Sugar, Apple
Example file 3 (dish3.csv):
Milk, Sugar, Cacao
Because I have about 250 files (dishes) I created a macro program to read those files. That way I can execute this macro in another macro to read all the dishes I need. The program looks like this:
%readDish (dishNumber);
data newDish;
* Find and read the csv-file;
infile "my_file_location/dish&dishNumber..csv" dlm=";" missover;
* Read up to 25 ingredients;
input ingredient1-ingredient25 : $25.;
* Put all ingredients in an array;
array ingredients{25} ingredient1-ingredient25;
* Loop thrue all the ingredients and output;
do i=1 to dim(ingredients);
dishNumber = &dishNumber;
ingredient = ingredients{i};
output;
end;
run;
%mend;
Is it possible to create a SAS (macro) program that is able to read all dishes, no matter how many ingredients I have? The SAS table should look like this:
1 Tomate
1 Cheese
1 Ham
1 Bread
Seems straightforward to me: read the data in vertically, then if you need it horizontal, add a transpose step afterwards. You don't have to read in a whole line in one step - the ## operator tells SAS to keep the line pointer on that line, so you just read in the one.
data dishes;
length _file $1024
ingredient $128;
infile "c:\temp\dish*.csv" dlm=',' filename=_file lrecl=32767; *or whatever your LRECL needs to be;
input ingredient $ ##;
dishnumber = input(compress(scan(_file,-2,'\.'),,'kd'),12.);
output;
run;
Here I use a wildcard to read them all in - you can of course us a macro with similar code if you need to, though wildcard or a concatenated filename is probably easier. The way I get dishnumber might not always work depending on the filename construction, but some form of that should be usable.
To expand on why this works: The way the datastep works in SAS is that it is a constant loop, looping over the code repeatedly until it encounters an "end condition". End conditions are, most commonly, the stop keyword, and then any attempt to read from a SET or INFILE where no further read is possible (i.e., you read a 100 line SAS dataset, and it tries to read row 101 in, fails, so ends the data step). However, other than that, it will keep doing the same code until it gets there. It just does some cleanup at the "run" point to make sure it is not infinitely looping.
In the case of input from infiles, usually SAS reads a line, then at the RUN, it will skip forward to the next EOL (end of line, usually a carriage return and linefeed in Windows) if it's not already at one. Sometimes that is useful - perhaps, usually. But, in some cases you'd rather ask SAS to keep reading the same line.
In comes the ## operator. ## says "do not advance to EOL even if you hit RUN". (# says "Do not advance to EOL except when you hit RUN" - normally input itself causes SAS to read until EOL.) Thus, when you perform the next data step iteration, the input pointer will be in the same exact place you left it - right after the previous field you read in.
This was highly useful in the 60s and 70s, when punchcards were the trendy new thing, and you would put lines of input often without regard to any line organization - in particular, if you input just one variable per row, at 8 columns per input variable, you're not wasting 72 blocks from one punchcard - so, you have input just like your ingredients: many pieces of data per row on the input, which then want to be translated into one piece of data per row in memory. While it's not as common nowadays to store data this way, this is certainly possible - as your data exemplify.