I need to read multiple raw text files into a SAS-dataset. Each file consists several ingredients as shown in the example files below. Each file (a dish) lists all the ingredients on one line, separated by a comma. The amount of ingredients is variable. Some example files (dishes):
Example file 1 (dish1.csv):
Tomate, Cheese, Ham, Bread
Example file 2 (dish2.csv):
Sugar, Apple
Example file 3 (dish3.csv):
Milk, Sugar, Cacao
Because I have about 250 files (dishes) I created a macro program to read those files. That way I can execute this macro in another macro to read all the dishes I need. The program looks like this:
%readDish (dishNumber);
data newDish;
* Find and read the csv-file;
infile "my_file_location/dish&dishNumber..csv" dlm=";" missover;
* Read up to 25 ingredients;
input ingredient1-ingredient25 : $25.;
* Put all ingredients in an array;
array ingredients{25} ingredient1-ingredient25;
* Loop thrue all the ingredients and output;
do i=1 to dim(ingredients);
dishNumber = &dishNumber;
ingredient = ingredients{i};
output;
end;
run;
%mend;
Is it possible to create a SAS (macro) program that is able to read all dishes, no matter how many ingredients I have? The SAS table should look like this:
1 Tomate
1 Cheese
1 Ham
1 Bread
Seems straightforward to me: read the data in vertically, then if you need it horizontal, add a transpose step afterwards. You don't have to read in a whole line in one step - the ## operator tells SAS to keep the line pointer on that line, so you just read in the one.
data dishes;
length _file $1024
ingredient $128;
infile "c:\temp\dish*.csv" dlm=',' filename=_file lrecl=32767; *or whatever your LRECL needs to be;
input ingredient $ ##;
dishnumber = input(compress(scan(_file,-2,'\.'),,'kd'),12.);
output;
run;
Here I use a wildcard to read them all in - you can of course us a macro with similar code if you need to, though wildcard or a concatenated filename is probably easier. The way I get dishnumber might not always work depending on the filename construction, but some form of that should be usable.
To expand on why this works: The way the datastep works in SAS is that it is a constant loop, looping over the code repeatedly until it encounters an "end condition". End conditions are, most commonly, the stop keyword, and then any attempt to read from a SET or INFILE where no further read is possible (i.e., you read a 100 line SAS dataset, and it tries to read row 101 in, fails, so ends the data step). However, other than that, it will keep doing the same code until it gets there. It just does some cleanup at the "run" point to make sure it is not infinitely looping.
In the case of input from infiles, usually SAS reads a line, then at the RUN, it will skip forward to the next EOL (end of line, usually a carriage return and linefeed in Windows) if it's not already at one. Sometimes that is useful - perhaps, usually. But, in some cases you'd rather ask SAS to keep reading the same line.
In comes the ## operator. ## says "do not advance to EOL even if you hit RUN". (# says "Do not advance to EOL except when you hit RUN" - normally input itself causes SAS to read until EOL.) Thus, when you perform the next data step iteration, the input pointer will be in the same exact place you left it - right after the previous field you read in.
This was highly useful in the 60s and 70s, when punchcards were the trendy new thing, and you would put lines of input often without regard to any line organization - in particular, if you input just one variable per row, at 8 columns per input variable, you're not wasting 72 blocks from one punchcard - so, you have input just like your ingredients: many pieces of data per row on the input, which then want to be translated into one piece of data per row in memory. While it's not as common nowadays to store data this way, this is certainly possible - as your data exemplify.
Related
I am studying SAS programming and there is one thing that is puzzling me. I tried to look up what colons (:) do in the text book I am using but I could not find anything.
The following program was one of the questions, and with the colon the program does read the instream data but without the colons it reads funny.
I am suspecting that the length of ABRAMS is less than 12 and that is why it reads it inappropriately, but with the colon for some reason it recognizes is fine.
I appreciate your help.
data a;
input #1 Lname $ Fname $ /
Department : $12. Salary : comma.10;
cards;
ABRAMS THOMAS
SALES $25,209.03
;
run;
proc print;
run;
Have a look at the documentation for the input statement. There is admittedly quite a lot of it, so here's a link to the specific page that deals with this:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000144370.htm
Relevant quote:
:
enables you to specify an informat that the INPUT statement uses to
read the variable value. For a character variable, this format
modifier reads the value from the next non-blank column until the
pointer reaches the next blank column, the defined length of the
variable, or the end of the data line, whichever comes first. For a
numeric variable, this format modifier reads the value from the next
non-blank column until the pointer reaches the next blank column or
the end of the data line, whichever comes first.
Im having some problems cleaning up free form text strings from a series of notations. The last part of this task involves identifying any names, and removing them from the string. Luckily, all names are U-Cased (always), and the relevant information is placed before the name (always).
My first thought was to use the FIND function to isolate where the name starts, then just output all characters before the starting position...but I could not determine how to use a "wild card" like option to grab the starting position of ANY capital letter. Sample and attempts included below -
DATA SAMPLE;
INPUT TXT $;
CARDS;
firsT
Second
thIrd
foUrth
;
RUN;
Attempt1:
DATA TEST;
SET SAMPLE;
ID = FIND(TXT,'A'-'Z');
RUN;
Attempt2:
DATA TEST;
SET SAMPLE;
ID = FIND(TXT,'A-Z');
RUN;
Clearly both attempts above are not too far from one another, but I could not find (or think) of another approach. Hoping that some mysterious function will come to rescue here...
Assuming I understand what you want to do, you're close - just not doing things the 'SAS' way.
FIND has two siblings, FINDC and FINDW. FINDC finds a single character in a list of characters, which it sounds like what you want to do. It has a lot of options for adding lists of characters; you can't just give it A-Z as that would add those three characters, but you can give it a U option to add uppercase characters.
DATA TEST;
SET SAMPLE;
_endpos= FINDC(TXT,,'u');
ID = substr(TXT,1,_endpos-1);
RUN;
The following is the simple SAS program:
data mydata;
do group = 'placebo', 'active';
do subj = 1 to 5;
input score #;
output;
end;
end;
datalines;
250 222 230 210 199
166 183 123 129 234
;
I am learning SAS by myself. So I was thinking to make sure what happens here. For my understanding, the first line of the 5 entries belongs to the group placebo and the second line belongs to the group active. At first, the input buffer contains the first line of the 5 numbers, and the do subj=1 to 5 prints them out one by one, until the end of the current data step iteration. Then, the data step continues with the second iteration. Is this understanding correct? Many thanks for your time and attention.
PS. I just want to make sure when to release the current input buffer. After checking online, I found that the purpose of the # is as the following:
holds an input record for the execution of the next INPUT statement within the same iteration of the DATA step. This line-hold specifier is called trailing #.
So, it means the input buffer is released if one of the following two conditions is met:
(1): A new input statement is met without any # or ##.
(2): The end of the current data step iteration.
Any comments are greatly appreciated.
I like Tom's answer, but want to expand a bit on the meaning of data step iteration. You wrote:
At first, the input buffer contains the first line of the 5 numbers, and the do subj=1 to 5 prints them out one by one, until the end of the current data step iteration. Then, the data step continues with the second iteration. Is this understanding correct?
The DATA step is an implied iterative loop, from the top (DATA statement) to the bottom (RUN statement typically, in this case I think DATALINES statement). If you want to see what happens on each iteration of the loop, you can write values to the log with the PUT statement, you can also write N to the log, which is a counter for DATA step iteration number. So you might change your code to:
do group = 'placebo', 'active';
do subj = 1 to 5;
input score #;
put _n_= score= ;
output;
end;
end;
If you do that you should see that all of the data (all 10 values from both rows) are processed on the first iteration of the DATA step. You should only see _n_=1 in the log. As #Tom explained, this is because in the explicit looping you wrote, SAS moves forward to the second line of data when it can't find a sixth value to read on the first line. I think most people would consider the NOTE SAS throws about moving to the next line as a warning or even error.
If you want to have two iterations of the DATA step loop, you could change to something like:
if _n_=1 then group = 'placebo';
else if _n_=2 then group= 'active';
do subj = 1 to 5;
input score #;
put _n_= score= ;
output;
end;
(Not suggesting that two iterations is better, or that the above code is better, point is just to show what data step iteration means).
Your code should work fine, but you should see a note that SAS went to a new line in your LOG.
When GROUP='placebo' the inner loop (DO SUBJ ...) will read 5 numbers and leave the pointer at the end of the first line. Then the outer loop will execute again with GROUP='active'. When it tries to read the SCORE for SUBJ=1 there is nothing left on the first line. So SAS will skip to the next line and read the first SCORE from there. Then the other four values are read from that line.
Finally at the end of the data step it will "release" the line so the pointer will be at the beginning of line three (if there was a line three).
Then the whole data step will loop one more time and set GROUP='placebo' and SUBJ=1, but when it tries to read the SCORE it reads past the end of the file and stops the data step.
Note that your program would work fine as long as you have 10 values spaced over as many lines as you want.
Hy everybody, I've found some problems in reading unformatted character strings in a simple file. When the first / is found, everything is missed after it.
This is the example of the text I would like to read: after the first 18 character blocks that are fixed (from #Mod to Flow[kW]), there is a list of chemical species' names, that are variables (in this case 5) within the program I'm writing.
#Mod ID Mod Name Type C. #Coll MF[kg/s] Pres.[Pa] Pres.[bar] Temp.[K] Temp.[C] Ent[kJ/kg K] Power[kW] RPM[rad/s] Heat Flow[kW] METHANE ETHANE PROPANE NITROGEN H2O
I would like to skip, after some formal checks, the first 18 blocks, then read the chemical species. To do the former, I created a character array with dimension of 18, each with a length of 20.
character(20), dimension(18) :: chapp
Then I would like to associate the 18 blocks to the character array
read(1,*) (chapp(i),i=1,18)
...but this is the result: from chapp(1) to chapp(7) are saved the right first 7 strings, but this is chapp(8)
chapp(8) = 'MF[kg '
and from here on, everything is leaved blank!
How could I overcome this reading problem?
The problem is due to your using list-directed input (the * as the format). List-directed input is useful for quick and dirty input, but it has its limitations and quirks.
You stumbled across a quirk: A slash (/) in the input terminates assignment of values to the input list for the READ statement. This is exactly the behavior that you described above.
This is not choice of the compiler writer, but is mandated by all relevant Fortran standards.
The solution is to use formatted input. There are several options for this:
If you know that your labels will always be in the same columns, you can use a format string like '(1X,A4,2X,A2,1X,A3,2X)' (this is not complete) to read in the individual labels. This is error-prone, and is also bad if the program that writes out the data changes format for some reason or other, or if the labes are edited by hand.
If you can control the program that writes the label, you can use tab characters to separate the individual labels (and also, later, the labels). Read in the whole line, split it into tab-separated substrings using INDEX and read in the individual fields using an (A) format. Don't use list-directed format, or you will get hit by the / quirk mentioned above. This has the advantage that your labels can also include spaces, and that the data can be imported from/to Excel rather easily. This is what I usually do in such cases.
Otherwise, you can read in the whole line and split on multiple spaces. A bit more complicated than splitting on single tab characters, but it may be the best option if you cannot control the data source. You cannot have labels containing spaces then.
I'm trying to replace embedded spaces in one of my variables (QPR) with a new character. Here is my (abbreviated) code:
data sas2;
input QPR $ & 1-9;
QPR=tranwrd(strip(QPR)," ","0");
run;
proc print data=sas2;
run;
The tranwrd function seems to work for observations with one embedded blank; however, it does not work when there are two blanks in a row.
For example, 234 2345 becomes 23402345, but 234 345 becomes 234 (i.e., The rest gets cut off, I assume because of strip). Instead, I want 23400345.
I also tried tranwrd without the strip function, but I go from 234 345 to 23400000 instead. Translate does the same thing.
Any ideas on why this won't work and how to fix it? Alternatively, are there easier/better ways to do this in the data step?
The "&" symbol in your input statement causes SAS to stop reading the data after two spaces. After SAS stops reading the data, it pads the rest of the string with spaces up to a total length of 9 chars. This is why you had a bunch of zeros at the end of the string when you didn't use strip. Removing the "&" should fix it.