I have a file in which the first line is a header line containing some meta-data information.
How can I get the current observation number(say =1 for the first observation) that the SAS processor is dealing with so that I can put in a IF clause to handle such special data line.
Follow up: I want to process the first line and keep one of the column values in a local variable for further processing. I don't want to keep this line in my final output. is this possible?
The automatic variable _N_ returns the current iteration number of the SAS data step loop. For a traditional data step, ie:
data something;
set something;
(code);
run;
_N_ is equivalent to the row number (since one row is retrieved for each iteration of the data step loop).
So if you wanted to only do something once, on the first iteration, this would accomplish that:
data something;
set something;
if _n_ = 1 then do;
(code);
end;
(more code);
run;
For your follow up, you want something like this:
data want;
set have;
retain _temp;
if _n_ = 1 then do;
_temp = x;
end;
... more code ...
drop _temp;
run;
DROP and RETAIN statements can appear anywhere in the code and have the same effect, I placed them in their human-logical locations. RETAIN says to not reset the variable to missing each time through the data step loop, so you can access it further down.
if you are reading a particularly large text file, you may want to avoid having to execute the (if _n_=1 then) condition for every iteration. You can do this by reading the file twice - once to extract the header row, and again to read in the file, as follows:
data _null_; /* create dummy file for demo purposes */
file "c:\myfile.txt";
put 'blah'; output;
put 'blah blah blah 666'; output;
data _null_; /* read in header info */
infile "c:\myfile.txt";
input myvar:$10.; /* or wherever the info is that you need */
call symput('myvar',myvar);/* create macro variable with relevant info */
stop; /* no further processing at this point */
data test; /* read in data FROM SECOND LINE */
infile "c:\myfile.txt" firstobs=2 ; /* note the FIRSTOBS option */
input my $ regular $ input $ statement ;
remember="&myvar";
run;
For short / simple stuff though, Joe's answer is better as it's more readable.. (and may be more efficient for small files).
Related
I'm relatively new to SAS (using SAS EG if it matters).
I have several previously written programs that read new data files. For each type of data file there's a separate program with it's specific INPUT column statement.
For example, one program would have:
DATA data1;
INFILE 'D:\file.txt' noprint missover;
INPUT
ID 1 - 8
NAME $ 9 - 20;
run;
whereas another program would have other definitions. for example:
INPUT
ID 1 - 5
NAME $ 6 - 20
Each data file contains hundreds of variables, so the INPUT column statement in each program is very long. However, the rest of these programs are completely identical.
My intention is to combine these programs into one,
I have two questions:
Is it possible to combine these programs with a conditional INPUT column statement?
Is it possible to read the definition of each file type columns from a variable? (Thus enabling me to define it elsewhere in the workflow or even to read it from an external file)
It seems like you use text files with a fixed width definition. For these you can each specify a format file of the form
column, type, start, end
and then read that file first in order to build the INPUT statement. column is the column name, type one of n (numeric) or c (character), start and end start and end positions for this column.
You would wrap this into a MACRO like this:
%macro readFile(file, output);
%local input_statement;
/* First, read the format file that contains the column details. */
data _null_;
infile "&file..fmt" dlm="," end=eof;
input column $ type $ start end;
length input_statement $ 32767;
retain input_statement "input";
if type = "c" then type = "$";
else type = "";
input_statement = catx(" ", input_statement, column, type, start, "-", end);
if eof then call symputx("input_statement", input_statement);
run;
/* Read the actual file. */
data &output.;
infile "&file.";
&input_statement.;
run;
%mend;
For a file file.txt the macro needs the format file to be named file.txt.fmt in the same path. Call the macro as
%readFile(%str(D:\file.txt), data1);
I am new to SAS and I need some help here
The question below:
So far, I have done this:
data Purchase;
infile ‘c:\temp\PurchaseRecords.dat’ dlm=’,’ DSD;
input id $8 visit_no # unitpurchased #;
keep id unitpurchased;
run;
What do I need to add in my statement to make those orders look like this?
just an example.
Thank you.
You can use the infile column= in conjunction with the input held input # modifier to determine when held input has run past a trailing comma meant to indicate a missing value that is to be interpreted as a case of zero units_purchased. The automatic variable _infile_ is used to check when an input statement has positioned itself for the next read to be beyond the length of a data line.
data want;
infile datalines dsd dlm=',' column=p;
attrib id length=$8 units_purchased length=8 ;
input id #; * held input record;
* loop over held input record;
do while (p <= length(_infile_)+1); * +1 for dealing with trailing comma;
input units_purchased #; * continue to hold the record;
if missing(units_purchased) then units_purchased = 0;
output;
end;
datalines;
C005,3,15,,39
D2356,4,11,,5
A323,3,10,15,20
F123,1,
run;
The sometimes easier to use ## modifier wouldn't be used in this case because a missing value is to be considered valid input and thus can't be used to assert a 'no more data' condition.
Since the data includes the number of values use that to control a DO loop to read the values. I am not sure why you would want to lose the information on the order of the values, so I have commented out the KEEP statement. To convert the missing values to zeros I used a sum statement. You could use an IF/THEN statement or a COALESE() function call or other methods to convert the missing values to zeros.
data Purchase;
infile 'c:\temp\PurchaseRecords.dat' dsd truncover ;
length id $8 ;
input id visit_no # ;
do visit=1 to visit_no ;
input unitpurchased #;
unitpurchased+0;
output;
end;
* keep id unitpurchased;
run;
Your original program had a few errors:
Wrong quote characters. Use normal ASCII single or double quote characters.
It is reading value of ID from only column 8. I find it better to use LENGTH statement to define the variables instead of forcing SAS to guess at how to define the variables.
The input statement improperly is trying to use column pointer motion command, #nnn. Plus the variable location to move the pointer to, unitpurchased, has not yet been given a value.
No attempt was made to read more than one value from the line.
You did not include truncover (or even the older missover) option on your infile statement.
I have something similar to the code below, I want to create every 2 character combination within my strings and then count the occurrence of each and store in a table. I will be changing the substr statement to a do loop to iterate through the whole string. But for now I just want to get the first character pair to work;
data temp;
input cat $50.;
call symput ('regex', substr(cat,1,2));
®ex = count(cat,substr(cat,1,2));
datalines;
bvbvbsbvbvbvbvblb
dvdvdvlxvdvdgd
cdcdcdcdvdcdcdvcdcded
udvdvdvdevdvdvdvdvdvdvevdedvdv
dvdkdkdvdkdkdkudvkdkd
kdkvdkdkvdkdkvudkdkdukdvdkdkdkdv
dvkvwduvwdedkd
;
run;
Expected results;
cat bv dv cd ud kd
#### 6
#### 4
#### 8
#### 1
#### 3
#### 9
#### 1
I'd prefer not to use a proc transpose as I can't loop through the string to create all the character pairs. I'll have to manually create them and I have upto 500 characters per string, plus I would like to search for 3 and 4 string patterns.
You can't do what you're asking to directly. You will either have to use the macro language, or use PROC TRANSPOSE. SAS doesn't let you reference data in the way you're trying to, because it has to have already constructed the variable names and such before it reads anything in.
I'll post a different solution that uses the macro language, but I suspect TRANSPOSE is the ultimate solution here; there's no practical reason this shouldn't work with your actual problem, and if you're having trouble with that it should be possible to help - post the do loop and what you're wanting, and we can of course help. Likely you just need to put the OUTPUT in the do loop.
data temp;
input cat $50.;
cat_val = substr(cat,1,2);
_var_ = count(cat,substr(cat,1,2));
output;
datalines;
bvbvbsbvbvbvbvblb
dvdvdvlxvdvdgd
cdcdcdcdvdcdcdvcdcded
udvdvdvdevdvdvdvdvdvdvevdedvdv
dvdkdkdvdkdkdkudvkdkd
kdkvdkdkvdkdkvudkdkdukdvdkdkdkdv
dvkvwduvwdedkd
;
run;
proc transpose data=temp out=temp_T(drop=_name_);
by cat notsorted; *or by some ID variable more likely;
id cat_val;
var _var_;
run;
Here's a solution that uses CALL EXECUTE rather than the macro language, as I decided that was actually a better solution. I wouldn't use this in production, but it hopefully shows the concept (in particular, I would not run a PROC DATASETS for each variable separately - I would concat all the renames into one string then run that at the end. I thought this better for showing how the process might work.)
This takes advantage of timing - namely, CALL EXECUTE happens after the data step terminates, so by that point you do know what variable maps to what data point. It does have to pass the data twice in order to drop the spurious variables, though if you either know the actual number of variables you want to have, or if you're okay with the excess variables hanging around, it would be okay to skip that, and PROC DATASETS doesn't actually open the whole dataset, so it would be quite fast (even the above with five calls is quite fast).
data temp;
input cat $50.;
array _catvars[50]; *arbitrary 50 chosen here - pick one big enough for your data;
array _catvarnames[50] $ _temporary_;
cat_val = substr(cat,1,2);
_iternum = whichc(cat_val, of _catvarnames[*]);
if _iternum=0 then do;
_iternum = whichc(' ',of _catvarnames[*]);
_catvarnames[_iternum]=cat_val;
call execute('proc datasets lib=work; modify temp; rename '||vname(_catvars[_iternum])||' = '||cat_val||'; quit;');
end;
_catvars[_iternum]= count(cat,substr(cat,1,2));
if _n_=7 then do; *this needs to actually be a test for end-of-file (so add `end=eof` to the set statement or infile), but you cannot do that in DATALINES so I hardcode the example.;
call execute('data temp; set temp; drop _catvars'||put(whichc(' ',of _catvarnames[*]),2. -l)||'-_catvars50;run;');
end;
datalines;
bvbvbsbvbvbvbvblb
dvdvdvlxvdvdgd
cdcdcdcdvdcdcdvcdcded
udvdvdvdevdvdvdvdvdvdvevdedvdv
dvdkdkdvdkdkdkudvkdkd
kdkvdkdkvdkdkvudkdkdukdvdkdkdkdv
dvkvwduvwdedkd
;
run;
Is it possible to loop through the records of a table to populate an html email without repeating the beginning and the end of the email?
With this example I get a mail with 5 tables of 1 row (because WORK.MyEmailTable is table of 5 records and set creates a loop in the data step):
data _null_;
file mymail;
set WORK.MyEmailTable;
put '<html><body><table>';
***loop through all records;
put '<tr>';
put %sysfunc(cats('<td>',var1,'</td>'));
put %sysfunc(cats('<td>',var2,'</td>'));
put %sysfunc(cats('<td>',var3,'</td>'));
put '</tr>';
put '</table></body></html>';
run;
And I'm looking to have 1 table of 5 rows.
I don't know if there is a way to prevent recursively put the beginning and the end of the mail when you use set in the data step.
(Let me know if it's not clear I'll update.)
Thank you,
You can use the _n_ automatic datastep variable to let you know when you are on the first observation, and the set statement option end= to know that you are on the last observation:
data _null_;
file mymail;
set WORK.MyEmailTable end=eof;
if _n_ eq 1 then do;
put '<html><body><table>';
end;
/*loop trhough all records*/
put '<tr>';
put %sysfunc(cats('<td>','_n_=',n,' eof=',eof,' ',var1,'</td>'));
put %sysfunc(cats('<td>','_n_=',n,' eof=',eof,' ',var2,'</td>'));
put %sysfunc(cats('<td>','_n_=',n,' eof=',eof,' ',var3,'</td>'));
put '</tr>';
if eof then do;
put '</table></body></html>';
end;
run;
I've added the values _n_ and eof to the output so you can see clearly how they work.
Rob's method is pretty much the standard, but there is another option if you prefer scripting an explicit loop (which can be more comfortable for non-SAS programmers to read). This will function exactly like Rob's answer, and may well compile to the same machine code even.
data _null_;
file mymail;
put '<html><body><table>';
do _n_ = 1 by 1 until (eof);
/*loop trhough all records*/
set WORK.MyEmailTable end=eof;
put '<tr>';
put %sysfunc(cats('<td>',var1,'</td>'));
put %sysfunc(cats('<td>',var2,'</td>'));
put %sysfunc(cats('<td>',var3,'</td>'));
put '</tr>';
end;
put '</table></body></html>';
stop;
run;
_n_ here doesn't have any special meaning (like it does in Rob's answer); it's used by convention since this way it does effectively have the same meaning as it does normally.
You need to use the end=eof to create a variable eof which is true on the last record of the dataset; otherwise the data step will terminate prematurely (before actually hitting your final statement). You also need the stop to tell it to not go back to the start - otherwise it will, and will put a new starting section, then terminate instantly when it hits the set. (Try it and see.)
do _n_=1 by 1 until (eof); is a SAS-specific way of using an incremental loop; it's similar to the c/c++ for (_n_=1; !eof; _n_++) for example - it allows you to have an auto-incremented do loop whilst having a separate, unrelated stopping criteria.
data &state.&sheet.;
set di;
retain &header.;
infile in filevar= path end=done missover;
do until(done);
if _N_ =1 then
input &headerlength.;
input &allvar.;
output;
end;run;
variable path is in di data set.
I wanna read multiple txt files into one SAS data set. In each txt file the first row is header and I want to retain this header for each observation so I used if _N_ = 1 input header then input second row of other variables for analysis.
The output is very strange. only the first row contains header and other rows are not correct observations.
Could someone help me a little bit? Thank you so much.
I like Shenglin Chen's answer, but here's another option: reset the row counter to 1 each time the data step starts importing a new file.
data &state.&sheet.;
set di;
retain &header.;
infile in filevar= path end=done missover;
do _N_ = 1 by 1 until(done);
if _N_ = 1 then input &headerlength.;
input &allvar.;
output;
end;
run;
This generalises more easily in case you ever want to do something different with every nth row within each file.
Try:
data &state.&sheet.;
set di;
retain &header.;
infile in filevar= path end=done missover dlm='09'x;
input &headerlength.;
do until(done);
input &allvar.;
output;
end;
run;
You should use WHILE (NOT DONE) instead of UNTIL (DONE) to prevent reading past the end of the file, and stopping the data step, when the file is empty. Or for some of the answers when the file only has the header row.