When reading an input file where one line contains more than one observation, we can use either '#' or '##'.
When should we use one over the other?
Use the double # when you want the pointer to remain in the same place for the next iteration of the data step. If you just want the pointer to remain in place the next INPUT statement in the current iteration of the data step then you just need to use one trailing #.
Example reading one line with multiple iterations of the data step.
data want;
id+1;
input score ##;
cards;
10 20 30 45
;
Example reading from one line multiple times in the same iteration of the data step.
data want;
infile cards truncover ;
input id score #;
do rep=1 by 1 until (score=.);
output;
input score #;
end;
cards;
1 10 20 30 45
2 15 32
3 5 6 8 12 13 56
;
Related
I understand how to use pointer control to search for a phrase in the raw data and then read the value into a SAS variable. I need to know how to tell SAS to stop reading the raw data when it encounters a particular phrase.
For example in the below code I want to read the data only between phrases Start and Stop. So the Jelly should not be part of the output
data work.two;
input #"1" Name :$32.;
datalines;
Start 1 Frank 6
1 Joan 2
3 Sui Stop
1 Jelly 4
5 Jose 3
;
run;
You cannot really combine those into a single pass through the file. The problem is that the #'1' will skip past the line with STOP in it so there is no way your data step will see it.
Pre-process the file.
filename copy temp;
data _null_;
file copy ;
retain start 0 ;
input ;
if index(_infile_,'Start') then start=1;
if start then put _infile_;
if index(_infile_,'Stop') then stop;
datalines;
Start 1 Frank 6
1 Joan 2
3 Sui Stop
1 Jelly 4
5 Jose 3
;
data work.two;
infile copy ;
input #"1" Name :$32. ##;
run;
You can make the logic to detect what parts of the source file to include as complex as you need.
All names are the second position from the left of each row, so name could be got by scan function, if there is 'Stop' in the row then stop loop.
data work.two;
input ##;
Name=scan(_infile_,-2);
if indexw(_infile_,'Stop')>0 then stop;
input;
datalines;
Start 1 Frank 6
1 Joan 2
3 Sui Stop
1 Jelly 4
5 Jose 3
;
run;
I am trying to find a quick way to replace missing values with the average of the two nearest non-missing values. Example:
Id Amount
1 10
2 .
3 20
4 30
5 .
6 .
7 40
Desired output
Id Amount
1 10
2 **15**
3 20
4 30
5 **35**
6 **35**
7 40
Any suggestions? I tried using the retain function, but I can only figure out how to retain last non-missing value.
I thinks what you are looking for might be more like interpolation. While this is not mean of two closest values, it might be useful.
There is a nifty little tool for interpolating in datasets called proc expand. (It should do extrapolation as well, but I haven't tried that yet.) It's very handy when making series of of dates and cumulative calculations.
data have;
input Id Amount;
datalines;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;
run;
proc expand data=have out=Expanded;
convert amount=amount_expanded / method=join;
id id; /*second is column name */
run;
For more on the proc expand see documentation: https://support.sas.com/documentation/onlinedoc/ets/132/expand.pdf
This works:
data have;
input id amount;
cards;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;
run;
proc sort data=have out=reversed;
by descending id;
run;
data retain_non_missing;
set reversed;
retain next_non_missing;
if amount ne . then next_non_missing = amount;
run;
proc sort data=retain_non_missing out=ordered;
by id;
run;
data final;
set ordered;
retain last_non_missing;
if amount ne . then last_non_missing = amount;
if amount = . then amount = (last_non_missing + next_non_missing) / 2;
run;
but as ever, will need extra error checking etc for production use.
The key idea is to sort the data into reverse order, allowing it to use RETAIN to carry the next_non_missing value back up the data set. When sorted back into the correct order, you then have enough information to interpolate the missing values.
There may well be a PROC to do this in a more controlled way (I don't know anything about PROC STANDARDIZE, mentioned in Reeza's comment) but this works as a data step solution.
Here's an alternative requiring no sorting. It does require IDs to be sequential, though that can be worked around if they're not.
What it does is uses two set statements, one that gets the main (and previous) amounts, and one that sets until the next amount is found. Here I use the sequence of id variables to guarantee it will be the right record, but you could write this differently if needed (keeping track of what loop you're on) if the id variables aren't sequential or in an order of any sort.
I use the first.amount check to make sure we don't try to execute the second set statement more than we should (which would terminate early).
You need to do two things differently if you want first/last rows treated differently. Here I assume prev_amount is 0 if it's the first row, and I assume last_amount is missing, meaning the last row just gets the last prev_amount repeated, while the first row is averaged between 0 and the next_amount. You can treat either one differently if you choose, I don't know your data.
data have;
input Id Amount;
datalines;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;;;;
run;
data want;
set have;
by amount notsorted; *so we can tell if we have consecutive missings;
retain prev_amount; *next_amount is auto-retained;
if not missing(amount ) then prev_amount=amount;
else if _n_=1 then prev_amount=0; *or whatever you want to treat the first row as;
else if first.amount then do;
do until ((next_id > id and not missing(next_amount)) or (eof));
set have(rename=(id=next_id amount=next_amount)) end=eof;
end;
amount = mean(prev_amount,next_amount);
end;
else amount = mean(prev_amount,next_amount);
run;
For example I have test1.txt like this,
One Two Three
1 2 3
4 5 6
Test2.txt like this,
One Two Three
7 8 9
10 11 12
Test3.txt like this,
One Two Three
13 14 15
16 17 18
What's the best way to import them into a table in sas and create something like this,
One Two Three
1 2 3
4 5 6
7 8 9
10 11 12
13 14 15
16 17 18
Here is my original code,
data want;
infile "text*.txt" delimiter=" " firstobs=2;
input One Two Three;
run;
Only the header line in the first file is skipped by the firstobs statement. I can accomplish that by writing repeating codes but that's obviously not good. I've also tried the variable EOV to detect the starting of a file, but I can't make it work. What's the best way to do it? Thanks in advance!
Use the EOV option on the INFILE statement. Here is one way.
data want;
infile "text*.txt" dsd dlm='09'X truncover eov=eov firstobs=2;
input #;
if eov then input;
input One Two Three;
eov=0;
run;
Use the FIRSTOBS=2 infile option to skip the header on the first file and then use the conditional input to skip the header for the others. You need to make sure the EOV flag gets properly set by pre-reading the line before testing EOV flag. You need to reset the EOV flag at the bottom of the data step.
I have the data as follows
id^number^obs
123^2^a~b
124^3^c~d~e
125^4^f~g~h~i
the first number is a unique id, the second number is the # of observations for the id, the rest of the line is the observations.
for the first line, the unique id is 123, it has 2 observations: they are a and b
I want read the data into SAS as
id number obs
123 2 a
123 2 b
124 3 c
124 3 d
124 3 e
125 4 f
125 4 g
125 4 h
125 4 i
My question is how I can do that in SAS?
Thanks a lot!
I'm assuming this is a question regarding reading in data from a flat-file and storing it in a SAS dataset. The following code will do that for you:
/* Insert filename */
filename myfile "";
/* This writes out a dataset called mydataset from the flat-file */
data mydataset;
infile myfile dlm='^' dsd firstobs=2;
input id number _obs $;
_i=1;
do until (scan(_obs,_i,'~') = '');
obs=scan(_obs,_i,'~');
_i+1;
drop _:; /* Remove this line to see all variables in final dataset */
output;
end;
run;
Explanation
The data-step reads in records from the flat-file, but before outputting to the dataset, it uses the scan function to separate the obs variable by '~', outputting a separate observation for each value.
As mentioned in the comment, you can remove the drop statement to further understand how the code is working.
To my disappointment, the following code, which sums up 'value' by week from 'master' for weeks which appear in 'transaction' does not work -
data master;
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input change_week ;
datalines;
1
3
;
run;
data _null_;
set transaction;
do until(done);
set master end=done;
where week=change_week;
sum = sum(value, sum);
end;
file print;
put week= sum=;
run;
SAS complains, rightly, because it doesn't see 'change_week' in master and does not know how to operate on it.
Surely there must be a way of doing some operation on a subset of a master set (of course, suitably indexed), given a transaction dataset... Does any one know?
I believe this is the closest answer to what the asker has requested.
This method uses an index on week on the large dataset, allowing for the possibility of invalid week values in the transaction dataset, and without requiring either dataset to be sorted in any particular order. Performance will probably be better if the master dataset is in week order.
For small transaction datasets, this should perform quite a lot better than the other solutions as it only retrieves the required observations from the master dataset. If you're dealing with > ~30% of the records in the master dataset in a single transaction dataset, Quentin's method may sometimes perform better due to the overhead of using the index.
data master(index = (week));
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input week ;
datalines;
1
3
4
;
run;
data _null_;
set transaction;
file print;
do until(done);
set master key = week end=done;
/*Prevent implicit retain from previous row if the key isn't found,
or we've read past the last record for the current key*/
if _IORC_ ne 0 then do;
_ERROR_ = 0;
call missing(value);
end;
else sum = sum(value, sum);
end;
put week= sum=;
run;
N.B. for this to work, the indexed variable in the master dataset must have exactly the same name and type as the variable in the transaction dataset. Also, the index must be of the non-unique variety in order to accommodate multiple rows with the same key value.
Also, it is possible to replace the set master... statement with an equivalent modify master... statement if you want to apply transactional changes directly, i.e. without SAS making a massive temp file and replacing the original.
You are correct, there are many ways to do this in SAS. Your example is inefficient because (once we got it working) it would still require a full read of "master" for ever line of "transaction".
(The reason you got the error was because you used where instead of if. In SAS, the sub-setting where in a data step is only aware of columns already existing within the data set it's sub-setting. They keep two options because there where is faster when it's usable.)
An alternative solution would be use proc sql. Hopefully this example is self-explanatory:
proc sql;
select
a.change_week,
sum(b.value) as value
from
transaction as a,
master as b
where a.change_week = b.week
group by change_week;
quit;
I don't suggest below solution (would like #Jeff's SQL solution or even a hash better). But just for playing with data step logic, I think below approach would work, if you trust that every key in transaction will exist in master. It relies on the fact that both datasets are sorted, so only makes one pass of each dataset.
On first iteration of the DATA step, it reads the first record from the transaction dataset, then keeps reading through the master dataset until it finds all the matching records for that key, then the DATA step loop iterates and it does it again for the next transaction record.
1003 data _null_;
1004 set transaction;
1005 by change_week;
1006
1007 do until(last.week and _found);
1008 set master;
1009 by week;
1010
1011 if week=change_week then do;
1012 sum = sum(value, sum);
1013 _found=1;
1014 end;
1015 end;
1016
1017 *file print;
1018 put week= sum= ;
1019 run;
week=1 sum=60
week=3 sum=75