I am trying to compute a column in SAS, that has dependency on itself. For example, I have the following list of initial values
ID Var_X Var_Y Var_Z
1 2 3 .
2 . 2 .
3 . . .
4 . . .
5 . . .
6 . . .
7 . . .
I need to fill up the blank spaces. The formulae are as follows:
Var_Z = 0.1 + 4*Var_x + 5*Var_Y
Var_X = lag1(Var_Z)
Var_Y = lag2(Var_Z)
As we see values of Var_X, Var_Y and Var_Z are inter-dependent. So the computaion needs to follow an specific order.
First we compute when ID = 1, Var_Z = 0.1 + 4*2 + 5*3 = 23.1
Next, when ID = 2, Var_X = lag1(Var_Z) = 23.1
Var_Y does not need computation at ID = 2 as we already have the initial value here. So, we have
ID Var_X Var_Y Var_Z
1 2 3 23.1
2 23.1 2 102.5 (= 0.1 + 4*23.1 +5*2)
3 . . .
4 . . .
5 . . .
6 . . .
7 . . .
We keep repeating this procedure until all vaues are calculated.
Is there a way, SAS can handle this? I tried DO loop, but I guess I did not do a good job coding it right. It just stops after ID = 2.
I am new at SAS so not familiar if there is a way SAS can handle this easily. Will wait for your suggestions.
You don't need to use LAG or RETAIN, if you're just doing this in a single data step. DO loop by itself will handle things nicely. RETAIN would only be needed if we were doing something involving a pre-existing data set, but there's really no reason to use one.
I'm using a shortcut here - while you describe VAR_Y in terms of VAR_Z, you really mean that after one iteration, VAR_Z moves to VAR_X and VAR_X moves to VAR_Y, so I do that (in the proper order to not mix things up).
data test_data;
if _n_ = 1 then do;
var_x=2;
var_y=3;
end;
do _iter = 1 to 7;
var_z = 0.1+4*var_x+5*var_y;
output;
var_y=var_x;
var_x=var_z;
end;
run;
proc print data=test_data;
run;
I believe you can do this within a DO loop - the key is making SAS remember the last values of your variables. My suggestion is to poke around a bit for a simple "counter" program that, in pseudo SAS code, is something like:
Do i = 1 to 100;
i = i + 1;
run;
And see what the actual syntax is in SAS. I suspect your problem is you're not using the retain statement within your DO loop. Check the SAS documentation for that and see if it fixes your problem?
Related
Here is the demonstrate data.
data faminc;
input famid faminc1-faminc12;
cards;
1 3281 3413 3114 2500 2700 . 3114 3319 3514 1282 2434 2818
2 4042 . . . . . 1531 2914 3819 4124 4274 4471
3 6015 . . . . . . . . . . .
;
run;
I would like to create an indicator variable called fam_indicator. If variables faminc2-faminc12 are all missing, then fam_indicator=1. Otherwise fam_indicator=0.
I tried the code below but it didn't work.
data fam;
set faminc;
if missing(faminc2-faminc12) then fam_indicator=1;
else fam_indicator=0;
run;
You can do this a bunch of different ways. If the variables are all numeric, then n will do it for you.
data fam;
set faminc;
if n(of faminc2-faminc12) eq 0 then fam_indicator=1;
else fam_indicator=0;
run;
cmiss and nmiss also could work; cmiss is generic regardless of type, while nmiss is only for numerics. They would count the number of missings, so you'd want if cmiss(of faminc2-faminc12) eq 11 or similar.
The other thing you needed was the of. n(faminc2-faminc12) would just subtract the one from the other. of says "the next thing here is a variable list" and it will then expand the list out.
nmiss function could be used directly, sum function is also another option, sum of all missing values is still missing value.
fam_indicator=ifn(sum(of faminc2-faminc12)=.,1,0);
In my dataset there are several observations (IDs) with all or too many missing variables. I want to know which IDs have no data (all variables are missing). I used proc freq but it gives me only freqency of variables, which do not serve my purpose. Proc mean nmiss also give me just total missing. I want to know exactly which IDs have missing variables. I searched online but couldn't locate solution of my problem. Help would be appreciated. Below is the sample data;
ID a b c d e
1 . 3 1 2 2
2 . . . . .
3 . . . . .
4 3 . 5 . .
I want result in a way that show me data of ID with complete missing information like;
ID a b c d e
2 . . . . .
3 . . . . .
Thanks
Thanks in advance
Use the nmiss function instead, which counts the number of missing values im the row for a specified list of variables. If you're looking at 3 variables for example
If nmiss(var1, var2, var3) =3;
Keep ID;
This will keep only records with all three variables missing.
The n function returns the number of non-missing numeric values in a list. This means you could use a variable list and not worry about counting the variables:
if n(of _numeric_) = 0 then output;
or
if n(of a--e) = 0 then output;
If you're checking character variables, there is no corresponding c function, but you could use the coalescec function to do something similar. The coalesce functions return the first non-missing value from a list of values. To select rows with all character values missing, use something like:
if missing(coalescec(of _character_)) then output;
I have a SAS dataset which looks like this:
Month Col1 Col2 Col3 Col4
200801 11 2 3 20
200802 5 9 4 10
. . . . .
. . . . .
. . . . .
201212 3 34 1 0
I want to create a dataset by shift each row's column Col1-Col4 values, to the right. It will look diagonally shifted.
Month Col1 Col2 Col3 Col4 Col5 Col6 Col7 . . . . . . . Coln
200801 11 2 3 20
200802 . 5 9 4 10
. . . . .
. . . . .
. . . . .
201212 . . . . . . . . . 3 34 1 0
Can someone suggest how I can do it?
Thanks!
First off, if you can avoid doing so, do. This is a pretty sparse way to store data, and will involve large datasets (definitely use OPTIONS COMPRESS at least), and usually can be worked around with good use of CLASS variables.
If you really must do this, PROC TRANSPOSE is your friend. While this is possible in the data step, it's less messy and more flexible in PROC TRANSPOSE.
First, make a totally vertical dataset (month+colname+colvalue):
data pre_t;
set have;
array cols col1-col4;
do _t = 1 to dim(cols);
colname = cats("col",((_N_-1) + _t)); *shifting here, edit this logic as needed;
value = cols[_t];
output;
keep colname value month;
run;
In that datastep, you are creating the eventual column name in colname and setting it up for transpose. If you have data not identical to the above (in particular, if you have data grouped by something else), N may not work and you may need to do some logic (such as figuring out difference from 200801) to calculate the col#.
Then, proc transpose:
proc transpose data=pre_t out=want;
by month;
id colname;
var value;
run;
And voilĂ , you should have what you were looking for. Make sure it's sorted properly in order to get the output in the expected order.
Consider the following example:
/* Create two not too interesting datasets: */
Data ones (keep = A);
Do i = 1 to 3;
A = 1;
output;
End;
run;
Data numbers;
Do B = 1 to 5;
output;
End;
Run;
/* The interesting step: */
Data together;
Set ones numbers;
if B = 2 then A = 2;
run;
So dataset ones contains one variable A with 3 observations, all ones and dataset numbers contains one variable (B) with 5 observations: the numbers 1 to 5.
I expect the resulting dataset together to have two columns (A and B) and the A column to read (vertically) 1, 1, 1, . , 2, . , . , .
However, when executing the code I find that column A reads 1, 1, 1, . , 2, 2, 2 , 2
Apparently the 2 created in the fifth observation is retained all the way down for no apparent reason. What is going on here?
(For the sake of completeness: when I split the last data step into two as below:
Data together;
set ones numbers;
run;
Data together;
set together;
if B = 2 then A = 2;
run;
it does do what I expect.)
Yes, any variable that is defined in a SET, MERGE, or UPDATE statement is automatically retained (not set to missing at the top of the data step loop). You can effectively ignore that with
output;
call missing(of <list of variables to clear out>);
run;
at the end of your data step.
This is how MERGE works for many-to-one merges, by the way, and the reason that many-to-many merges don't usually work the way you want them to.
The difference between the 'together' and the 'separate' cases is that in the separate case, you have two data sets with different variables. If you are running this in interactive mode, ie SAS Program Editor or Enhanced Editor (not EG or batch mode), you can use the data step debugger to see this a little more clearly. You would see the following:
At the end of the last row of the ones dataset:
i A B
3 1 .
Notice B exists, but is missing. Then it goes back to the top of the data step loop. All three variables are left alone since they're all from the data sets. Then it attempts to read from ones one more time, which generates:
i A B
. . .
Then it realizes it cannot read from ones, and starts to read from numbers. At the end of the first row of the numbers dataset:
i A B
. . 1
Then it goes to the top, again changes nothing; then it reads in a 2 for B.
i A B
. . 2
Then it sets A to 2, per your program:
i A B
. 2 2
Then it returns to the start of the data step loop again.
i A B
. 2 2
Then it reads in B=3:
i A B
. 2 3
Then it continues looping, for B=4, 5.
Now, compare that to the single dataset. It will be nearly the same (with a small difference at the switch between datasets that does not yield a different result). Now we go to the step where A=2 B=2:
i A B
. 2 2
Now when the data step reads in the next row, it has all three variables on it. So it yields:
i A B
. . 3
Since it read in A=. from the row, it is setting it to missing. In the one-data-step version, it didn't have a value for A to read in, so it didn't replace the 2 with missing.
I know there are similar questions regarding serial numbers but my case is a little different.
I need to assign serial number based on the group variable. Now, I have my data sorted by the group variable. The following data is just a part of the whole dataset. Basically, I want to create "serial_num" variable that assign unique serial number by the group as shown below.
For example, when group = 1, each has own unique serial number. When group = 2, there are two identical serial numbers. I hope you guys get the pattern by observing the data below.
Thanks in advance.
serial_num group
----------------
1 1
2 1
. .
. .
. .
7 2
7 2
8 2
8 2
. .
. .
. .
10 3
10 3
10 3
11 3
11 3
11 3
. .
. .
. .
An odd requirement, but here's a solution using plain old data step.
data output;
set input;
by group;
if first.group or c = group then do;
c = 0;
serial_num + 1;
end;
c + 1;
drop c;
run;
A rough solution using IML. Mainly to check with you whether it fits the pattern you want then if necessary, I can expand it to enable data set input or make improvement.
Note: y is the generated serial number vector.
proc iml;
x={1,1,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,4,4};
y=j(nrow(x),1,.);
y[1,1]=1;
j=1;
do i=2 to nrow(y);
if y[i-x[i,1],1]=j then do;
j=j+1;
y[i,1]=j;
end;
else if x[i,1]^=x[i-1,1] then y[i,1]=y[i-1,1]+1;
else y[i,1]=y[i-1,1];
end;
print y;
quit;