SAS: assign serial number based on the group - sas

I know there are similar questions regarding serial numbers but my case is a little different.
I need to assign serial number based on the group variable. Now, I have my data sorted by the group variable. The following data is just a part of the whole dataset. Basically, I want to create "serial_num" variable that assign unique serial number by the group as shown below.
For example, when group = 1, each has own unique serial number. When group = 2, there are two identical serial numbers. I hope you guys get the pattern by observing the data below.
Thanks in advance.
serial_num group
----------------
1 1
2 1
. .
. .
. .
7 2
7 2
8 2
8 2
. .
. .
. .
10 3
10 3
10 3
11 3
11 3
11 3
. .
. .
. .

An odd requirement, but here's a solution using plain old data step.
data output;
set input;
by group;
if first.group or c = group then do;
c = 0;
serial_num + 1;
end;
c + 1;
drop c;
run;

A rough solution using IML. Mainly to check with you whether it fits the pattern you want then if necessary, I can expand it to enable data set input or make improvement.
Note: y is the generated serial number vector.
proc iml;
x={1,1,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,4,4};
y=j(nrow(x),1,.);
y[1,1]=1;
j=1;
do i=2 to nrow(y);
if y[i-x[i,1],1]=j then do;
j=j+1;
y[i,1]=j;
end;
else if x[i,1]^=x[i-1,1] then y[i,1]=y[i-1,1]+1;
else y[i,1]=y[i-1,1];
end;
print y;
quit;

Related

SAS: Replacing missing value with average of nearest neighbors

I am trying to find a quick way to replace missing values with the average of the two nearest non-missing values. Example:
Id Amount
1 10
2 .
3 20
4 30
5 .
6 .
7 40
Desired output
Id Amount
1 10
2 **15**
3 20
4 30
5 **35**
6 **35**
7 40
Any suggestions? I tried using the retain function, but I can only figure out how to retain last non-missing value.
I thinks what you are looking for might be more like interpolation. While this is not mean of two closest values, it might be useful.
There is a nifty little tool for interpolating in datasets called proc expand. (It should do extrapolation as well, but I haven't tried that yet.) It's very handy when making series of of dates and cumulative calculations.
data have;
input Id Amount;
datalines;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;
run;
proc expand data=have out=Expanded;
convert amount=amount_expanded / method=join;
id id; /*second is column name */
run;
For more on the proc expand see documentation: https://support.sas.com/documentation/onlinedoc/ets/132/expand.pdf
This works:
data have;
input id amount;
cards;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;
run;
proc sort data=have out=reversed;
by descending id;
run;
data retain_non_missing;
set reversed;
retain next_non_missing;
if amount ne . then next_non_missing = amount;
run;
proc sort data=retain_non_missing out=ordered;
by id;
run;
data final;
set ordered;
retain last_non_missing;
if amount ne . then last_non_missing = amount;
if amount = . then amount = (last_non_missing + next_non_missing) / 2;
run;
but as ever, will need extra error checking etc for production use.
The key idea is to sort the data into reverse order, allowing it to use RETAIN to carry the next_non_missing value back up the data set. When sorted back into the correct order, you then have enough information to interpolate the missing values.
There may well be a PROC to do this in a more controlled way (I don't know anything about PROC STANDARDIZE, mentioned in Reeza's comment) but this works as a data step solution.
Here's an alternative requiring no sorting. It does require IDs to be sequential, though that can be worked around if they're not.
What it does is uses two set statements, one that gets the main (and previous) amounts, and one that sets until the next amount is found. Here I use the sequence of id variables to guarantee it will be the right record, but you could write this differently if needed (keeping track of what loop you're on) if the id variables aren't sequential or in an order of any sort.
I use the first.amount check to make sure we don't try to execute the second set statement more than we should (which would terminate early).
You need to do two things differently if you want first/last rows treated differently. Here I assume prev_amount is 0 if it's the first row, and I assume last_amount is missing, meaning the last row just gets the last prev_amount repeated, while the first row is averaged between 0 and the next_amount. You can treat either one differently if you choose, I don't know your data.
data have;
input Id Amount;
datalines;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;;;;
run;
data want;
set have;
by amount notsorted; *so we can tell if we have consecutive missings;
retain prev_amount; *next_amount is auto-retained;
if not missing(amount ) then prev_amount=amount;
else if _n_=1 then prev_amount=0; *or whatever you want to treat the first row as;
else if first.amount then do;
do until ((next_id > id and not missing(next_amount)) or (eof));
set have(rename=(id=next_id amount=next_amount)) end=eof;
end;
amount = mean(prev_amount,next_amount);
end;
else amount = mean(prev_amount,next_amount);
run;

How to know ids with missing variable in SAS

In my dataset there are several observations (IDs) with all or too many missing variables. I want to know which IDs have no data (all variables are missing). I used proc freq but it gives me only freqency of variables, which do not serve my purpose. Proc mean nmiss also give me just total missing. I want to know exactly which IDs have missing variables. I searched online but couldn't locate solution of my problem. Help would be appreciated. Below is the sample data;
ID a b c d e
1 . 3 1 2 2
2 . . . . .
3 . . . . .
4 3 . 5 . .
I want result in a way that show me data of ID with complete missing information like;
ID a b c d e
2 . . . . .
3 . . . . .
Thanks
Thanks in advance
Use the nmiss function instead, which counts the number of missing values im the row for a specified list of variables. If you're looking at 3 variables for example
If nmiss(var1, var2, var3) =3;
Keep ID;
This will keep only records with all three variables missing.
The n function returns the number of non-missing numeric values in a list. This means you could use a variable list and not worry about counting the variables:
if n(of _numeric_) = 0 then output;
or
if n(of a--e) = 0 then output;
If you're checking character variables, there is no corresponding c function, but you could use the coalescec function to do something similar. The coalesce functions return the first non-missing value from a list of values. To select rows with all character values missing, use something like:
if missing(coalescec(of _character_)) then output;

SAS Merge with column filters

I have the below two datasets and need the thord one as output.
ONE TWO
----------- ------------------
ID ID TAG VALUE
1 1 Y 1000
2 2 N 2000
3
OUTPUT
------------
ID TAG VALUE
1 Y 1000
2 . .
3 . .
The merge should happen only if the TAG = 'Y' in TWO dataset.
Also need all the values from ONE dataset.
Can this be done using SAS MERGE?
data output;
merge one (in=a)
two (in=b where=(tag = 'Y'));
by id;
if a;
run;

Generating Interdependent Data in SAS

I am trying to compute a column in SAS, that has dependency on itself. For example, I have the following list of initial values
ID Var_X Var_Y Var_Z
1 2 3 .
2 . 2 .
3 . . .
4 . . .
5 . . .
6 . . .
7 . . .
I need to fill up the blank spaces. The formulae are as follows:
Var_Z = 0.1 + 4*Var_x + 5*Var_Y
Var_X = lag1(Var_Z)
Var_Y = lag2(Var_Z)
As we see values of Var_X, Var_Y and Var_Z are inter-dependent. So the computaion needs to follow an specific order.
First we compute when ID = 1, Var_Z = 0.1 + 4*2 + 5*3 = 23.1
Next, when ID = 2, Var_X = lag1(Var_Z) = 23.1
Var_Y does not need computation at ID = 2 as we already have the initial value here. So, we have
ID Var_X Var_Y Var_Z
1 2 3 23.1
2 23.1 2 102.5 (= 0.1 + 4*23.1 +5*2)
3 . . .
4 . . .
5 . . .
6 . . .
7 . . .
We keep repeating this procedure until all vaues are calculated.
Is there a way, SAS can handle this? I tried DO loop, but I guess I did not do a good job coding it right. It just stops after ID = 2.
I am new at SAS so not familiar if there is a way SAS can handle this easily. Will wait for your suggestions.
You don't need to use LAG or RETAIN, if you're just doing this in a single data step. DO loop by itself will handle things nicely. RETAIN would only be needed if we were doing something involving a pre-existing data set, but there's really no reason to use one.
I'm using a shortcut here - while you describe VAR_Y in terms of VAR_Z, you really mean that after one iteration, VAR_Z moves to VAR_X and VAR_X moves to VAR_Y, so I do that (in the proper order to not mix things up).
data test_data;
if _n_ = 1 then do;
var_x=2;
var_y=3;
end;
do _iter = 1 to 7;
var_z = 0.1+4*var_x+5*var_y;
output;
var_y=var_x;
var_x=var_z;
end;
run;
proc print data=test_data;
run;
I believe you can do this within a DO loop - the key is making SAS remember the last values of your variables. My suggestion is to poke around a bit for a simple "counter" program that, in pseudo SAS code, is something like:
Do i = 1 to 100;
i = i + 1;
run;
And see what the actual syntax is in SAS. I suspect your problem is you're not using the retain statement within your DO loop. Check the SAS documentation for that and see if it fixes your problem?

how to solve the problem of selecting multiple rows

I have the data in this format- it is just an
example: n=2
X Y info
2 1 good
2 4 bad
3 2 good
4 1 bad
4 4 good
6 2 good
6 3 good
Now, the above data is in sorted manner (total 7 rows). I need to make a group of 2 , 3 or 4 rows separately and generate a graph. In the above data, I made a group of 2 rows. The third row is left alone as there is no other column in 3rd row to form a group. A group can be formed only within the same row. NOT with other rows.
Now, I will check if both the rows have “good” in the info column or not. If both rows have “good” – the group formed is also good , otherwise bad. In the above example, 3rd /last group is “good” group. Rest are all bad group. Once I’m done with all the rows, I will calculate the total no. of Good groups formed/Total no. of groups.
In the above example, the output will be: Total no. of good groups/Total no. of groups => 1/3.
This is the case of n=2(size of group)
Now, for n=3, we make group of 3 rows and for n=4, we make a group of 4 rows and find the good /bad groups in a similar way. If all the rows in a group has “good” block—the result is good block, otherwise bad.
Example: n= 3
2 1 good
2 4 bad
2 6 good
3 2 good
4 1 good
4 4 good
4 6 good
6 2 good
6 3 good
In the above case, I left the 4th row and last 2 rows as I can’t make group of 3 rows with them. The first group result is “bad” and last group result is “good”.
Output: 1/ 2
For n= 4:
2 1 good
2 4 good
2 6 good
2 7 good
3 2 good
4 1 good
4 4 good
4 6 good
6 2 good
6 3 good
6 4 good
6 5 good
In this case, I make a group of 4 and finds the result. The 5th,6th,7th,8th row are left behind or ignored. I made 2 groups of 4 rows and both are “good” blocks.
Output: 2/2
So, After getting 3 output values for n=2 , n-3, and n=4 I will plot a graph of these values.
Below is code that I think is getting what you are looking for. It assumes that the data that you described is stored separately in the three datasets named data_2, data_3, and data_4. Each of these datasets is processed by the %FIND_GOOD_GROUPS macro that determines which groups of X have all "GOOD" values in INFO, then this summary information is appended as a new row to the BASE dataset. I didn't add the code, but you could calculate the ratio of GOOD_COUNT to FREQ in a separate data step, then use a procedure to plot the N value and the ratio. Hope this gets close to what you're trying to accomplish.
%******************************************************************************;
%macro main;
%find_good_groups(dsn=data_2, n=2);
%find_good_groups(dsn=data_3, n=3);
%find_good_groups(dsn=data_4, n=4);
proc print data=base uniform noobs;
%mend main;
%******************************************************************************;
%******************************************************************************;
%macro find_good_groups(dsn=,n=);
%***************************************************************************;
%* Sort data by X and Y so that you can use FIRST.X variable in Data step. *;
%***************************************************************************;
proc sort data=&dsn;
by x y;
run;
%***************************************************************************;
%* TEMP dataset uses the FIRST.X variable to reset COUNT and GOOD_COUNT to *;
%* initial values for each row where X changes. Each row in the X groups *;
%* adds 1 to COUNT and sets GOOD_COUNT to 0 (zero) if INFO is ever "BAD". *;
%* A record is output if COUNT is equal to the macro parameter &N. *;
%***************************************************************************;
data temp;
keep good_count n;
retain count 0 good_count 1 n &n;
set &dsn;
by x y;
if first.x then do;
count = 0;
good_count = 1;
end;
count = count + 1;
if good_count eq 1 then do;
if trim(left(upcase(info))) eq "BAD" then do;
good_count = 0;
end;
end;
if count eq &n then output;
run;
%***************************************************************************;
%* Summarize the TEMP data to find the number of times that all of the *;
%* rows had "GOOD" in the INFO column for each value of X. *;
%***************************************************************************;
proc summary data=temp;
id n;
var good_count;
output out=n_&n (drop=_type_) sum=;
run;
%***************************************************************************;
%* Append to BASE dataset to retain the sums and frequencies from all of *;
%* the datasets. BASE can be used to plot the N / number of Good records. *;
%***************************************************************************;
proc append data=n_&n base=base force; run;
%mend find_good_groups;
%******************************************************************************;
%main