Generate new variable in one dataset using observation from another dataset in SAS - sas

I have two datasets, one with one observation and two variables. Other dataset with 10 observations, four variables.
Dataset 1
Final Result
X Fail
Dataset 2
A B C D Output
1 1 2 Pass
2 1 2 Pass
3 1 2 Pass
4 1 2 Fail
5 1 2 Pass
6 1 2 Fail
7 1 2 Pass
8 1 2 Fail
9 1 2 Pass
10 1 2 Pass
I would like to generate a fifth variable (output) in the second dataset depending on the value of the second variable in the first dataset.
If Result in first dataset equal to fail, generate a new variable output in the second dataset as fail. If Result in first dataset equal to pass, then generate a new variable output in the second dataset which will be equal to the value in column D of the second dataset.

Just use some simple IF/THEN logic. Since you know DATASET1 only has one observation then only read one observation from it.
data want;
if _n_=1 then set dataset1 ;
set dataset2 ;
length OUTPUT $4 ;
if RESULT='FAIL' then OUTPUT=RESULT;
else OUTPUT=D ;
run;

Related

SAS comparing data in a column

I'm very new to SAS and i'm trying to figure out my way around using it. I'm trying to figure out how to use the Compare procedure. Basically what I want to do is to see if the values in one column match the values in another column multiplied by 2 and count the number of mistakes. So if I have this data set:
a b
2 4
1 2
3 5
It should check whether b = 2 * a and tell me how many errors they are. I've been reading through the documentation for the compare procedure but like i said i'm very new and i can't seem to figure out how to check for this.
You could do if with PROC COMPARE but you still need to compute 2*a and you can't do that with PROC COMPARE. I would create a FLAG and summarize the FLAG. IFN function returns 1 for values that are NOT equal. PROC MEANS counts the 1's where mean is percent and sum is count of non-matching.
data comp;
input a b;
flag = ifn(b NE 2*a,1,0);
cards;
2 4
1 2
3 5
;;;;
run;
proc means n mean sum;
var flag;
run;
Proc compare compares values in two different datasets, whereas your variables are both in one dataset. The following may be simplest:
data matches errors;
set temp;
if b = 2 * a then output matches;
else output errors;
run;

Setting *most* variables to missing, while preserving the contents of a select few

I have a dataset like this (but with several hundred vars):
id q1 g7 q3 b2 zz gl az tre
1 1 2 1 1 1 2 1 1
2 2 3 3 2 2 2 1 1
3 1 2 3 3 2 1 3 3
4 3 1 2 2 3 2 1 1
5 2 1 2 2 1 2 3 3
6 3 1 1 2 2 1 3 3
I'd like to keep id, b2, and tre, but set everything else to missing. In a dataset this small, I can easily use call missing (q1, g7, q3, zz, gl, az) - but in a set with many more variables, I would effectively like to say call missing (of _ALL_ *except ID, b2, tre*).
Obviously, SAS can't read my mind. I've considered workarounds that involve another data step or proc sql where I copy the original variables to a new ds and merge them back on post, but I'm trying to find a more elegant solution.
This technique uses an un-executed set statement (compile time function only) to define all variables in the original data set. Keeps the order and all variable attributes type, labels, format etc. Basically setting all the variables to missing. The next SET statement which will execute brings in only the variables the are NOT to be set to missing. It doesn't explicitly set variables to missing but achieves the same result.
data nomiss;
input id q1 g7 q3 b2 zz gl az tre;
cards;
1 1 2 1 1 1 2 1 1
2 2 3 3 2 2 2 1 1
3 1 2 3 3 2 1 3 3
4 3 1 2 2 3 2 1 1
5 2 1 2 2 1 2 3 3
6 3 1 1 2 2 1 3 3
;;;;
run;
proc print;
run;
data manymiss;
if 0 then set nomiss;
set nomiss(keep=id b2 tre:);
run;
proc print;
run;
Another fairly simple option is to set them missing using a macro, and basic code writing techniques.
For example, let's say we have a macro:
%call_missing(var=);
call missing(&var.);
%mend call_missing;
Now we can write a query that uses dictionary.columns to identify the variables we want set to missing:
proc sql;
select name
from dictionary.columns
where libname='WORK' and memname='HAVE'
and not (name in ('ID','B2','TRE')); *note UPCASE for all these;
quit;
Now, we can combine these two things to get a macro variable containing code we want, and use that:
proc sql;
select cats('%call_missing(var=',name ,')')
into :misslist separated by ' '
from dictionary.columns
where libname='WORK' and memname='HAVE'
and not (name in ('ID','B2','TRE')); *note UPCASE for all these;
quit;
data want;
set have;
&misslist.;
run;
This has the advantage that it doesn't care about the variable types, nor the order. It has the disadvantage that it's somewhat more code, but it shouldn't be particularly long.
If the variables are all of the same type (numeric or character) then you could use an array.
data want ;
set have;
array _all_ _numeric_ ;
do over _all_;
if upcase(vname(_all_)) not in ('ID','B2') then _all_=.;
end;
run;
If you don't care about the order then just drop the variables and add them back on with 0 observations.
data want;
set have (keep=ID B2 TRE:) have (obs=0 drop=ID B2 TRE:);
run;

Sas value counting

I am wanting to count the number of time a certain value appears in a particular column in sas. For example in the following dataset the value 1 appears 3 times
value 2 appears twice, value 3 appears once, value 4 appears 4 times and value 5 appears four times.
Game_ball
1
1
1
2
2
3
4
4
4
5
5
5
5
5
I want the dataset to represented like the following:
Game_ball Count
1 3
2 2
3 1
4 4
5 4
. .
. .
. .
Thanks in advance
As per #Dwal, proc freq is the easiest solution.
Using your sample data,
proc freq data=sample;
table game_ball/out=output;
run;
Or do it in one-pass data step
proc sort data = sample;by game_ball;run;
data output;
set sample;
retain count;
if first.game_ball then count = 0;
count + 1;
if last.game_ball then output;
by game_ball;
run;
Or in SQL
proc sql;
create table output as
select game_ball, count(*) as count
from sample
group by game_ball;
quit;

Use of Lag /Lead function

Kindly refer the sample data. I have Month, Region and Values in my data set. I need an Ouput Column as mentioned below. Basically I need on the basis of Month by Values moved ahead. Kindly help.
Month Region Values Output
1 R1 2 3
1 R2 4 5
2 R1 3 4
2 R2 5 7
3 R1 4 6
3 R2 7 5
4 R1 6
4 R2 5
Thanks,
Gauraw
If I got it right, you want to assign as OUTCOME the value from the next month within each region. If so, then you can use two SET-statements, the second of which will add the same dataset, but shifted by one record (FIRSTOBS=2).
proc sort data=yourdata; by region month; run;
data result;
set yourdata;
by region;
do until(eof);
set yourdata(firstobs=2 keep=values rename=(values=outcome)) end=eof;
end;
if LAST.region then call missing(outcome);
run;
And we need to wrap SET into DO UNTIL loop, because otherwise we'll loose the last record of the dataset - the end of the second instance of the same dataset will be reached one record earlier and DATA step will stop.

how to solve the problem of selecting multiple rows

I have the data in this format- it is just an
example: n=2
X Y info
2 1 good
2 4 bad
3 2 good
4 1 bad
4 4 good
6 2 good
6 3 good
Now, the above data is in sorted manner (total 7 rows). I need to make a group of 2 , 3 or 4 rows separately and generate a graph. In the above data, I made a group of 2 rows. The third row is left alone as there is no other column in 3rd row to form a group. A group can be formed only within the same row. NOT with other rows.
Now, I will check if both the rows have “good” in the info column or not. If both rows have “good” – the group formed is also good , otherwise bad. In the above example, 3rd /last group is “good” group. Rest are all bad group. Once I’m done with all the rows, I will calculate the total no. of Good groups formed/Total no. of groups.
In the above example, the output will be: Total no. of good groups/Total no. of groups => 1/3.
This is the case of n=2(size of group)
Now, for n=3, we make group of 3 rows and for n=4, we make a group of 4 rows and find the good /bad groups in a similar way. If all the rows in a group has “good” block—the result is good block, otherwise bad.
Example: n= 3
2 1 good
2 4 bad
2 6 good
3 2 good
4 1 good
4 4 good
4 6 good
6 2 good
6 3 good
In the above case, I left the 4th row and last 2 rows as I can’t make group of 3 rows with them. The first group result is “bad” and last group result is “good”.
Output: 1/ 2
For n= 4:
2 1 good
2 4 good
2 6 good
2 7 good
3 2 good
4 1 good
4 4 good
4 6 good
6 2 good
6 3 good
6 4 good
6 5 good
In this case, I make a group of 4 and finds the result. The 5th,6th,7th,8th row are left behind or ignored. I made 2 groups of 4 rows and both are “good” blocks.
Output: 2/2
So, After getting 3 output values for n=2 , n-3, and n=4 I will plot a graph of these values.
Below is code that I think is getting what you are looking for. It assumes that the data that you described is stored separately in the three datasets named data_2, data_3, and data_4. Each of these datasets is processed by the %FIND_GOOD_GROUPS macro that determines which groups of X have all "GOOD" values in INFO, then this summary information is appended as a new row to the BASE dataset. I didn't add the code, but you could calculate the ratio of GOOD_COUNT to FREQ in a separate data step, then use a procedure to plot the N value and the ratio. Hope this gets close to what you're trying to accomplish.
%******************************************************************************;
%macro main;
%find_good_groups(dsn=data_2, n=2);
%find_good_groups(dsn=data_3, n=3);
%find_good_groups(dsn=data_4, n=4);
proc print data=base uniform noobs;
%mend main;
%******************************************************************************;
%******************************************************************************;
%macro find_good_groups(dsn=,n=);
%***************************************************************************;
%* Sort data by X and Y so that you can use FIRST.X variable in Data step. *;
%***************************************************************************;
proc sort data=&dsn;
by x y;
run;
%***************************************************************************;
%* TEMP dataset uses the FIRST.X variable to reset COUNT and GOOD_COUNT to *;
%* initial values for each row where X changes. Each row in the X groups *;
%* adds 1 to COUNT and sets GOOD_COUNT to 0 (zero) if INFO is ever "BAD". *;
%* A record is output if COUNT is equal to the macro parameter &N. *;
%***************************************************************************;
data temp;
keep good_count n;
retain count 0 good_count 1 n &n;
set &dsn;
by x y;
if first.x then do;
count = 0;
good_count = 1;
end;
count = count + 1;
if good_count eq 1 then do;
if trim(left(upcase(info))) eq "BAD" then do;
good_count = 0;
end;
end;
if count eq &n then output;
run;
%***************************************************************************;
%* Summarize the TEMP data to find the number of times that all of the *;
%* rows had "GOOD" in the INFO column for each value of X. *;
%***************************************************************************;
proc summary data=temp;
id n;
var good_count;
output out=n_&n (drop=_type_) sum=;
run;
%***************************************************************************;
%* Append to BASE dataset to retain the sums and frequencies from all of *;
%* the datasets. BASE can be used to plot the N / number of Good records. *;
%***************************************************************************;
proc append data=n_&n base=base force; run;
%mend find_good_groups;
%******************************************************************************;
%main