I have a dataset where some SAS Datastep logic are
needed to populate the columns that are missing, or to be derived from exiting columns.
The dataset looks more like the below:
mpi v1 v2 v3......v9 v10 v11.....v50
001 a 1.324
002 c 0.876
003 f 11.9
004 r 5.7
005 b 3.3
. . .
. . .
n t 0.4
I actually developed the program below:
/*a*/
IF v2 ('a') AND 0 <= v11 <= 2 THEN DO;
v13 = 1;
v14 =20;
END;
IF v2 IN ('a') AND 2 < v11 <= 3.1 THEN DO;
v13 = 2;
v14 =40;
END;
IF v2 IN ('a') AND 3.1 < v11<= 5.3 THEN DO;
v13 = 3;
v14 =60; END;
IF v2 IN ('a') AND 5.3 < v11 <= 11.5 THEN DO;
v13 = 4;
v14 =80;
END;
IF v2 IN ('a') AND v11 > 11.5 THEN DO;
v13 = 5;
v14 =100;
END;
My request is that I need to write same program to populate v13 and v14 when v2 IN c, f, t, r, etc; but of different parameters for the bound in v11 (separate for c, e, g,...) while v13 and v14 remain the same for the categories.
I would like to use SAS macro to get this done to avoid repetition of program. Can you help out on this?
The best way to do this is to create a dataset with the values of v2,v11,v13,v14, and merge it on or otherwise combine it with your dataset.
Doing that is a little more complicated when you have a range for a value, but by no means impossible.
Let's say you have a dataset, with v2, v11min, v11max, v13, and v14.
data mergeon;
input v2 $ v11min v11max v13 v14;
datalines;
a 0 2 1 20
a 2 3.1 2 40
a 3.1 5.3 3 60
a 5.3 11.5 4 80
a 11.5 9999 5 100
c 0 4 1 20
c 4 8.1 2 40
c 8.1 9.6 3 60
c 9.6 13.5 4 80
c 13.5 9999 5 100
;;;;
run;
data have;
input mpi v2 $ v11 v13 v14;
datalines;
1 a 2 0 0
2 a 4 0 0
3 c 1 0 0
4 c 7 0 0
5 c 9 0 0
6 a 22 0 0
7 a 10 0 0
;;;;
run;
proc sql;
create table want as
select H.mpi, H.v2, H.v11, coalesce(M.v13,H.v13) as v13, coalesce(M.v14,H.v14) as v14
from have H
left join mergeon M
on H.v2=M.v2
and M.v11min < H.v11 <= M.v11max
;
quit;
COALESCE chooses the first nonmissing value, meaning it will keep the H.v13 value only when M.v13 is missing (so, when the merge fails to find a record in the mergeon table).
If you aren't comfortable with SQL, you can also use a few other options; a hash table is probably the easiest, though you may also be able to use an update statement (not as familiar with those myself).
Related
In this data, I need to subset by each variable by certain percentage.
For example,
Obs Group Score
1 A 1
2 A 2
3 B 1
4 B 1
5 C 3
6 C 1
7 C 1
8 A 1
9 A 3
10 A 1
11 A 2
12 B 3
13 C 2
I would need to subset 10 obs.
The sample must consist of all groups, and score of 1 takes higher priority.
Each group is given certain percent.
Let say 50% for A, 20% for B and 30% for C.
I tried using proc surveyselect but it failed. The number of alloc is not same as the strata.
proc surveyselect data=example out=test sampsize=10;
strata group score/alloc=(0.5 0.2 0.3);
run;
I don't know proc surveyselect too much, so I give the data step version.
data have;
input Obs Group$ Score;
cards;
1 A 1
2 A 2
3 B 1
4 B 1
5 C 3
6 C 1
7 C 1
8 A 1
9 A 3
10 A 1
11 A 2
12 B 3
13 C 2
;
run;
proc sort;
by Group Score;
run;
data want;
array _Dist_[3]$ _temporary_('A','B','C');
array _Upper_[3] _temporary_(5,2,3);
array _Count_[3] _temporary_;
do i = 1 to rec;
set have nobs=rec point=i;
do j = 1 to dim(_Dist_);
_Count_[j] + (Group=_Dist_[j]);
if _Count_[j] <= _Upper_[j] and Group = _Dist_[j] then output;
end;
end;
stop;
drop j;
run;
I am new in SAS and I'm trying to do scatter plot to see X vs residual but when I run the code this error appears
ERROR: Procedure SQPLOT not found.
this is my code:
data EC
input x e;
datalines;
2 3.2
3 2.9
4 -1.7
5 -2.0
6 -2.3
7 -1.2
8 -0.9
9 0.8
10 0.7
11 0.5
;
run;
proc sqplot data = EC;
scatter x = x y=residual;
run;
could you help me where is the wrong?
There is no procedure name SQPLOT. You probably want to use SGPLOT.
data EC;
input x e;
datalines;
2 3.2
3 2.9
4 -1.7
5 -2.0
6 -2.3
7 -1.2
8 -0.9
9 0.8
10 0.7
11 0.5
;
run;
proc sgplot data=EC;
scatter x = x y=e;
run;
For the situations where your code tries to use a procedure which is not licensed (or installed) the log will show a similar ERROR: message.
I have a dataset that looks like this
data test;
input id1$ id2$ score1 score2 score3 total;
datalines;
A D 9 36 6 51
A D 9 8 6 23
A E 5 3 2 10
B D 5 3 3 11
B E 7 4 7 18
B E 5 3 3 11
C D 8 7 9 24
C E 8 52 6 66
C D 4 5 3 12
;
run;
I want to add a column that calculates what percentage of the corresponding total is of the summation within id1 and id2.
What I mean is this; id1 has a value of A. Within the value of A, there are twoid2 values ; D and E. There are two values of D, and one of E. The two total values of D are 51 and 23, and they sum to 74. The one total value of E is 10, and it sums to 10. The column I'd like to create would hold the values of .68 (51/74), .31 (23/74), and 1 (10/10) in row 1 ,row 2, and row 3 respectively.
I need to perform this calculations for the rest of the id1 and their corresponding id2. So when complete, I want a table that would like like this:
id1 id2 score1 score2 score3 total percent_of_total
A D 9 36 6 51 0.689189189
A D 9 8 6 23 0.310810811
A E 5 3 2 10 1
B D 5 3 3 11 1
B E 7 4 7 18 0.620689655
B E 5 3 3 11 0.379310345
C D 8 7 9 24 0.666666667
C E 8 52 6 66 1
C D 4 5 3 12 0.333333333
I realize a loop might be able to solve the problem I've given, but I'm dealing with EIGHT levels of stratification, with as many as 98 sublevels within those levels. A loop is not practical. I'm thinking something along the lines of PROC SUMMARY but I'm not too familiar with the function.
Thank you.
It is easy to do with a data step. Make sure the records are sorted.
You can find the grand total for the ID1*ID2 combination and then use it to calculate the percentage.
proc sort data=test;
by id1 id2;
run;
data want ;
do until (last.id2);
set test ;
by id1 id2 ;
grand = sum(grand,total);
end;
do until (last.id2);
set test ;
by id1 id2 ;
precent_of_total = total/grand ;
output;
end;
run;
I need to find who has in order A-B-C. Please check the table for example;
id term grade subj num
10 2002 D 332 1
10 2002 A 333 2
11 2005 C 232 1
11 2005 A 232 2
11 2005 B 232 3
11 2005 C 232 4
15 2010 A 130 1
15 2010 B 130 2
15 2010 C 130 3
20 2000 B 500 1
20 2000 A 500 2
20 2000 C 500 3
What i need fromthis table is id : 11 AND 15
The output should be like
id term subj
11 2005 232
15 2010 130
So i need list the id's that had Grade of 'A' in it then was changed to 'B' then it was changed to 'C' .
Num could be in order. It dosen't have to start from 1, it could be 1 or 2 or 3, etc. But it should be in order A then B then C
I dont need to see the ID=20 bec for the num order grades' are not in order.
If all you are looking for is a simple 'A'-'B'-'C' sequence, then the LAG() function is sufficient. That is what I show in the example below. If you are looking for more sequences (e.g. 'A'-'B', 'B'-'C', 'A'-'B'-'C'-'D'), a slightly more complex solution is needed. If so, I'll edit the answer accordingly.
Below is a test program showing the implementation:
DATA d1;
INPUT
id :8.
term :8.
grade :$2.
subj :8.
num :8.
;
DATALINES;
10 2002 D 332 1
10 2002 A 333 2
11 2005 C 232 1
11 2005 A 232 2
11 2005 B 232 3
11 2005 C 232 4
15 2010 A 130 1
15 2010 B 130 2
15 2010 C 130 3
;
RUN;
DATA d2 (
KEEP = id term subj
);
SET d1;
grade_previous_1 = LAG1(grade);
grade_previous_2 = LAG2(grade);
IF (grade = 'C' AND grade_previous_1 = 'B' AND grade_previous_2 = 'A');
RUN;
Note that the LAG functions must be evaluated on their own lines and stored in variables, as shown above - don't fold them into the IF conditions or they won't always get executed. That is, don't say:
IF (grade = 'C' AND LAG1(grade) = 'B' AND LAG2(grade) = 'A');
That actually works in this example but in general it's better to get into the habit of calling LAG() outside of IF conditions and storing results in temporary variables.
I have a data set that has a person's name and how many times they scored a 1-10. For example, Bob scored 7 1s, 8 2s, and 7 4s, but did not receive any other scores.
Name 1 2 3 4 5 6 7 8 9 10
Bob 7 8 7 0 0 0 0 0 0 0
Hal 9 3 1 0 0 0 0 0 0 0
I want a data set that has a row for Bob that looks like this
Bob 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4
Hal 1 1 1 1 1 1 1 1 1 2 2 2 3
I'm doing this in SAS by the way.
I know I can write a macro to create variables named score1, score2, ..., scoreN.
I am having trouble populating the cells. Any help would be appreciated. Thanks.
Such things - changing the structure of the dataset - sometimes easier to do with PROC TRANSPOSE:
data have;
input Name $ v1 v2 v3 v4 v5 v6 v7 v8 v9 v10;
datalines;
Bob 7 8 7 0 0 0 0 0 0 0
;
run;
/*convert original wide dataset into long one*/
proc transpose data=have out=have_long;
var v:;
by Name;
run;
data want;
set have_long;
substr(_NAME_,1,1)=""; *to get rid of first 'v' in variables' names;
do i=1 to COL1;
new_var=_NAME_;
output;
end;
drop _NAME_ COL1 i;
run;
/*convert back to wide dataset*/
proc transpose data=want out=want(drop=_NAME_);
var new_var;
by Name;
run;