I have a dataset similar to
data NATR332;
input Y1 Y2;
datalines;
146 141
141 143
135 139
142 139
140 140
143 141
138 138
137 140
142 142
136 138
run;`
I used proc sql to find the difference between Y1 and Y2 and removed the rows where the difference is = 0 by using the code
proc SQL;
/*create table temp as*/
select *,
Y1 - Y2 as Difference
from NATR332
where (Y1-Y2 ^= 0)
;
I now want to create a new column called rank where I rank the absolute value of the differences. I tried to use the
rank () over partition in proc sql
and didn't have any luck so I was thinking I would maybe have to use the proc rank function. How would I go about creating this column? I am much more familiar with sql than I am sas so I try to do most of my work in proc sql when using sas.
Thank you in advance.
I would do the following:
data diffs;
set NATR332;
difference = abs(Y1-Y2);
if difference ne 0;
run;
proc rank data=diffs descending out=diffs_ranked;
var difference;
ranks ranking;
run;
You have a dataset called diffs_ranked and a variable called ranking that holds the ranks, largest to smallest because of the descending option.
Related
I have a dataset with some volumes in a column and I want to create a second column that contains the average of the previous three observations. Is this possible?
e.g.
data have;
input Vol Avg_pre_4;
datalines;
228 .
141 .
125 .
101 164.66
116 122.33
107 114
74 108
118 99
127 99.67
123 106.33
;
run;
The LAG function is an automatic built-in queue.
VOL_AVG_OF_PRIOR3 = MEAN ( lag(Vol), lag2(Vol), lag3(Vol) )
if _n_ < 4 then VOL_AVG_OF_PRIOR3 = .;
I have data set, that has States, Corn, and Cotton. I want to create a new variable, Corn_Pct in SAS (% of state corn output relative to the country's output of corn). The same for Cotton_pct.
sample of data: (numbers are not real)
State Corn Cotton
TX 135 500
AK 120 350
...
Can anyone help?
You can do this using a simple Proc SQL. Let the dataset be "Test",
Proc sql ;
create table test_percent as
select *,
Corn/sum(corn) as Corn_Pct format=percent7.1,
Cotton/sum(Cotton) as Cotton_Pct format=percent7.1
from test
;
quit;
If you have many columns, you can use Arrays and do loops to automatically generate percentages everytime.
I have calculated the total of a column in Inner Query and then used that total for the calculation in outer query using Cross Join
Hey Try this:-
/*My Dataset */
Data Test;
input State $ Corn Cotton ;
cards;
TK 135 500
AK 120 350
CK 100 250
FG 200 300
run;
/*Code*/
Proc sql;
create table test_percent as
Select a.*, (corn * 100/sm_corn) as Corn_pct, (Cotton * 100/sm_cotton) as Cotton_pct
from test a
cross join
(
select sum(corn) as sm_corn ,
sum(Cotton) as sm_cotton
from test
) b ;
quit;
/*My Output*/
State Corn Cotton Corn_pct Cotton_pct
TK 135 500 24.32432432 35.71428571
AK 120 350 21.62162162 25
CK 100 250 18.01801802 17.85714286
FG 200 300 36.03603604 21.42857143
Here you have an alternative using proc means and data step:
proc means data=test sum noprint;
output out=test2(keep=corn cotton) sum=corn cotton;
quit;
data test_percent (drop=corn_sum cotton_sum);
set test2(rename=(corn=corn_sum cotton=cotton_sum) in=in1) test(in=in2);
if (in1=1) then do;
call symput('corn_sum',corn_sum);
call symput('cotton_sum',cotton_sum);
end;
else do;
Corn_pct = corn/symget('corn_sum');
Cotton_pct = cotton/symget('cotton_sum');
output;
end;
run;
I don't know where to start with this. I've tried listing the columns in every possible order but they are always listed horizontally. The dataset is:
data job2;
input year apply_count interviewed_count hired_count interviewed_mean hired_mean;
datalines;
2012 349 52 12 0.149 0.23077
2013 338 69 20 0.20414 0.28986
2014 354 70 18 0.19774 0.25714
;
run;
Here's an example of the proc report code for just one analysis variable:
proc report data = job2;
columns apply_count year;
define year / across " ";
define apply_count / analysis "Applied" format = comma8.;
run;
Ideally the final report would look like this:
2012 2013 2014
Applied 349 338 354
Interv. 52 69 70
Hired 12 20 18
Inter % 15% 20% 20%
Hired % 23% 29% 26%
I don't know if this is the best way to do this.
data job2;
input year apply_count interviewed_count hired_count interviewed_mean hired_mean;
datalines;
2012 349 52 12 0.149 0.23077
2013 338 69 20 0.20414 0.28986
2014 354 70 18 0.19774 0.25714
;;;;
run;
proc transpose data=job2 out=job3;
by year;
run;
data job3;
set job3;
length y atype $8;
y = propcase(scan(_name_,1,'_'));
atype = scan(_name_,-1,'_');
if atype eq 'mean' then substr(y,8,1)='%';
run;
proc print;
run;
proc report data=job3 list;
columns atype y year, col1 dummy;
define atype / group noprint;
define y / group order=data ' ';
define year / across ' ';
define dummy / noprint;
define col1 / format=12. ' ';
compute before atype;
xatype = atype;
endcomp;
compute after atype;
line ' ';
endcomp;
compute col1;
if xatype eq 'mean' then do;
call define('_C3_','format','percent12.');
call define('_C4_','format','percent12.');
call define('_C5_','format','percent12.');
end;
endcomp;
run;
I am having trouble with how to compare two data sets in SAS, but one data set might have extra observations. I want to get rid of these extra observations and just compare the rest of the two data sets as they are. Let me give an example:
Data Set 1
ID Value1 Value2
105 1 A
105 2 B
105 3 C
*105 4 D
106 10 E
106 20 F
106 30 G
107 50 H
107 60 I
Data Set 2
ID Value1 Value2
105 1 A
105 2 B
105 3 C
106 10 E
106 20 F
106 30 G
107 50 H
107 60 I
Both data sets are equal except for the observation with ID=105, Value1=4 (marked with an asterisk for visual convenience) that is in Data Set 1, but not in Data Set 2.
I need to compare both data sets with these types of observations gone from my first data set and check if those observations are equal for ID and Value1. And yes, the ID value is repeated for some observations. They are not duplicates though as they have different "Value1" values associated with them.
Is there an easy way to do this?
data a1;
input ID value1 value2$;
datalines;
105 1 A
105 2 B
105 3 C
105 4 D
106 10 E
106 20 F
106 30 G
107 50 H
107 60 I
run;
data b1;
input ID value1 value2$;
datalines;
105 1 A
105 2 B
105 3 C
106 10 E
106 20 F
106 30 G
107 50 H
107 60 I
run;
data a2(rename=(value1=value1_a value2=value2_a));
set a1;
newID=compress(ID||value1);
run;
data b2(rename= ( value1=value1_b value2=value2_b));
set b1;
newID=compress(ID||value1);
run;
proc sort data=a2;
by newID;
run;
proc sort data=b2;
by newid;
run;
data c1;
merge a2(in=a) b2(in=b);
by newID;
from_a=a;
from_b=b;
run;
/**check out unmatched data records**/
data unmatched;;
set c1;
where from_a^=1 or from_b^=1;
run;
proc print data=unmatched;
run;
Results:
Here is for matched records:
data matched;;
set c1;
where from_a=1 and from_b=1;
run;
proc print data=matched;
run;
Results:
Use PROC COMPARE with BY or ID
proc sort data=data1;
by id value1 value2;
run;
proc sort data=data2;
by id value1 value2;
run;
proc compare base=data1 compare=data;
id id value1;
run;
This is documented under Comparing datasets with an ID variable:
http://support.sas.com/documentation/cdl/en/proc/65145/HTML/default/viewer.htm#n14cxqy1h9hof4n1cq4xmhv2atgs.htm
data have;
input ID Herpes;
datalines;
111 1
111 .
111 1
111 1
111 1
111 .
111 .
254 0
254 0
254 1
254 .
254 1
331 1
331 1
331 1
331 0
331 1
331 1
;
Where 1=Positive, 0=Negative, .=Missing/Not Indicated
Observations are sorted by ID (random numbers, no meaning) and date of visit (not included because not needed from here forward). Once you have Herpes, you always have Herpes. How do I adjust the Herpes variable (or create a new one) so that once a Positive is indicated (Herpes=1), all following obs will show Herpes=1 for that ID?
I want the resulting set to look like this:
111 1
111 1 (missing changed to 1)
111 1
111 1
111 1 (missing changed to 1)
111 1 (missing changed to 1)
111 1
254 0
254 0
254 1
254 1 (missing changed to 1 following positive at prior visit)
254 1
331 1
331 1
331 1
331 1 (patient-indicated negative/0 changed to 1 because of prior + visit)
331 1
331 1
The below code should do the trick. The trick is to use by-group processing in conjunction with the retain statement.
proc sort data=have;
by id;
run;
data want;
set have;
by id;
retain uh_oh .;
if first.id then do;
uh_oh = .;
end;
if herpes then do;
uh_oh = 1;
end;
if uh_oh then do;
herpes = 1;
end;
drop uh_oh;
run;
You could create a new variable that sums the herpes flag within ID:-
proc sort data=have;
by id;
data have_too;
set have;
by id;
if first.id then sum_herpes_in_id = 0;
sum_herpes_in_id ++ herpes;
run;
That way it's always positive from the first time herpes=1 within id. You can access these observations in other datasteps / procs with where sum_herpes_in_id;.
And for free, you also have the total number of herpes flags per id (if that's of any use).
This can also be done in SQL. Here is an example using UPDATE to update the table in place. (This could also be done in base SAS with MODIFY.)
proc sql undopolicy=none;
update have H
set herpes=1 where exists (
select 1 from have V
where h.id=v.id
and h.dtvar ge v.dtvar
and v.herpes=1
);
quit;
The SAS version using modify. BY doesn't work in a one-dataset modify for some reason, so you have to do your own version of first.id.
data have;
modify have;
drop _:;
retain _t _i;
if _i ne id then _t=.;
_i=id;
_t = _t or herpes;
if _t then herpes=1;
run;