Say I have two data sets A and B that have identical variables and want to rank values in B based on values in A, not B itself (as "PROC RANK data=B" does.)
Here's a simplified example of data sets A, B and want (the desired output):
A:
obs_A VAR1 VAR2 VAR3
1 10 100 2000
2 20 300 1000
3 30 200 4000
4 40 500 3000
5 50 400 5000
B:
obs_B VAR1 VAR2 VAR3
1 15 150 2234
2 14 352 1555
3 36 251 1000
4 41 350 2011
5 60 553 5012
want:
obs VAR1 VAR2 VAR3
1 2 2 3
2 2 4 2
3 4 3 1
4 5 4 3
5 6 6 6
I come up with a macro loop that involves PROC RANK and PROC APPEND like below:
%macro MyRank(A,B);
data AB; set &A &B; run;
%do i=1 %to 5;
proc rank data=AB(where=(obs_A ne . OR obs_B=&i) out=tmp;
var VAR1-3;
run;
proc append base=want data=tmp(where=(obs_B=&i) rename=(obs_B=obs)); run;
%end;
%mend;
This is ok when the number of observations in B is small. But when it comes to very large number, it takes so long and thus wouldn't be a good solution.
Thanks in advance for suggestions.
I would create formats to do this. What you're really doing is defining ranges via A that you want to apply to B. Formats are very fast - here assuming "A" is relatively small, "B" can be as big as you like and it's always going to take just as long as it takes to read and write out the B dataset once, plus a couple read/writes of A.
First, reading in the A dataset:
data ranking_vals;
input obs_A VAR1 VAR2 VAR3;
datalines;
1 10 100 2000
2 20 300 1000
3 30 200 4000
4 40 500 3000
5 50 400 5000
;;;;
run;
Then transposing it to vertical, as this will be the easiest way to rank them (just plain old sorting, no need for proc rank).
data for_ranking;
set ranking_vals;
array var[3];
do _i = 1 to dim(var);
var_name = vname(var[_i]);
var_value = var[_i];
output;
end;
run;
proc sort data=for_ranking;
by var_name var_value;
run;
Then we create a format input dataset, and use the rank as the label. The range is (previous value -> current value), and label is the rank. I leave it to you how you want to handle ties.
data for_fmt;
set for_ranking;
by var_name var_value;
retain prev_value;
if first.var_name then do; *initialize things for a new varname;
rank=0;
prev_value=.;
hlo='l'; *first record has 'minimum' as starting point;
end;
rank+1;
fmtname=cats(var_name,'F');
start=prev_value;
end=var_value;
label=rank;
output;
if last.var_name then do; *For last record, some special stuff;
start=var_value;
end=.;
hlo='h';
label=rank+1;
output; * Output that 'high' record;
start=.;
end=.;
label=.;
hlo='o';
output; * And a "invalid" record, though this should never happen;
end;
prev_value=var_value; * Store the value for next row.;
run;
proc format cntlin=for_fmt;
quit;
And then we test it out.
data test_b;
input obs_B VAR1 VAR2 VAR3;
var1r=put(var1,var1f.);
var2r=put(var2,var2f.);
var3r=put(var3,var3f.);
datalines;
1 15 150 2234
2 14 352 1555
3 36 251 1000
4 41 350 2011
5 60 553 5012
;;;;
run;
One way that you can rank by a variable from a separate dataset is by using proc sql's correlated subqueries. Essentially you counts the number of lower values in the lookup dataset for each value in the data to be ranked.
proc sql;
create table want as
select
B.obs_B,
(
select count(distinct A.Var1) + 1
from A
where A.var1 <= B.var1.
) as var1
from B;
quit;
Which can be wrapped in a macro. Below, a macro loop is used to write each of the subqueries. It looks through the list of variable and parametrises the subquery as required.
%macro rankBy(
inScore /*Dataset containing data to be ranked*/,
inLookup /*Dataset containing data against which to rank*/,
varID /*Variable by which to identify an observation*/,
varsRank /*Space separated list of variable names to be ranked*/,
outData /*Output dataset name*/);
/* Rank variables in one dataset by identically named variables in another */
proc sql;
create table &outData. as
select
scr.&varID.
/* Loop through each variable to be ranked */
%do i = 1 %to %sysfunc(countw(&varsRank., %str( )));
/* Store the variable name in a macro variable */
%let var = %scan(&varsRank., &i., %str( ));
/* Rank: count all the rows with lower value in lookup */
, (
select count(distinct lkp&i..&var.) + 1
from &inLookup. as lkp&i.
where lkp&i..&var. <= scr.&var.
) as &var.
%end;
from &inScore. as scr;
quit;
%mend rankBy;
%rankBy(
inScore = B,
inLookup = A,
varID = obs_B,
varsRank = VAR1 VAR2 VAR3,
outData = want);
Regarding speed, this will be slow if your A is large, but should be okay for large B and small A.
In rough testing on a slow PC I saw:
A: 1e1 B: 1e6 time: ~1s
A: 1e2 B: 1e6 time: ~2s
A: 1e3 B: 1e6 time: ~5s
A: 1e1 B: 1e7 time: ~10s
A: 1e2 B: 1e7 time: ~12s
A: 1e4 B: 1e6 time: ~30s
Edit:
As Joe points out below the length of time the query takes depends not just on the number of observations in the dataset, but how many unique values exist within the data. Apparently SAS performs optimisations to reduce the comparisons to only the distinct values in B, thereby reducing the number of times the elements in A need to be counted. This means that if the dataset B contains a large number of unique values (in the ranking variables) the process will take significantly longer then the times shown. This is more likely to happen if your data is not integers as Joe demonstrates.
Edit:
Runtime test rig:
data A;
input obs_A VAR1 VAR2 VAR3;
datalines;
1 10 100 2000
2 20 300 1000
3 30 200 4000
4 40 500 3000
5 50 400 5000
;
run;
data B;
do obs_B = 1 to 1e7;
VAR1 = ceil(rand("uniform")* 60);
VAR2 = ceil(rand("uniform")* 500);
VAR3 = ceil(rand("uniform")* 6000);
output;
end;
run;
%let start = %sysfunc(time());
%rankBy(
inScore = B,
inLookup = A,
varID = obs_B,
varsRank = VAR1 VAR2 VAR3,
outData = want);
%let time = %sysfunc(putn(%sysevalf(%sysfunc(time()) - &start.), time12.2));
%put &time.;
Output:
0:00:12.41
Related
I need to perform the same operation on many different periods. In my sample data for two periods: 402 and 403.
I cannot understand the concept of how I can make a loop that will do it for me.
At the end, I'd like to have final1 for period 402, final2 for period 403 etc.
Sample data that I use for testing:
data one;
input period $ a $ b $ c $ d e;
cards;
402 a . a 1 3
402 . b . 2 4
402 a a a . 5
402 . . b 3 5
403 a a a . 6
403 a a a . 7
403 a a a 2 8
;
run;
This is how I manually choose one period of one data:
data new;
set one;
where period='402';
run;
This is how I calculate different things for the given period e.g. number of missing data, non-missing, total:
1 - For numeric variables:
proc iml;
use new;
read all var _NUM_ into x[colname=nNames];
n = countn(x,"col");
nmiss = countmiss(x,"col");
ntotal = n + nmiss;
2 - and similarly for char variables:
read all var _CHAR_ into x[colname=cNames];
close nww;
c = countn(x,"col");
cmiss = countmiss(x,"col");
ctotal = c + cmiss;
Save numeric and char results:
create cnt1Data var {nNames n nmiss ntotal};
append;
close cnt1Data;
create cnt2Data var {cNames c cmiss ctotal};
append;
close cnt2Data;
Rename columns to be the same:
data cnt1Datatemp;
set cnt1Data;
rename nNames = Name n = nonMissing nmiss = missing ntotal = total;
run;
data cnt2Datatemp;
set cnt2Data;
rename cNames = Name c = nonMissing cmiss = missing ctotal = total;
run;
and merge data into the final set:
data final;
set cnt1Datatemp cnt2Datatemp;
run;
Final data for period 402 should look like:
a b c d e
2 2 1 1 0 - missing
2 2 3 3 4 - non-missing
4 4 4 4 4 - total
and respectively for period 403:
a b c d e
0 0 0 2 0 - missing
3 3 3 1 3 - non-missing
3 3 3 3 3 - total
You can make something similar with simple SQL query.
create table miss_count as select period
, sum(missing(A)) as A
, sum(missing(B)) as B
...
from have
group by period
;
Results:
period a b c d e
402 2 2 1 1 0
403 0 0 0 2 0
It you add in
, count(*) as nobs
then you have all the information you need to calculate all of the counts you wanted.
If the number of variables is short enough you can even generate the code into a macro variable (limit of 64K bytes in a macro variable)
proc sql noprint;
select catx(' ','sum(missing(',nliteral(name),')) as',nliteral(name))
into :varlist separated by ','
from dictionary.columns
where libname='WORK' and memname='ONE' and lowcase(name) ne 'period'
;
create table miss_count as select period,count(*) as nobs,&varlist
from one
group by period
;
quit;
Results:
period nobs a b c d e
402 4 2 2 1 1 0
403 3 0 0 0 2 0
It is much easier to find this information in sql;
proc sql;
select sum(a is not missing) as fil_a
, sum(a is missing) as mis_a
, count(*) as tot_a
from one
where period eq 402;
quit;
You can even 0handle all periods at once using group by.
There are a few ways to make this work for all variables in a dataset (except for some group by variables). For instance:
%macro count_missing();
proc sql;
select count(*), name
into :no_var, :var_list separated by ' '
from sasHelp.vcolumn
where libName eq 'WORK' and memName eq 'ONE' and upcase(name) ne 'PERIOD';
create view count_missing as
select count(*) as total
%do var_nr = 1 %to &no_var;
%let var = %scan(&var_list, &var_nr);
, sum(&var is missing) as mis_&var
%end;
from work.one
group by period;
quit;
data report_missing;
set count_missing;
format count_of $32.;
count_of = 'missing';
%do var_nr = 1 %to &no_var;
%let var = %scan(&var_list, &var_nr);
&var = mis_&var;
%end;
output;
count_of = 'non missing';
%do var_nr = 1 %to &no_var;
%let var = %scan(&var_list, &var_nr);
&var = total - mis_&var;
%end;
output;
count_of = 'total';
%do var_nr = 1 %to &no_var;
%let var = %scan(&var_list, &var_nr);
&var = total;
%end;
output;
end;
%mend;
%count_missing();
You don't need iml to summarize data over observations. You can do that with a retain statement too. Moreover, using by processing with first and last, you can process all periods in one go.
data final;
set one;
by period;
if first.period then do;
mis_a = 0;
total = 0;
end;
retain mis_a;
if missing(a) then mis_a +=1; else fil_a += 1;
total += 1;
if last.period;
fil_a = total - mis_a;
end;
This is by far the fastest way to handle a big dataset if the data is sorted by period.
To make it work for a set of variables not known upfront, you can apply the same techniques as in my other solution.
I have a dataset with tree variables, three binary variables.
I wrote a proc tabulate
proc tabulate data=mydata;
class country var1 var2;
table Country, var1 var2;
run;
Var1 Var2
0 1 0 1
USA 40 50 40 50
AUS 50 20 50 20
IRE 60 40 60 40
DUB 70 50 70 50
Here I get the table with the totals of both var 1 var 2 for 0s and 1s.
However I want only the totals of 1s in this cross table. How can I do that.
If I use a where caluse as below, it shows only 1ns both..
proc tabulate data=mydata;
class country var1 var2;
table Country, var1 var2;
where var1=1 and var2=2;
run;
When I use the above it brings out only the 1s present in both at the sametime.
Which is not I am looking for.
So the dataset I want is as below.
Var1 Var2
1 1
USA 50 50
AUS 20 20
IRE 40 40
DUB 50 50
Is there any other way of doing this?
Change and to or.
Truth table for
Var1=1, Var2=1
Include?
Var1 Var2 AND OR
0 0 N N
0 1 N Y
1 0 N Y
1 1 Y Y
Since your variables are coded 0,1 you can ask for the SUM statistic to get the "count" of the number of ones.
proc tabulate data=mydata;
class country;
var var1 var2;
table Country, var1*sum var2*sum;
run;
My initial Dataset has 14000 STID variable with 10^5 observation for each.
I would like to make some procedures BY each stid, output the modification into data by STID and then set all STID together under each other into one big dataset WITHOUT a need to output all temporary STID-datsets.
I start writing a MACRO:
data HAVE;
input stid $ NumVar1 NumVar2;
datalines;
a 5 45
b 6 2
c 5 3
r 2 5
f 4 4
j 7 3
t 89 2
e 6 1
c 3 8
kl 1 6
h 2 3
f 5 41
vc 58 4
j 5 9
ude 7 3
fc 9 11
h 6 3
kl 3 65
b 1 4
g 4 4
;
run;
/* to save all distinct values of THE VARIABLE stid into macro variables
where &N_VAR - total number of distinct variable values */
proc sql;
select count(distinct stid)
into :N_VAR
from HAVE;
select distinct stid
into :stid1 - :stid%left(&N_VAR)
from HAVE;
quit;
%macro expand_by_stid;
/*STEP 1: create datasets by STID*/
%do i=1 %to &N_VAR.;
data stid&i;
set HAVE;
if stid="&&stid&i";
run;
/*STEP 2: from here data modifications for each STID-data (with procs and data steps, e.g.)*/
data modified_stid&i;
set stid&i;
NumVar1_trans=NumVar1**2;
NumVar2_trans=NumVar1*NumVar2;
run;
%end;
/*STEP 3: from here should be some code lines that set together all created datsets under one another and delete them afterwards*/
data total;
set %do n=1 %to &N_VAR.;
modified_stid&n;
%end;
run;
proc datasets library=usclim;
delete <ALL DATA SETS by SPID>;
run;
%mend expand_by_stid;
%expand_by_stid;
But the last step does not work. How can I do it?
You're very close - all you need to do is remove the semicolon in the macro loop and put it after the %end in step 3, as below:
data total;
set
%do n=1 %to &N_VAR.;
modified_stid&n
%end;;
run;
This then produces the statement you were after:
set modified_stid1 modified_stid2 .... ;
instead of what your macro was originally generating:
set modified_stid1; modified_stid2; ...;
Finally, you can delete all the temporary datasets using stid: in the delete statement:
proc datasets library=usclim;
delete stid: ;
run;
I have the following data where people in households are sorted by age (oldest to youngest):
data houses;
input HouseID PersonID Age;
datalines;
1 1 25
1 2 20
2 1 32
2 2 16
2 3 14
2 4 12
3 1 44
3 2 42
3 3 10
3 4 5
;
run;
I would like to calculate for each household the maximum age difference between consecutively aged people. So this example would give values of 5 (=25-20), 16 (=32-16) and 32 (=42-10) for households 1, 2 and 3 consecutively.
I could do this using lots of merges (i.e. extract person 1, merge with extract of person 2, and so on), but as there can be upto 20+ people in a household I'm looking for a much more direct method.
Here's a two pass solution. Same first step as the two solutions above, sort by age. In the second step keep track of max_diff per row, at the last record of HouseID output the results. This results in only two passes through the data.
proc sort data=houses; by houseid age;run;
data want;
set houses;
by houseID;
retain max_diff 0;
diff = dif1(age)*-1;
if first.HouseID then do;
diff = .; max_diff=.;
end;
if diff>max_diff then max_diff=diff;
if last.houseID then output;
keep houseID max_diff;
run;
proc sort data=houses; by houseid personid age;run;
data _t1;
set houses;
diff = dif1(age) * (-1);
if personid = 1 then diff = .;
run;
proc sql;
create table want as
select houseid, max(diff) as Max_Diff
from _t1
group by houseid;
proc sort data = house;
by houseid descending age;
run;
data house;
set house;
by houseid;
lag_age = lag1(age);
if first.houseid then age_diff = 0;
age_diff = lag_age - age;
run;
proc sql;
select houseid,max(age_diff) as max_age_diff
from house
group by houseid;
quit;
Working:
First sort the data set using houseid and descending Age.
Second data step will calculate difference between current age value (in PDV) and previous age value in PDV. Then, using sql procedure, we can get the max age difference for each houseid.
Just throwing one more into the mix. This one is a condensed version of Reeza's response.
/* No need to sort by PersonID as age is the only concern */
proc sort data = houses;
by HouseID Age;
run;
data want;
set houses;
by HouseID;
/* Keep the diff when a new row is loaded */
retain diff;
/* Only replace the diff if it is larger than previous */
diff = max(diff, abs(dif(Age)));
/* Reset diff for each new house */
if first.HouseID then diff = 0;
/* Only output the final diff for each house */
if last.HouseID;
keep HouseID diff;
run;
Here is an example using FIRST. and LAST. with one pass (after sort) through the data.
data houses;
input HouseID PersonID Age;
datalines;
1 1 25
1 2 20
2 1 32
2 2 16
2 3 14
2 4 12
3 1 44
3 2 42
3 3 10
3 4 5
;
run;
Proc sort data=HOUSES;
by houseid descending age ;
run;
Data WANT(keep=houseid max_diff);
format houseid max_diff;
retain max_diff age1 age2;
Set HOUSES;
by houseid descending age ;
if first.houseid and last.houseid then do;
max_diff=0;
output;
end;
else if first.houseid then do;
call missing(max_diff,age1,age2);
age1=age;
end;
else if not(first.houseid or last.houseid) then do;
age2=age;
temp=age1-age2;
if temp>max_diff then max_diff=temp;
age1=age;
end;
else if last.houseid then do;
age2=age;
temp=age1-age2;
if temp>max_diff then max_diff=temp;
output;
end;
Run;
Suppose a data are as follows:
A B C
1 3 2
1 4 9
2 6 0
2 7 3
where A B and C are the variable names.
Is there a way to transform the table to
A 1
A 1
A 2
A 2
B 3
B 4
B 6
B 7
C 2
C 9
C 0
C 3
Expanding on the advice from #donPablo, here's how you would code it. Create an array to read across the data, then output each iteration of that array so you end up with the number of rows being the rows * columns from the original dataset. The VNAME function enables you to store the variable name (A, B, C) as a value in a separate variable.
data have;
input A B C;
datalines;
1 3 2
1 4 9
2 6 0
2 7 3
;
run;
data want;
set have;
length var1 $10;
array vars{*} _numeric_;
do i=1 to dim(vars);
var1=vname(vars{i});
var2=vars{i};
keep var1 var2;
output;
end;
run;
proc sort data=want;
by var1;
run;
The least amount of (expensive) development time might be --
Read and store the first row
For each subsequent row
Read the row
Create three records
Until end
Sort
How many times will this be run? Per day/ per year?
What number of rows are there?
Might we save 1 hr / month? 1 min / year? Something will need to read the entire file. Optomize last. Make it work first.
tkx
It should work correctly:
DATA A(keep A);
new_var = 'A';
SET your_data;
RUN;
DATA B(keep B);
new_var = 'B';
SET your_data;
RUN;
DATA C(keep C);
new_var = 'C';
SET your_data;
RUN;
PROC APPEND base=A data=B FORCE;
RUN;
PROC APPEND base=A data=C FORCE;
RUN;
Data A is a result data set.