I'm working with some SAS data, and am trying to figure out how to find a record's sort position in a datastep while using as few steps as possible.
Here's an example --
data Places;
infile datalines delimiter=',';
input state $ city $40. ;
datalines;
WA,Seattle
OR,Portland
OR,Salem
OR,Tillamook
WA,Vancouver
;
Proc Sort data=WORK.PLACES;
by STATE CITY;
run;
data WORK.PLACES;
set WORK.PLACES;
by STATE CITY;
ST_CITY_RNK = _N_;
run;
Proc Sort data=WORK.PLACES;
by CITY;
run;
data WORK.PLACES;
set WORK.PLACES;
by CITY;
CITY_RNK = _N_;
run;
In this example, is there a way to calculate ST_CITY_RNK and CITY_RNK without sorting multiple times? It feels like this should be possible with ordered hash tables, but I'm not sure how to go about doing it.
Thank you!
Hash table would be doable. Temporary arrays would have roughly the same effect and might be a bit easier.
The major limitation of either is what do you do with non-unique city names? Salem, Oregon and Salem, Massachusetts? Obviously in state-city rank that's fine, though you may find states with more than one Lincoln or similar, who knows; but in just City you'll certainly find several Columbias, Lincolns, Charlestons, etc. My solution gives the same sort rank to all of them (but would then skip forward 6 or whatever to the next one). The data step solution you post above would give them unique ranks. The hash iterator could probably do either one. You could tweak this with some effort to give unique ranks, but it would be work.
data Places;
infile datalines delimiter=',';
input state $ city $40. ;
datalines;
WA,Seattle
OR,Portland
OR,Salem
OR,Tillamook
WA,Vancouver
;
run;
data sortrank;
*Init pair of arrays - the one that stores the original values, and one to mangle by sorting;
array states[32767] $ _temporary_;
array states_cities_sorted[32767] $40. _temporary_ (32767*'ZZZZZ');
array cities[32767] $40. _temporary_;
array cities_sorted[32767] $40. _temporary_ (32767*'ZZZZZ');
*Iterate over the dataset, load into arrays;
do _n_ = 1 by 1 until (Eof);
set places end=eof;
states[_n_] = state;;
states_cities_sorted[_n_] = catx(',',state,city);
cities[_n_] = city;
cities_sorted[_n_] = city;
end;
*Sort the to-be-sorted arrays;
call sortc(of states_cities_sorted[*]);
call sortc(of cities_sorted[*]);
do _i = 1 to _n_;
*For each array element, look up the rank using `whichc`, looking for the value of the unsorted element in the sorted list;
city_rank = whichc(cities[_i],of cities_sorted[*]);
state_cities_rank = whichc(catx(',',states[_i],cities[_i]),of states_cities_sorted[*]);
*And put the array elements back in their proper variables;
city = cities[_i];
state= states[_i];
*And finally make a row output;
output;
end;
run;
For reference, here's a hash approach:
data Places;
infile datalines delimiter=',';
input state $ city $40. ;
datalines;
WA,Seattle
OR,Portland
OR,Salem
OR,Tillamook
WA,Vancouver
;
run;
data places;
set places;
if _n_ = 1 then do;
declare hash h1(ordered:'a',dataset:'places');
rc = h1.definekey('city');
rc = h1.definedata('city');
rc = h1.definedone();
declare hiter hi1('h1');
declare hash h2(ordered:'a',dataset:'places');
rc = h2.definekey('state','city');
rc = h2.definedata('state','city');
rc = h2.definedone();
declare hiter hi2('h2');
end;
t_city = city;
t_state = state;
rc = hi1.first();
do city_rank = 1 by 1 until(t_city = city);
rc = hi1.next();
end;
rc = hi2.first();
do state_city_rank = 1 by 1 until(t_city = city and t_state = state);
rc = hi2.next();
end;
state = t_state;
city = t_city;
drop t_:;
run;
Related
I can't find a way to summarize the same variable using different weights.
I try to explain it with an example (of 3 records):
data pippo;
a=10;
wgt1=0.5;
wgt2=1;
wgt3=0;
output;
a=3;
wgt1=0;
wgt2=0;
wgt3=1;
output;
a=8.9;
wgt1=1.2;
wgt2=0.3;
wgt3=0.1;
output;
run;
I tried the following:
proc summary data=pippo missing nway;
var a /weight=wgt1;
var a /weight=wgt2;
var a /weight=wgt3;
output out=pluto (drop=_freq_ _type_) sum()=;
run;
Obviously it gives me a warning because I used the same variable "a" (I can't rename it!).
I've to save a huge amount of data and not so much physical space and I should construct like 120 field (a0-a6,b0-b6 etc) that are the same variables just with fixed weight (wgt0-wgt5).
I want to store a dataset with 20 columns (a,b,c..) and 6 weight (wgt0-wgt5) and, on demand, processing a "summary" without an intermediate datastep that oblige me to create 120 fields.
Due to the huge amount of data (more or less 55Gb every month) I'd like also not to use proc sql statement:
proc sql;
create table pluto
as select sum(db.a * wgt1) as a0, sum(db.a * wgt1) as a1 , etc.
quit;
There is a "Super proc summary" that can summarize the same field with different weights?
Thanks in advance,
Paolo
I think there are a few options. One is the data step view that data_null_ mentions. Another is just running the proc summary however many times you have weights, and either using ods output with the persist=proc or 20 output datasets and then setting them together.
A third option, though, is to roll your own summarization. This is advantageous in that it only sees the data once - so it's faster. It's disadvantageous in that there's a bit of work involved and it's more complicated.
Here's an example of doing this with sashelp.baseball. In your actual case you'll want to use code to generate the array reference for the variables, and possibly for the weights, if they're not easily creatable using a variable list or similar. This assumes you have no CLASS variable, but it's easy to add that into the key if you do have a single (set of) class variable(s) that you want NWAY combinations of only.
data test;
set sashelp.baseball;
array w[5];
do _i = 1 to dim(w);
w[_i] = rand('Uniform')*100+50;
end;
output;
run;
data want;
set test end=eof;
i = .;
length varname $32;
sumval = 0 ;
sum=0;
if _n_ eq 1 then do;
declare hash h_summary(suminc:'sumval',keysum:'sum',ordered:'a');;
h_summary.defineKey('i','varname'); *also would use any CLASS variable in the key;
h_summary.defineData('i','varname'); *also would include any CLASS variable in the key;
h_summary.defineDone();
end;
array w[5]; *if weights are not named in easy fashion like this generate this with code;
array vars[*] nHits nHome nRuns; *generate this with code for the real dataset;
do i = 1 to dim(w);
do j = 1 to dim(vars);
varname = vname(vars[j]);
sumval = vars[j]*w[i];
rc = h_summary.ref();
if i=1 then put varname= sumval= vars[j]= w[i]=;
end;
end;
if eof then do;
rc = h_summary.output(dataset:'summary_output');
end;
run;
One other thing to mention though... if you're doing this because you're doing something like jackknife variance estimation or that sort of thing, or anything that uses replicate weights, consider using PROC SURVEYMEANS which can handle replicate weights for you.
You can SCORE your data set using a customized SCORE data set that you can generate
with a data step.
options center=0;
data pippo;
retain a 10 b 1.75 c 5 d 3 e 32;
run;
data score;
if 0 then set pippo;
array v[*] _numeric_;
retain _TYPE_ 'SCORE';
length _name_ $32;
array wt[3] _temporary_ (.5 1 .333);
do i = 1 to dim(v);
call missing(of v[*]);
do j = 1 to dim(wt);
_name_ = catx('_',vname(v[i]),'WGT',j);
v[i] = wt[j];
output;
end;
end;
drop i j;
run;
proc print;[enter image description here][1]
run;
proc score data=pippo score=score;
id a--e;
var a--e;
run;
proc print;
run;
proc means stackods sum;
ods exclude summary;
ods output summary=summary;
run;
proc print;
run;
enter image description here
I have a dataset (LRG_DS) with about 74,000,000 observations. The dataset has been indexed by a variable (I_VAR1) that has about 7500 unique values. I've discovered this by running a proc contents on the dataset.
I'd like to create a dataset (TEMP)contains just the 7000 unique values of the index variable.
I've tried the following:
data TEMP;
set LRG_DS (keep = I_VAR1);
by I_VAR1;
if first.I_VAR1;
run;
and
proc sort data = LRG_DS nodupkey out = TEMP (keep = I_VAR1);
by I_VAR1;
run;
The first approach takes about 46 seconds and the second takes about 55 seconds.
I've read that the sas7bndx is file is not intended to be examined in isolation, but rather as a file to speed up the some of the procedures performed using the index variable.
Any help is much appreciated!
YMMV but using populating an empty hash table with the unique key values may perform better than a sort.
Create some example data:
data x;
do cnt=1 to 10*100000;
var=round(rand('uniform'),0.001);
do cnt2=1 to 10;
output;
end;
drop cnt2;
end;
run;
Test speed with a proc sort:
proc sort data=x(keep=var) out=sorted nodupkey;
by var;
run;
Compare with the hash table version:
data _null_;
set x(keep=var) end=eof;
if _n_ eq 1 then do;
declare hash ht ();
rc = ht.DefineKey ('var');
rc = ht.DefineDone ();
end;
if ht.check() ne 0 then do;
rc = ht.add();
end;
if eof then do;
ht.output(dataset:"ids");
end;
run;
From my very brief tests, I found that the hash table version starts to perform worse as the number of unique values grows. It may be possible to offset this by dimensioning the hash appropriately beforehand but I didn't test.
I have an sql table like this one
id | payment | date |
______|_____________|________________________|
obs1 | -20,10,13 | 21184,22765,22704 |
And so on (1M+ observation). I prepeared all the data for using finance() in SQL, so in SAS i just need to take them and pass to the function. I am confident, that the data i prepared will return right answer
The problem is that i can't find the most proper way to do caclulate the function on entire data. Right now i am going row by row in cycle and passing data to macro variables throught proc sql BUT i can't get string larger than 1000 characters, so my program isn't working.
I am running next function:
finance('XIRR', payment, date, 0.15);
Can you help me please? Thanks
The code i had before the answer. Worked unacceptable long!
%macro eir (input_data, cash_var, dt_var, output_data);
data rawdata;
set &input_data(dbmax_text=32000);
run;
proc sql noprint;
select count(*) into :n from rawdata ;
quit;
%let n = 100;
%do j=1 %to &n;
data x;
set rawdata(firstobs = &j obs= &j);
run;
proc sql noprint;
select &cash_var into: cf from x;
select &dt_var into: dt from x;
quit;
data x;
set x;
r= finance('xirr', &cf, &dt, 0.15);
drop &cash_var &dt_var;
run;
data out;
set %if &j>1 %then %do; out %end; x;
run;
%end;
proc append base = &output_data data=out;
run;
proc datasets nolist;
delete x out rawdata;
run;
%mend eir;
%eir(input_data = have, cash_var = pmt, dt_var = dt, output_data = ggg);
Took 20 minutes to calculate 50,000 rows
and now it's just
data want;
set have(dbmax_text=32000);
eir = input(resolve(catx(',','%sysfunc(finance(XIRR',pmt,dt,'0.15),hex16)')),hex16.);
run;
Took 6 minutes to calcuate 1,400,000 rows
Tom just saved our project =)
The FINANCE() function wants a list of values, not a character string. You could parse the string and convert the text back into numbers and pass those to the function. But if the lengths of the lists vary from observation to observation that will cause issues.
You could use the macro processor to help you. You can generate a call to %sysfunc(finance()) and read the generated string back into a numeric variable.
It also might work to pad the short lists with zero payments on the last recorded date.
Let's make some test data.
data have ;
infile cards dsd dlm='|' ;
length id $20 payment date $100 ;
input id payment date;
cards;
obs1 | -20,10,13 | 21184,22765,22704
obs2 | -20,10 | 21184,22765
;
Now let's try converting it two ways. One by creating numeric variables to pass to the FINANCE() function call and the other by generating %sysfunc(finance()) call so that we can make sure the %sysfunc() call is working properly.
data want;
set have ;
array v (3) _temporary_;
array d (3) _temporary_;
do i=1 to dim(v);
v(i)=coalesce(input(scan(payment,i,','),32.),0);
d(i)=input(scan(date,i,','),32.);
if missing(d(i)) and i>1 then d(i)=d(i-1);
end;
drop i;
value1=finance('XIRR',of v(*),of d(*),0.15);
value2=input(resolve(catx(',','%sysfunc(finance(XIRR',payment,date,'0.15),hex16)')),hex16.);
run;
Here's my best guess based on the limited details you've provided. I think you need to split out each date and payment into separate variables before you can call the finance function, e.g.:
data have;
infile datalines dlm='|';
input id :$8. amount :$20. date :$20.;
datalines;
obs1 | -20,10,13 | 21184,22765,22704
;
run;
data want;
set have;
array dates[3] d1-d3;
array amounts[3] a1-a3;
do i = 1 to 3;
amounts[i] = input(scan(amount, i, ','), 8.);
dates[i] = input(scan(date, i, ','), 8.);
end;
XIRR = finance('XIRR', of a1-a3, of d1-d3, 0.15);
run;
I suspect this will only work you have the same number of dates and payments in every row, otherwise you will run into array out of bounds issues or problems with the IRR calculation.
Is there any more elegant way than that presented below for the following task:
to create Indicator Variables (below "MAX_X1" and "MAX_X2") whithin each group (below "key1") of multiple observation (below "key2") with value 1 if this observation corresponds to the maximum value of the variable in eache group and 0 otherwise
data have;
call streaminit(4321);
do key1=1 to 10;
do key2=1 to 5;
do x1=rand("uniform");
x2=rand("Normal");
output;
end;
end;
end;
run;
proc means data=have noprint;
by key1;
var x1 x2;
output out=max
max= / autoname;
run;
data want;
merge have max;
by key1;
drop _:;
run;
proc sql;
title "MAX";
select name into :MAXvars separated by ' '
from dictionary.columns
WHERE LIBNAME="WORK" AND MEMNAME="WANT" AND NAME like "%_Max"
order by name;
quit;
title;
data want; set want;
array MAX (*) &MAXvars;
array XVars (*) x1 x2;
array Indicators (*) MAX_X1 MAX_X2;
do i=1 to dim(MAX);
if XVars[i]=MAX[i] then Indicators[i]=1; else Indicators[i]=0;
end;
drop i;
run;
Thanks for any suggestion of optimization
Proc sql can be used with a group by statement to allow summary functions across values of a variable.
data have;
call streaminit(4321);
do key1=1 to 10;
do key2=1 to 5;
do x1=rand("uniform");
x2=rand("Normal");
output;
end;
end;
end;
run;
proc sql;
create table want
as select
key1,
key2,
x1,
x2,
case
when x1 = max(x1) then 1
else 0 end as max_x1,
case
when x2 = max(x2) then 1
else 0 end as max_x2
from have
group by key1
order by key1, key2;
quit;
It is also possible to do this in a single data step, provided that you read the input dataset twice - this is an example of a double DOW-loop.
data have;
call streaminit(4321);
do key1=1 to 10;
do key2=1 to 5;
do x1=rand("uniform");
x2=rand("Normal");
output;
end;
end;
end;
run;
/*Sort by key1 (or generate index) if not already sorted*/
proc sort data = have;
by key1;
run;
data want;
if 0 then set have;
array xvars[3,2] x1 x2 x1_max_flag x2_max_flag t_x1_max t_x2_max;
/*1st DOW-loop*/
do _n_ = 1 by 1 until(last.key1);
set have;
by key1;
do i = 1 to 2;
xvars[3,i] = max(xvars[1,i],xvars[3,i]);
end;
end;
/*2nd DOW-loop*/
do _n_ = 1 to _n_;
set have;
do i = 1 to 2;
xvars[2,i] = (xvars[1,i] = xvars[3,i]);
end;
output;
end;
drop i t_:;
run;
This may be a bit complicated to understand, so here's a rough explanation of how it flows:
Read one by group with the first DOW-loop, updating rolling max variables as each row is read in. Don't output anything yet.
Now read the same by-group again using the second DOW-loop, checking to see whether each row is equal to the rolling max and outputting each row.
Go back to first DOW-loop, read the next by-group and repeat.
I have a question about transposing data without using PROC Transpose.
0 a b c
1 dog cat camel
2 9 7 2534
Without using PROC TRANSPOSE, how can I get a resulting dataset of:
Animals Weight
1 dog 9
2 cat 7
3 camel 2534
This is a bit of a curious request. This example code is hard coded for your 3 variables. You will have to generalize this if needed.
data temp;
input a $ b $ c $;
datalines;
dog cat camel
9 7 2534
;
run;
data animal_weight;
set temp end=last;
format animal animals1-animals3 $8.;
format weight weights1-weights3 best. ;
retain animals: weights:;
array animals[3];
array weights[3];
if _n_ = 1 then do;
animals[1] = a;
animals[2] = b;
animals[3] = c;
end;
else if _n_ = 2 then do;
weights[1] = input(a,best.);
weights[2] = input(b,best.);
weights[3] = input(c,best.);
end;
if last then do;
do i=1 to 3;
animal = animals[i];
weight = weights[i];
output;
end;
end;
drop i animals: weights: a b c;
run;
Read the values into 2 arrays, converting the weights from strings into numbers. Use the _N_ variable to figure out which array to populate. At the end of the data set, output the values in the arrays.
I wouldn't give this as an answer to a homework problem that I actually wanted to get a good grade on (because it's far too advanced, so it's obvious you asked for help); but the hash solution is almost certainly the most flexible and what I'd hope someone doing this in the real world would do (assuming there is a 'don't use proc transpose' real world reason, such as available resources). The problem is somewhat undefined, so this is only moderately fault-tolerant.
data have;
input a $ b $ c $;
datalines;
dog cat camel
9 7 2534
;;;;
run;
data _null_;
set have end=eof;
array charvars _character_;
if _n_ = 1 then do;
length animal $15 weight 8;
declare hash h();
h.defineKey('row');
h.defineData('animal','weight');
h.defineDone();
end;
animal=' ';
weight=.;
do row = 1 to dim(charvars);
rc_f = h.find();
if rc_f ne 0 then do;
animal=charvars[row];
rc_a = h.add();
animal=' ';
end;
else if rc_f eq 0 then do;
weight=input(charvars[row],best12.);
rc_r = h.replace();
end;
end;
if eof then rc_o = h.output(dataset:'want');
run;
Do you always have just two rows or is that the no of columns and the rows are dynamic?
If you have a dynamic no of rows and columns, then the ideal way will be to use open function, get the no of columns to a macro variable. This will be the no of rows in your new dataset. Then take the no of rows in your original dataset which will be the no of columns in your new dataset. This must happen before the actual Transpose method. Post this you can read it in to an array and using the macro variables as the dimensions output the values in to the new dataset.
Having said all this, why would you want to re-invent the wheel when you already have the SAS provided ready made transpose function?