This might be a stupid question, but I'm having a hard time with this issue.
I have data something like
animal firstcharacter
mouse m
dog d
cat c
monkey m
donkey d
I want to divide this "original" data into several datasets based on firstcharacter.
In this example, I should have 3 groups (c, d, m).
This is easy if I do this one by one:
data new_c; set original; if firstcharacter = "c" then; run;
data new_d; set original; if firstcharacter = "d" then; run;
data new_m; set original; if firstcharacter = "m" then; run;
The problem is, I have hundreds of these groups in the actual data.
Is there a simple way (using either do loop or macro variable) to do this?
Thanks.
This is pretty easy to do with hash tables. This is the 'easy' version, which requires a sort but doesn't require a hash of hashes or any real management.
data have;
input animal $ firstcharacter $;
datalines;
mouse m
dog d
cat c
monkey m
donkey d
;;;;
run;
proc sort data=have;
by firstcharacter;
run;
data _null_;
set have;
by firstcharacter;
if _n_=1 then do;
declare hash h;
end;
if first.firstcharacter then do;
h = _new_ hash();
h.defineKey('animal');
h.defineData('animal','firstcharacter');
h.defineDone();
end;
rc = h.add();
if last.firstcharacter then do;
rc = h.output(dataset:cats('new_',firstcharacter));
end;
run;
More complex methods exist using a hash of hashes (search on that if you want to know more).
Related
I can't find a way to summarize the same variable using different weights.
I try to explain it with an example (of 3 records):
data pippo;
a=10;
wgt1=0.5;
wgt2=1;
wgt3=0;
output;
a=3;
wgt1=0;
wgt2=0;
wgt3=1;
output;
a=8.9;
wgt1=1.2;
wgt2=0.3;
wgt3=0.1;
output;
run;
I tried the following:
proc summary data=pippo missing nway;
var a /weight=wgt1;
var a /weight=wgt2;
var a /weight=wgt3;
output out=pluto (drop=_freq_ _type_) sum()=;
run;
Obviously it gives me a warning because I used the same variable "a" (I can't rename it!).
I've to save a huge amount of data and not so much physical space and I should construct like 120 field (a0-a6,b0-b6 etc) that are the same variables just with fixed weight (wgt0-wgt5).
I want to store a dataset with 20 columns (a,b,c..) and 6 weight (wgt0-wgt5) and, on demand, processing a "summary" without an intermediate datastep that oblige me to create 120 fields.
Due to the huge amount of data (more or less 55Gb every month) I'd like also not to use proc sql statement:
proc sql;
create table pluto
as select sum(db.a * wgt1) as a0, sum(db.a * wgt1) as a1 , etc.
quit;
There is a "Super proc summary" that can summarize the same field with different weights?
Thanks in advance,
Paolo
I think there are a few options. One is the data step view that data_null_ mentions. Another is just running the proc summary however many times you have weights, and either using ods output with the persist=proc or 20 output datasets and then setting them together.
A third option, though, is to roll your own summarization. This is advantageous in that it only sees the data once - so it's faster. It's disadvantageous in that there's a bit of work involved and it's more complicated.
Here's an example of doing this with sashelp.baseball. In your actual case you'll want to use code to generate the array reference for the variables, and possibly for the weights, if they're not easily creatable using a variable list or similar. This assumes you have no CLASS variable, but it's easy to add that into the key if you do have a single (set of) class variable(s) that you want NWAY combinations of only.
data test;
set sashelp.baseball;
array w[5];
do _i = 1 to dim(w);
w[_i] = rand('Uniform')*100+50;
end;
output;
run;
data want;
set test end=eof;
i = .;
length varname $32;
sumval = 0 ;
sum=0;
if _n_ eq 1 then do;
declare hash h_summary(suminc:'sumval',keysum:'sum',ordered:'a');;
h_summary.defineKey('i','varname'); *also would use any CLASS variable in the key;
h_summary.defineData('i','varname'); *also would include any CLASS variable in the key;
h_summary.defineDone();
end;
array w[5]; *if weights are not named in easy fashion like this generate this with code;
array vars[*] nHits nHome nRuns; *generate this with code for the real dataset;
do i = 1 to dim(w);
do j = 1 to dim(vars);
varname = vname(vars[j]);
sumval = vars[j]*w[i];
rc = h_summary.ref();
if i=1 then put varname= sumval= vars[j]= w[i]=;
end;
end;
if eof then do;
rc = h_summary.output(dataset:'summary_output');
end;
run;
One other thing to mention though... if you're doing this because you're doing something like jackknife variance estimation or that sort of thing, or anything that uses replicate weights, consider using PROC SURVEYMEANS which can handle replicate weights for you.
You can SCORE your data set using a customized SCORE data set that you can generate
with a data step.
options center=0;
data pippo;
retain a 10 b 1.75 c 5 d 3 e 32;
run;
data score;
if 0 then set pippo;
array v[*] _numeric_;
retain _TYPE_ 'SCORE';
length _name_ $32;
array wt[3] _temporary_ (.5 1 .333);
do i = 1 to dim(v);
call missing(of v[*]);
do j = 1 to dim(wt);
_name_ = catx('_',vname(v[i]),'WGT',j);
v[i] = wt[j];
output;
end;
end;
drop i j;
run;
proc print;[enter image description here][1]
run;
proc score data=pippo score=score;
id a--e;
var a--e;
run;
proc print;
run;
proc means stackods sum;
ods exclude summary;
ods output summary=summary;
run;
proc print;
run;
enter image description here
I have a database with serveral variables, including one, RIF, that hase an x^2 shape relative to another variable, Y.
I want to obtain two seperate databases, separated based on whether the observation is on the decreasing or the increasing part of the curve.
I thought I had something by using the lag function, but my code does not work.
proc sort data=have; by y; run;
data want;
set have;
do while (rif<=lag(rif));
Part=1;
end;
if Part ne 1 then Part=2
run;
And the separating given Part, but it seems to create infintite loop.
Is there a mistake in my code / is there a better way of doing this
data have;
do x = -10 to 10 by 1;
y = x**2;
output;
end;
run;
data want;
set have;
lag_y = lag(y);
if _n_ = 1 then Part=.;
else if y <= lag_y then Part=1;
else Part=2;
drop lag_y;
run;
I am very new but keen to learn SAS coding.I have 2 data sets a and b namely dt1 and dt2 which consist of columns a for dt1 and b and c for dt2:
a b c
2014 2008 2
2009 3
2014 4
2015 5
I am trying to get the nth row of the c column when the element which is at nth row of b column is equal to a(1)
Here it is c=4;
I wrote a code below.
DATA dt1;
set dt1;
data dt2;
set dt2;
i=1;
do while (b ne a);
i=i+1;
end;
call symput('ROW_NUMBER',i);
run;
proc print data = dt2(keep = c obs = &ROW_NUMBER firstobs = &ROW_NUMBER);
run;
but this code enters in an infinite loop and I could not find any solution for this. I appreciate if you help solve this issue.
Thanks
I think you should learn the basic syntax of the data step before trying to use macro variables. A lot of what you're doing makes little sense. Here is an explanation of how the data step works. You will do yourself a huge favor if you study that.
Here's how to do an inner join in proc sql, which seems to be more in line with your goal here. This simply selects the values of c where dt1.a is equal to dt2.b:
proc sql;
select c
from dt1 inner join dt2 on dt1.a = dt2.b;
quit;
If you were to use a data step, you'd do something like this the following.
data out(keep=c);
set dt1;
do until (a=b or eof);
set dt2 end=eof;
if a=b then output;
end;
run;
proc print data=out noobs;
run;
Use the end= option to create temporary variable eof which allows you to end the loop after the last row of dt2 is read.
This is a simple MERGE. You just need to rename the variables to match. This assumes they're both sorted by the value (a/b). You can then set the macro variable in that data step or do whatever you want.
data want;
merge dt1(in=_a rename=a=b) dt2(in=_b);
by b;
if _a and _b;
call symput("ROW_NUMBER",c);
run;
If you want to define macro variables:
data _null_;
set dt2;
if _n_=1 then set dt1;
if a=b then do;
call symput('c_val',c);
call symput('row_num',_n_);
end;
run;
%put &row_num &c_val;
I have a SAS dataset as follow :
Key A B C D E
001 1 . 1 . 1
002 . 1 . 1 .
Other than keeping the existing varaibales, I want to replace variable value with the variable name if variable A has value 1 then new variable should have value A else blank.
Currently I am hardcoding the values, does anyone has a better solution?
The following should do the trick (the first dstep sets up the example):-
data test_data;
length key A B C D E 3;
format key z3.; ** Force leading zeroes for KEY;
key=001; A=1; B=.; C=1; D=.; E=1; output;
key=002; A=.; B=1; C=.; D=1; E=.; output;
proc sort;
by key;
run;
data results(drop = _: i);
set test_data(rename=(A=_A B=_B C=_C D=_D E=_E));
array from_vars[*] _:;
array to_vars[*] $1 A B C D E;
do i=1 to dim(from_vars);
to_vars[i] = ifc( from_vars[i], substr(vname(from_vars[i]),2), '');
end;
run;
It all looks a little awkward as we have to rename the original (assumed numeric) variables to then create same-named character variables that can hold values 'A', 'B', etc.
If your 'real' data has many more variables, the renaming can be laborious so you might find a double proc transpose more useful:-
proc transpose data = test_data out = test_data_tran;
by key;
proc transpose data = test_data_tran out = results2(drop = _:);
by key;
var _name_;
id _name_;
where col1;
run;
However, your variables will be in the wrong order on the output dataset and will be of length $8 rather than $1 which can be a waste of space. If either points are important (they rsldom are) and both can be remedied by following up with a length statement in a subsequent datastep:-
option varlenchk = nowarn;
data results2;
length A B C D E $1;
set results2;
run;
option varlenchk = warn;
This organises the variables in the right order and minimises their length. Still, you're now hard-coding your variable names which means you might as well have just stuck with the original array approach.
I have a question about transposing data without using PROC Transpose.
0 a b c
1 dog cat camel
2 9 7 2534
Without using PROC TRANSPOSE, how can I get a resulting dataset of:
Animals Weight
1 dog 9
2 cat 7
3 camel 2534
This is a bit of a curious request. This example code is hard coded for your 3 variables. You will have to generalize this if needed.
data temp;
input a $ b $ c $;
datalines;
dog cat camel
9 7 2534
;
run;
data animal_weight;
set temp end=last;
format animal animals1-animals3 $8.;
format weight weights1-weights3 best. ;
retain animals: weights:;
array animals[3];
array weights[3];
if _n_ = 1 then do;
animals[1] = a;
animals[2] = b;
animals[3] = c;
end;
else if _n_ = 2 then do;
weights[1] = input(a,best.);
weights[2] = input(b,best.);
weights[3] = input(c,best.);
end;
if last then do;
do i=1 to 3;
animal = animals[i];
weight = weights[i];
output;
end;
end;
drop i animals: weights: a b c;
run;
Read the values into 2 arrays, converting the weights from strings into numbers. Use the _N_ variable to figure out which array to populate. At the end of the data set, output the values in the arrays.
I wouldn't give this as an answer to a homework problem that I actually wanted to get a good grade on (because it's far too advanced, so it's obvious you asked for help); but the hash solution is almost certainly the most flexible and what I'd hope someone doing this in the real world would do (assuming there is a 'don't use proc transpose' real world reason, such as available resources). The problem is somewhat undefined, so this is only moderately fault-tolerant.
data have;
input a $ b $ c $;
datalines;
dog cat camel
9 7 2534
;;;;
run;
data _null_;
set have end=eof;
array charvars _character_;
if _n_ = 1 then do;
length animal $15 weight 8;
declare hash h();
h.defineKey('row');
h.defineData('animal','weight');
h.defineDone();
end;
animal=' ';
weight=.;
do row = 1 to dim(charvars);
rc_f = h.find();
if rc_f ne 0 then do;
animal=charvars[row];
rc_a = h.add();
animal=' ';
end;
else if rc_f eq 0 then do;
weight=input(charvars[row],best12.);
rc_r = h.replace();
end;
end;
if eof then rc_o = h.output(dataset:'want');
run;
Do you always have just two rows or is that the no of columns and the rows are dynamic?
If you have a dynamic no of rows and columns, then the ideal way will be to use open function, get the no of columns to a macro variable. This will be the no of rows in your new dataset. Then take the no of rows in your original dataset which will be the no of columns in your new dataset. This must happen before the actual Transpose method. Post this you can read it in to an array and using the macro variables as the dimensions output the values in to the new dataset.
Having said all this, why would you want to re-invent the wheel when you already have the SAS provided ready made transpose function?