I have two groups, A and B, and two numeric variables, X and Y. I want to create two new variables, new1 and new2, based on the values of X and Y (respectively) for group B (i.e., IF group = B THEN new1 = X, new2 = Y). I want to take those newly created variables, append them to group A, and then delete group B. In the end, there should be one row for group A containing X, Y, new1, and new2. I'm uncertain how to accomplish this.
I've looked into using PROC TRANSPOSE, but I'm unsure if that's the right starting point. My internet searches are lacking because I'm not even sure what to call what I'm attempting to do, though I'm betting this is a common procedure requiring a common solution.
EXAMPLE
Not sure how to generalize the problem, but for the given problem this will work:
/* Just reversing the records */
proc sort data = have;
by descending group;
run;
data want;
set have;
retain new1 new2;
if _N_ = 1 then do;
new1 = x;
new2 = y;
end;
else output;
run;
This sounds like a case of 1 to 1 merging (merge with out BY).
data have; input
group $1. x y; datalines;
A 3 4
B 2 6
run;
data want;
merge
have(where=( group='A'))
have(where=(Bgroup='B') rename=(x=Bx y=By group=Bgroup))
;
drop Bgroup;
run;
Related
Ive got 50 columns of data, with 4 different measurements in each, as well as designation tags (groups C, D, and E). Ive averaged the 4 measurements... So every data point now has an average. Now, I am supposed to take the average of all the data points averages of each specific group.
So I want all the data in group C to be averaged, and so on for D and E.... and I dont know how to do that.
avg1=(MEAS1+MEAS2+MEAS3+MEAS4)/4;
avg_score=round(avg1, .1);
run;
proc print;
run;
This is what I have so far.
There are several procedures, and SQL that can average values over a group.
I'll guess you meant to say 50 rows of data.
Example:
Proc MEANS
data have;
call streaminit(314159);
do _n_ = 1 to 50;
group = substr('CDE', rand('integer',3),1);
array v meas1-meas4;
do _i_ = 1 to dim(v);
num + 2;
v(_i_) = num;
end;
output;
end;
drop num;
run;
data rowwise_means;
set have;
avg_meas = mean (of meas:);
run;
* group wise means of row means;
proc means noprint data=rowwise_means nway;
class group;
var avg_meas;
output out=want mean=meas_grandmean;
run;
rowwise_means
want (grandmean, or mean of means)
I have a 2 column dataset - accounts and attributes, where there are 6 types of attributes.
I am trying to use PROC TRANSPOSE in order to set the 6 different attributes as 6 new columns and set 1 where the column has that attribute and 0 where it doesn't
This answer shows two approaches:
Proc TRANSPOSE, and
array based transposition using index lookup via hash.
For the case that all of the accounts missing the same attribute, there would be no way for the data itself to exhibit all the attributes -- ideally the allowed or expected attributes should be listed in a separate table as part of your data reshaping.
Proc TRANSPOSE
When working with a table of only account and attribute you will need to construct a view adding a numeric variable that can be transposed. After TRANSPOSE the result data will have to be further massaged, replacing missing values (.) with 0.
Example:
data have;
call streaminit(123);
do account = 1 to 10;
do attribute = 'a','b','c','d','e','f';
if rand('uniform') < 0.75 then output;
end;
end;
run;
data stage / view=stage;
set have;
num = 1;
run;
proc transpose data=stage out=want;
by account;
id attribute;
var num;
run;
data want;
set want;
array attrs _numeric_;
do index = 1 to dim(attrs);
if missing(attrs(index)) then attrs(index) = 0;
end;
drop index;
run;
proc sql;
drop view stage;
From
To
Advanced technique - Array and Hash mapping
In some cases the Proc TRANSPOSE is deemed unusable by the coder or operator, perhaps very many by groups and very many attributes. An alternate way to transpose attribute values into like named flag variables is to code:
Two scans
Scan 1 determine attribute values that will be encountered and used as column names
Store list of values in a macro variable
Scan 2
Arrayify the attribute values as variable names
Map values to array index using hash (or custom informat per #Joe)
Process each group. Set arrayed variable corresponding to each encountered attribute value to 1. Array index obtained via lookup through hash map.
Example:
* pass #1, determine attribute values present in data, the values will become column names;
proc sql noprint;
select distinct attribute into :attrs separated by ' ' from have;
* or make list of attributes from table of attributes (if such a table exists outside of 'have');
* select distinct attribute into :attrs separated by ' ' from attributes;
%put NOTE: &=attrs;
* pass #2, perform array based tranposformation;
data want2(drop=attribute);
* prep pdv, promulgate by group variable attributes;
if 0 then set have(keep=account);
array attrs &attrs.;
format &attrs. 4.;
if _n_=1 then do;
declare hash attrmap();
attrmap.defineKey('attribute');
attrmap.defineData('_n_');
attrmap.defineDone();
do _n_ = 1 to dim(attrs);
attrmap.add(key:vname(attrs(_n_)), data: _n_);
end;
end;
* preset all flags to zero;
do _n_ = 1 to dim(attrs);
attrs(_n_) = 0;
end;
* DOW loop over by group;
do until (last.account);
set have;
by account;
attrmap.find(); * lookup array index for attribute as column;
attrs(_n_) = 1; * set flag for attribute (as column);
end;
* implicit output one row per by group;
run;
One other option for doing this not using PROC TRANSPOSE is the data step array technique.
Here, I have a dataset that hopefully matches yours approximately. ID is probably your account, Product is your attribute.
data have;
call streaminit(2007);
do id = 1 to 4;
do prodnum = 1 to 6;
if rand('Uniform') > 0.5 then do;
product = byte(96+prodnum);
output;
end;
end;
end;
run;
Now, here we transpose it. We make an array with the six variables that could occur in HAVE. Then we iterate through the array to see if that variable is there. You can add a few additional lines to the if first.id block to set all of the variables to 0 instead of missing initially (I think missing is better, but YMMV).
data want;
set have;
by id;
array vars[6] a b c d e f;
retain a b c d e f;
if first.id then call missing(of vars[*]);
do _i = 1 to dim(vars);
if lowcase(vname(vars[_i])) = product then
vars[_i] = 1;
end;
if last.id then output;
run;
We could do it a lot faster if we knew how the dataset was constructed, of course.
data want;
set have;
by id;
array vars[6] a b c d e f;
if first.id then call missing(of vars[*]);
retain a b c d e f;
vars[rank(product)-96]=1;
if last.id then output;
run;
While your data doesn't really work that way, you could make an informat though that did this.
*First we build an informat relating the product to its number in the array order;
proc format;
invalue arrayi
'a'=1
'b'=2
'c'=3
'd'=4
'e'=5
'f'=6
;
quit;
*Now we can use that!;
data want;
set have;
by id;
array vars[6] a b c d e f;
if first.id then call missing(of vars[*]);
retain a b c d e f;
vars[input(product,arrayi.)]=1;
if last.id then output;
run;
This last one is probably the absolute fastest option - most likely much faster than PROC TRANSPOSE, which tends to be one of the slower procs in my book, but at the cost of having to know ahead of time what variables you're going to have in that array.
My data is more than 70,000. I have more than 50 variables. (Var1 to Var50). In each variable, there are about about 30 groups (I'll use a to z). I am trying to get a selection of data using if statements. I'd like to select every data with the same group. Eg data in var 1 to 30 with a, data with var 1 to 30 in b.
I seem to be writing
If (Var1="a" and Var2="a" and Var3="a" and Var4="a" and all the way to var50=
"a") or (Var1="b" and Var2="a" and Var3="b" and Var4="b" and all the way to var50=
"b")...
How do I consolidate? I tried using an array but it didnt work and i was not sure if arrays work in the IF and then statement.
IF (VAR2="A" or VAR2="B" or VAR2="C" or VAR2="D"
or VAR3="A" or VAR3="B" or VAR3="C" or VAR3="D"
or VAR4="A" or VAR4="B" or VAR4="C" or VAR4="D"
or VAR5="A" or VAR5="B" or VAR5="C" or VAR5="D"
or VAR6="A" or VAR6="B" or VAR6="C" or VAR6="D"
or VAR7="A" or VAR7="B" or VAR7="C" or VAR7="D"
or VAR8="A" or VAR8="B" or VAR8="C" or VAR8="C"
or VAR9="A" or VAR9="B" or VAR9="C" or VAR9="D"
or VAR10="A" or VAR10="B" or I10_D10="C" or VAR10="D"
or VAR12="A" or VAR12="B" or VAR12="C" or VAR12="D"
or VAR13="A" or VAR13="B" or VAR13="C" or VAR13="D"
or VAR14="A" or VAR14="B" or VAR14="C" or VAR14="D"
or VAR15="A" or VAR15="B" or VAR15="C" or VAR15="D"
or VAR6="A" or VAR16="B" or VAR16="C" or VAR16="D"
or VAR17="A" or VAR17="B" or VAR17="C" or VAR17="D"
or VAR18="A" or VAR18="B" or VAR18="C" or VAR18="C"
or VAR19="A" or VAR19="B" or VAR19="C" or I10_D19="D"
or VAR20="A" or VAR20="B" or I10_D20="C" or VAR20="D"
or VAR21="D" or VAR22="A" or VAR22="B" or VAR22="C" or VAR22="D"
or VAR23="A" or VAR23="B" or VAR23="C" or VAR23="D"
or VAR24="A" or VAR24="B" or VAR24="C" or VAR24="D"
or VAR25="A" or VAR25="B" or VAR25="C" or VAR25="D"
or VAR26="A" or VAR26="B" or VAR26="C" or VAR26="D"
or VAR27="A" or VAR27="B" or VAR27="C" or VAR27="D"
or VAR28="A" or VAR28="B" or VAR28="C" or VAR28="C"
or VAR29="A" or VAR29="B" or VAR29="C" or VAR29="D"
or VAR30="A" or VAR30="B" or I10_D30="C" or VAR30="D")
then Group=1; else Group=0;
You probably don't need a macro, however a macro might be faster.
%let value=a;
data want;
set have;
array var[50];
keepit=1;
do i=1 to 50;
keepit = keepit and (var[i]="&value");
if ^keepit then
leave;
end;
if keepit;
drop i keepit;
run;
I create a signal variable and update it's value, it will be false if any value in the var[] array is not the &value. I leave the loop early if we find 1 non-matching value, to make it more efficient.
It's not exactly clear what you want. If you want to avoid checking all variables you can use WHICHC to find if any in a list are A.
X = whichc('a', of var1-var30);
If you want to see what different groups you have across all the variables, I think a big proc freq is what you want:
proc freq data=have noprint;
table var1*var2*var3*var4....*var30*gender*age / list out=table_counts;
run;
And then check the table_counts data set to see if that has what you want.
If neither of these are what you want, you need to add more details to your question. A sample of data and expected output would be perfect.
When I need to search several variables for a particular value what I will do is - combine all variables into one string and then search that string. Like this:
*** CREATE TEST DATA ***;
data have;
infile cards;
input VAR1 $ VAR2 $ VAR3 $ VAR4 $ VAR5 $;
cards;
J J K A M
S U I O P
D D D D D
l m n o a
Q U J C S
;
run;
data want;
set have;
*** USE CATS FUNCTION TO CONCATENATE ALL VAR# INTO ONE VARIABLE ***;
allvar = cats(var1, var2, var3, var4, var5);
*** IF NEEDED, APPLY UPCASE TO CONCATENATED VARIABLE ***;
*allvar = upcase(allvar);
*** USE INDEXC FUNCTION TO SEARCH NEW VARIABLE ***;
if indexc(allvar, "ABCD") > 0 then group = 1;
else group = 0;
run;
I'm not sure if this is exactly what you need, but hopefully this is something you can modify for your particular task.
The code as posted is testing if ANY of a list of variables match ANY of a list of values.
Let's make a simple test dataset.
data have ;
input id (var1-var5) ($);
cards;
1 E F G H I
2 A B C D E
;;;;
Make one array of the values you want to find and one array of the variables you want to check. Loop over the list of variables until you either find one that contains one of the values or you run out of variables to test.
data want ;
set have;
array values (4) $8 _temporary_ ('A' 'B' 'C' 'D');
array vars var1-var5 ;
group=0;
do i=1 to dim(vars) until (group=1);
if vars(i) in values then group=1;
end;
drop i;
run;
You could avoid the array for the list of values if you want.
if vars(i) in ('A' 'B' 'C' 'D') then group=1;
But using the array will allow you to make the loop run over the list of values instead of the list of variables.
do i=1 to dim(values) until (group=1);
if values(i) in vars then group=1;
end;
Which might be important if you wanted to keep the variable i to indicate which value (or variable) was first matched.
I have a database with serveral variables, including one, RIF, that hase an x^2 shape relative to another variable, Y.
I want to obtain two seperate databases, separated based on whether the observation is on the decreasing or the increasing part of the curve.
I thought I had something by using the lag function, but my code does not work.
proc sort data=have; by y; run;
data want;
set have;
do while (rif<=lag(rif));
Part=1;
end;
if Part ne 1 then Part=2
run;
And the separating given Part, but it seems to create infintite loop.
Is there a mistake in my code / is there a better way of doing this
data have;
do x = -10 to 10 by 1;
y = x**2;
output;
end;
run;
data want;
set have;
lag_y = lag(y);
if _n_ = 1 then Part=.;
else if y <= lag_y then Part=1;
else Part=2;
drop lag_y;
run;
Consider the following data set test:
x y
1 A
2 A
. A
4 A
. B
. B
7 B
8 B
Basically I want missing values in the group to be replaced by the previous value. But if the previous value is from another group then don't use that value to replace the current value.
Consider this code:
proc sort data=testout=Sorted;
by y;
run;
data Out2(drop=_x);
set Sorted;
by y;
retain _x;
if first.y then do;
_x=x;
end; else do;
if missing(x) then do;
x = _x;
end; else do;
_x = x;
end;
end;
run;
What does the underscore behind the x mean (_x)?
Using an underscore as a prefix is a fairly common naming convention for temporary variables. As far as SAS is concerned, _x is just another variable - the _ doesn't have any special effect. However, if all your temporary variables (and only those ones) start with an underscore, it's a bit less work to tidy up at the end of your data step, because you can use a : wildcard to drop them all in one go. e.g.
data example;
set sashelp.class;
_age = age;
_sex = sex;
drop _: ;
run;
_x is simply a new variable. It could as easily be z or last_x. However, it may be more clear to someone reading the code that it is related to x; while I'm not aware of any convention similar to "_ means previous value of ", it's not an unreasonable one.