My data is more than 70,000. I have more than 50 variables. (Var1 to Var50). In each variable, there are about about 30 groups (I'll use a to z). I am trying to get a selection of data using if statements. I'd like to select every data with the same group. Eg data in var 1 to 30 with a, data with var 1 to 30 in b.
I seem to be writing
If (Var1="a" and Var2="a" and Var3="a" and Var4="a" and all the way to var50=
"a") or (Var1="b" and Var2="a" and Var3="b" and Var4="b" and all the way to var50=
"b")...
How do I consolidate? I tried using an array but it didnt work and i was not sure if arrays work in the IF and then statement.
IF (VAR2="A" or VAR2="B" or VAR2="C" or VAR2="D"
or VAR3="A" or VAR3="B" or VAR3="C" or VAR3="D"
or VAR4="A" or VAR4="B" or VAR4="C" or VAR4="D"
or VAR5="A" or VAR5="B" or VAR5="C" or VAR5="D"
or VAR6="A" or VAR6="B" or VAR6="C" or VAR6="D"
or VAR7="A" or VAR7="B" or VAR7="C" or VAR7="D"
or VAR8="A" or VAR8="B" or VAR8="C" or VAR8="C"
or VAR9="A" or VAR9="B" or VAR9="C" or VAR9="D"
or VAR10="A" or VAR10="B" or I10_D10="C" or VAR10="D"
or VAR12="A" or VAR12="B" or VAR12="C" or VAR12="D"
or VAR13="A" or VAR13="B" or VAR13="C" or VAR13="D"
or VAR14="A" or VAR14="B" or VAR14="C" or VAR14="D"
or VAR15="A" or VAR15="B" or VAR15="C" or VAR15="D"
or VAR6="A" or VAR16="B" or VAR16="C" or VAR16="D"
or VAR17="A" or VAR17="B" or VAR17="C" or VAR17="D"
or VAR18="A" or VAR18="B" or VAR18="C" or VAR18="C"
or VAR19="A" or VAR19="B" or VAR19="C" or I10_D19="D"
or VAR20="A" or VAR20="B" or I10_D20="C" or VAR20="D"
or VAR21="D" or VAR22="A" or VAR22="B" or VAR22="C" or VAR22="D"
or VAR23="A" or VAR23="B" or VAR23="C" or VAR23="D"
or VAR24="A" or VAR24="B" or VAR24="C" or VAR24="D"
or VAR25="A" or VAR25="B" or VAR25="C" or VAR25="D"
or VAR26="A" or VAR26="B" or VAR26="C" or VAR26="D"
or VAR27="A" or VAR27="B" or VAR27="C" or VAR27="D"
or VAR28="A" or VAR28="B" or VAR28="C" or VAR28="C"
or VAR29="A" or VAR29="B" or VAR29="C" or VAR29="D"
or VAR30="A" or VAR30="B" or I10_D30="C" or VAR30="D")
then Group=1; else Group=0;
You probably don't need a macro, however a macro might be faster.
%let value=a;
data want;
set have;
array var[50];
keepit=1;
do i=1 to 50;
keepit = keepit and (var[i]="&value");
if ^keepit then
leave;
end;
if keepit;
drop i keepit;
run;
I create a signal variable and update it's value, it will be false if any value in the var[] array is not the &value. I leave the loop early if we find 1 non-matching value, to make it more efficient.
It's not exactly clear what you want. If you want to avoid checking all variables you can use WHICHC to find if any in a list are A.
X = whichc('a', of var1-var30);
If you want to see what different groups you have across all the variables, I think a big proc freq is what you want:
proc freq data=have noprint;
table var1*var2*var3*var4....*var30*gender*age / list out=table_counts;
run;
And then check the table_counts data set to see if that has what you want.
If neither of these are what you want, you need to add more details to your question. A sample of data and expected output would be perfect.
When I need to search several variables for a particular value what I will do is - combine all variables into one string and then search that string. Like this:
*** CREATE TEST DATA ***;
data have;
infile cards;
input VAR1 $ VAR2 $ VAR3 $ VAR4 $ VAR5 $;
cards;
J J K A M
S U I O P
D D D D D
l m n o a
Q U J C S
;
run;
data want;
set have;
*** USE CATS FUNCTION TO CONCATENATE ALL VAR# INTO ONE VARIABLE ***;
allvar = cats(var1, var2, var3, var4, var5);
*** IF NEEDED, APPLY UPCASE TO CONCATENATED VARIABLE ***;
*allvar = upcase(allvar);
*** USE INDEXC FUNCTION TO SEARCH NEW VARIABLE ***;
if indexc(allvar, "ABCD") > 0 then group = 1;
else group = 0;
run;
I'm not sure if this is exactly what you need, but hopefully this is something you can modify for your particular task.
The code as posted is testing if ANY of a list of variables match ANY of a list of values.
Let's make a simple test dataset.
data have ;
input id (var1-var5) ($);
cards;
1 E F G H I
2 A B C D E
;;;;
Make one array of the values you want to find and one array of the variables you want to check. Loop over the list of variables until you either find one that contains one of the values or you run out of variables to test.
data want ;
set have;
array values (4) $8 _temporary_ ('A' 'B' 'C' 'D');
array vars var1-var5 ;
group=0;
do i=1 to dim(vars) until (group=1);
if vars(i) in values then group=1;
end;
drop i;
run;
You could avoid the array for the list of values if you want.
if vars(i) in ('A' 'B' 'C' 'D') then group=1;
But using the array will allow you to make the loop run over the list of values instead of the list of variables.
do i=1 to dim(values) until (group=1);
if values(i) in vars then group=1;
end;
Which might be important if you wanted to keep the variable i to indicate which value (or variable) was first matched.
Related
I have a dataset test1, I want to generate a key which is the combination of any of the specified variables. For example, the key in ideal_1, or the key in ideal_2. I need to write a macro for this, but the challenges for me is that the number of the vars are not fixed, as you can see in ideal1, it is the combination of 2, and in ideal3 it is the combination of 3.
data test1;
input var1$ var2$ var3$ var4$ var5$ var6$;
datalines;
1 a a b e
2 a f b e
3 a a a a
1 b a a a
2 a f b e
;
run;
data ideal_1;
set test1;
key=strip(var1)||strip(var2);
run;
data ideal_2;
set test1;
key=strip(var1)||strip(var2)||strip(var5);
run;
Just use a variable list. You could store the list into a macro variable to make it easier to edit.
%let keylist=var1 var2 var5 ;
Then you can use the macro variable where ever you need it.
data ideal_2;
set test1;
key=cats(of &keylist);
run;
If the variables have a naming convention as in your example you can use something like the following, which uses the colon operator to concatenate all of the variables that start with the prefix VAR.
key = catt(of var:);
I am trying to develop a recursive program to in missing string values using flat probabilities (for instance if a variable had three possible values and one observation was missing, the missing observation would have a 33% of being replace with any value).
Note: The purpose of this post is not to discuss the merit of imputation techniques.
DATA have;
INPUT id gender $ b $ c $ x;
CARDS;
1 M Y . 5
2 F N . 4
3 N Tall 4
4 M Short 2
5 F Y Tall 1
;
/* Counts number of categories i.e. 2 */
proc sql;
SELECT COUNT(Unique(gender)) into :rescats
FROM have
WHERE Gender ~= " " ;
Quit;
%let rescats = &rescats;
%put &rescats; /*internal check */
/* Collects response categories separated by commas i.e. F,M */
proc sql;
SELECT UNIQUE gender into :genders separated by ","
FROM have
WHERE Gender ~= " "
GROUP BY Gender;
QUIT;
%let genders = &genders;
%put &genders; /*internal check */
/* Counts entries to be evaluated. In this case observations 1 - 5 */
/* Note CustomerKey is an ID variable */
proc sql;
SELECT COUNT (UNIQUE(customerKey)) into :ID
FROM have
WHERE customerkey < 6;
QUIT;
%let ID = &ID;
%put &ID; /*internal check */
data want;
SET have;
DO i = 1 to &ID; /* Control works from 1 to 5 */
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 and 2 */
RandGender = (ROUND(u*(&rescats - 1)) + 1)*1;
/* PROBLEM Should if gender is missing set string value of M or F */
IF gender = ' ' THEN gender = SCAN(&genders, RandGender, ',');
END;
RUN;
I the SCAN function does not create a F or M observation within gender. It also appears to create a new M and F variable. Additionally the DO Loop creates addition entry under within CustomerKey. Is there any way to get rid of these?
I would prefer to use loops and macros to solve this. I'm not yet proficient with arrays.
Here is my attempt at tidying this up a little:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
/*Consolidated into 1 proc, addded noprint and removed unnecessary group by*/
proc sql noprint;
/* Counts number of categories i.e. 2 */
SELECT COUNT(unique(gender)) into :rescats
FROM have
WHERE not(missing(Gender));
/* Collects response categories separated by commas i.e. F,M */
SELECT unique gender into :genders separated by ","
FROM have
WHERE not(missing(Gender))
;
Quit;
/*Removed redundant %let statements*/
%put rescats = &rescats; /*internal check */
%put genders = &genders; /*internal check */
/*Removed ID list code as it wasn't making any difference to the imputation in this example*/
data want;
SET have;
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 or 2 */
RandGender = ROUND(u*(&rescats - 1)) + 1;
IF missing(gender) THEN gender = SCAN("&genders", RandGender, ','); /*Added quotes around &genders to prevent SAS interpreting M and F as variable names*/
RUN;
Halo8:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
run;
Tip: You can use a dot (.) to mean a missing value for a character variable during INPUT.
Tip: DATALINES is the modern alternative to CARDS.
Tip: Data values don't have to line up, but it helps humans.
Thus this works as well:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
DATALINES;
1 M Y . 5
2 F N . 4
3 . N Tall 4
4 M . Short 2
5 F Y Tall 1
;
run;
Tip: Your technique requires two passes over the data.
One to determine the distinct values.
A second to apply your imputation.
Most approaches require two passes per variable processed. A hash approach can do only two passes but requires more memory.
There are many ways to deteremine distinct values: SORTING+FIRST., Proc FREQ, DATA Step HASH, SQL, and more.
Tip: Solutions that move data to code back to data are sometimes needed, but can be troublesome. Often the cleanest way is to let data remain data.
For example: INTO will be the wrong approach if the concatenated distinct values would require more than 64K
Tip: Data to Code is especially troublesome for continuous values and other values that are not represented exactly the same when they become code.
For example: high precision numeric values, strings with control-characters, strings with embedded quotes, etc...
This is one approach using SQL. As mentioned before, Proc SURVEYSELECT is far better for real applications.
Proc SQL;
Create table REPLACEMENTS as select distinct gender from have where gender is NOT NULL;
%let REPLACEMENT_COUNT = &SQLOBS; %* Tip: Take advantage of automatic macro variable SQLOBS;
data REPLACEMENTS;
set REPLACEMENTS;
rownum+1; * rownum needed for RANUNI matching;
run;
Proc SQL;
* Perform replacement of missing values;
Update have
set gender =
(
select gender
from REPLACEMENTS
where rownum = ceil(&REPLACEMENT_COUNT * ranuni(1234))
)
where gender is NULL
;
%let SYSLAST = have;
DM 'viewtable have' viewtable;
You don't have to be concerned about columns not having a missing value because no replacement would occur in those. For columns having a missing the list of candidate REPLACEMENTS excludes the missing and the REPLACEMENT_COUNT is correct for computing the uniform probability of replacement, 1/COUNT, coded as rownum = ceil (random)
I have a question to create a new variable.I have several variables named A,B,C,D,E,F,G.All variables are 0/1 binary variable.So I want to create a new variable which shows any 3 or more those variables equal to 1.
For example,
new_variable =0;
if ANY 3 or more variables(A,B,C,D,E,F,G) =1 then new_variable =1;
There's no way sort of a way to do the syntax like you have, but since you're smart and have 0/1 binaries, there's a very easy way if you think about it a sec, to see if 3 or more are 1.
if sum(of a b c d e f g) >= 3 then new_Variable=1;
Actually a bit simpler:
new_Variable = (sum(of a b c d e f g) GE 3);
as true=1 false=0 when you evaluate a boolean expression.
If your data are in an array or with a common prefix, there is a way to do that more easily:
new_variable = (sum(of arrayname[*]) GE 3);
or
new_variable = (sum(of varprefix:) GE 3);
where arrayname is your array or varprefix is the common prefix your variables (and only your variables) share.
Edit: There is, sort of, a way to do this in a similar kind of syntax. Using countc:
data have;
call streaminit(7);
array vars[7] a b c d e f g;
do _n_ = 1 to 20;
do _i = 1 to dim(vars);
vars[_i] = rand('Binomial',.2,1);
end;
output;
end;
run;
data want;
set have;
if countc(cats(of a--g),'1') ge 3;
run;
If you had something other than 1/0, you could use catx to delimit them with a space or something, and then countw to look for the complete value; here, 11 will look like two 1s not eleven, if that were possible in the data.
There are a lot of other solutions, by the way; maybe some others will come and mention them. CALL SORTN and then look for the first instance of 1, for example.
I would like to create a variable called DATFL that would have the following values for the last obseration :
DATFL
gender/scan
Here is the code :
data mix_ ;
input id $ name $ gender $ scan $;
datalines;
1 jon M F
2 jill F L
3 james F M
4 jonas M M
;
run;
data mix_3; set mix_;
length datfl datfl_ $ 50;
array m4(*) id name gender scan;
retain datfl;
do i=1 to dim(m4);
if index(m4(i) ,'M') then do;
datfl_=vname(m4(i)) ;
if missing(datfl) then datfl=datfl_;
else datfl=strip(datfl)||"/"||datfl_;
end;
end;
run;
Unfortunately, the value I get for 'DATFL' at the last observation is 'gender/scan/gender/scan'.Obviously because of the retain statement that I used for 'DATFL' I ended up with duplicates. At the end of this data step, I was planning to use a CALL SYMPUT statement to load the last value into macro variable but I won't do it until I fix my issue...Can anyone provide me with a guidance on how to prevent 'DATFL' to have duplicates value at the end of the dataset ? Cheers
sas_kappel
Don't retain DATFL, Instead, retain DATFL_.
data mix_3; set mix_;
length datfl datfl_ $ 50;
array m4(*) id name gender scan;
retain datfl_;
do i=1 to dim(m4);
if index(m4(i) ,'M') then do;
datfl_=vname(m4(i)) ;
if missing(datfl) then datfl=datfl_;
else datfl=strip(datfl)||"/"||datfl_;
end;
end;
if missing(datfl) then datfl = datfl_;
run;
It doesn't work...Let me change the dataset (mix_) and you can see that RETAIN DATFLl_, is not working in this scenario.
data mix_ ;
input id $ name $ gender $ scan $;
datalines;
1 jon M M
2 Marc F L
3 james F M
4 jonas H M
;
run;
To resume, what I want is to have the DISTINCT value of DATFL, into a macro variable. The code that I proposed does,for each records,a search for variables having the letter M, if it true then DATFL receives the variable name of the array variable. If there are multiple variable names then they will be separated by '/'. For the next records, do the same, BUT add only variable names satisfying the condition AND the variables that were not already kept in DATFL. Currently, if you run my program I have for DATFL at observation 4, DATFL=gender/scan/name/scan/scan but I would like to have DATFL=gender/scan/name , because those one are the distinct values. Ultimatlly, I will then write the following code;
if eof then CALL SYMPUT('DATFL',datfl);
sas_kappel
Your revised data makes it much clearer what you're looking for. Here is some code that should give the correct result.
I've used the CALL CATX function to add new values to DATFL, separated by a /. It first checks that the relevant variable name doesn't already exist in the string.
data mix_ ;
input id $ name $ gender $ scan $;
datalines;
1 jon M M
2 Marc F L
3 james F M
4 jonas H M
;
run;
data _null_;
set mix_ end=eof;
length datfl $100; /*or whatever*/
retain datfl;
array m4{*} $ id name gender scan;
do i = 1 to dim(m4);
if index(m4{i},'M') and not index(datfl,vname(m4{i})) then call catx('/',datfl,vname(m4{i}));
end;
if eof then call symput('DATFL', datfl);
run;
%put datfl = &DATFL.;
I have a SAS dataset as follow :
Key A B C D E
001 1 . 1 . 1
002 . 1 . 1 .
Other than keeping the existing varaibales, I want to replace variable value with the variable name if variable A has value 1 then new variable should have value A else blank.
Currently I am hardcoding the values, does anyone has a better solution?
The following should do the trick (the first dstep sets up the example):-
data test_data;
length key A B C D E 3;
format key z3.; ** Force leading zeroes for KEY;
key=001; A=1; B=.; C=1; D=.; E=1; output;
key=002; A=.; B=1; C=.; D=1; E=.; output;
proc sort;
by key;
run;
data results(drop = _: i);
set test_data(rename=(A=_A B=_B C=_C D=_D E=_E));
array from_vars[*] _:;
array to_vars[*] $1 A B C D E;
do i=1 to dim(from_vars);
to_vars[i] = ifc( from_vars[i], substr(vname(from_vars[i]),2), '');
end;
run;
It all looks a little awkward as we have to rename the original (assumed numeric) variables to then create same-named character variables that can hold values 'A', 'B', etc.
If your 'real' data has many more variables, the renaming can be laborious so you might find a double proc transpose more useful:-
proc transpose data = test_data out = test_data_tran;
by key;
proc transpose data = test_data_tran out = results2(drop = _:);
by key;
var _name_;
id _name_;
where col1;
run;
However, your variables will be in the wrong order on the output dataset and will be of length $8 rather than $1 which can be a waste of space. If either points are important (they rsldom are) and both can be remedied by following up with a length statement in a subsequent datastep:-
option varlenchk = nowarn;
data results2;
length A B C D E $1;
set results2;
run;
option varlenchk = warn;
This organises the variables in the right order and minimises their length. Still, you're now hard-coding your variable names which means you might as well have just stuck with the original array approach.