How to record qualitative variable with over 100 dummies to several levels as quantitative in SAS - sas

I am working with SAS and want to record variable which with over 50+ different qualitative dummies. For example, the state of the U.S.
In this case, I just want to reduce them into 4 or 5 levels dummy as quantitative variable.
I get several ideaS, for example to use if/else statement, however, the problem is that i have to write down and specify each of area name in SAS and the code looks like super heavy.
Is there any other ways to do that without redundant code? Or to avoid write each specific name of variable? In SAS.
Any ideas are appreciated!!

Method 1:
Use IN, but you still have to list the variables. You can also do it via a format, but you have to define the format first anyways.
if state in ('AL', 'AK', 'AZ' ... etc) then state_group = 1;
else if state in ( .... ) then state_group = 2;
Method 2:
For a format, you create format using PROC FORMAT and then apply it.
proc format;
value $ state_grp_fmt
'AL', 'AK', 'AZ' = 1
'DC', 'NC' = 2 ;
run;
And then you can use it with a PUT statement.
State_Group = put(state, state_grp_fmt);

Related

Searching for Easy Way in SAS to Vertically Stack Different Variable Names to One

Faced with situation where vendor had different varname counts when one variable will actually suffice. Imagine the following very simplified version of my SAS code. Here there are five variables of interest but I am OUTPUTing it to 2 tables/dataframes that may be stacked once I RENAME the key variables in each file.
data
ABA ADDI;
set zach.COMMERCIAL_A12;
keep
PRODUCT_DESC
ABA_NUMERCNT
ADD_INITIATION_NUMERCNT;
if ABA_DENOMCNT = 1 then output ABA;
if ADD_INITIATION_DENOMCNT = 1 then output ADDI;
run;
Right now the program creates the two new OUTPUT files = ABA and ADDI. Each of the files has the same three variables from my KEEP. Later on I will stack them. So for ABA I wish to keep only PRODUCT_DESC & ABA_NUMERCNT and for ADDI I wish to keep PRODUCT_DESC & ADD_INITIATION_NUMERCNT. But before stacking them I would like to sort of automate it so that ABA_NUMERCNT becomes VarTemp and ADD_INITIATION_NUMERCNT again becomes VarTemp before they are stacked.
Is there an easy way to do this?
Looks like you want to use dataset options on your output datasets. It is a little hard to follow the details of your question but it looks like you want something like:
data
ABA (rename=(ABA_DENOMCNT=newvar ) drop=ADD_INITIATION_DENOMCNT )
ADDI (rename=(ADD_INITIATION_DENOMCNT=newvar) drop=ABA_DENOMCNT )
;
set zach.COMMERCIAL_A12;
keep PRODUCT_DESC ABA_NUMERCNT ADD_INITIATION_NUMERCNT;
if ABA_DENOMCNT = 1 then output ABA;
if ADD_INITIATION_DENOMCNT = 1 then output ADDI;
run;

Delete all observations starting with a list of values from database (SAS)

I am trying to find the optimized way to do this :
I want to delete from a character variable all the observations STARTING with different possible strings such as :
"Subtotal" "Including:"
So if it starts with any of these values (or many others that i didn't write here) then delete them from the dataset.
Best solution would be a macro variable containing all the values but i don't know how to deal with it. (%let list = Subtotal Including: but counts them as variables while they are values)
I did this :
data a ; set b ;
if findw(product,"Subtotal") then delete ;
if findw(product,"Including:") then delete;
...
...
Would appreciate any suggestions !Thanks
First figure out what SAS code you want. Then you can begin to worry about how to use macro logic or macro variables.
Do you just to exclude the strings that start with the values?
data want ;
set have ;
where product not in: ("Subtotal" "Including");
run;
Or do you want to subset based on the first "word" in the string variable?
where scan(product,1) not in ("Subtotal" "Including");
Or perhaps case insensitive?
where lowcase(scan(product,1)) not in ("subtotal" "including");
Now if the list of values is small enough (less than 64K bytes) then you could put the list into a macro variable.
%let list="Subtotal" "Including";
And then later use the macro variable to generate the WHERE statement.
where product not in: (&list);
You could even generate the macro variable from a dataset of prefix values.
proc sql noprint;
select quote(trim(prefix)) into :list separated by ' '
from prefixes
;
quit;

Is it possible to filter a data step on a newly computed variable?

In a basic data step I'm creating a new variable and I need to filter the dataset based on this new variable.
data want;
set have;
newVariable = 'aaa';
*lots of computations that change newVariable ;
*if xxx then newVariable = 'bbb';
*if yyy AND not zzz then newVariable = 'ccc';
*etc.;
where newVariable ne 'aaa';
run;
ERROR: Variable newVariable is not on file WORK.have.
I usually do this in 2 steps, but I'm wondering if there is a better way.
( Of course you could always write a complex where statement based on variables present in WORK.have. But in this case the computation of newVariable it's too complex and it is more efficient to do the filter in a 2nd data step )
I couldn't find any info on this, I apologize for the dumb question if the answer is in the documentation and I didn't find it. I'll remove the question if needed.
Thanks!
Use a subsetting if statement:
if newVariable ne 'aaa';
In general, if <condition>; is equivalent to if not(<condition>) then delete;. The delete statement tells SAS to abandon this iteration of the data step and go back to the start for the next iteration. Unless you have used an explicit output statement before your subsetting if statement, this will prevent a row from being output.

What is the Stata-equivalent of this SAS macro?

I will present the simplified version of what I want to do. I know how to do it easily in SAS but not in Stata.
Let's say I am trying to create a "poor" binary variable = 1 if an observation is classified as poor and 0 otherwise. I want to have two classifications, one is based on real income, and another based on real consumption (these are variables in the dataset).
The SAS macro would be
%MACRO poverty_bin(type=, measure=)
DATA dataset;
SET dataset;
IF &measure. <= poverty_line THEN poor&type. = 1 ELSE poor&type. = 0;
RUN;
%MEND
%poverty_bin(type=con, measure=real_consumption);
%poverty_bin(type=inc, measure=real_income);
which should create two binary variables poor_con and poor_inc.
I have no idea how to do this in Stata. I tried doing something like this just to see if nested foreach is what I'm looking for:
foreach x of newlist con inc {
foreach y of newlist real_income real_consumption{
display "`x' and `y'"
}
}
But it gives an error message saying "variable real_income already defined"
The error message you cite implies that earlier code you do not show us created a variable real_income.
I do not know SAS but I can tell you that given a numeric variable x
gen y = x <= 42
will create a variable y with value 1 if x <= 42 and 0 otherwise.
For another such variable, use another similar statement. In Stata and perhaps any other language, setting up a nested loop or defining a program instead of making two statements directly seems overkill. For a number of new variables much larger than 2, that might not be true.
foreach v in x y {
gen new`v' = `v' <= 42
}
For completely arbitrary existing names, new names and thresholds it is likely to be easier to write out statements individually.
This is documented. See for example 13.2.2 in [U] or this FAQ.

giving a string variable values conditional on another variable

I am using Stata 14. I have US states and corresponding regions as integer.
I want create a string variable that represents the region for each observation.
Currently my code is
gen div_name = "A"
replace div_name = "New England" if div_no == 1
replace div_name = "Middle Atlantic" if div_no == 2
.
.
replace div_name = "Pacific" if div_no == 9
..so it is a really long code.
I was wondering if there is a shorter way to do this where I can automate assigning values rather than manually hard coding them.
You can define value labels in one line with label define and then use decode to create the string variable. See the help for those commands.
If the correspondence was defined in a separate dataset you could use merge. See e.g. this FAQ
There can't be a short-cut here other than typing all the names at some point or exploiting the fact that someone else typed them earlier into a file.
With nine or so labels, typing them yourself is quickest.
Note that you type one statement more than you need, even doing it the long way, as you could start
gen div_name = "New England" if div_no == 1