What is the Stata-equivalent of this SAS macro? - stata

I will present the simplified version of what I want to do. I know how to do it easily in SAS but not in Stata.
Let's say I am trying to create a "poor" binary variable = 1 if an observation is classified as poor and 0 otherwise. I want to have two classifications, one is based on real income, and another based on real consumption (these are variables in the dataset).
The SAS macro would be
%MACRO poverty_bin(type=, measure=)
DATA dataset;
SET dataset;
IF &measure. <= poverty_line THEN poor&type. = 1 ELSE poor&type. = 0;
RUN;
%MEND
%poverty_bin(type=con, measure=real_consumption);
%poverty_bin(type=inc, measure=real_income);
which should create two binary variables poor_con and poor_inc.
I have no idea how to do this in Stata. I tried doing something like this just to see if nested foreach is what I'm looking for:
foreach x of newlist con inc {
foreach y of newlist real_income real_consumption{
display "`x' and `y'"
}
}
But it gives an error message saying "variable real_income already defined"

The error message you cite implies that earlier code you do not show us created a variable real_income.
I do not know SAS but I can tell you that given a numeric variable x
gen y = x <= 42
will create a variable y with value 1 if x <= 42 and 0 otherwise.
For another such variable, use another similar statement. In Stata and perhaps any other language, setting up a nested loop or defining a program instead of making two statements directly seems overkill. For a number of new variables much larger than 2, that might not be true.
foreach v in x y {
gen new`v' = `v' <= 42
}
For completely arbitrary existing names, new names and thresholds it is likely to be easier to write out statements individually.
This is documented. See for example 13.2.2 in [U] or this FAQ.

Related

SAS: How do I sum observations based on the variable's name for each observation of y to create a new variable?

I'm running SAS code on a data set that is thousands of rows (typical). I need to create 2 new variables in a data step that includes the sum of each row (categories by either X or Z in the title for each observation of y based on the Variable Name. Obviously I cannot write out each variable I need the sum of because it will be impossible in my actual data set. I think the answer is a Loop of sorts, but not having any luck finding a solution online where I don't need to list all of the variables.
A much smaller example data set is listed below of what I need the data to look like at the end.
So far I tried doing something like this but I KNOW this is so far off, I just am really stuck on how to get it to recognize the variables name and stop when it hits the last X or last Z.
DATA sample1 (drop = i);
set data;
do i = i to 10;
answer = sum(i);
end;
run
You can use a variable short cut references with the :.
of X: means sum everything that starts with the variable X.
data want;
set have;
sumx = sum(of X:);
sumZ = sum(of Z:);
*if you know the end of the series;
sumx = sum(of X1-X4);
sumZ = sum(of Z1-Z5);
run;
Different ways of specifying the variable list is illustrated here

How to record qualitative variable with over 100 dummies to several levels as quantitative in SAS

I am working with SAS and want to record variable which with over 50+ different qualitative dummies. For example, the state of the U.S.
In this case, I just want to reduce them into 4 or 5 levels dummy as quantitative variable.
I get several ideaS, for example to use if/else statement, however, the problem is that i have to write down and specify each of area name in SAS and the code looks like super heavy.
Is there any other ways to do that without redundant code? Or to avoid write each specific name of variable? In SAS.
Any ideas are appreciated!!
Method 1:
Use IN, but you still have to list the variables. You can also do it via a format, but you have to define the format first anyways.
if state in ('AL', 'AK', 'AZ' ... etc) then state_group = 1;
else if state in ( .... ) then state_group = 2;
Method 2:
For a format, you create format using PROC FORMAT and then apply it.
proc format;
value $ state_grp_fmt
'AL', 'AK', 'AZ' = 1
'DC', 'NC' = 2 ;
run;
And then you can use it with a PUT statement.
State_Group = put(state, state_grp_fmt);

Stata macro list uniq extended function (remove duplicates from a macro var list)

This question has been edited to add sample data and clean-up (hopefully) some unnecessary steps per feedback.
I am starting with longitudinal data in wide format. I need to subset, reshape, and perform summary steps for multiple different chunks of data. I want to create macro variables with varlists needed for reshaping and other repetitive steps in wide and long format. The variables being reshaped follow a consistent naming pattern of (prefix)_(name)_#. There are also variables following the same pattern that do not need to be reshaped, and variables that are time-invariant and follow other naming conventions. To generate sample data:
set obs 1
foreach t in 0 6 15 18 21 {
foreach w in score postint postintc constime starttime {
gen p_`w'_`t' = 1
}
}
gen p_miles_0 = 1
gen p_hea_0 = 1
gen cons_age = 1
ds
I want to create two macro vars 1) wide_varlist for wide format data where the variables end in a number and 2) uniquestubs for long format data where the macro list contains just the stubs. I am having trouble using the macro list extended function "uniq" to generate #2 here. Here is my code so far. My full varlists are actually much longer.
Steps to create macro with wide format varlist:
/* create varlist for wide format data a time point 0,6,15,18,21 */
ds p_score_* p_postint_* p_postintc_* p_constime_* p_starttime_*
di "`r(varlist)'"
global wide_varlist `r(varlist)'
Start steps to create macro with long format varlist:
/*copy in wide format varlist*/
global stubs "$wide_varlist"
/*remove # - this results in a macro with 5 dups of same stub*/
foreach mo of numlist 0,6,15,18,21{
global stubs : subinstr global stubs "`mo'" "", all
}
/*keep unique stubs*/
global uniquestubs : list uniq stubs
Everything above works as I intend until global uniquestubs : list uniq stubs, which doesn't create the macro uniquestubs at all.
My situation seems similar to this this question but the same solution didn't work for me.
Any thoughts? Appreciate the help.
It's a bit difficult to follow what you are trying to do (a) without a reproducible example (b) because much of your code is just copying the same varlist to different places, which is a distraction.
We can fix (a) by creating a toy dataset:
clear
set obs 1
foreach t in 0 6 15 18 21 {
foreach w in score postint postintc constime starttime {
gen p_`w'_`t' = 1
}
}
ds
p_score_0 p_score_6 p_score_15 p_score_18 p_score_21
p_postint_0 p_postint_6 p_postint_15 p_postint_18 p_postint_21
p_postintc_0 p_postintc_6 p_postintc~5 p_postintc~8 p_postintc~1
p_constime_0 p_constime_6 p_constim~15 p_constim~18 p_constim~21
p_starttim~0 p_starttim~6 p_startti~15 p_startti~18 p_startti~21
Now the main difficulty seems to be that you want stubs for a reshape long. This code suffices for the toy dataset. There is no need to scan yet more variable names with the same information. If you don't have all variables for all time points, you may need more complicated code.
unab stubs: p_*_0
local stubs : subinstr local stubs "0" "", all
di "`stubs'"
p_score_ p_postint_ p_postintc_ p_constime_ p_starttime_
I don't understand the enthusiasm for globals here, but, programming taste aside, you can put the last result in a global quite easily.

SAS function that will only use non-missing values for a variable?

I am trying to create a new variable that is the sum of other variables. Should be simple enough, however if one of the variables that is being used in the calculation of the new variable has a missing value, then the new variable has a missing value as well, when I want it to just sum across the remaining non-missing variables. For example, the data may look like:
a b c d e
1 . 3 2 6
The new variable is calculated as
newvar=a+b+c+d+e
For the above row, SAS returns a missing value for newvar because b is missing, when I would like it to return
newvar=a+c+d+e
as the answer. Is there a simple way to get SAS to do this?
Sure thing: just use the SUM function:
data _null_;
a=1;
b=.;
c=3;
d=2;
e=6;
newvar = sum(a,b,c,d,e);
put newvar=;
run;

Perform Fisher Exact Test from aggregated using Stata

I have a set of data like below:
A B C D
1 2 3 4
2 3 4 5
They are aggregated data which ABCD constitutes a 2x2 table, and I need to do Fisher exact test on each row, and add a new column for the p-value of the Fisher exact test for that row.
I can use fisher.exact and loop to do it in R, but I can't find a command in Stata for Fisher exact test.
You are thinking in R terms, and that is often fruitless in Stata (just as it is impossible for a Stata guy to figure out how to do by ... : regress in R; every package has its own paradigm and its own strengths).
There are no objects to add columns to. May be you could say a little bit more as to what you need to do, eventually, with your p-values, so as to find an appropriate solution that your Stata collaborators would sympathize with.
If you really want to add a new column (generate a new variable, speaking Stata), then you might want to look at tabulate and its returned values:
clear
input x y f1 f2
0 0 5 10
0 1 7 12
1 0 3 8
1 1 9 5
end
I assume that your A B C D stand for two binary variables, and the numbers are frequencies in the data. You have to clear the memory, as Stata thinks about one data set at a time.
Then you could tabulate the results and generate new variables containing p-values, although that would be a major waste of memory to create variables that contain a constant value:
tabulate x y [fw=f1], exact
return list
generate p1 = r(p_exact)
tabulate x y [fw=f2], exact
generate p2 = r(p_exact)
Here, [fw=variable] is a way to specify frequency weights; I typed return list to find out what kind of information Stata stores as the result of the procedure. THAT'S the object-like thing Stata works with. R would return the test results in the fisher.test()$p.value component, and Stata creates returned values, r(component) for simple commands and e(component) for estimation commands.
If you want a loop solution (if you have many sets), you can do this:
forvalues k=1/2 {
tabulate x y [fw=f`k'], exact
generate p`k' = r(p_exact)
}
That's the scripting capacity in which Stata, IMHO, is way stronger than R (although it can be argued that this is an extremely dirty programming trick). The local macro k takes values from 1 to 2, and this macro is substituted as ``k'` everywhere in the curly bracketed piece of code.
Alternatively, you can keep the results in Stata short term memory as scalars:
tabulate x y [fw=f1], exact
scalar p1 = r(p_exact)
tabulate x y [fw=f2], exact
scalar p2 = r(p_exact)
However, the scalars are not associated with the data set, so you cannot save them with the
data.
The immediate commands like cci suggested here would also have returned values that you can similarly retrieve.
HTH, Stas
Have a look the cci command with the exact option:
cci 10 15 30 10, exact
It is part of the so-called "immediate" commands. They allow you to do computations directly from the arguments rather than from data stored in memory. Have a look at help immediate
Each observation in the poster's original question apparently consisted of the four counts in one traditional 2 x 2 table. Stas's code applied to data of individual observations. Nick pointed out that -cci- can analyze a b c d data. Here's code that applies -cci to each table and, like Stas's code, adds the p-values to the data set. The forvalues i = 1/`=_N' statement tells Stata to run the loop from the first to the last observation. a[`i'] refers to the the value of the variable `a' in the i-th observation.
clear
input a b c d
10 2 8 4
5 8 2 1
end
gen exactp1 = .
gen exactp2 =.
label var exactp1 "1-sided exact p"
label var exactp2 "2-sided exact p"
forvalues i = 1/`=_N'{
local a = a[`i']
local b = b[`i']
local c = c[`i']
local d = d[`i']
qui cci `a' `b' `c' `d', exact
replace exactp1 = r(p1_exact) in `i'
replace exactp2 = r(p_exact) in `i'
}
list
Note that there is no problem in giving a local macro the same name as a variable.