SAS N Function in Stata

SAS N Function in Stata - stata

Is there a function in Stata equivalent to the SAS N() function?
For example, in SAS,
N(of a1-a10) should result in the count of variables of a1 to a10 with nonmissing values.

The egen functions count() and rownonmiss() produce counts of non-missing values in new variables, the first working column-wise (e.g. on variables) and the second operating row-wise (across variables within observations).
Many commands report on missings in various ways, e.g. codebook, inspect and missings (SSC), on one or several variables at a time. On the last, see (e.g.) this forum post. For the others, see help and manual entries as usual, which are also visible over the internet, e.g. the help for codebook.
How to find this out: Note that search missing would have pointed to egen (and much else too, which can't easily be helped).

Related

Is there a way to generate a new variable only if it meets certain criteria?

I am trying to replicate the following Stata code in R:
gen UAPDL_1=sqrt((((Sanchez_1-Iglesias_1)^2)+((Casado_1-Iglesias_1)^2)+((Rivera_1-Iglesias_1)^2))/3) if maxIglesias_1==1
replace UAPDL_1=sqrt((((Sanchez_1-Rivera_1)^2)+((Casado_1-Rivera_1)^2)+((Iglesias_1-Rivera_1)^2))/3) if maxRivera_1==1
In other words, I am trying to make different calculations and generate a new variable with different values depending on certain conditions (in this case, they have value 1 in an another variable. I managed to create the variables to be met for making the calculation (maxIglesias==1 and maxRivera==1), but I am stuck in the generation of the UAPDL variable. I tried with case_when and ifelse, but in these cases these commands only let you define a certain value. Is there a way with mutate or dplyr (or any other package) to achieve this goal?

Welcome to SO!
Let me try to 'parse' your question for the sake of clarity.
You want to generate a variable UAPDL depending on the value of two distinct variables (maxIglesias_1 and maxRivera_1, which let's say correspond to values f(I) and f(R), respectively). Here I note that, according to the snippet of the code you posted, there is no guarantee that the two variables are mutually exclusive - i.e., you may have records with maxIglesias_1 == 1 AND maxRivera_1 == 1. In those cases, the order in which you run the commands matters, as they all end up valued f(R), or f(I) if you twist them.
However, in order to replicate the Stata commands that you posted (issue with the ordering included!) you should run
UAPDL_1 <- numeric(length(maxIglesias_1)) # generate the vector
UAPDL_1[maxIglesias_1 == 1] <- f(I)
UAPDL_1[maxRivera_1 == 1] <- f(R)
where I assume that maxIglesias_1 and maxIglesias_1 are two R objects of the same length as the original Stata matrix.
Good luck!

What does this block of SAS code do?

I have sas code that I need to partially convert to c++ code, however I am struggling understand its function. I have no experience with sas, and after a few hours of various tutorials and examples I have made very little progress. I don't have access to any of the input data or any corresponding output either. The code follows the following format, but I've changed the variable names:
data data1;
set data2;
output;
if type='ABCD' and zone=1 then do;
type='BCDE'; spec='CDE'; sub='ABCD DEF'; output;
type='EFGH'; spec='FGH'; output;
type='ABCD'; spec='DEF';
end;
The code then continues on, however I only need to understand the logic of this if statement. In the actual code there are many of these statements but they all follow the same structure, understanding one should help me to understand them all. The variable values are only important insofar as type and uniqueness, if variables here share a value then that is true in the original code as well, otherwise they are different.
I know that the program is designed to take combinations of type/spec/zone and convert them into other type/spec combinations but I can't seem to grasp the logic.

The DATA and SET statements define the target and source, respectively.
The first OUTPUT statement will insure that the target has at least one copy of every record read from the source data.
The code inside the DO END block of the IF/THEN statement will cause two additional records to be written when it runs. They will have different values for the TYPE, SPEC and SUB variables as the assignment statements indicate. At the end of the DO block the values of TYPE, SPEC and SUB will have been set to 'ABCD','DEF' and 'ABCD DEF', respectively.
So if your input is
TYPE,SPEC,SUB,ZONE
ABCD,UNK,UNK,0
ABCD,XX,YY,1
UNK,UNK,UNK,0
The values written by the part of the code you posted would be.
TYPE,SPEC,SUB,ZONE
ABCD,UNK,UNK,0
ABCD,XX,YY,1
BCDE,CDE,ABCD DEF,1
EFGH,FGH,ABCD DEF,1
UNK,UNK,UNK,0

SAS - How to determine the number of variables in a used range?

I imagine what I'm asking is pretty basic, but I'm not entirely certain how to do it in SAS.
Let's say that I have a range of variables, or an array, x1-xn. I want to be able to run a program that uses the number of variables within that range as part of its calculation. But I want to write it in such a way that, if I add variables to that range, it will still function.
Essentially, I want to be able to create a variable that if I have x1-x6, the variable value is '6', but if I have x1-x7, the value is '7'.
I know that :
var1=n(of x1-x6)
will return the number of non-missing numeric variables.. but I want this to work if there are missing values.
I hope I explained that clearly and that it makes sense.

Couple of things.
First off, when you put a range like you did:
x1-x7
That will always evaluate to seven items, whether or not those variables exist. That simply evaluates to
x1 x2 x3 x4 x5 x6 x7
So it's not very interesting to ask how many items are in that, unless you're generating that through a macro (and if you are, you probably can have that macro indicate how many items are in it).
But the range x1--x7 or x: both are more interesting problems, so we'll continue.
The easiest way to do this is, if the variables are all of a single type (but an unknown type), is to create an array, and then use the dim function.
data _null_;
x3='ABC';
array _temp x1-x7;
count = dim(_temp);
put count=;
run;
That doesn't work, though, if there are multiple types (numeric and character) at hand. If there are, then you need to do something more complex.
The next easiest solution is to combine nmiss and n. This works if they're all numeric, or if you're tolerant of the log messages this will create.
data _null_;
x3='ABC';
count = nmiss(of x1-x7) + n(of x1-x7);
put count=;
run;
nmiss is number of missing, plus n is number of nonmissing numeric. Here x3 is counted with the nmiss group.
Unfortunately, there is not a c version of n, or we'd have an easier time with this (combining c and cmiss). You could potentially do this in a macro function, but that would get a bit messy.
Fortunately, there is a third option that is tolerant of character variables: combining countw with catx. Then:
data _null_;
x3='ABC';
x4=' ';
count = countw(catq('dm','|',of x1-x7),'|','q');
put count=;
run;
This will count all variables, numeric or character, with no conversion notes.
What you're doing here is concatenating all of the variables together with a delimiter between, so [x1]|[x2]|[x3]..., and then counting the number of "words" in that string defining word as thing delimited by "|". Even missing values will create something - so .|.|ABC|.|.|.|. will have 7 "words".
The 'm' argument to CATQ tells it to even include missing values (spaces) in the concatenation. The 'q' argument to COUNTW tells it to ignore delimiters inside quotes (which CATQ adds by default).
If you use a version before CATQ is available (sometime in 9.2 it was added I believe), then you can use CATX, but you lose the modifiers, meaning you have more trouble with empty strings and embedded delimiters.

How to run/not run SAS or SQL code based on conditional output?

I have a SAS program with a macro that will output a different list of variables based on the input criteria. For example, with %MACRO(OPTION1), I get three variables, but with %MACRO(OPTION2), I get four variables. The name of all of the variables is fixed, but it's just a matter of if they are created or not (based on the option).
How can I adjust the macro so that any option inputted by the user will still allow the macro to run? In other words, how can I tell it to ignore some variables if they don't exist.
Fortunately, I am not restricted to any specific procedure, but it would probably have to be either in a DATA step (macro language) or a PROC SQL statement (where clause or some other conditional statement).

This is answerable in the general, as an approach to programming.
The first rule:
Use macro parameters explicitly when the amount of code is small.
This means, if you want to (say) do a PROC MEANS on something, but the variable differed, you could do:
%macro run_means(var=);
proc means data=sashelp.class;
var &var.;
run;
%mend run_means;
%run_means(var=height);
%run_means(var=weight);
etc. Don't put some conditional logic in the macro, make them external. This includes lists of variables; make the whole list of variables parameters. Don't write them into your macro. If it's a long list, make it a macro variable in your main program, and pass that macro variable. Your macro itself should strive to accept what's given; today you have two sets of variables, tomorrow you might have three, or a slightly different set of one or the other. It's easier to change what you pass to the macro than to change the macro.
This concept will feel comfortable to folks used to object oriented programming, in particular the modular approach, although the separation of data is a bit different.
The second rule:
When substantial parts of a macro vary based on a parameter, separate that code into multiple macros.
In this case, let's say you have two things you want to do: run a PROC MEANS, or run a PROC FREQ, depending on if it's a character or numeric variable. Here, I suggest a general rule of not putting all of that into one macro. It's possible, but it's generally a bad idea. Adding to the previous macro, if you wanted to do this for sashelp.class, I'd do it like this:
%macro run_freq(var=);
proc freq data=sashelp.class;
tables &var.;
run;
%mend run_freq;
%run_means(var=height);
%run_means(var=weight);
%run_freq (var=sex);
How you create these may be programmatic. A lot depends on what you're doing and how you're generating the code; and sometimes in the middle of your macro, you generate the value that determines which of the two things you do. I would still write the portion that varies as a separate macro, though; you can then add logic to call the appropriate macro, and allow it to be more legible.

can't evaluate if statement with variables

I've got experience in a lot of other programming languages, but I'm having a lot of difficulty with Stata syntax. I've got a statement that evaluates with no problem if I put in values, but I can't figure out why it's not evaluating variables like I expect it to.
gen j=5
forvalues i = 1(1)5 {
replace TrustBusiness_local=`i' if TrustBusiness_local2==`j'
replace j=`j'-1
}
If I replace i and j with 1 and 5 respectively, like I'm expecting to happen from the code above, then it works fine, but I get an if not found error otherwise, which hasn't produced meaningful results when Googled. Does anyone see what I don't see? I hate to brute-force something that could so simply be done with a loop.

Easy to understand once you approach it the right way!
Problem 1. You never defined local macro j. That in itself is not an error, but it often leads to errors. Macros that don't exist are equivalent to empty strings, so Stata sees in this example the code
if TrustBusiness_local2==`j'
as
if TrustBusiness_local2==
which is illegal; hence the error message.
Problem 2. There is no connection of principle between a variable you called j and a local macro called j but referenced using single quotes. A variable in Stata is a variable (namely, column) in your dataset; that doesn't mean a variable otherwise in the sense of any programming language. Variables meaning single values can be held in Stata within scalars or within macros. Putting a constant into a variable, Stata sense, is legal, but usually bad style. If you have millions of observations, for example, you now have a column j with millions of values of 5 within it.
Problem 3. You could, legally, go
local j "j"
so that now the local macro j contains the text "j", which depending on how you use it could be interpreted as a variable name. It's hard to see why you would want to do that here, but it would be legal.
Problem 4. Your whole example doesn't even need a loop as it appears to mean
replace TrustBusiness_local= 6 - TrustBusiness_local2 if inlist(TrustBusiness_local2, 1,2,3,4,5)
and, depending on your data, the if qualifier could be redundant. Flipping 5(1)1 to 1(1)5 is just a matter of subtracting from 6.
Problem 5. Your example written as a loop in Stata style could be
local j = 5
forvalues i = 1/5 {
replace TrustBusiness_local=`i' if TrustBusiness_local2==`j'
local j=`j'-1
}
and it could be made more concise, but given Problem 4 that no loop is needed, I will leave it there.
Problem 6. What you talking about are, incidentally, not if statements so far as Stata is concerned, as the if qualifier used in your examples is not the same as the if command.
The problem of translating one language's jargon into another can be challenging. See my comments at http://www.stata.com/statalist/archive/2008-08/msg01258.html After experience in other languages, the macro manipulations of Stata seemed at first strange to me too; they are perhaps best understood as equivalent to shell programming.
I wouldn't try to learn Stata by Googling. Read [U] from beginning to end. (A similar point was made in the reply to your previous question at use value label in if command in Stata but you don't want to believe it!)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js