SAS WHERE statement with several conditions - sas

In SAS PROC FREQ, using a WHERE statement with multiple conditions, I would like to understand why adding a condition causes a frequency to increase.
The first instance:
PROC FREQ;
WHERE X=1 AND Y=1;
TABLE YEARS;
RUN;
Outputs N=100 for a particular year.
But:
PROC FREQ;
WHERE (X=1 AND Y=1) AND A=2 OR B=2;
TABLE YEARS;
RUN;
Outputs larger N than the previous WHERE for the same year, e.g., N=200.
In the second FREQ and WHERE statement I think the condition in parentheses should be evaluated first, before the AND...OR, and should select the same N=100 as the first WHERE statement. And then the remaining criteria in the line, AND A=2 OR B=2, should select a subset of N=100 have either A=2 or B=2. And consequently, selected N should be less than or equal to 100, but not greater than 100.
This is what I want--the subset of (X=1 AND Y=1)
that also has either A=2 OR B=2--but it does not seem to be what I am getting. Suggestions?
Is this the correct statement for what I want?
WHERE (X=1 AND Y=1 AND A=2) OR (X=1 AND Y=1 AND B=2);
Thank you.

Adding an un-nested OR to a logical expression will always cause the result set to remain the same or become larger.
You need the parenthesis to change the order of evaluation. When there are no parentheses all the and expressions will be evaluated first, then the or expressions
From the documentation Combining Expressions By Using Logical Operators
Processing Compound Expressions
When SAS encounters a compound WHERE
expression (multiple conditions), the software follows rules to
determine the order in which to evaluate each expression. When WHERE
expressions are combined, SAS processes the conditions in a specific
order:
The NOT expression is processed first.
Then the expressions joined by AND are processed.
Finally, the expressions joined by OR are processed.
Using Parentheses to Control Order of Evaluation
Even though SAS evaluates logical operators in a specific order, you can
control the order of evaluation by nesting expressions in parentheses.
That is, an expression enclosed in parentheses is processed before one
not enclosed. The expression within the innermost set of parentheses
is processed first, followed by the next deepest, moving outward until
all parentheses have been processed.
For example, suppose you want a
list of all the Canadian sites that have both SAS/GRAPH and SAS/STAT
software, so you issue the following expression:
where product='GRAPH' or product='STAT' and country='Canada';
The result, however, includes all sites that license SAS/GRAPH software along with the Canadian
sites that license SAS/STAT software. To obtain the correct results,
you can use parentheses, which causes SAS to evaluate the comparisons
within the parentheses first, providing a list of sites with either
product licenses, then the result is used for the remaining condition:
where (product='GRAPH' or product='STAT') and country='Canada';
So your
WHERE (X=1 AND Y=1) AND A=2 OR B=2;
is the same as
WHERE (X=1 AND Y=1 AND A=2) OR B=2;
your this is what I want described in the question is
WHERE (X=1 AND Y=1) AND (A=2 OR B=2);
which is the same (by distributive law of logic)
WHERE (X=1 AND Y=1 AND A=2) OR (X=1 AND Y=1 AND B=2);
No matter how you state the expression, adding an OR will always have the possibility of increasing the number of items meeting the expression. The un-nested OR will have the possibility of selecting more items than a nested (or parentheticalized) OR

Related

Looping over macro of macros

I've defined a macro of macros:
local my_macros "`macro1' `macro2' `macro3'"
Each of the individual macros has a list of covariates, e.g.
local macro1 "cov1 cov2 cov3"
local macro2 "cov4 cov5 cov6"
local macro3 "cov7 cov8 cov9"
When I loop over my_macros, I want to extract each individual macro. So for example, if I have
for each m in my_macros{
di `m'
}
then it would ideally print the three macros, something like
`macro1'
`macro2'
`macro3'
or
cov1 cov2 cov3
cov4 cov5 cov6
cov7 cov8 cov9
This is because the actual loop I'm running is a regression, and each macro is a list of covariates I want to run. However, the output instead looks like
for each m in my_macros{
di `m'
}
0
0
0
0
0
0
0
0
0
0
So in the full regression loop, only one covariate is being included in a regression at a time. Does anyone know what's going on and how to get each macro as a line of output when I print `my_macros'?
Solution What you want can be done by nesting macro references.
local macro1 "cov1 cov2 cov3"
local macro2 "cov4 cov5 cov6"
local macro3 "cov7 cov8 cov9"
That's fine. But now the crucial step to loop over such macros could be
forval j = 1/3 {
... `macro`j'' ...
}
where the dots indicate whatever else is needed. Evaluation of macros is exactly like evaluation in elementary algebra or arithmetic whenever parentheses, brackets or braces are used: innermost references are evaluated first, so a reference to macro j is evaluated first.
Misunderstandings The question contains various small and large misunderstandings.
M1. for each is a repeated typo for foreach.
M2. in my_macros is written where only of local my_macros makes sense.
M3. Once you define a macro from three macros each containing three words, the original macros no longer have any identity as three separate entities. The levels are the new macro; its constituent words (here variable names); and the individual characters (not relevant here). To retain such identities you would need to introduce punctuation, say commas, and parse the contents using that punctuation. But here it is easier to use nested references, and not to define a wider macro at all.
M4. Assuming that you really defined my_macros in two steps so that it eventually contained nine variable names, then a loop like
foreach m of local my_macros {
di `m'
}
would be issuing in turn nine commands like
di cov1
Each such command displays the value of each variable in the first observation (it's not obvious that Stata does that, but it's true). That is,
di `m'
(where local macro m contains a variable name) is exactly equivalent to
di `m'[1]
To see the name, i.e. the text inside the macro, here a variable name, and not the value, you would need the statement inside the loop to be
di "`m'"
Hence the double quotes " " insist on the name, not the value, being displayed. Although you don't give a data example or reproducible code, a series of nine (not ten) zeros would be displayed if and only if all those nine variables contain zeros in the first observation.
The same confusion between name and value occurred in your previous thread Stata type mismatch with local macro?

MARIE Assembly if else

Struggle with MARIE Assembly.
Needing to write a code that has x=3 and y=5, is x>y then it needs to output 1, if x<y it needs to output one,
I have the start but don't know how to do if else statements in MARIE
LOAD X
SUBT Y
SKIPCOND 800
JUMP ELSE
OUTPUT
HALT
Structured statements have a pattern, and each one has an equivalent pattern in assembly language.
The if-then-else statement, for example, has the following pattern:
if ( <condition> )
<then-part>
else
<else-part>
// some statement after if-then-else
Assembly language uses an if-goto-label style.  if-goto is a conditional test & branch; and goto alone is an unconditional branch.  These forms alter the flow of control and can be composed to do the same job as structure statements.
The equivalent pattern for the if-then-else in assembly (but written in pseudo code) is as follows:
if ( <condition> is false ) goto if1Else;
<then-part>
goto if1Done;
if1Else:
<else-part>
if1Done:
// some statement after if-then-else
You will note that the first conditional branch (if-goto) needs to branch on condition false.  For example, let's say that the condition is x < 10, then the if-goto should read if ( x >= 10 ) goto if1Else;, which branches on x < 10 being false.  The point of the conditional branch is to skip the then-part (to skip ahead to the else-part) when the condition is false — and when the condition is true, to simply allow the processor to run the then-part, by not branching ahead.
We cannot allow both the then-part and the else-part to execute for the same if-statement's execution.  The then-part, once completed, should make the processor move on to the next statement after the if-then-else, and in particular, to avoid the else-part, since the then-part just fired.  This is done using an unconditional branch (goto without if), to skip ahead around the else-part — if the then-part just fired, then we want the processor to unconditionally skip the else-part.
The assembly pattern for if-then-else statement ends with a label, here if1Done:, which is the logical end of the if-then-else pattern in the if-goto-label style.  Many prefer to name labels after what comes next, but these labels are logically part of the if-then-else, so I choose to name them after the structured statement patterns rather than about subsequent code.  Hopefully, you follow the assembly pattern and see that whether the if-then-else runs the then-part or the else-part, the flow of control comes back together to run the next line of code after the if-then-else, whatever that is (there must be a statement after the if-then-else, because a single statement alone is just a snippet: an incomplete fragment of code that would need to be completed to actually run).
When there are multiple structured statements, like if-statements, each pattern translation must use its own set of labels, hence the numbering of the labels.
(There are optimizations where labels can be shared between two structured statements, but doing that does not optimize the code in any way, and makes it harder to change.  Sometimes nested statements can result in branches to unconditional branches — since these actual machine code and have runtime costs, they can be optimized, but such optimizations make the code harder to rework so should probably be held off until the code is working.)
When two or more if-statements are nested, the pattern is simply applied multiple times.  We can transform the outer if statement first, or the inner first, as long as the pattern is properly applied, the flow of control will work the same in assembly as in the structured statement.
In summary, first compose a larger if-then-else statement:
if ( x < y )
Output(1)
else
Output(one)
(I'm not sure this is what you need, but it is what you said.)
Then apply the pattern transformation into if-goto-label: since, in the abstract, this is the first if-then-else, let's call it if #1, so we'll have two labels if1Done and if1Else.  Place the code found in the structured pattern into the equivalent locations of the if-goto-label pattern, and it will work the same.
MARIE uses SkipCond to form the if-goto statement.  It is typical of machine code to have separate compare and branch instructions (as for a many instruction set architectures, there are too many operands to encode an if goto in a single instruction (if x >= y goto Label; has x, y, >=, and Label as operands/parameters).  MARIE uses subtract and branch relative to 0 (the SkipCond).  There are other write-ups on the specific ways to use it so I won't go into that here, though you have a good start on that already.

Use of the '<' operator in SAS

I have to convert some SAS code. In other programming languages I am used to < being used in comparisons e.g. in pseudo-code: If x < y then z
In SAS, what is the < operator achieving here:
intck(month,startdate,enddate)-(day(enddate)<day(startdate))
I have been able to understand the functions using the reference documentation but I can't see anything relating to how '<' is being used here.
Just to go into a little more detail about what the code you have there is doing, it's an old school method to determine the number of months from one date to the next (possibly to calculate a birthday, for example).
Originally, SAS functions intck and intnx only calculated the number of "firsts of the month" in between two dates (or similar for other intervals). So INTCK('month','31OCT2020'd, '01NOV2020'd) = 1, while INTCK('month','01OCT2020'd,'30NOV2020'd) = 1. Not ideal! So you'd add in this particular bit of code, -(day(enddate)<day(startdate)), which says "if it is not been a full month yet, subtract one". It's equivalent to this:
if day(enddate) < day(startdate) then diff = intck(month,startdate,enddate) - 1;
else diff = intck(month,startdate,enddate);
There's now a better way to do this (yay!). intck and 'intnx' are a bit different, but it's the same idea. For intck the argument is method, where c for "continuous" is what you want to compare same period in the month. For intnx it is the alignment option, where 's' means "same" (so, move to the same point in the month).
So your code now should be:
intck(month,startdate,enddate,'c')
The symbol < is an operator in that expression. It is not a function call , like INTNX() is in your expression.
SAS evaluates boolean expressions (like the less than test in your example) to 1 for TRUE and 0 for FALSE.
So your expression is subtracting 1 when the day of month of ENDDATE is smaller than the day of month of STARTDATE.
Note: You can also do the reverse, treat a number as a boolean expression. For example in a statement like:
if (BASELINE) then PERCENT_CHANGE = (VALUE-BASELINE) / BASELINE ;
A missing value or a value of zero in BASELINE will be treated as FALSE and so in those cases the assignment statement does not run.

SAS N Function in Stata

Is there a function in Stata equivalent to the SAS N() function?
For example, in SAS,
N(of a1-a10) should result in the count of variables of a1 to a10 with nonmissing values.
The egen functions count() and rownonmiss() produce counts of non-missing values in new variables, the first working column-wise (e.g. on variables) and the second operating row-wise (across variables within observations).
Many commands report on missings in various ways, e.g. codebook, inspect and missings (SSC), on one or several variables at a time. On the last, see (e.g.) this forum post. For the others, see help and manual entries as usual, which are also visible over the internet, e.g. the help for codebook.
How to find this out: Note that search missing would have pointed to egen (and much else too, which can't easily be helped).

SAS: Difference between IF-THEN and IF-THEN-DO Statments?

I am new to SAS and would like to know what are the difference Difference between "IF-THEN" and "IF-THEN-DO" statements in SAS?
Simplified you can say, if then is for one statement, if then do for a block of statements. If you use if without then in Datastep, it prevents output for the specific set.
Example:
data x;
set y;
if a = 1 then /*one statment is following*/
b=2;
if a = 1 then do; /* a block of statements is follwing till end statement, similar to brackets in other programming languages*/
b=2;
c=3;
end;
if a = 1; /*only when a = 1 data will be written to x*/
run;
SAS evaluates the expression in an IF-THEN statement to produce a result that is either non-zero, zero, or missing. A non-zero and nonmissing result causes the expression to be true; a result of zero or missing causes the expression to be false.
If the conditions that are specified in the IF clause are met, the IF-THEN statement executes a SAS statement for observations that are read from a SAS data set, for records in an external file, or for computed values. An optional ELSE statement gives an alternative action if the THEN clause is not executed. The ELSE statement, if used, must immediately follow the IF-THEN statement.
Using IF-THEN statements without the ELSE statement causes SAS to evaluate all IF-THEN statements. Using IF-THEN statements with the ELSE statement causes SAS to execute IF-THEN statements until it encounters the first true statement. Subsequent IF-THEN statements are not evaluated. (Source: support.sas.com)
The DO statement is the simplest form of DO group processing. The statements between the DO and END statements are called a DO group. You can nest DO statements within DO groups.
A simple DO statement is often used within IF-THEN/ELSE statements to designate a group of statements to be executed depending on whether the IF condition is true or false. (Source: support.sas.com)
Regards,
Vasilij