Problems in creating dummy variables - stata

The treatment boroughs are five boroughs with ID 1, 2, 3, 6, 14. The “Operation Theseus" policy lasts from week 80 to week 85.
ocu: borough ID
I tried creating dummies of treated and time but they just show as zero for all
gen treated =0 if missing(ocu)==0
replace treated =1 if ocu==1/2/3/6/14
gen time = (week==80-85) & !missing(week)`

ocu == 1/2/3/6/14 is a legal expression, but is likely to be a long way from what you want.
occ == 1 | occ == 2 | occ == 3 | occ == 6 | occ == 14
is legal and long-winded and
inlist(occ, 1, 2, 3, 6, 14)
legal and likely to be appealing as an expression for: does occ take on any the values specified?
Although Stata supports | as an "or" operator (and not / for that purpose) note that
occ == 1 | 2 | 3 | 6 | 14
is legal but almost never what anyone would want, as it is parsed
(occ == 1) | 2 | 3 | 6 | 14
and will always be evaluated 1 (true), regardless of the value of occ, as just one of the other arguments 2 3 6 14 being non-zero means that the entire expression evaluates to 1 (true).
The expression week==80-85 is also incorrect syntax if you want it to mean week between 80 and 85. Stata will evaluate week == 80-85, applying the subtraction first and so test for equality with -5. See precedence rules as documented in help operators.
The order of evaluation (from first to last) of all operators is ! (or ~), ^, - (negation), /, *, - (subtraction), +, != (or ~=), >, <, <=, >=, ==, &, and |.*
Subtraction comes before testing for equality.
You may want week >= 80 & week <= 85 or inrange(week, 80, 85).
If week is between 80 and 85, then it can't be missing. That test is redundant (but harmless).

Related

strfmt range in x++

what's wrong with this range?
rangeTransDate = strFmt('(("%1.%2" <= "%3" && "%3" == "%5") || ("%1.%4" > "%3"))',tableStr(CustomTable),fieldStr(CustomTable,TransDate), date2str(dateTo,321,2,0,2,0,4),fieldStr(CustomTable,SettlementDate),SysQuery::valueEmptyString());
i'm getting this error:
Query extended range error: Right parenthesis expected next to position 72.
This page of AX 2012 documentation is still relevant (I cannot find a AX365 version). Highlighting the important bit gives:
The rules for creating query range value expressions are:
Enclose the whole expression in parentheses.
Enclose all subexpressions in parentheses.
Use the relational and logical operators available in X++.
Only use field names from the range's data source.
Use the dataSource.field notation for fields from other data sources in the query.
This means that X++ expects curly brackets around every comparison operator (a.k.a. "subexpression" in the documentation). You are missing some...
Also, use the date2strxpp() function to properly handle all date to string conversions. This function can handle empty date values (dateNull()) by translating these to 1900-01-01. I doubt that putting an empty string (SysQuery::valueEmptyString()) in there will work.
So try this, the commented subexpression levels show the bracket groupings:
// subexpressions lvl 2: 4 4
// subexpressions lvl 1: |1 1 2 2 3 3|
// || | | | | ||
rangeTransDate = strFmt('(("%1.%2" <= "%3") && ("%3" == "%5") || ("%1.%4" > "%3"))',
tableStr(CustomTable),
fieldStr(CustomTable,TransDate),
date2strxpp(dateTo),
fieldStr(CustomTable,SettlementDate),
date2strxpp(dateNull()));
If you still get a similar error at runtime, add even more brackets to group every subexpression in pairs:
// subexpressions lvl 3: 5 5
// subexpressions lvl 2: |4 4 3 3|
// subexpressions lvl 1: ||1 1 2 2| 3 3|
// ||| | | || | ||
rangeTransDate = strFmt('((("%1.%2" <= "%3") && ("%3" == "%5")) || ("%1.%4" > "%3"))',
tableStr(CustomTable),
fieldStr(CustomTable,TransDate),
date2strxpp(dateTo),
fieldStr(CustomTable,SettlementDate),
date2strxpp(dateNull()));

From natural language to C++ expression

Assignment:
Translate the following natural language expressions to C++ expressions. Assume that all the variables are non-negative numbers or boolean (of value true or false).
Natural Language:
Either a and b are both false or c is true, but not both.
My solution:
(a==0 && b==0)xor(c==1)
Professors solution:
(!a && !b) != c
Questions:
I think I slightly understand the first bracket, by saying "not-a" and "not-b" I think that a and b must then be wrong, provided a b are assumed to be non-zero in the beginning. Right?
But what about the part that says "unequal to c"?
I don't understand the Professors solution, can anyone break it down for me?
Thank you for the help!
I'll assume that a, b and c are bool.
Let's draw some truth tables:
| a | !a | a==1 | a==0 |
| 0 | 1 | 0 | 1 |
| 1 | 0 | 1 | 0 |
As you can see, a and a==1 are equivalent, and !a and a==0 are also equivalent, so we can rewrite (a==0 && b==0)xor(c==1) as (!a && !b) xor c.
Now some more truth tables:
| a | b | a xor b | a != b |
| 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 |
| 1 | 0 | 1 | 1 |
| 1 | 1 | 0 | 0 |
So a!=b is equivalent to a xor b, so we can rewrite (!a && !b) xor c to (!a && !b)!=c. As you see, your solutions are fully equivalent, just written with different 'signs'.
UPD: Forgot to mention. There are reasons why professor's solution looks exactly in that way.
The professor's solution is more idiomatic. While your solution is technically correct, it's not an idiomatic C++ code.
First little issue is usage of types. Your solution relies on conversion between int and bool when you compare boolean value to a number or use xor, which is a 'bit-wise exclusive or' operator acting on ints too. In a modern C++ it is much more appreciated to use values of correct types and not to rely on such conversions as they're sometimes not so clear and hard to reason about. For bool such values are true and false instead of 1 and 0 respectively. Also != is more appropriate than xor because while technically bools are stored as numbers, but sematically you haven't any numbers, just logical values.
Second issue is about idiomacy too. It lies here: a == 0. It is not considered a good practice to compare boolean expressions to boolean constants. As you already know, a == true is fully equivalent to just a, and a == false is just !a or not a (I prefer the latter). To understand the reason why that comparing isn't good just compare two code snippets and decide, which is clearer:
if (str.empty() == false) { ... }
vs
if (not str.empty()) { ... }
Think booleans, not bits
In summary, your professor's solution is better (but still wrong, strictly speaking, see further down) because it uses boolean operators instead of bitwise operators and treating booleans as integers. The expression c==1 to represent "c is true" is incorrect because if c may be a number (according to the stated assignment) then any non-zero value of c is to be regarded as representing true.
See this question on why it's better not to compare booleans with 0 or 1, even when it's safe to do so.
One very good reason not to use xor is that this is the bit-wise exclusive or operation. It happens to work in your example because both the left hand side and right hand side are boolean expressions that convert to 1 or 0 (see again 1).
The boolean exclusive-or is in fact !=.
Breaking down the expression
To understand your professor's solution better, it's easiest to replace the boolean operators with their "alternative token" equivalents, which turns it into better redable (imho) and completely equivalent C++ code:
Using 'not' for '!' and 'and' for '&&' you get
(not a and not b) != c
Unfortunately, there is no logical exclusive_or operator other than not_eq, which isn't helpful in this case.
If we break down the natural language expression:
Either a and b are both false or c is true, but not both.
first into a sentence about boolean propositions A and B:
Either A or B, but not both.
this translates into A != B (only for booleans, not for any type A and B).
Then proposition A was
a and b are both false
which can be stated as
a is false and b is false
which translates into (not a and not b), and finally
c is true
Which simply translates into c.
Combining them you get again (not a and not b) != c.
For further explanation how this expression then works, I defer to the truth tables that others have given in their answers.
You're both wrong
And if I may nitpick: The original assignment stated that a, b and c can be non-negative numbers, but did not unambiguously state that if they were numbers, they should be limited to the values 0 and 1. If any number that is not 0 represents true, as is customary, then the following code would yield a surprising answer:
auto c = 2; // "true" in some way
auto a = 0; // "false"
auto b = 0; // "false"
std::cout << ((!a && !b) != c);
// this will output: 1 (!)
// fix by making sure that != compares booleans:
std::cout << ((!a && !b) != (bool)c);
I will tryto explain with some more words: Numbers can be implicitly converted to boolean values:
The value zero (for integral, floating-point, and unscoped enumeration) and the null pointer and the null pointer-to-member values become false. All other values become true.
Source on cppreference
This leads to the following conclusions:
a == 0 is the same as !a, because a is converted to a boolean and then inverted, which equals !(a != 0). The same goes for b.
c==1 will become only true when c equals 1. Using the conversion (bool)c would yield true when c != 0 not just if c == 1. So it can work, because one usually uses the value 1 to represent true, but it's not garantued.
a != b is the same as a xor b when a and b ar boolean expressions. It's true, when one value or the other is true, but not both. In this case the left hand side (a==0 && b==0) is boolean, so the right hand side c is converted to boolean too, thus, both sides are interpreted as boolean expressions, thus != is the same as xor in this case.
You can check all of this yourself with the truthtables that the other answers provided.
As we can see from the truth tables:
!(not) and ==0 give the same results.
!= and xor give the same results.
c==1 is the same as just c
So one under the other, shows why these 2 expressions give the same result:
(a==0 && b==0) xor (c==1)
(!a && !b) != c
Truth tables :
Not
| | ! |
| 0 | 1 |
| 1 | 0 |
==0
| |==0|
| 0 | 1 |
| 1 | 0 |
==1
| |==1|
| 0 | 0 |
| 1 | 1 |
And
| a | b | && |
| 0 | 0 | 0 |
| 0 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 1 |
Not equal
| a | b | != |
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
XOR
| a | b |xor|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |

Let a variable equal multiple values in an if-statement [duplicate]

This question already has an answer here:
Generating a new variable using conditional statements
(1 answer)
Closed 3 years ago.
I am doing data clean-up in Stata and I need to recode a variable to equal 1 if a whole set of other variables are equal to 1, 6, or 7.
I can do this using the code below:
replace anyadl = 1 if diffdress==1 | diffdress==6 | diffdress==7 | ///
diffwalk==1 | diffwalk==6 | diffwalk==7 | ///
diffbath==1 | diffbath==6 | diffbath==7 | ///
diffeat==1 | diffeat==6 | diffeat==7 | ///
diffbed==1 | diffbed==6 | diffbed==7 | ///
difftoi==1 | difftoi==6 | difftoi==7
However, this is very inefficient to type out and it is easy to make errors.
Is there a simpler way to do this?
For example, something along the following lines:
replace anyadl = 1 if diff* == (1 | 6 | 7)
Your fantasy syntax wouldn't do what you want even if it were legal, as for example 1|6|7 would be evaluated as 1. That is, in Stata 1 OR 6 OR 7 is in effect true OR true OR true, so true, and thus 1, given the rules non-zero is true as input and true is 1 as output. The expression is 1|6|7 is legal; it's the wildcard in an equality or inequality that isn't.
Stepping back, your code is producing an indicator (some people say dummy) variable with values 1 or missing. In practice such a variable is much more useful if created with values 0 and 1 (and in some instances missing too).
generate anyad1 = 0
foreach v in dress walk bath eat bed toi {
replace anyad1 = 1 if inlist(diff`v', 1, 6, 7)
}
is one approach. In general, note both inlist(foo, 1, 6, 7) and inlist(1, foo, bar, bazz) as useful constructs.
Reading:
This paper on generating indicators
This one on useful functions
This one on inlist() and inrange()
FAQ on true and false in Stata

Coding dichotomous variables in Stata

I have a set of dichotomous variables for firm size:
emp1_2 (i.e. firm with 1 or 2 employed people, including the owner), emp3_9, emp10_19, emp20_49, emp50_99, emp100_249, emp250_499, emp500, plus I do not have information on 27 firms size but I have an educated guess that they are large firms.
I want to create a dichotomous variable for a firm being a "small firm"; therefore, this variable equals 1 when emp1_2==1 | emp3_9==1 | emp10_19==1 equals 1, and 0 otherwise.
To my understanding of Stata, of which I am a bare user, the two following methods to construct dichotomous variables should be equivalent.
Method 1)
gen lar_firm = 0
replace lar_firm = 1 if emp1_2==1 | emp3_9==1 | emp10_19==1
Method 2)
gen lar_firm = (emp1_2 | emp3_9 | emp10_19)
Instead I have found out that with method 2) lar_firm equals 1 for firms for which emp1_2 | emp3_9 | emp10_19 and for firms that do not enter in any of the categories (i.e. emp1_2, emp3_9, emp10_19, emp20_49, emp50_99, emp100_249, emp250_499, emp500) but for which I have an educated guess that they are large firms.
I am wondering whether there is some subtle difference between the two methods. I though they should lead to equal outcomes.
When you do
gen lar_firm = emp1_2 | emp3_9 | emp10_19
you're testing if
(emp1_2 != 0) | (emp3_9 != 0) |(emp10_19 != 0)
In particular, missing values . are different from 0: they are greater in fact.
For more information:
http://www.stata.com/support/faqs/data-management/logical-expressions-and-missing-values/

Stata: Counting number of consecutive occurrences of a pre-defined length

Observations in my data set contain the history of moves for each player. I would like to count the number of consecutive series of moves of some pre-defined length (2, 3 and more than 3 moves) in the first and the second halves of the game. The sequences cannot overlap, i.e. the sequence 1111 should be considered as a sequence of the length 4, not 2 sequences of length 2. That is, for an observation like this:
+-------+-------+-------+-------+-------+-------+-------+-------+
| Move1 | Move2 | Move3 | Move4 | Move5 | Move6 | Move7 | Move8 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 1 | 1 | 1 | 1 | . | . | 1 | 1 |
+-------+-------+-------+-------+-------+-------+-------+-------+
…the following variables should be generated:
Number of sequences of 2 in the first half =0
Number of sequences of 2 in the second half =1
Number of sequences of 3 in the first half =0
Number of sequences of 3 in the second half =0
Number of sequences of >3 in the first half =1
Number of sequences of >3 in the second half = 0
I have two potential options of how to proceed with this task but neither of those leads to the final solution:
Option 1: Elaborating on Nick’s tactical suggestion to use strings (Stata: Maximum number of consecutive occurrences of the same value across variables), I have concatenated all “move*” variables and tried to identify the starting position of a substring:
egen test1 = concat(move*)
gen test2 = subinstr(test1,"11","X",.) // find all consecutive series of length 2
There are several problems with Option 1:
(1) it does not account for cases with overlapping sequences (“1111” is recognized as 2 sequences of 2)
(2) it shortens the resulting string test2 so that positions of X no longer correspond to the starting positions in test1
(3) it does not account for variable length of substring if I need to check for sequences of the length greater than 3.
Option 2: Create an auxiliary set of variables to identify the starting positions of the consecutive set (sets) of the 1s of some fixed predefined length. Building on the earlier example, in order to count sequences of length 2, what I am trying to get is an auxiliary set of variables that will be equal to 1 if the sequence of started at a given move, and zero otherwise:
+-------+-------+-------+-------+-------+-------+-------+-------+
| Move1 | Move2 | Move3 | Move4 | Move5 | Move6 | Move7 | Move8 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
+-------+-------+-------+-------+-------+-------+-------+-------+
My code looks as follows but it breaks when I am trying to restart counting consecutive occurrences:
quietly forval i = 1/42 {
gen temprow`i' =.
egen rowsum = rownonmiss(seq1-seq`i') //count number of occurrences
replace temprow`i'=rowsum
mvdecode seq1-seq`i',mv(1) if rowsum==2
drop rowsum
}
Does anyone know a way of solving the task?
Assume a string variable concatenating all moves all (the name test1 is hardly evocative).
FIRST TRY: TAKING YOUR EXAMPLE LITERALLY
From your example with 8 moves, the first half of the game is moves 1-4 and the second half moves 5-8. Thus there is for each half only one way to have >3 moves, namely that there are 4 moves. In that case each substring will be "1111" and counting reduces to testing for the one possibility:
gen count_1_4 = substr(all, 1, 4) == "1111"
gen count_2_4 = substr(all, 5, 4) == "1111"
Extending this approach, there are only two ways to have 3 moves in sequence:
gen count_1_3 = inlist(substr(all, 1, 4), "111.", ".111")
gen count_2_3 = inlist(substr(all, 5, 4), "111.", ".111")
In similar style, there can't be two instances of 2 moves in sequence in each half of the game as that would qualify as 4 moves. So, at most there is one instance of 2 moves in sequence in each half. That instance must match either of two patterns, "11." or ".11". ".11." is allowed, so either includes both. We must also exclude any false match with a sequence of 3 moves, as just mentioned.
gen count_1_2 = (strpos(substr(all, 1, 4), "11.") | strpos(substr(all, 1, 4), ".11") ) & !count_1_3
gen count_2_2 = (strpos(substr(all, 5, 4), "11.") | strpos(substr(all, 5, 4), ".11") ) & !count_2_3
The result of each strpos() evaluation will be positive if a match is found and (arg1 | arg2) will be true (1) if either argument is positive. (For Stata, non-zero is true in logical evaluations.)
That's very much tailored to your particular problem, but not much worse for that.
P.S. I didn't try hard to understand your code. You seem to be confusing subinstr() with strpos(). If you want to know positions, subinstr() cannot help.
SECOND TRY
Your last code segment implies that your example is quite misleading: if there can be 42 moves, the approach above can not be extended without pain. You need a different approach.
Let's suppose that the string variable all can be 42 characters long. I will set aside the distinction between first and second halves, which can be tackled by modifying this approach. At its simplest, just split the history into two variables, one for the first half and one for the second and repeat the approach twice.
You can clone the history by
clonevar work = all
gen length1 = .
gen length2 = .
and set up your count variables. Here count_4 will hold counts of 4 or more.
gen count_4 = 0
gen count_3 = 0
gen count_2 = 0
First we look for move sequences of length 42, ..., 2. Every time we find one, we blank it out and bump up the count.
qui forval j = 42(-1)2 {
replace length1 = length(work)
local pattern : di _dup(`j') "1"
replace work = subinstr(work, "`pattern'", "", .)
replace length2 = length(work)
if `j' >= 4 {
replace count4 = count4 + (length1 - length2) / `j'
}
else if `j' == 3 {
replace count3 = count3 + (length1 - length2) / 3
}
else if `j' == 2 {
replace count2 = count2 + (length1 - length2) / 2
}
}
The important details here are
If we delete (repeated instances of) a pattern and measure the change in length, we have just deleted (change in length) / (length of pattern) instances of that pattern. So, if I look for "11" and found that the length decreased by 4, I just found two instances.
Working downwards and deleting what we found ensures that we don't find false positives, e.g. if "1111111" is deleted, we don't find later "111111", "11111", ..., "11" which are included within it.
Deletion implies that we should work on a clone in order not to destroy what is of interest.