Let a variable equal multiple values in an if-statement [duplicate] - if-statement

This question already has an answer here:
Generating a new variable using conditional statements
(1 answer)
Closed 3 years ago.
I am doing data clean-up in Stata and I need to recode a variable to equal 1 if a whole set of other variables are equal to 1, 6, or 7.
I can do this using the code below:
replace anyadl = 1 if diffdress==1 | diffdress==6 | diffdress==7 | ///
diffwalk==1 | diffwalk==6 | diffwalk==7 | ///
diffbath==1 | diffbath==6 | diffbath==7 | ///
diffeat==1 | diffeat==6 | diffeat==7 | ///
diffbed==1 | diffbed==6 | diffbed==7 | ///
difftoi==1 | difftoi==6 | difftoi==7
However, this is very inefficient to type out and it is easy to make errors.
Is there a simpler way to do this?
For example, something along the following lines:
replace anyadl = 1 if diff* == (1 | 6 | 7)

Your fantasy syntax wouldn't do what you want even if it were legal, as for example 1|6|7 would be evaluated as 1. That is, in Stata 1 OR 6 OR 7 is in effect true OR true OR true, so true, and thus 1, given the rules non-zero is true as input and true is 1 as output. The expression is 1|6|7 is legal; it's the wildcard in an equality or inequality that isn't.
Stepping back, your code is producing an indicator (some people say dummy) variable with values 1 or missing. In practice such a variable is much more useful if created with values 0 and 1 (and in some instances missing too).
generate anyad1 = 0
foreach v in dress walk bath eat bed toi {
replace anyad1 = 1 if inlist(diff`v', 1, 6, 7)
}
is one approach. In general, note both inlist(foo, 1, 6, 7) and inlist(1, foo, bar, bazz) as useful constructs.
Reading:
This paper on generating indicators
This one on useful functions
This one on inlist() and inrange()
FAQ on true and false in Stata

Related

strfmt range in x++

what's wrong with this range?
rangeTransDate = strFmt('(("%1.%2" <= "%3" && "%3" == "%5") || ("%1.%4" > "%3"))',tableStr(CustomTable),fieldStr(CustomTable,TransDate), date2str(dateTo,321,2,0,2,0,4),fieldStr(CustomTable,SettlementDate),SysQuery::valueEmptyString());
i'm getting this error:
Query extended range error: Right parenthesis expected next to position 72.
This page of AX 2012 documentation is still relevant (I cannot find a AX365 version). Highlighting the important bit gives:
The rules for creating query range value expressions are:
Enclose the whole expression in parentheses.
Enclose all subexpressions in parentheses.
Use the relational and logical operators available in X++.
Only use field names from the range's data source.
Use the dataSource.field notation for fields from other data sources in the query.
This means that X++ expects curly brackets around every comparison operator (a.k.a. "subexpression" in the documentation). You are missing some...
Also, use the date2strxpp() function to properly handle all date to string conversions. This function can handle empty date values (dateNull()) by translating these to 1900-01-01. I doubt that putting an empty string (SysQuery::valueEmptyString()) in there will work.
So try this, the commented subexpression levels show the bracket groupings:
// subexpressions lvl 2: 4 4
// subexpressions lvl 1: |1 1 2 2 3 3|
// || | | | | ||
rangeTransDate = strFmt('(("%1.%2" <= "%3") && ("%3" == "%5") || ("%1.%4" > "%3"))',
tableStr(CustomTable),
fieldStr(CustomTable,TransDate),
date2strxpp(dateTo),
fieldStr(CustomTable,SettlementDate),
date2strxpp(dateNull()));
If you still get a similar error at runtime, add even more brackets to group every subexpression in pairs:
// subexpressions lvl 3: 5 5
// subexpressions lvl 2: |4 4 3 3|
// subexpressions lvl 1: ||1 1 2 2| 3 3|
// ||| | | || | ||
rangeTransDate = strFmt('((("%1.%2" <= "%3") && ("%3" == "%5")) || ("%1.%4" > "%3"))',
tableStr(CustomTable),
fieldStr(CustomTable,TransDate),
date2strxpp(dateTo),
fieldStr(CustomTable,SettlementDate),
date2strxpp(dateNull()));

Problems in creating dummy variables

The treatment boroughs are five boroughs with ID 1, 2, 3, 6, 14. The “Operation Theseus" policy lasts from week 80 to week 85.
ocu: borough ID
I tried creating dummies of treated and time but they just show as zero for all
gen treated =0 if missing(ocu)==0
replace treated =1 if ocu==1/2/3/6/14
gen time = (week==80-85) & !missing(week)`
ocu == 1/2/3/6/14 is a legal expression, but is likely to be a long way from what you want.
occ == 1 | occ == 2 | occ == 3 | occ == 6 | occ == 14
is legal and long-winded and
inlist(occ, 1, 2, 3, 6, 14)
legal and likely to be appealing as an expression for: does occ take on any the values specified?
Although Stata supports | as an "or" operator (and not / for that purpose) note that
occ == 1 | 2 | 3 | 6 | 14
is legal but almost never what anyone would want, as it is parsed
(occ == 1) | 2 | 3 | 6 | 14
and will always be evaluated 1 (true), regardless of the value of occ, as just one of the other arguments 2 3 6 14 being non-zero means that the entire expression evaluates to 1 (true).
The expression week==80-85 is also incorrect syntax if you want it to mean week between 80 and 85. Stata will evaluate week == 80-85, applying the subtraction first and so test for equality with -5. See precedence rules as documented in help operators.
The order of evaluation (from first to last) of all operators is ! (or ~), ^, - (negation), /, *, - (subtraction), +, != (or ~=), >, <, <=, >=, ==, &, and |.*
Subtraction comes before testing for equality.
You may want week >= 80 & week <= 85 or inrange(week, 80, 85).
If week is between 80 and 85, then it can't be missing. That test is redundant (but harmless).

Coding dichotomous variables in Stata

I have a set of dichotomous variables for firm size:
emp1_2 (i.e. firm with 1 or 2 employed people, including the owner), emp3_9, emp10_19, emp20_49, emp50_99, emp100_249, emp250_499, emp500, plus I do not have information on 27 firms size but I have an educated guess that they are large firms.
I want to create a dichotomous variable for a firm being a "small firm"; therefore, this variable equals 1 when emp1_2==1 | emp3_9==1 | emp10_19==1 equals 1, and 0 otherwise.
To my understanding of Stata, of which I am a bare user, the two following methods to construct dichotomous variables should be equivalent.
Method 1)
gen lar_firm = 0
replace lar_firm = 1 if emp1_2==1 | emp3_9==1 | emp10_19==1
Method 2)
gen lar_firm = (emp1_2 | emp3_9 | emp10_19)
Instead I have found out that with method 2) lar_firm equals 1 for firms for which emp1_2 | emp3_9 | emp10_19 and for firms that do not enter in any of the categories (i.e. emp1_2, emp3_9, emp10_19, emp20_49, emp50_99, emp100_249, emp250_499, emp500) but for which I have an educated guess that they are large firms.
I am wondering whether there is some subtle difference between the two methods. I though they should lead to equal outcomes.
When you do
gen lar_firm = emp1_2 | emp3_9 | emp10_19
you're testing if
(emp1_2 != 0) | (emp3_9 != 0) |(emp10_19 != 0)
In particular, missing values . are different from 0: they are greater in fact.
For more information:
http://www.stata.com/support/faqs/data-management/logical-expressions-and-missing-values/

How to populate missing values for string variable in a column based on fixed criteria

To populate missing data with a fixed range of values
I would like to check how to populate column aktype with a range of values (the range of values for the same pidlink are always fixed at 11 types of values listed below) for those cells with missing values. I have about 17,000+ observations that are missing.
The range of values are as follows:
A
B
C
D
E
G
H
I
J
K
L
I have tried the following command but it does not work:-
foreach x of varlist aktype=1/11 {
replace aktype = "A" in 1 if aktype==""
replace aktype = "B" in 2 if aktype==""
replace aktype = "C" in 3 if aktype==""
replace aktype = "D" in 4 if aktype==""
replace aktype = "E" in 5 if aktype==""
replace aktype = "G" in 6 if aktype==""
replace aktype = "H" in 7 if aktype==""
replace aktype = "I" in 8 if aktype==""
replace aktype = "J" in 9 if aktype==""
replace aktype = "K" in 10 if aktype==""
replace aktype = "L" in 11 if aktype==""
}
Would appreciate it if you could advise on the right command to use. Many thanks!
I would generate a variable AK that has letters A-K in positions 1-11 (and 12-22, and 23-33, and so on). The replace missing values with the value of this variable AK.
* generate data
clear
set obs 20
generate aktype = ""
replace aktype = "foo" in 1/1
replace aktype = "bar" in 10/12
* generate variable with letters A-K
generate AK = char(65 + mod(_n - 1, 11))
* fill missing values
replace aktype = AK if missing(aktype)
list
This yields the following.
. list
+-------------+
| aktype AK |
|-------------|
1. | foo A |
2. | B B |
3. | C C |
4. | D D |
5. | E E |
|-------------|
This first addresses the comment "it does not work".
Generally, in this kind of forum you should always be specific and say exactly what happens, namely where the code breaks down and what the result is (e.g. what error message you get). If necessary, add why that is not what is wanted.
Specifically, in this case Stata would get no further than
foreach x of varlist aktype=1/11
which is illegal (as well as unclear to Stata programmers).
You can loop over a varlist. In this case looping over a single variable aktype is legal. (It is usually pointless, but that's style, not syntax.) So this is legal:
foreach x of varlist aktype
By the way, you define x as the loop argument, but never refer to it inside the loop. That isn't illegal, but it is unusual.
You can also loop over a numlist, e.g.
foreach x of numlist 1/11
although
forval x = 1/11
is a more direct way of doing that. All this follows from the syntax diagrams for the commands concerned, where whatever is not explicitly allowed is forbidden.
On occasions when you need to loop over a varlist and a numlist you will need to use different syntax, but what is best depends on the precise problem.
Now second to the question: I can't see any kind of rule in the question for which values get assigned A through L, so can't advise positively.

Stata: Counting number of consecutive occurrences of a pre-defined length

Observations in my data set contain the history of moves for each player. I would like to count the number of consecutive series of moves of some pre-defined length (2, 3 and more than 3 moves) in the first and the second halves of the game. The sequences cannot overlap, i.e. the sequence 1111 should be considered as a sequence of the length 4, not 2 sequences of length 2. That is, for an observation like this:
+-------+-------+-------+-------+-------+-------+-------+-------+
| Move1 | Move2 | Move3 | Move4 | Move5 | Move6 | Move7 | Move8 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 1 | 1 | 1 | 1 | . | . | 1 | 1 |
+-------+-------+-------+-------+-------+-------+-------+-------+
…the following variables should be generated:
Number of sequences of 2 in the first half =0
Number of sequences of 2 in the second half =1
Number of sequences of 3 in the first half =0
Number of sequences of 3 in the second half =0
Number of sequences of >3 in the first half =1
Number of sequences of >3 in the second half = 0
I have two potential options of how to proceed with this task but neither of those leads to the final solution:
Option 1: Elaborating on Nick’s tactical suggestion to use strings (Stata: Maximum number of consecutive occurrences of the same value across variables), I have concatenated all “move*” variables and tried to identify the starting position of a substring:
egen test1 = concat(move*)
gen test2 = subinstr(test1,"11","X",.) // find all consecutive series of length 2
There are several problems with Option 1:
(1) it does not account for cases with overlapping sequences (“1111” is recognized as 2 sequences of 2)
(2) it shortens the resulting string test2 so that positions of X no longer correspond to the starting positions in test1
(3) it does not account for variable length of substring if I need to check for sequences of the length greater than 3.
Option 2: Create an auxiliary set of variables to identify the starting positions of the consecutive set (sets) of the 1s of some fixed predefined length. Building on the earlier example, in order to count sequences of length 2, what I am trying to get is an auxiliary set of variables that will be equal to 1 if the sequence of started at a given move, and zero otherwise:
+-------+-------+-------+-------+-------+-------+-------+-------+
| Move1 | Move2 | Move3 | Move4 | Move5 | Move6 | Move7 | Move8 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
+-------+-------+-------+-------+-------+-------+-------+-------+
My code looks as follows but it breaks when I am trying to restart counting consecutive occurrences:
quietly forval i = 1/42 {
gen temprow`i' =.
egen rowsum = rownonmiss(seq1-seq`i') //count number of occurrences
replace temprow`i'=rowsum
mvdecode seq1-seq`i',mv(1) if rowsum==2
drop rowsum
}
Does anyone know a way of solving the task?
Assume a string variable concatenating all moves all (the name test1 is hardly evocative).
FIRST TRY: TAKING YOUR EXAMPLE LITERALLY
From your example with 8 moves, the first half of the game is moves 1-4 and the second half moves 5-8. Thus there is for each half only one way to have >3 moves, namely that there are 4 moves. In that case each substring will be "1111" and counting reduces to testing for the one possibility:
gen count_1_4 = substr(all, 1, 4) == "1111"
gen count_2_4 = substr(all, 5, 4) == "1111"
Extending this approach, there are only two ways to have 3 moves in sequence:
gen count_1_3 = inlist(substr(all, 1, 4), "111.", ".111")
gen count_2_3 = inlist(substr(all, 5, 4), "111.", ".111")
In similar style, there can't be two instances of 2 moves in sequence in each half of the game as that would qualify as 4 moves. So, at most there is one instance of 2 moves in sequence in each half. That instance must match either of two patterns, "11." or ".11". ".11." is allowed, so either includes both. We must also exclude any false match with a sequence of 3 moves, as just mentioned.
gen count_1_2 = (strpos(substr(all, 1, 4), "11.") | strpos(substr(all, 1, 4), ".11") ) & !count_1_3
gen count_2_2 = (strpos(substr(all, 5, 4), "11.") | strpos(substr(all, 5, 4), ".11") ) & !count_2_3
The result of each strpos() evaluation will be positive if a match is found and (arg1 | arg2) will be true (1) if either argument is positive. (For Stata, non-zero is true in logical evaluations.)
That's very much tailored to your particular problem, but not much worse for that.
P.S. I didn't try hard to understand your code. You seem to be confusing subinstr() with strpos(). If you want to know positions, subinstr() cannot help.
SECOND TRY
Your last code segment implies that your example is quite misleading: if there can be 42 moves, the approach above can not be extended without pain. You need a different approach.
Let's suppose that the string variable all can be 42 characters long. I will set aside the distinction between first and second halves, which can be tackled by modifying this approach. At its simplest, just split the history into two variables, one for the first half and one for the second and repeat the approach twice.
You can clone the history by
clonevar work = all
gen length1 = .
gen length2 = .
and set up your count variables. Here count_4 will hold counts of 4 or more.
gen count_4 = 0
gen count_3 = 0
gen count_2 = 0
First we look for move sequences of length 42, ..., 2. Every time we find one, we blank it out and bump up the count.
qui forval j = 42(-1)2 {
replace length1 = length(work)
local pattern : di _dup(`j') "1"
replace work = subinstr(work, "`pattern'", "", .)
replace length2 = length(work)
if `j' >= 4 {
replace count4 = count4 + (length1 - length2) / `j'
}
else if `j' == 3 {
replace count3 = count3 + (length1 - length2) / 3
}
else if `j' == 2 {
replace count2 = count2 + (length1 - length2) / 2
}
}
The important details here are
If we delete (repeated instances of) a pattern and measure the change in length, we have just deleted (change in length) / (length of pattern) instances of that pattern. So, if I look for "11" and found that the length decreased by 4, I just found two instances.
Working downwards and deleting what we found ensures that we don't find false positives, e.g. if "1111111" is deleted, we don't find later "111111", "11111", ..., "11" which are included within it.
Deletion implies that we should work on a clone in order not to destroy what is of interest.