Stata: Permutations of string variables - stata

I have three string variables of the length 2 and I need to get (a) all possible permutations of the three variables (keeping the order of strings within each variable fixed), (b) all possible variable pairs. Small number of variables allows me to do it manually, but I was wondering if there is a more elegant and concise way of solving this.
It is currently coded as:
egen perm1 = concat(x1 x5 x9)
egen perm2 = concat(x1 x9 x5)
egen perm3 = concat(x5 x1 x9)
egen perm4 = concat(x5 x9 x1)
egen perm5 = concat(x9 x5 x1)
egen perm6 = concat(x9 x1 x5)
gen tuple1 = substr(perm1,1,4)
gen tuple2 = substr(perm2,3,4)
gen tuple3 = substr(perm3,1,4)
gen tuple4 = substr(perm4,3,4)...
An abstract from a resulting table illustrates the desired outcome:
+----+----+----+--------+--------+--------+--------+--------+--------+--------+--------+
| x1 | x5 | x9 | perm1 | perm2 | perm3 | perm4 | perm5 | perm6 | tuple1 | tuple2 |
+----+----+----+--------+--------+--------+--------+--------+--------+--------+--------+
| 01 | 05 | 09 | 010509 | 010905 | 050109 | 050901 | 090501 | 090105 | 0105 | 0509 |
+----+----+----+--------+--------+--------+--------+--------+--------+--------+--------+

Neat question. I don't know if there's a "built in" way to do permutations, but the following should do it.
You want to loop over all your variables, but make sure that don't get duplicates. As the dimensions increase this gets tricky. What I do it loop over the same list and each time remove the current counter from counter space of the nested loop.
Unfortunately, this still requires you to write each loop structure, but this should be easy enough to cut-paste-find-replace.
clear
set obs 100
generate x1 = "01"
generate x5 = "05"
generate x9 = "09"
local vars x1 x5 x9
local i = 0
foreach a of varlist `vars' {
local bs : list vars - a
foreach b of varlist `bs' {
local cs : list bs - b
foreach c of varlist `cs' {
local ++i
egen perm`i' = concat(`a' `b' `c')
}
}
}
Edit: Re-reading the question, I'm not clear on what you want (since row1_1 isn't one of your concated variables. Note that if you really want the "drop one" permutations, then just remove one variable from the concat call. This is because "n permute n" is the same as "n permute n-1". That is, there are 6 3-item permutations of 3 items. There are also 6 2-item permutations of 3 items. So
egen perm`i' = concat(`a' `b')

Related

How to repeatedly insert arguments from a list into a function until the list is empty?

Using R, I am working with simulating the outcome from an experiment where participants choose between two options (A or B) defined by their outcomes (x) and probabilities of winning the outcome (p). I have a function "f" that collects its arguments in a matrix with the columns "x" (outcome) and "p" (probability):
f <- function(x, p) {
t <- matrix(c(x,p), ncol=2)
colnames(t) <- c("x", "p")
t
}
I want to use this function to compile a big list of all the trials in the experiment. One way to do this is:
t1 <- list(1A=f(x=c(10), p=c(0.8)),
1B=f(x=c(5), p=c(1)))
t2 <- list(2A=f(x=c(11), p=c(0.8)),
2B=f(x=c(7), p=c(1)))
.
.
.
tn <- list(nA=f(x=c(3), p=c(0.8)),
nB=f(x=c(2), p=c(1)))
Big_list <- list(t1=t1, t2=t2, ... tn=tn)
rm(t1, t2, ... tn)
However, I have very many trials, which may change in future simulations, why repeating myself in this way is intractable. I have my trials in an excel document with the following structure:
| Option | x | p |
|---- |------| -----|
| A | 10 | 0.8 |
| B | 7 | 1 |
| A | 9 | 0.8 |
| B | 5 | 1 |
|... |...| ...|
I am trying to do some kind of loop which takes "x" and "p" from each "A" and "B" and inserts them into the function f, while skipping two rows ahead after each iteration (so that each option is only inserted once). This way, I want to get a set of lists t1 to tn while not having to hardcode everything. This is my best (but still not very good) attempt to explain it in pseudocode:
TRIALS <- read.excel(file_with_trials)
for n=1 to n=(nrows(TRIALS)-1) {
t(*PRINT 'n' HERE*) <- list(
(*PRINT 'n' HERE*)A=
f(x=c(*INSERT COLUMN 1, ROW n FROM "TRIALS"*),
p=c(*INSERT COLUMN 2, ROW n FROM "TRIALS"*)),
(*PRINT 'Z' HERE*)B=
f(x=c(*INSERT COLUMN 1, ROW n+1 FROM "TRIALS"*),
p=c(*INSERT COLUMN 2, ROW n+1 FROM "TRIALS"*)))
}
Big_list <- list(t1=t1, t2=t2, ... tn=tn)
That is, I want the code to create a numbered set of lists by drawing x and p from each pair of rows until my excel file is empty.
Any help (and feedback on how to improve this question) is greatly appreciated!

Using egen functions with replace

I created a new variable from the mean of another variable using egen:
egen afd_lr2 = mean(afd_lire2w) if ost == 0
Now I would like to replace the values with the mean of another variable if ost == 1:
replace afd_lr2 = mean(afd_lireo) if ost ==1
This is not possible, as the mean function cannot be used with the replace command.
How can I achieve my goal?
The following works for me:
sysuse auto, clear
generate price2 = price + 5345
egen a_price = mean(price) if foreign == 0
egen b_price = mean(price2) if foreign == 1
replace a_price = b_price if foreign == 1
This should work
egen afd_lr2 = mean(cond(ost == 0, afd_lire2w, cond(ost == 1, afd_lireo, .))), by(ost)
Here is a test:
clear
input float(group y1 y2)
1 42 .
1 42 .
2 . 999
2 . 999
end
egen mean = mean(cond(group == 1, y1, cond(group == 2, y2, .))), by(group)
tabdisp group, c(mean)
----------------------
group | mean
----------+-----------
1 | 42
2 | 999
----------------------
The key is that the mean() function of egen feeds on an expression, which can be more complicated than a single variable name. That said, this is trickier than I would generally advise, as
generate work = afd_lire2w if ost == 0
replace work = afd_lireo if ost == 1
egen mean = mean(work), by(ost)
is easier to understand and should occur to a programmer any way.

Ignore missing values when generating new variable

I want to create a new variable in Stata, that is a function of 3 different variables, X, Y and Z, like:
gen new_var = (((X)*3) + ((Y)*2) + ((Z)*4))/7
All observations have missing values for one or two of the variables.
When I run the aforementioned command, all it generates are missing values, because no observation has values for all 3 of the variables. I would like Stata to complete the function ignoring the missing variables.
I tried the following commands without success:
gen new_var= (cond(missing(X*3),., X) + cond(missing(Y*2),., Y))/7
gen new_var= (!missing(X*3+Y*2+Z*4)/7)
gen new_var= (max(X , Y, Z)/7) if missing(X , Y, Z)
The egen command does not allow complicated functions; otherwise rowtotal() could work.
EDIT:
To clarify, "ignoring missing variables" means that even if any one of the component variables is not missing, then apply the function to only that variable and produce a value for the new variable. The new variable should have missing values only when all three component variables are missing.
I am going to guess that "ignoring missing values" means "treating them as zeros". If you have some other idea, you should make it explicit.
That could be
gen new_var = (cond(missing(X), 0, 3 * X) ///
+ cond(missing(Y), 0, 2 * Y) ///
+ cond(missing(Z), 0, 4 * Z)) / 7
Let's look at your solutions and explain why they are all wrong either in general or usually.
(cond(missing(X*3),., X) + cond(missing(Y*2),., Y))/7
It is sufficient is note that if it's true that X is missing, then cond() yields missing, as then X * 3 is missing too. The same kind of remark applies to terms involving Y and Z. So you're replacing any missing values by missing values, which is no gain.
!missing(X*3+Y*2+Z*4)/7
Given the information that at least one of X Y Z is always missing, then this always evaluates to 0/7 or 0. Even if X Y Z were all non-missing, then it would evaluate to 1/7. That is a long way from the sum you want. missing() always yields 1 or 0, and its negation thus 0 or 1.
(max(X, Y, Z)/7) if missing(X , Y, Z)
The maximum of X, Y, Z will be the right answer if and only if one of the values is not missing and the other two are missing. max() ignores missings to the extent possible (even though in other contexts missings are treated as if arbitrarily large positive numbers).
If you just want to "ignore missing values" without "treating them as zeros", the following will work:
clear
set obs 10
generate X = rnormal(5, 2)
generate Y = rnormal(10, 5)
generate Z = rnormal(1, 10)
replace X = . in 2
replace Y = . in 5
replace Z = . in 9
generate new_var = (((X)*3) + ((Y)*2) + ((Z)*4)) / 7 if X != . | Y != . | Z != .
list
+---------------------------------------------+
| X Y Z new_var |
|---------------------------------------------|
1. | 3.651024 3.48609 -24.1695 -11.25039 |
2. | . 14.14995 8.232919 . |
3. | 3.689442 9.812483 1.154064 5.044221 |
4. | 2.500493 13.02909 5.25539 7.797317 |
5. | 4.19431 . 6.584174 . |
6. | 7.221717 13.92533 5.045283 9.956708 |
7. | 5.746871 14.26329 3.828253 8.725744 |
8. | 1.396223 16.2358 19.01479 16.10277 |
9. | 4.633088 13.95751 . . |
10. | 2.521546 4.490258 -3.396854 .422534 |
+---------------------------------------------+
Alternatively, you could also use the inlist() function:
generate new_var = (((X)*3) + ((Y)*2) + ((Z)*4)) / 7 if !inlist(., X, Y, Z)

How to extract components of a disorganized string variable in Stata?

I have a text variable showing patient prescription that looks quite messy like this:
PatientRx
ACETAZOLAMIDE 250MG TABLET- 100
ADAPALENE + BENZOYL 0.1% + 2.5% GEL-..
ADRENALINE/EPIPEN 300MCG/0.3ML INJ..
ALENDRONATE + COLECA 70MG + 140MCG TA..
ALLOPURINOL 100MG TABLET- 100
ALUM HYDROX + MAG HY 250+120+120MG/5M..
AMILORIDE + HYDROCHL 5MG + 50MG HCL T..
While I haven't looked through all these values, some patterns may arise:
Often times there are more than one drugs and they are separated, for example by space and forward slash.
Drugs are also be separated with plus sign. But plus sign is also used between doses.
The rule related to space is very arbitrary, both at the beginning and in the middle of entry.
How can I extract only the names of the drugs into new variables? New variables should look like this:
Newvar1 Newvar2
ACETAZOLAMIDE
ADAPALENE BENZOYL
ADRENALINE EPIPEN
ALENDRONATE COLECA
and so on.
Some would reach first for regular expressions, which you might indeed need for the full problem. In addition note moss as installed by ssc install moss.
But it seems easiest, given the information in the example here, which is all we have to go on, to look for the position of the first numeric digit 0 to 9 and then parse what goes before. I don't know whether drug names ever contain numeric digits.
clear
input str40 sandbox
" ACETAZOLAMIDE 250MG TABLET- 100"
"ADAPALENE + BENZOYL 0.1% + 2.5% GEL-"
" ADRENALINE/EPIPEN 300MCG/0.3ML INJ"
"ALENDRONATE + COLECA 70MG + 140MCG TA"
" ALLOPURINOL 100MG TABLET- 100"
"ALUM HYDROX + MAG HY 250+120+120MG/5M"
" AMILORIDE + HYDROCHL 5MG + 50MG HCL T"
end
gen wherenum = .
quietly forval j = 0/9 {
replace wherenum = min(wherenum, strpos(sandbox, "`j'")) if strpos(sandbox, "`j'")
}
gen drug = substr(sandbox, 1, wherenum - 1)
split drug, parse(+ /)
l drug?, sep(0)
+---------------------------+
| drug1 drug2 |
|---------------------------|
1. | ACETAZOLAMIDE |
2. | ADAPALENE BENZOYL |
3. | ADRENALINE EPIPEN |
4. | ALENDRONATE COLECA |
5. | ALLOPURINOL |
6. | ALUM HYDROX MAG HY |
7. | AMILORIDE HYDROCHL |
+---------------------------+

How to change order of string based on dates

I received data with a string variable that looks something like:
var_name
25-DEC-99: A11, B14, C89; 28-FEB-94: A27, B94, C30
01-APR-11: A25, B82, C65
04-JUL-09: A21, B55, C26; 12-MAR-03: A11, B72, C68; 08-JUN-11: A62, B47, C82
12-JUN-00: A77, B19, C73; 03-JUL-12: A99, B04, C54
27-OCT-15: A22, B95, C08
And so on. My goal is to split these strings up into different variable names. The variable names would be v1_date, v1_A, v1_B, v1_C, v2_date, v2_A, v2_B, v2_C, v3_date, v3_A, v3_B, v3_C.
I can use split var_name, p(";"), rename to be v1, v2, and v3, and then split again to do this. But the problem is that I want v1, v2, and v3 to be in chronological order based on the date and the data is not currently arranged in that fashion. How can I make it so that the date of v1 comes before v2 and the date of v2 comes before the date of v3? For example in the first observation, I want 25-DEC-99: A11, B14, C89 to be associated with v2 and 28-FEB-94: A27, B94, C30 to be associated with v1.
The following gets you close, I believe. It uses both split and reshape.
clear
set more off
input ///
str100 myvar
"25-DEC-99: A11, B14, C89; 28-FEB-94: A27, B94, C30"
"01-APR-11: A25, B82, C65"
"04-JUL-09: A21, B55, C26; 12-MAR-03: A11, B72, C68; 08-JUN-11: A62, B47, C82"
"12-JUN-00: A77, B19, C73; 03-JUL-12: A99, B04, C54"
"27-OCT-15: A22, B95, C08"
end
split myvar, p(;)
drop myvar
gen obs = _n
reshape long myvar, i(obs)
drop if missing(myvar)
split myvar, p(:)
drop myvar
gen myvar11 = date(myvar1, "DMY", 2020)
format %td myvar11
drop myvar1
rename (myvar11 myvar2) (mydate mycells)
order mydate, before(mycells)
bysort obs (mydate) : gen neworder = _n
drop _j
reshape wide mydate mycells, i(obs) j(neworder)
list
You can loop over the mycells variables if you need to further split them.
In general, please consider using dataex (SSC) to create easy data examples.
You don't give all the (not trivial) code you used to split the variables. As it happens, I don't think your variable names are easy to work with, so I re-created the split in my own fashion. If you reshape long the split data, then sorting by date is easy, but I have pulled up short of the reverse reshape wide, as I suspect the long structure is much easier to work with.
clear
input str80 data
"25-DEC-99: A11, B14, C89; 28-FEB-94: A27, B94, C30"
"01-APR-11: A25, B82, C65"
"04-JUL-09: A21, B55, C26; 12-MAR-03: A11, B72, C68; 08-JUN-11: A62, B47, C82"
"12-JUN-00: A77, B19, C73; 03-JUL-12: A99, B04, C54"
"27-OCT-15: A22, B95, C08"
end
split data, p(;) gen(x)
local j = 1
gen work = ""
foreach x of var x* {
replace work = substr(`x', 1, strpos(`x', ":") - 1)
gen date`j' = daily(work, "DMY", 2050)
replace work = substr(`x', strpos(`x', ":") + 1, .)
split work, p(,)
rename (work1 work2 work3) (vA`j' vB`j' vC`j')
local ++j
}
drop work
drop x*
drop data
gen id = _n
edit
reshape long date vA vB vC, i(id) j(which)
drop if missing(date)
bysort id (date): replace which = _n
list, sepby(id)
+----------------------------------------+
| id which date vA vB vC |
|----------------------------------------|
1. | 1 1 12477 A27 B94 C30 |
2. | 1 2 14603 A11 B14 C89 |
|----------------------------------------|
3. | 2 1 18718 A25 B82 C65 |
|----------------------------------------|
4. | 3 1 15776 A11 B72 C68 |
5. | 3 2 18082 A21 B55 C26 |
6. | 3 3 18786 A62 B47 C82 |
|----------------------------------------|
7. | 4 1 14773 A77 B19 C73 |
8. | 4 2 19177 A99 B04 C54 |
|----------------------------------------|
9. | 5 1 20388 A22 B95 C08 |
+----------------------------------------+