Retrieve row number according to value of another variable in index - stata

This problem is very simple in R, but I can't seem to get it to work in Stata.
I want to use the square brackets index, but with an expression in it that involves another variable, i.e. for a variable with unique values cumul I want:
replace country = country[cumul==20] in 12
cumul == 20 corresponds to row number 638 in the dataset, so the above should replace in line 12 the country variable with the value of that same variable in line 638. The above expression is clearly not the right way to do it: it just replaces the country variable in line 12 with a missing value.

Stata's row indexing does not work in that way. What you can do, however, is a simple two-line solution:
levelsof country if cumul==20
replace country = "`r(levels)'" in 12
If you want to be sure that cumul==20 uniquely identifies just a single value of country, add:
assert `:word count `r(levels)''==1
between the two lines.

It's probably worth explaining why the construct in the question doesn't work as you wish, beyond "Stata is not R!".
Given a variable x: in a reference like x[1] the [1] is referred to as a subscript, despite nothing being written below the line. The subscript is the observation number, the number being always that in the dataset as currently held in memory.
Stata allows expressions within subscripts; they are evaluated observation by observation and the result is then used to look-up values in variables. Consider this sandbox:
clear
input float y
1
2
3
4
5
end
. gen foo = y[mod(_n, 2)]
(2 missing values generated)
. gen x = 3
. gen bar = y[y == x]
(4 missing values generated)
. list
+-------------------+
| y foo x bar |
|-------------------|
1. | 1 1 3 . |
2. | 2 . 3 . |
3. | 3 1 3 1 |
4. | 4 . 3 . |
5. | 5 1 3 . |
+-------------------+
mod(_n, 2) is the remainder on dividing the observation _n by 2: that is 1 for odd observation numbers and 0 for even numbers. Observation 0 is not in the dataset (Stata starts indexing at 1). It's not an error to refer to values in that observation, but the result is returned as missing (numeric missing here, and empty strings "" if the variable is string). Hence foo is x[1] or 1 for odd observation numbers and missing for even numbers.
True or false expressions are evaluated as 1 if true and 0 is false. Thus y == x is true only in observation 3, and so bar is the value of y[1] there and missing everywhere else. Stata doesn't have the special (and useful) twist in R that it is the subscripts for which a true or false expression is true that are used to select zero or more values.
There are ways of using subscripts to get special effects. This example shows one. (It's much easier to get the same kind of result in Mata.)
. gen random = runiform()
. sort random
. gen obs = _n
. sort y
. gen randomsorted = random[obs]
. l
+-----------------------------------------------+
| y foo x bar random obs random~d |
|-----------------------------------------------|
1. | 1 1 3 . .3488717 4 .0285569 |
2. | 2 . 3 . .2668857 3 .1366463 |
3. | 3 1 3 1 .1366463 2 .2668857 |
4. | 4 . 3 . .0285569 1 .3488717 |
5. | 5 1 3 . .8689333 5 .8689333 |
+-----------------------------------------------+
This answer doesn't cover matrices in Stata or Mata.

Related

pick the last record in a list of variables/ columns

I have a dataset in wide format in Stata and I would like to pick the last observation of each variable. In the example below, I would like to generate a new variable based on the last observation of the list of variables.
I tried the code below and it doesn't work. My thought was to pick one variable at a time, e.g. v1==1
id v1 v2 v3 new varible
1 1 2 2
2 1 2 3 3
3 1 1
4 1 4 4
gen new_variable=.
foreach v of varlist v*{
replace new_variable=1 if `v'==1
replace new_variable=2 if `v'==2
replace new_variable=3 if `v'==3
}
You want the last non-missing value in each observation (row, record, case) over a series of variables (columns, fields). Terminology in your question is confused.
I first interpret the blanks in your data example as numeric missing values. That being so, what you want is given by the egen function rowlast(). It can also be obtained by looping as follows
Initialise with the first variable.
Looping over the other variables, replace if each variable is not missing.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte(v1 v2 v3) float wanted
1 2 . 2
1 2 3 3
1 . . 1
1 4 . 4
end
egen WANTED = rowlast(v1 v2 v3)
gen wAnTeD = v1
forval j = 2/3 {
replace wAnTeD = v`j' if !missing(v`j')
}
list
+-----------------------------------------+
| v1 v2 v3 wanted WANTED wAnTeD |
|-----------------------------------------|
1. | 1 2 . 2 2 2 |
2. | 1 2 3 3 3 3 |
3. | 1 . . 1 1 1 |
4. | 1 4 . 4 4 4 |
+-----------------------------------------+
I next interpret the data as string variables. The egen solution doesn't work but the loop idea does work. Note that missing means empty strings "": spaces must be removed or ignored.
* Example generated by -dataex-. For more info, type help dataex
clear
input str1(v1 v2 v3 wanted)
"1" "2" "" "2"
"1" "2" "3" "3"
"1" "" "" "1"
"1" "4" "" ""
end
gen WANTED = v1
forval j = 2/3 {
replace WANTED = v`j' if !missing(v`j')
}

Capturing non-missing values row wise and storing it in new variables

My dataset contains multiple variables called avar_1 to bvar_10 referring to the history of an individual. For some reasons, the history is not always complete and there are some "gaps" (e.g. avar_1 and avar_4 are non-missing, but avar_2 and avar_3 are missing). For each individual, I want to store the first non-missing value in a new variable called var1 the second non-missing in var2 etc, so that I have a history without missing values.
I've tried the following code
local x=1
foreach wave in a b {
forval i=1/10 {
capture drop var`x'
generate var`x'=.
capture replace var`x'=`wave'var`i' if !mi(`wave'`var'`i')
if (!mi(var`x')) {
local x=1+`x'
}
}
}
var1 is generated properly but var2 only contains missings and following variables are not generated. However, I set trace on and saw that the var2 is actually replaced for all variables from avar_1 to bvar_10.
My guess is that the local x is not correctly updated as its value change for the whole dataset but should be different for each observation.
Is that the problem and if so, how can I avoid it?
A concise concrete data example is worth more than a long explanation. Your description seems consistent with an example like this:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 id float(avar_1 avar_2 avar_3 bvar_1 bvar_2)
"A" 1 . 6 8 10
"B" 2 4 . 9 .
"C" 3 5 7 . 11
end
* 4 is specific to this example.
rename (bvar_*) (avar_#), renumber(4)
reshape long avar_, i(id) j(which)
(note: j = 1 2 3 4 5)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 3 -> 15
Number of variables 6 -> 3
j variable (5 values) -> which
xij variables:
avar_1 avar_2 ... avar_5 -> avar_
-----------------------------------------------------------------------------
drop if missing(avar_)
bysort id (which) : replace which = _n
list, sepby(id)
+--------------------+
| id which avar_ |
|--------------------|
1. | A 1 1 |
2. | A 2 6 |
3. | A 3 8 |
4. | A 4 10 |
|--------------------|
5. | B 1 2 |
6. | B 2 4 |
7. | B 3 9 |
|--------------------|
8. | C 1 3 |
9. | C 2 5 |
10. | C 3 7 |
11. | C 4 11 |
+--------------------+
Positive points:
Your data layout cries out for some structure given by a rename and especially by a reshape long. I don't give here code for a reshape wide as for the great majority of Stata purposes, you'd be better off with this layout.
Negative points:
!mi(var`x')
returns whether the first value of a variable is not missing. If foo were a variable in the dataset, !mi(foo) is evaluated as !mi(foo[1]). That is not what you want here. See https://www.stata.com/support/faqs/programming/if-command-versus-if-qualifier/ for the full story.
I'd recommend more evocative variable names.

Splitting values in Stata and save them in a new variable

I have a numeric variable with values similar to the following system
1
2
12
21
2
I would like to split the values which have length > 1 and put the second half of the value
in another variable.
So the second variable would have the values:
.
.
2
1
.
Theoretically I would just use a simple replace statement, but I am looking for a code/loop, which would
recognize the double digit values and split them automatically and save them in the second variable. Because with time, there will be more observations added and I cannot do this task manually for >10k cases.
Here's one approach:
clear
input foo
1
2
12
21
2
end
generate foo1 = floor(foo/10)
generate foo2 = mod(foo, 10)
list
+-------------------+
| foo foo1 foo2 |
|-------------------|
1. | 1 0 1 |
2. | 2 0 2 |
3. | 12 1 2 |
4. | 21 2 1 |
5. | 2 0 2 |
+-------------------+
More on these functions here, here and here.
If zeros for the first part should be missing, then
replace foo1 = . if foo1 == 0
or (to do it in one)
generate foo1 = floor(foo/10) if foo >= 10
The code is also good for any arguments with three digits or more.

Fill in missing values of one variable using match with another variable

Imagine the following Stata data structure:
input x y
1 3
1 .
1 .
2 3
2 .
2 .
. 3
end
I want to fill the missing values using the corresponding match of pairs for other observations. However, if there is ambiguity (in the example, 3 corresponding to both 1 and 2), the code should not copy. In my example, the final data structure should look like this:
1 3
1 3
1 3
2 3
2 3
2 3
. 3
Note that both 1 and 2 are filled, as they are unambiguously 3.
My data is only numeric, and the number of unique values of variables x and y is large, so I am looking for a general rule that works in every case.
I am thinking on using the user-written command carryforward, running something like
bysort x: carryforward y if x != . , replace dynamic_condition(x[_n-1] == x[_n]) strict
bysort y: carryforward x if y != . , replace dynamic_condition(y[_n-1] == y[_n]) strict
Yet, this does not work when there are double matches.
UPDATE: the solution proposed by Nick does not work for every example. I updated the example to reflect this. The reason why the proposed solution does not work is because the function tag puts a 1 only at one instance of each value. Thus, when a value (3) is related to two values (1, 2), the tag will appear only in one of them. Hence, the copying occurs for one. In the example above, Nick's code and results are:
egen tagy = tag(y) if !missing(y)
egen tagx = tag(x) if !missing(x)
egen ny = total(tagy), by(x)
egen nx = total(tagx), by(y)
bysort x (y) : replace y = y[1] if ny == 1
bysort y (x) : replace x = x[1] if nx == 1
list, sep(0)
+-------------------------------+
| x y tagy tagx ny nx |
|-------------------------------|
1. | 1 3 0 0 1 0 |
2. | 1 3 0 0 1 0 |
3. | 1 3 1 1 1 2 |
4. | 2 3 0 1 0 2 |
5. | . 3 0 0 0 2 |
6. | 2 . 0 0 0 0 |
7. | 2 . 0 0 0 0 |
+-------------------------------+
As seen, the code works for filling x=1 and not filling y=3 (line 5). Yet, it does not fill lines 6 and 7 because tagy=1 only appears once (x=1).
This is a bit clunky, but it should work:
bysort x: egen temp=sd(x) if x!=.
bysort x (y): replace y=y[1] if temp==0
drop temp
Since the standard deviation of a constant is zero, temp=0 if non-missing x's are all the same.
sort x, y
replace y = y[_n-1] if missing(y) & x[_n-1] == x[_n]

Count observations within dynamic range

Consider the following example:
input group day month year number treatment NUM
1 1 2 2000 1 1 2
1 1 6 2000 2 0 .
1 1 9 2000 3 0 .
1 1 5 2001 4 0 .
1 1 1 2010 5 1 1
1 1 5 2010 6 0 .
2 1 1 2001 1 1 0
2 1 3 2002 2 1 0
end
gen date = mdy(month,day,year)
format date %td
drop day month year
For each group, I have a varying number of observations. Each observations refers to an event that is specified with a date. Variable number is the numbering within each group.
Now, I want to count the number of observations that occur one year starting from the date of each treatment observation (excluding itself) within this group. This means, I want to create the variable NUM that I have already put into my example above. I do not care about the number of observations with treatment = 0.
EDIT Begin: The following information was found to be missing but necessary to tackle this problem: The treatment variable will have a value of 1 if there is no observation within the same group in the last year. Thus it is also not possible that the variable NUM will have to consider observations with treatment = 1. In principal, it is possible that there are two observations within a group that have identical dates. EDIT End
I have looked into Stata tip 51: Events in intervals. It seems to work out however my dataset is huge (> 1 mio observations) such that it is really really inefficient - especially because I do not care about all treatment = 0 observations.
I was wondering if there is any alternative. My approach was to look for the observation with the latest date within each group that is still in the range of 1 year (and maybe store it in variable latestDate). Then I would simply subtract the value in variable number of the observation found from the value in count of the treatment = 0 variable.
Note: My "inefficient" code looks as follows
gsort -treatment
gen treatment_id = _n
replace treatment_id = . if treatment==0
gen count=.
sum treatment_id, meanonly
qui forval i = 1/`r(max)'{
count if inrange(date-date[`i'],1,365) & group == group[`i']
replace count = r(N) in `i'
}
sort group date
I am assuming that treatment can't occur within 1 year of the previous treatment (in the group). This is true in your example data, but may not be true in general. But, assuming that it is the case, then this should work. I'm using carryforward which is on SSC (ssc install carryforward). Like your latestDate thought, I determine one year after the most recent treatment and count the number of observations in that window.
sort group date
gen yrafter = (date + 365) if treatment == 1
by group: carryforward yrafter, replace
format yrafter %td
gen in_window = date <= yrafter & treatment == 0
egen answer = sum(in_window), by(group yrafter)
replace answer = . if treatment == 0
I can't promise this will be faster than a loop but I suspect that it will be.
The question is not completely clear.
Consider the following data with two different results, num2 and num3:
+-----------------------------------------+
| date2 group treat num2 num3 |
|-----------------------------------------|
| 01feb2000 1 1 3 2 |
| 01jun2000 1 0 . . |
| 01sep2000 1 0 . . |
| 01nov2000 1 1 0 0 |
| 01may2002 1 0 . . |
| 01jan2010 1 1 1 1 |
| 01may2010 1 0 . . |
|-----------------------------------------|
| 01jan2001 2 1 0 0 |
| 01mar2002 2 1 0 0 |
+-----------------------------------------+
The variable num2 is computed assuming you are interested in counting all observations that are within a one-year period after a treated observation (treat == 1), be those observations equal to 0 or 1 for treat. For example, after 01feb2000, there are three observations that comply with the time span condition; two have treat==0 and one has treat == 1, and they are all counted.
The variable num3 is also counting observations that are within a one-year period after a treated observation, but only the cases for which treat == 0.
num2 is computed with code in the spirit of the article you have cited. The use of in makes the run more efficient and there is no gsort (as in your code), which is quite slow. I have assumed that in each group there are no repeated dates:
clear
set more off
input ///
group str15 date count treat num
1 01.02.2000 1 1 2
1 01.06.2000 2 0 .
1 01.09.2000 3 0 .
1 01.11.2000 3 1 .
1 01.05.2002 4 0 .
1 01.01.2010 5 1 1
1 01.05.2010 6 0 .
2 01.01.2001 1 1 0
2 01.03.2002 2 1 0
end
list
gen date2 = date(date,"DMY")
format date2 %td
drop date count num
order date
list, sepby(group)
*----- what you want -----
gen num2 = .
isid group date, sort
forvalues j = 1/`=_N' {
count in `j'/L if inrange(date2 - date2[`j'], 1, 365) & group == group[`j']
replace num2 = r(N) in `j'
}
replace num2 = . if !treat
list, sepby(group)
num3 is computed with code similar in spirit (and results) as that posted by #jfeigenbaum:
<snip>
*----- what you want -----
isid group date, sort
by group: gen indicat = sum(treat)
sort group indicat, stable
by group indicat: egen num3 = total(inrange(date2 - date2[1], 1, 365))
replace num3 = . if !treat
list, sepby(group)
Even more than two interpretations are possible for your problem, but I'll leave it at that.
(Note that I have changed your example data to include cases that probably make the problem more realistic.)