pick the last record in a list of variables/ columns - stata

I have a dataset in wide format in Stata and I would like to pick the last observation of each variable. In the example below, I would like to generate a new variable based on the last observation of the list of variables.
I tried the code below and it doesn't work. My thought was to pick one variable at a time, e.g. v1==1
id v1 v2 v3 new varible
1 1 2 2
2 1 2 3 3
3 1 1
4 1 4 4
gen new_variable=.
foreach v of varlist v*{
replace new_variable=1 if `v'==1
replace new_variable=2 if `v'==2
replace new_variable=3 if `v'==3
}

You want the last non-missing value in each observation (row, record, case) over a series of variables (columns, fields). Terminology in your question is confused.
I first interpret the blanks in your data example as numeric missing values. That being so, what you want is given by the egen function rowlast(). It can also be obtained by looping as follows
Initialise with the first variable.
Looping over the other variables, replace if each variable is not missing.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte(v1 v2 v3) float wanted
1 2 . 2
1 2 3 3
1 . . 1
1 4 . 4
end
egen WANTED = rowlast(v1 v2 v3)
gen wAnTeD = v1
forval j = 2/3 {
replace wAnTeD = v`j' if !missing(v`j')
}
list
+-----------------------------------------+
| v1 v2 v3 wanted WANTED wAnTeD |
|-----------------------------------------|
1. | 1 2 . 2 2 2 |
2. | 1 2 3 3 3 3 |
3. | 1 . . 1 1 1 |
4. | 1 4 . 4 4 4 |
+-----------------------------------------+
I next interpret the data as string variables. The egen solution doesn't work but the loop idea does work. Note that missing means empty strings "": spaces must be removed or ignored.
* Example generated by -dataex-. For more info, type help dataex
clear
input str1(v1 v2 v3 wanted)
"1" "2" "" "2"
"1" "2" "3" "3"
"1" "" "" "1"
"1" "4" "" ""
end
gen WANTED = v1
forval j = 2/3 {
replace WANTED = v`j' if !missing(v`j')
}

Related

Capturing non-missing values row wise and storing it in new variables

My dataset contains multiple variables called avar_1 to bvar_10 referring to the history of an individual. For some reasons, the history is not always complete and there are some "gaps" (e.g. avar_1 and avar_4 are non-missing, but avar_2 and avar_3 are missing). For each individual, I want to store the first non-missing value in a new variable called var1 the second non-missing in var2 etc, so that I have a history without missing values.
I've tried the following code
local x=1
foreach wave in a b {
forval i=1/10 {
capture drop var`x'
generate var`x'=.
capture replace var`x'=`wave'var`i' if !mi(`wave'`var'`i')
if (!mi(var`x')) {
local x=1+`x'
}
}
}
var1 is generated properly but var2 only contains missings and following variables are not generated. However, I set trace on and saw that the var2 is actually replaced for all variables from avar_1 to bvar_10.
My guess is that the local x is not correctly updated as its value change for the whole dataset but should be different for each observation.
Is that the problem and if so, how can I avoid it?
A concise concrete data example is worth more than a long explanation. Your description seems consistent with an example like this:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 id float(avar_1 avar_2 avar_3 bvar_1 bvar_2)
"A" 1 . 6 8 10
"B" 2 4 . 9 .
"C" 3 5 7 . 11
end
* 4 is specific to this example.
rename (bvar_*) (avar_#), renumber(4)
reshape long avar_, i(id) j(which)
(note: j = 1 2 3 4 5)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 3 -> 15
Number of variables 6 -> 3
j variable (5 values) -> which
xij variables:
avar_1 avar_2 ... avar_5 -> avar_
-----------------------------------------------------------------------------
drop if missing(avar_)
bysort id (which) : replace which = _n
list, sepby(id)
+--------------------+
| id which avar_ |
|--------------------|
1. | A 1 1 |
2. | A 2 6 |
3. | A 3 8 |
4. | A 4 10 |
|--------------------|
5. | B 1 2 |
6. | B 2 4 |
7. | B 3 9 |
|--------------------|
8. | C 1 3 |
9. | C 2 5 |
10. | C 3 7 |
11. | C 4 11 |
+--------------------+
Positive points:
Your data layout cries out for some structure given by a rename and especially by a reshape long. I don't give here code for a reshape wide as for the great majority of Stata purposes, you'd be better off with this layout.
Negative points:
!mi(var`x')
returns whether the first value of a variable is not missing. If foo were a variable in the dataset, !mi(foo) is evaluated as !mi(foo[1]). That is not what you want here. See https://www.stata.com/support/faqs/programming/if-command-versus-if-qualifier/ for the full story.
I'd recommend more evocative variable names.

Split IDs in categorical variables

I have a variable with IDs:
clear
input ID
1
.
2
1
.
3
4
5
4
4
6
end
How can I create separate categorical variables with ID as a name and values of 1 and 2 (the latter if the generated variable matches the ID)?
For example, variable _ID_1 should look as follows:
2
.
1
2
.
1
1
1
1
1
1
Any ideas?
Another way to do it:
clear
input ID
1
.
2
1
.
3
4
5
4
4
6
end
forvalues j = 1/6 {
generate ID_`j' = 1 + (ID == `j') if ID != .
}
list

Combine overlapping categorical variables

I am trying to "combine" two categorical variables in Stata (say var1 and var2) into a new (also categorical) variable (say res).
The example below illustrates what I am trying to achieve:
var1 var2 res
1 1 A
1 2 A
2 1 A
3 3 B
4 2 A
5 4 D
What this example does is to combine all categories of var1 and var2 that "overlap".
Here, the pair var1 == 1 and var2 == 1 initially form a group (res== A). All other pairs containing var1 == 1 or var2 == 1 should belong to the same group (hence res== A in rows 2 and 3). Because in row 2 we have var2==2, any pair with containing var2==2 should belong to the same group. That's why in row 4 res== A.
Another way to look at this problem is using the following matrix:
| 1 2 3 4
-----------------------
1 | 1 1
2 | 1
3 | 1
4 | 1
5 | 1
Because the element [1,1] is not empty (or zero), all elements in row 1 and column 1 must belong to the same group. Because [1,2] is not empty, the same is true for row 1, column 2. And so on and so forth. It does not matter which row/column you decide to start from.
egen group alone doesn't cut it.
Any ideas?
Sounds like you want to further group var1 if the values of var2 are the same. If that's the case, then you can use a program I wrote called group_id that's available from SSC. To install it, type in Stata's Command window:
ssc install group_id
Here's an example of how you would use it:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(var1 var2) str1 res
1 1 "A"
1 2 "A"
2 1 "A"
3 3 "B"
4 2 "A"
5 4 "D"
end
gen long wanted = var1
group_id wanted, matchby(var2)
list, sep(0)
and the results:
. list, sep(0)
+----------------------------+
| var1 var2 res wanted |
|----------------------------|
1. | 1 1 A 1 |
2. | 1 2 A 1 |
3. | 2 1 A 1 |
4. | 3 3 B 3 |
5. | 4 2 A 1 |
6. | 5 4 D 5 |
+----------------------------+

Retrieve row number according to value of another variable in index

This problem is very simple in R, but I can't seem to get it to work in Stata.
I want to use the square brackets index, but with an expression in it that involves another variable, i.e. for a variable with unique values cumul I want:
replace country = country[cumul==20] in 12
cumul == 20 corresponds to row number 638 in the dataset, so the above should replace in line 12 the country variable with the value of that same variable in line 638. The above expression is clearly not the right way to do it: it just replaces the country variable in line 12 with a missing value.
Stata's row indexing does not work in that way. What you can do, however, is a simple two-line solution:
levelsof country if cumul==20
replace country = "`r(levels)'" in 12
If you want to be sure that cumul==20 uniquely identifies just a single value of country, add:
assert `:word count `r(levels)''==1
between the two lines.
It's probably worth explaining why the construct in the question doesn't work as you wish, beyond "Stata is not R!".
Given a variable x: in a reference like x[1] the [1] is referred to as a subscript, despite nothing being written below the line. The subscript is the observation number, the number being always that in the dataset as currently held in memory.
Stata allows expressions within subscripts; they are evaluated observation by observation and the result is then used to look-up values in variables. Consider this sandbox:
clear
input float y
1
2
3
4
5
end
. gen foo = y[mod(_n, 2)]
(2 missing values generated)
. gen x = 3
. gen bar = y[y == x]
(4 missing values generated)
. list
+-------------------+
| y foo x bar |
|-------------------|
1. | 1 1 3 . |
2. | 2 . 3 . |
3. | 3 1 3 1 |
4. | 4 . 3 . |
5. | 5 1 3 . |
+-------------------+
mod(_n, 2) is the remainder on dividing the observation _n by 2: that is 1 for odd observation numbers and 0 for even numbers. Observation 0 is not in the dataset (Stata starts indexing at 1). It's not an error to refer to values in that observation, but the result is returned as missing (numeric missing here, and empty strings "" if the variable is string). Hence foo is x[1] or 1 for odd observation numbers and missing for even numbers.
True or false expressions are evaluated as 1 if true and 0 is false. Thus y == x is true only in observation 3, and so bar is the value of y[1] there and missing everywhere else. Stata doesn't have the special (and useful) twist in R that it is the subscripts for which a true or false expression is true that are used to select zero or more values.
There are ways of using subscripts to get special effects. This example shows one. (It's much easier to get the same kind of result in Mata.)
. gen random = runiform()
. sort random
. gen obs = _n
. sort y
. gen randomsorted = random[obs]
. l
+-----------------------------------------------+
| y foo x bar random obs random~d |
|-----------------------------------------------|
1. | 1 1 3 . .3488717 4 .0285569 |
2. | 2 . 3 . .2668857 3 .1366463 |
3. | 3 1 3 1 .1366463 2 .2668857 |
4. | 4 . 3 . .0285569 1 .3488717 |
5. | 5 1 3 . .8689333 5 .8689333 |
+-----------------------------------------------+
This answer doesn't cover matrices in Stata or Mata.

Fill in missing values of one variable using match with another variable

Imagine the following Stata data structure:
input x y
1 3
1 .
1 .
2 3
2 .
2 .
. 3
end
I want to fill the missing values using the corresponding match of pairs for other observations. However, if there is ambiguity (in the example, 3 corresponding to both 1 and 2), the code should not copy. In my example, the final data structure should look like this:
1 3
1 3
1 3
2 3
2 3
2 3
. 3
Note that both 1 and 2 are filled, as they are unambiguously 3.
My data is only numeric, and the number of unique values of variables x and y is large, so I am looking for a general rule that works in every case.
I am thinking on using the user-written command carryforward, running something like
bysort x: carryforward y if x != . , replace dynamic_condition(x[_n-1] == x[_n]) strict
bysort y: carryforward x if y != . , replace dynamic_condition(y[_n-1] == y[_n]) strict
Yet, this does not work when there are double matches.
UPDATE: the solution proposed by Nick does not work for every example. I updated the example to reflect this. The reason why the proposed solution does not work is because the function tag puts a 1 only at one instance of each value. Thus, when a value (3) is related to two values (1, 2), the tag will appear only in one of them. Hence, the copying occurs for one. In the example above, Nick's code and results are:
egen tagy = tag(y) if !missing(y)
egen tagx = tag(x) if !missing(x)
egen ny = total(tagy), by(x)
egen nx = total(tagx), by(y)
bysort x (y) : replace y = y[1] if ny == 1
bysort y (x) : replace x = x[1] if nx == 1
list, sep(0)
+-------------------------------+
| x y tagy tagx ny nx |
|-------------------------------|
1. | 1 3 0 0 1 0 |
2. | 1 3 0 0 1 0 |
3. | 1 3 1 1 1 2 |
4. | 2 3 0 1 0 2 |
5. | . 3 0 0 0 2 |
6. | 2 . 0 0 0 0 |
7. | 2 . 0 0 0 0 |
+-------------------------------+
As seen, the code works for filling x=1 and not filling y=3 (line 5). Yet, it does not fill lines 6 and 7 because tagy=1 only appears once (x=1).
This is a bit clunky, but it should work:
bysort x: egen temp=sd(x) if x!=.
bysort x (y): replace y=y[1] if temp==0
drop temp
Since the standard deviation of a constant is zero, temp=0 if non-missing x's are all the same.
sort x, y
replace y = y[_n-1] if missing(y) & x[_n-1] == x[_n]