I am studying Stata programming with the book An Introduction to Stata Programming, Second Edition.
In chapter 4 there is code to generate a variable that tests whether some other variables satisfy a logical condition, the code is like:
foreach v of varlist child1-child12{
local n_school "`n_school' + inrange(`v', 1, 5)"
}
gen n_school = `n_school'
When I change this code to suit my own data,
foreach v of varlist qp605_s_1-qp605_s_5 {
local n_med "`n_med' + inrange(`v', 1, 5)"
}
gen n_med = `n_med'
where qp605_s_1's values range from 1 to 17, then Stata returns:
. foreach v of varlist qp605_s_1-qp605_s_5 {
2. local n_med "`n_med' + inrange(`v', 1, 5)"
3. }
. gen n_med = `n_med'
unknown function +inrange()
r(133);
Any ideas what is wrong with this code?
I see where I was wrong
The local n_med begins with +, so I change it to:
local n_med 0
foreach v of varlist qp605_s_1-qp605_s_5{
local n_med "`n_med' + inrange(`v', 1, 5)"
}
gen n_med = `n_med',after(qp605_s_5)
and it works!
BTW, according to An Introduction to Stata Programming, this method is faster than if you first generate a variable which is all zero and then replace it by a loop, because the replace command is slower than generate, so it is better to avoid replace.
Here is another approach.
* Example generated by -dataex-. For more info, type help dataex
clear
input float(var1 var2)
1 5
2 6
3 7
4 8
end
gen wanted = .
mata :
data = st_data(., "var*")
st_store(., "wanted", rowsum(data :>= 1 :& data :<= 5))
end
list
+----------------------+
| var1 var2 wanted |
|----------------------|
1. | 1 5 2 |
2. | 2 6 1 |
3. | 3 7 1 |
4. | 4 8 1 |
+----------------------+
Related
I have a dataset in wide format in Stata and I would like to pick the last observation of each variable. In the example below, I would like to generate a new variable based on the last observation of the list of variables.
I tried the code below and it doesn't work. My thought was to pick one variable at a time, e.g. v1==1
id v1 v2 v3 new varible
1 1 2 2
2 1 2 3 3
3 1 1
4 1 4 4
gen new_variable=.
foreach v of varlist v*{
replace new_variable=1 if `v'==1
replace new_variable=2 if `v'==2
replace new_variable=3 if `v'==3
}
You want the last non-missing value in each observation (row, record, case) over a series of variables (columns, fields). Terminology in your question is confused.
I first interpret the blanks in your data example as numeric missing values. That being so, what you want is given by the egen function rowlast(). It can also be obtained by looping as follows
Initialise with the first variable.
Looping over the other variables, replace if each variable is not missing.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte(v1 v2 v3) float wanted
1 2 . 2
1 2 3 3
1 . . 1
1 4 . 4
end
egen WANTED = rowlast(v1 v2 v3)
gen wAnTeD = v1
forval j = 2/3 {
replace wAnTeD = v`j' if !missing(v`j')
}
list
+-----------------------------------------+
| v1 v2 v3 wanted WANTED wAnTeD |
|-----------------------------------------|
1. | 1 2 . 2 2 2 |
2. | 1 2 3 3 3 3 |
3. | 1 . . 1 1 1 |
4. | 1 4 . 4 4 4 |
+-----------------------------------------+
I next interpret the data as string variables. The egen solution doesn't work but the loop idea does work. Note that missing means empty strings "": spaces must be removed or ignored.
* Example generated by -dataex-. For more info, type help dataex
clear
input str1(v1 v2 v3 wanted)
"1" "2" "" "2"
"1" "2" "3" "3"
"1" "" "" "1"
"1" "4" "" ""
end
gen WANTED = v1
forval j = 2/3 {
replace WANTED = v`j' if !missing(v`j')
}
My dataset contains multiple variables called avar_1 to bvar_10 referring to the history of an individual. For some reasons, the history is not always complete and there are some "gaps" (e.g. avar_1 and avar_4 are non-missing, but avar_2 and avar_3 are missing). For each individual, I want to store the first non-missing value in a new variable called var1 the second non-missing in var2 etc, so that I have a history without missing values.
I've tried the following code
local x=1
foreach wave in a b {
forval i=1/10 {
capture drop var`x'
generate var`x'=.
capture replace var`x'=`wave'var`i' if !mi(`wave'`var'`i')
if (!mi(var`x')) {
local x=1+`x'
}
}
}
var1 is generated properly but var2 only contains missings and following variables are not generated. However, I set trace on and saw that the var2 is actually replaced for all variables from avar_1 to bvar_10.
My guess is that the local x is not correctly updated as its value change for the whole dataset but should be different for each observation.
Is that the problem and if so, how can I avoid it?
A concise concrete data example is worth more than a long explanation. Your description seems consistent with an example like this:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 id float(avar_1 avar_2 avar_3 bvar_1 bvar_2)
"A" 1 . 6 8 10
"B" 2 4 . 9 .
"C" 3 5 7 . 11
end
* 4 is specific to this example.
rename (bvar_*) (avar_#), renumber(4)
reshape long avar_, i(id) j(which)
(note: j = 1 2 3 4 5)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 3 -> 15
Number of variables 6 -> 3
j variable (5 values) -> which
xij variables:
avar_1 avar_2 ... avar_5 -> avar_
-----------------------------------------------------------------------------
drop if missing(avar_)
bysort id (which) : replace which = _n
list, sepby(id)
+--------------------+
| id which avar_ |
|--------------------|
1. | A 1 1 |
2. | A 2 6 |
3. | A 3 8 |
4. | A 4 10 |
|--------------------|
5. | B 1 2 |
6. | B 2 4 |
7. | B 3 9 |
|--------------------|
8. | C 1 3 |
9. | C 2 5 |
10. | C 3 7 |
11. | C 4 11 |
+--------------------+
Positive points:
Your data layout cries out for some structure given by a rename and especially by a reshape long. I don't give here code for a reshape wide as for the great majority of Stata purposes, you'd be better off with this layout.
Negative points:
!mi(var`x')
returns whether the first value of a variable is not missing. If foo were a variable in the dataset, !mi(foo) is evaluated as !mi(foo[1]). That is not what you want here. See https://www.stata.com/support/faqs/programming/if-command-versus-if-qualifier/ for the full story.
I'd recommend more evocative variable names.
Suppose I have 100 variables named ID, var1, var2, ..., var99. I have 1000 rows. I want to browse all the rows and columns that contain a 0.
I wanted to just do this:
browse ID, var* if var* == 0
but it doesn't work. I don't want to hardcode all 99 variables obviously.
I wanted to essentially write an if like this:
gen has0 = 0
forvalues n = 1/99 {
if var`n' does not contain 0 {
drop v
} // pseudocode I know doesn't work
has0 = has0 | var`n' == 0
}
browse if has0 == 1
but obviously that doesn't work.
Do I just need to reshape the data so it has 2 columns ID, var with 100,000 rows total?
My dear colleague #NickCox forces me to reply to this (duplicate) question because he is claiming that downloading, installing and running a new command is better than using built-in ones when you "need to select from 99 variables".
Consider the following toy example:
clear
input var1 var2 var3 var4 var5
1 4 9 5 0
1 8 6 3 7
0 6 5 6 8
4 5 1 8 3
2 1 0 2 1
4 6 7 1 9
end
list
+----------------------------------+
| var1 var2 var3 var4 var5 |
|----------------------------------|
1. | 1 4 9 5 0 |
2. | 1 8 6 3 7 |
3. | 0 6 5 6 8 |
4. | 4 5 1 8 3 |
5. | 2 1 0 2 1 |
6. | 4 6 7 1 9 |
+----------------------------------+
Actually you don't have to download anything:
preserve
generate obsno = _n
reshape long var, i(obsno)
rename var value
generate var = "var" + string(_j)
list var obsno value if value == 0, noobs
+----------------------+
| var obsno value |
|----------------------|
| var5 1 0 |
| var1 3 0 |
| var3 5 0 |
+----------------------+
levelsof var if value == 0, local(selectedvars) clean
display "`selectedvars'"
var1 var3 var5
restore
This is the approach i recommended in the linked question for identifying negative values. Using levelsof one can do the same thing with findname using a built-in command.
This solution can also be adapted for browse:
preserve
generate obsno = _n
reshape long var, i(obsno)
rename var value
generate var = "var" + string(_j)
browse var obsno value if value == 0
levelsof var if value == 0, local(selectedvars) clean
display "`selectedvars'"
pause
restore
Although i do not see why one would want to browse the results when can simply list them.
EDIT:
Here's an example more closely resembling the OP's dataset:
clear
set seed 12345
set obs 1000
generate id = int((_n - 1) / 300) + 1
forvalues i = 1 / 100 {
generate var`i' = rnormal(0, 150)
}
ds var*
foreach var in `r(varlist)' {
generate rr = runiform()
replace `var' = 0 if rr < 0.0001
drop rr
}
Applying the above solution yields:
display "`selectedvars'"
var13 var19 var35 var36 var42 var86 var88 var90
list id var obsno value if value == 0, noobs sepby(id)
+----------------------------+
| id var obsno value |
|----------------------------|
| 1 var86 18 0 |
| 1 var19 167 0 |
| 1 var13 226 0 |
|----------------------------|
| 2 var88 351 0 |
| 2 var36 361 0 |
| 2 var35 401 0 |
|----------------------------|
| 3 var42 628 0 |
| 3 var90 643 0 |
+----------------------------+
Short answer: wildcards for bunches of variables can't be inserted in if qualifiers. (The if command is different from the if qualifier.)
Your question is contradictory on what you want. At one point your pseudocode has you dropping variables! drop has a clear, destructive meaning to Stata programmers: it doesn't mean "ignore".
But let's stick to the emphasis on browse.
findname, any(# == 0)
finds variables for which any value is 0. search findname, sj to find the latest downloadable version.
Note also that
findname, type(numeric)
will return the numeric variables in r(varlist) (and also a local macro if you so specify).
Then several egen functions compete for finding 0s in each observation for a specified varlist: the command findname evidently helps you identify which varlist.
Let's create a small sandbox to show technique:
clear
set obs 5
gen ID = _n
forval j = 1/5 {
gen var`j' = 1
}
replace var2 = 0 in 2
replace var3 = 0 in 3
list
findname var*, any(# == 0) local(which)
egen zero = anymatch(`which'), value(0)
list `which' if zero
+-------------+
| var2 var3 |
|-------------|
2. | 0 1 |
3. | 1 0 |
+-------------+
So, the problem is split into two: finding the observations with any zeros and finding the observations with any zeros, and then putting the information together.
Naturally, the use of findname is dispensable as you can just write your own loop to identify the variables of interest:
local wanted
quietly foreach v of var var* {
count if `v' == 0
if r(N) > 0 local wanted `wanted' `v'
}
Equally naturally, you can browse as well as list: the difference is just in the command name.
This problem is very simple in R, but I can't seem to get it to work in Stata.
I want to use the square brackets index, but with an expression in it that involves another variable, i.e. for a variable with unique values cumul I want:
replace country = country[cumul==20] in 12
cumul == 20 corresponds to row number 638 in the dataset, so the above should replace in line 12 the country variable with the value of that same variable in line 638. The above expression is clearly not the right way to do it: it just replaces the country variable in line 12 with a missing value.
Stata's row indexing does not work in that way. What you can do, however, is a simple two-line solution:
levelsof country if cumul==20
replace country = "`r(levels)'" in 12
If you want to be sure that cumul==20 uniquely identifies just a single value of country, add:
assert `:word count `r(levels)''==1
between the two lines.
It's probably worth explaining why the construct in the question doesn't work as you wish, beyond "Stata is not R!".
Given a variable x: in a reference like x[1] the [1] is referred to as a subscript, despite nothing being written below the line. The subscript is the observation number, the number being always that in the dataset as currently held in memory.
Stata allows expressions within subscripts; they are evaluated observation by observation and the result is then used to look-up values in variables. Consider this sandbox:
clear
input float y
1
2
3
4
5
end
. gen foo = y[mod(_n, 2)]
(2 missing values generated)
. gen x = 3
. gen bar = y[y == x]
(4 missing values generated)
. list
+-------------------+
| y foo x bar |
|-------------------|
1. | 1 1 3 . |
2. | 2 . 3 . |
3. | 3 1 3 1 |
4. | 4 . 3 . |
5. | 5 1 3 . |
+-------------------+
mod(_n, 2) is the remainder on dividing the observation _n by 2: that is 1 for odd observation numbers and 0 for even numbers. Observation 0 is not in the dataset (Stata starts indexing at 1). It's not an error to refer to values in that observation, but the result is returned as missing (numeric missing here, and empty strings "" if the variable is string). Hence foo is x[1] or 1 for odd observation numbers and missing for even numbers.
True or false expressions are evaluated as 1 if true and 0 is false. Thus y == x is true only in observation 3, and so bar is the value of y[1] there and missing everywhere else. Stata doesn't have the special (and useful) twist in R that it is the subscripts for which a true or false expression is true that are used to select zero or more values.
There are ways of using subscripts to get special effects. This example shows one. (It's much easier to get the same kind of result in Mata.)
. gen random = runiform()
. sort random
. gen obs = _n
. sort y
. gen randomsorted = random[obs]
. l
+-----------------------------------------------+
| y foo x bar random obs random~d |
|-----------------------------------------------|
1. | 1 1 3 . .3488717 4 .0285569 |
2. | 2 . 3 . .2668857 3 .1366463 |
3. | 3 1 3 1 .1366463 2 .2668857 |
4. | 4 . 3 . .0285569 1 .3488717 |
5. | 5 1 3 . .8689333 5 .8689333 |
+-----------------------------------------------+
This answer doesn't cover matrices in Stata or Mata.
Sorry if the title of my question is unclear, but it's hard to summarize it on one line. I have a panel data set (codes to generate it are at the bottom):
. xtset id year
panel variable: id (strongly balanced)
time variable: year, 1 to 3
delta: 1 unit
. l, sep(3)
+-----------------+
| id year x |
|-----------------|
1. | 1 1 1.1 |
2. | 1 2 1.2 |
3. | 1 3 1.3 |
|-----------------|
4. | 2 1 2.1 |
5. | 2 2 2.2 |
6. | 2 3 2.3 |
+-----------------+
I want to create variables x_1, x_2 and x_3, where x_j has the value of x in year j for each id. I can achieve it as follows (with no elegance pursued):
. forv k=1/3 {
2. capture drop tmp
3. gen tmp = x if year==`k'
4. by id: egen x_`k' = mean(tmp)
5. }
(4 missing values generated)
(4 missing values generated)
(4 missing values generated)
. drop tmp
. l, sep(3)
+-----------------------------------+
| id year x x_1 x_2 x_3 |
|-----------------------------------|
1. | 1 1 1.1 1.1 1.2 1.3 |
2. | 1 2 1.2 1.1 1.2 1.3 |
3. | 1 3 1.3 1.1 1.2 1.3 |
|-----------------------------------|
4. | 2 1 2.1 2.1 2.2 2.3 |
5. | 2 2 2.2 2.1 2.2 2.3 |
6. | 2 3 2.3 2.1 2.2 2.3 |
+-----------------------------------+
Is there a way without using a loop? I know I can write a program or an ado file (determining the variable names automatically), but I wonder if there are some builtin commands for my purpose.
The full commands are here.
clear all
set obs 6
gen id = floor((_n-1)/3)+1
by id, sort: gen year = _n
xtset id year
gen x = id+year/10
l, sep(3)
forv k=1/3 {
capture drop tmp
gen tmp = x if year==`k'
by id: egen x_`k' = mean(tmp)
}
drop tmp
l, sep(3)
Loops are good. What I can do for you is shorten your loop:
clear all
set obs 6
gen id = floor((_n-1)/3)+1
by id, sort: gen year = _n
xtset id year
gen x = id+year/10
l, sep(3)
forv k=1/3 {
by id: gen x_`k' = x[`k']
}
l, sep(3)
There is a decency assumption in there of a balanced panel. This loop makes no such assumption, but you need to loop over the observed years:
forv year = 1/3 {
by id: egen X_`year' = total(x / (year == `year'))
}
See also this discussion, especially Sections 9 and 10.
You may also be interested in separate, which avoids an explicit loop, but only gets you part of the way to where you want to be.
All that said, it's hard to believe that you need these variables at all. The mechanism of time series operators solves many problems, while tools such as rangestat (SSC) fill in many gaps.
A late entry, but you could avoid loops if you wanted by using reshape and merge:
clear *
input float(id year x)
1 1 1.1
1 2 1.2
1 3 1.3
2 1 2.1
2 2 2.2
2 3 2.3
end
tempfile master
save `master'
reshape wide x, i(id) j(year)
tempfile using
save `using'
use `master', clear
merge m:1 id using `using', nogen
This "answer", which I post because it is too long as a comment, contains results from practices following Nick Cox's answer. All credits go to him.
Method 1: Use egen and total, missing.
levelsof year, local(yearlevels)
foreach v of varlist x {
foreach year of local yearlevels {
by id: egen `v'_`year' = total(`v' / (year==`year')), missing
}
}
The missing option handles unbalanced panels.
Method 2: Use separate and then copy the values.
foreach v of varlist x {
separate `v', by(year) gen(`v'_)
local newvars = r(varlist)
foreach w of local newvars {
by id: egen f_`w' = total(`w'), missing
}
drop `newvars'
}
This also handles unbalanced panels, but the new variable names are f_x_1, etc. The first method needs the levels of year, while the second needs creating a set of intermediate variables. I personally slightly prefer the first. It would be wonderful if Method 2 can be shortened.