Find _n of Observation(s) that have a certain value - stata

I want to find the observation numbers that correspond to the observations that have a particular value, say 29. I would then like to save these observation numbers in a macro.
Is there a better way to do so than the following clunky and inefficient forvalues loop?
sysuse auto, clear
local n
forvalues i=1/`=_N' {
if mpg[`i']==29 local n `n' `i'
}
display "`n'"

gen long obsno = _n
levelsof obsno if mpg == 29
is less typing for you. Why do you want this?

Related

Is there a way to flip the order of observations in Stata?

Is it possible to create a backwards counting variable in Stata (like the command _n, just numbering observations backwards)? Or a command to flip the data set, so that the observation with the most recent date is the first one? I would like to make a scatter plot with AfD on the y-axis and the date (row_id) on the x-axis. When I make the plot however, the weeks are ordered backwards. How can I change the order?
This is the code:
generate row_id=_n
twoway scatter AfD row_id || lfit AfD row_id
Here are the data set and the plot:
Your date variable is a string variable, which is unlikely to get you the desired result if you sort on that variable.
You can create a Stata internal form date variable from your string variable:
gen date_num = daily(date, "MDY")
format date_num %td
The values of this new variable will represent the number of days since 1 Jan 1960.
If you create a scatter plot with this date variable on the x-axis, by default it will be sorted from min to max. To let it run from max to min you can specify option xscale(reverse).
If you still want to create an id variable by yourself you can choose one of these options (ascending and descending):
sort date_num
gen id = _n
gsort -date_num
gen id = _n
For your problem, plotting in terms of a daily date variable and -- if for some reason that is a good idea -- using xscale(reverse) are likely to be what you need, as well explained by #Wouter.
In general something like
gen long newid = _N - _n + 1
sort newid
will reverse a dataset.

Looping with distance matching

I want to match treated firms to control firms by industry and year considering firms that are the closest in terms of profitability (roa). I want a 1:1 match. I am using a distance measure (mahalanobis).
I have 530,000 firm-year observations in my sample, namely 267,000 treated observations and 263,000 control observations approximatively. Here is my code:
gen neighbor1 = .
gen idobs = .
levelsof industry
local a = r(levels)
levelsof year
local b = r(levels)
foreach i in `a' {
foreach j in `b'{
capture noisily psmatch2 treat if industry == `i' & year == `j', mahalanobis(roa)
capture noisily replace neighbor1 = _n1 if industry == `i' & year == `j'
capture noisily replace idobs = _id if industry == `i' & year == `j'
drop _treated _support _weight _id _n1 _nn
}
}
Treat is my treatment variable. It takes the value of 1 for treated observations and 0 for non-treated observations.
The command psmatch2 creates the variable _n1 and _id among others. _n1 is the id number of the matched observation (closest neighbor) and _id is an id number (1 - 530,000) that is unique to each observation.
The code 'works', i.e. I get no error message. My variable neighbor1 has 290,724 non-missing observations.
However, these 290,724 observations vary between 1 and 933 which is odd. The variable neighbor1 should provide me the observation id number of the matched observation, which can vary between 1 and 530,000.
It seems that the code erases or ignores the result of the matching process in different subgroups. What am I doing wrong?
Edit:
I found a public dataset and adapted my previous code so that you can run my code with this dataset and see more clearly what the problem could be.
I am using Vella and Verbeek (1998) panel data on 545 men worked every year from 1980-1987 from this website: https://www.stata.com/texts/eacsap/
Let's say that I want to match treated observations, i.e. people, to control observations by marriage status (married) and year considering people that worked a similar number of hours (hours), i.e. the shortest distance.
I create a random treatment variable (treat) for the sake of this example.
use http://www.stata.com/data/jwooldridge/eacsap/wagepan.dta
gen treat = round(runiform())
gen neighbor1 = .
gen idobs = .
levelsof married
local a = r(levels)
levelsof year
local b = r(levels)
foreach i in `a' {
foreach j in `b'{
capture noisily psmatch2 treat if married == `i' & year == `j', mahalanobis(hours)
capture noisily replace neighbor1 = _n1 if married == `i' & year == `j'
capture noisily replace idobs = _id if married == `i' & year == `j'
drop _treated _support _weight _id _n1 _nn
}
}
What this code should do is to look at each subgroup of observations: 444 observations in 1980 that are not married, 101 observations in 1980 that are married, ..., and 335 observations in 1987 that are married. In each of these subgroups, I would like to match a treated observation to a control observation considering the shortest distance in the number of hours worked.
There are two problems that I see after running the code.
First, the variable idobs should take a unique number between 1 and 4360 because there are 4360 observations in this dataset. It is just an ID number. It is not the case. A few observations can have an ID number 1, 2 and so on.
Second, neighbor1 varies between 1 and 204 meaning that the matched observations have only ID numbers varying from 1 to 204.
What is the problem with my code?
Here is a solution using the command iematch, installed through the package ietoolkit -> ssc install ietoolkit. For disclosure, I wrote this command. psmatch2 is great if you want the ATT. But if all you want is to match observations across two groups using nearest neighbor, then iematch is cleaner.
In both commands you need to make each industry-year match in a subset, then combine that information. In both commands the matched group ID will restart from 1 in each subset.
Using your example data, this creates one matchID var for each subset, then you will have to find a way to combine these to a single matchID without conflicts across the data set.
* use data set and keep only vars required for simplicity
use http://www.stata.com/data/jwooldridge/eacsap/wagepan.dta, clear
keep year married hour
* Set seed for replicability. NEVER use the 123456 seed in production, randomize a new seed
set seed 123456
*Generate mock treatment
gen treat = round(runiform())
*generate vars to store results
gen matchResult = .
gen matchDiff = .
gen matchCount = .
*Create locals for loops
levelsof married
local married_statuses = r(levels)
levelsof year
local years = r(levels)
*Loop over each subgroup
foreach year of local years {
foreach married_status of local married_statuses {
*This command is similar to psmatch2, but a simplified version for
* when you are not looking for the ATT.
* This command is only about matching.
iematch if married == `married_status' & year == `year', grp(treat) match(hour) seedok m1 maxmatch(1)
*These variables list meta info about the match. See helpfile for docs,
*but this copy info from each subset in this loop to single vars for
*the full data set. Then the loop specfic vars are dropped
replace matchResult = _matchResult if married == `married_status' & year == `year'
replace matchDiff = _matchDiff if married == `married_status' & year == `year'
replace matchCount = _matchCount if married == `married_status' & year == `year'
drop _matchResult _matchDiff _matchCount
*For each loop you will get a match ID restarting at 1 for each loop.
*Therefore we need to save them in one var for each loop and combine afterwards.
rename _matchID matchID_`married_status'_`year'
}
}

Unable to display statistics in the graph note() parameter

I want to display the total count of the data in the graph note().
I tried the following:
note(count)
However, this just displays the literal word "count".
I also tried to create a local variable but I am having difficulty just initializing it.
While I can do the following:
. local N = 100
. di `N'
100
I can't seem to do:
. local N = count
count not found
The total number of observations is stored in _N.
sysuse auto, clear
display _N
74
So the following works for me:
local N = _N
twoway scatter mpg price, note(Total no of observations: `N')
The total number of observations is kept in _N but it is not necessarily the number of observations used in a graph.
The command count displays a result and leaves a saved result, the number counted, in its wake as r(N). This is documented both in the help for count and in the manual entry.
Hence you can verify that this sequence leaves a note 74 observations in the resulting graph.
. sysuse auto, clear
(1978 Automobile Data)
. count if mpg < .
74
. histogram mpg, note(`r(N)' observations)
(bin=8, start=12, width=3.625)
Note that no r-class command should intervene here between count and your use of its result. r-class saved results, like any other saved results, are overwritten easily. In many circumstances you are well advised, as you did, to store the result in a local macro, say by
. local N = r(N)
immediately after the count command and then refer to that later in the note().
This is a more general method because count by itself returns the number of observations and so can be used when this is directly what you want.
Combining the other answers, I ultimately did:
count
local N = r(N)
count if male
local N_male = r(N)
count if !male
local N_female = r(N)
...
note("N = `N'" " `N_male' (Male)" " `N_female' (Female)")
But still can't get the commas to render at the thousands and millions place.

Simulating AR(1) in Stata from last observation to the first

I want to simulate an AR(1) process, but start from the end. But my code does not work as expected:
clear
set obs 100
gen et=rnormal(0,1)
quietly gen yt= et in L
quietly replace yt=0.5*yt[_n+1]+et in 1/L-1
Your help is really appreciated.
Just do it the normal way and then reverse order:
clear
set obs 100
gen obs = -_n
gen et=rnormal(0,1)
quietly gen yt = et in 1
quietly replace yt = 0.5*yt[_n-1] + et in 2/L
sort obs
The key is that Stata works in order of the observations. So, this code works as you would want in cascade, value for observation 2 depending on observation 1, 3 on 2, and so forth.
You won't get a cascade going the other direction.
Also, set seed for reproducibility.

How to efficiently create lag variable using Stata

I have panel data (time: date, name: ticker). I want to create 10 lags for variables x and y. Now I create each lag variable one by one using the following code:
by ticker: gen lag1 = x[_n-1]
However, this looks messy.
Can anyone tell me how can I create lag variables more efficiently, please?
Shall I use a loop or does Stata have a more efficient way of handling this kind of problem?
#Robert has shown you the streamlined way of doing it. For completion, here is the "traditional", boring way:
clear
set more off
*----- example data -----
set obs 2
gen id = _n
expand 20
bysort id: gen time = _n
tsset id time
set seed 12345
gen x = runiform()
gen y = 10 * runiform()
list, sepby(id)
*----- what you want -----
// "traditional" loop
forvalues i = 1/10 {
gen x_`i' = L`i'.x
gen y_`i' = L`i'.y
}
list, sepby(id)
And a combination:
// a combination
foreach v in x y {
tsrevar L(1/10).`v'
rename (`r(varlist)') `v'_#, addnumber
}
If the purpose is to create lagged variables to use them in some estimation, know you can use time-series operators within many estimation commands, directly; that is, no need to create the lagged variables in the first place. See help tsvarlist.
You can loop to do this but you can also take advantage of tsrevar to generate temporary lagged variables. If you need permanent variables, you can use rename group to rename them.
clear
set obs 2
gen id = _n
expand 20
bysort id: gen time = _n
tsset id time
set seed 12345
gen x = runiform()
gen y = 10 * runiform()
tsrevar L(1/10).x
rename (`r(varlist)') x_#, addnumber
tsrevar L(1/10).y
rename (`r(varlist)') y_#, addnumber
Note that if you are doing this to calculate a statistic on a rolling window, check out tsegen (from SSC)