Generating dummies in Stata

Generating dummies in Stata - stata

I have a dataset in Stata of the following form
id | year
a | 1950
b | 1950
c | 1950
d | 1950
.
.
.
y | 1950
-----
a | 1951
b | 1951
c | 1951
d | 1951
.
.
.
y | 1951
-----
...
I'm looking for a quick way to rewrite the following code
gen dummya=1 if id=="a"
gen dummyb=1 if id=="b"
gen dummyc=1 if id=="c"
...
gen dummyy=1 if id=="y"
and
gen dummy50=1 if year==1950
gen dummy51=1 if year==1951
...

Note that all your dummies would be created as 1 or missing. It is almost always more useful to create them directly as 1 or 0. Indeed, that is the usual definition of dummies.
In general, it's a loop over the possibilities using forvalues or foreach, but the shortcut is too easy not to be preferred in this case. Consider this reproducible example:
. sysuse auto, clear
(1978 Automobile Data)
. tab rep78, gen(rep78)
Repair |
Record 1978 | Freq. Percent Cum.
------------+-----------------------------------
1 | 2 2.90 2.90
2 | 8 11.59 14.49
3 | 30 43.48 57.97
4 | 18 26.09 84.06
5 | 11 15.94 100.00
------------+-----------------------------------
Total | 69 100.00
. d rep78?
storage display value
variable name type format label variable label
------------------------------------------------------------------------------
rep781 byte %8.0g rep78== 1.0000
rep782 byte %8.0g rep78== 2.0000
rep783 byte %8.0g rep78== 3.0000
rep784 byte %8.0g rep78== 4.0000
rep785 byte %8.0g rep78== 5.0000
That's all the dummies (some prefer to say "indicators") in one fell swoop through an option of tabulate.
For completeness, consider an example doing it the loop way. We imagine that years 1950-2015 are represented:
forval y = 1950/2015 {
gen byte dummy`y' = year == `y'
}
Two digit identifiers dummy50 to dummy15 would be unambiguous in this example, so here they are as a bonus:
forval y = 1950/2015 {
local Y : di %02.0f mod(`y', 100)
gen byte dummy`y' = year == `y'
}
Here byte is dispensable unless memory is very short, but it's good practice any way.
If anyone was determined to write a loop to create indicators for the distinct values of a string variable, that can be done too. Here are two possibilities. Absent an easily reproducible example in the original post, let's create a sandbox. The first method is to encode first, then loop over distinct numeric values. The second method is find the distinct string values directly and then loop over them.
clear
set obs 3
gen mystring = word("frog toad newt", _n)
* Method 1
encode mystring, gen(mynumber)
su mynumber, meanonly
forval j = 1/`r(max)' {
gen dummy`j' = mynumber == `j'
label var dummy`j' "mystring == `: label (mynumber) `j''"
}
* Method 2
levelsof mystring
local j = 1
foreach level in `r(levels)' {
gen dummy2`j' = mystring == `"`level'"'
label var dummy2`j' `"mystring == `level'"'
local ++j
}
describe
Contains data
obs: 3
vars: 8
size: 96
------------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------------
mystring str4 %9s
mynumber long %8.0g mynumber
dummy1 float %9.0g mystring == frog
dummy2 float %9.0g mystring == newt
dummy3 float %9.0g mystring == toad
dummy21 float %9.0g mystring == frog
dummy22 float %9.0g mystring == newt
dummy23 float %9.0g mystring == toad
------------------------------------------------------------------------------
Sorted by:

Use i.<dummy_variable_name>
For example, in your case, you can use following command for regression:
reg y i.year

I also recommend using
egen year_dum = group(year)
reg y i.year_dum
This can be generalized arbitrarily, and you can easily create, e.g., year-by-state fixed effects this way:
egen year_state_dum = group(year state)
reg y i.year_state_dum

Related

retaining the variable label in reshape

I would like to retain the variable labels after reshaping the dataset from long to wide. I have a problem in inputting the dataset (I keyed them in Excel then imported).
clear
set obs 7
input id a1 a2 a3
"s001" "John" 23 "Primary"
"s002" "Mary" 32 "Secondary"
"s002" "Anna" 23 "Tertiary"
"s003" "Joseph" 34 "Secondary"
"s003" "Oganyo" 23 "Primary"
"s004" "Manyoya" 34 "Tertiary"
"s005" "Makbuti" 45 "Primary"
end
*======= Label the variables
label var a1 "partners name"
label var a2 "partners age"
label var a3 "partners education"
foreach variable of varlist a*{
local varlabel : variable label `variable'
di "`varlabel'"
bys id: gen index = _n
renvars a* , postfix(_)
reshape wide a*, i(id) j(index)
label var `variable'* "`varlabel'"
}

The code here is very confused and just won't work well without major surgery.
At the outset your input statement fails to declare string variables when needed.
The set obs 7 is compatible with an end to the input.
A big problem is that you are looping over a bunch of commands including reshape, but there is just one reshape to carry out.
Your code for saving variable labels needs to save then separately.
renvars is from the Stata Journal (2005), and doesn't seem needed here any way. As from Stata 12 (2011!) it should still work but is essentially superseded by extensions to rename.
Here's my best guess at the example you should have given and the code you need.
clear
input str4 id str7 a1 a2 str9 a3
"s001" "John" 23 "Primary"
"s002" "Mary" 32 "Secondary"
"s002" "Anna" 23 "Tertiary"
"s003" "Joseph" 34 "Secondary"
"s003" "Oganyo" 23 "Primary"
"s004" "Manyoya" 34 "Tertiary"
"s005" "Makbuti" 45 "Primary"
end
label var a1 "partner's name"
label var a2 "partner's age"
label var a3 "partner's education"
local j = 0
foreach variable of varlist a* {
local ++j
local varlabel`j' : variable label `variable'
di "`varlabel`j''"
}
bys id: gen index = _n
reshape wide a*, i(id) j(index)
forval j = 1/3 {
foreach v of var a`j'* {
label var `v' "`varlabel`j''"
}
}
list
describe
Here's the output of the last two commands.
. list
+------------------------------------------------------------+
| id a11 a21 a31 a12 a22 a32 |
|------------------------------------------------------------|
1. | s001 John 23 Primary . |
2. | s002 Mary 32 Secondary Anna 23 Tertiary |
3. | s003 Joseph 34 Secondary Oganyo 23 Primary |
4. | s004 Manyoya 34 Tertiary . |
5. | s005 Makbuti 45 Primary . |
+------------------------------------------------------------+
.
. describe
Contains data
Observations: 5
Variables: 7
------------------------------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
------------------------------------------------------------------------------------------------------------------
id str4 %9s
a11 str7 %9s partner's name
a21 float %9.0g partner's age
a31 str9 %9s partner's education
a12 str7 %9s partner's name
a22 float %9.0g partner's age
a32 str9 %9s partner's education
------------------------------------------------------------------------------------------------------------------
Sorted by: id
Note: reshape wide makes most things more difficult in Stata.
Note: Please use dataex to create reproducible data examples in Stata.

Computing running sum with moving time-window

My data
I am working on a spell dataset in the following format:
cls
clear all
set more off
input id spellnr str7 bdate_str str7 edate_str employed
1 1 2008m1 2008m9 1
1 2 2008m12 2009m8 0
1 3 2009m11 2010m9 1
1 4 2010m10 2011m9 0
///
2 1 2007m4 2009m12 1
2 2 2010m4 2011m4 1
2 3 2011m6 2011m8 0
end
* translate to Stata monthly dates
gen bdate = monthly(bdate_str,"YM")
gen edate = monthly(edate_str,"YM")
drop *_str
format %tm bdate edate
list, sepby(id)
Corresponding to:
+---------------------------------------------+
| id spellnr employed bdate edate |
|---------------------------------------------|
1. | 1 1 1 2008m1 2008m9 |
2. | 1 2 0 2008m12 2009m8 |
3. | 1 3 1 2009m11 2010m9 |
4. | 1 4 0 2010m10 2011m9 |
|---------------------------------------------|
5. | 2 1 1 2007m4 2009m12 |
6. | 2 2 1 2010m4 2011m4 |
7. | 2 3 0 2011m6 2011m8 |
+---------------------------------------------+
Here a given person (id) can have multiple spells (spellnr) of two types (unempl: 1 for unemployment; 0 for employment). the start-end dates of each spell are definied by bdate and edate, respectively.
Imagine the data was already cleaned, and is such that no spells overlap with each other.
There might be "missing" periods in between any two spells though.
This is captured by the dummy dataset above.
My question:
For each unemployment spell, I need to compute the number of months spent in employment in the last 6 months, 12 months, and 24 months.
Note that, importantly, each id can go in and out from employment, and all past employment spells should be taken into account (not just the last one).
In my example, this would lead to the following desired output:
+--------------------------------------------------------------+
| id spellnr employed bdate edate m6 m24 m48 |
|--------------------------------------------------------------|
1. | 1 1 1 2008m1 2008m9 . . . |
2. | 1 2 0 2008m12 2009m8 4 9 9 |
3. | 1 3 1 2009m11 2010m9 . . . |
4. | 1 4 0 2010m10 2011m9 6 11 20 |
|--------------------------------------------------------------|
5. | 2 1 1 2007m4 2009m12 . . . |
6. | 2 2 1 2010m4 2011m4 . . . |
7. | 2 3 0 2011m6 2011m8 5 20 44 |
+--------------------------------------------------------------+
My (working) attempt:
The following code returns the desired result.
* expand each spell to one observation per time unit (here "months"; works also for days)
expand edate-bdate+1
bysort id spellnr: gen spell_date = bdate + _n - 1
format %tm spell_date
list, sepby(id spellnr)
* fill-in empty months (not covered by spells)
xtset id spell_date, monthly
tsfill
* compute cumulative time spent in employment and lagged values
bysort id (spell_date): gen cum_empl = sum(employed) if employed==1
bysort id (spell_date): replace cum_empl = cum_empl[_n-1] if cum_empl==.
bysort id (spell_date): gen lag_7 = L7.cum_empl if employed==0
bysort id (spell_date): gen lag_24 = L25.cum_empl if employed==0
bysort id (spell_date): gen lag_48 = L49.cum_empl if employed==0
qui replace lag_7=0 if lag_7==. & employed==0 // fix computation for first spell of each "id" (if not enough time to go back with "L.")
qui replace lag_24=0 if lag_24==. & employed==0
qui replace lag_48=0 if lag_48==. & employed==0
* compute time spent in employment in the last 6, 24, 48 months, at the beginning of each unemployment spell
bysort id (spell_date): gen m6 = cum_empl - lag_7 if employed==0
bysort id (spell_date): gen m24 = cum_empl - lag_24 if employed==0
bysort id (spell_date): gen m48 = cum_empl - lag_48 if employed==0
qui drop if (spellnr==.)
qui bysort id spellnr (spell_date): keep if _n == 1
drop spell_date cum_empl lag_*
list
This works fine, but becomes quite inefficient when using (several millions of) daily data. Can you suggest any alternative approach that does not involve expanding the dataset?
In words what I do above is:
I expand data to have one row per month;
I fill-in the "gaps" in between the spells with -tsfill-
I Compute the running time spent in employment, and use lag operators to get the three quantities of interest.
This is in the vein of what done here, in a past question that I posted. However the working example there was unnecessarily complicated and with some mistakes.
SOLUTIONS PERFORMANCE
I tried different approaches suggested in the accepted answer below (including using joinby as suggested in an earlier version of the answer). In order to create a larger dataset I used:
expand 500000
bysort id spellnr: gen new_id = _n
drop id
rename new_id id
which creates a dataset with 500,000 id's (for a total of 3,500,000 spells).
The first solution largely dominates the ones that use joinby or rangejoin (see also the comments to the accepted answer below).

Below code might save some running time.
bys id (employed): gen tag = _n if !employed
sum tag, meanonly
local maxtag = `r(max)'
foreach i in 6 24 48 {
gen m`i' = .
forval d = 1/`maxtag' {
by id: gen x = 1 + min(bdate[`d'],edate) - max(bdate[`d']-`i',bdate) if employed
egen y = total(x*(x>0)), by(id)
replace m`i' = y if tag == `d'
drop x y
}
}
sort id bdate
The same logic, along with -rangejoin- (ssc) should also deserve a try. Please kindly provide some feedback after testing with your (large) actual data.
preserve
keep if employed
replace employed = 0
tempfile em
save `em'
restore
foreach i in 6 24 48 {
gen _bd = bdate - `i'
rangejoin edate _bd bdate using `em', by(id employed) p(_)
egen m`i' = total(_edate - max(_bd,_bdate)+1) if !employed, by(id bdate)
bys id bdate: keep if _n==1
drop _*
}

Browse all the rows and columns that contain a zero

Suppose I have 100 variables named ID, var1, var2, ..., var99. I have 1000 rows. I want to browse all the rows and columns that contain a 0.
I wanted to just do this:
browse ID, var* if var* == 0
but it doesn't work. I don't want to hardcode all 99 variables obviously.
I wanted to essentially write an if like this:
gen has0 = 0
forvalues n = 1/99 {
if var`n' does not contain 0 {
drop v
} // pseudocode I know doesn't work
has0 = has0 | var`n' == 0
}
browse if has0 == 1
but obviously that doesn't work.
Do I just need to reshape the data so it has 2 columns ID, var with 100,000 rows total?

My dear colleague #NickCox forces me to reply to this (duplicate) question because he is claiming that downloading, installing and running a new command is better than using built-in ones when you "need to select from 99 variables".
Consider the following toy example:
clear
input var1 var2 var3 var4 var5
1 4 9 5 0
1 8 6 3 7
0 6 5 6 8
4 5 1 8 3
2 1 0 2 1
4 6 7 1 9
end
list
+----------------------------------+
| var1 var2 var3 var4 var5 |
|----------------------------------|
1. | 1 4 9 5 0 |
2. | 1 8 6 3 7 |
3. | 0 6 5 6 8 |
4. | 4 5 1 8 3 |
5. | 2 1 0 2 1 |
6. | 4 6 7 1 9 |
+----------------------------------+
Actually you don't have to download anything:
preserve
generate obsno = _n
reshape long var, i(obsno)
rename var value
generate var = "var" + string(_j)
list var obsno value if value == 0, noobs
+----------------------+
| var obsno value |
|----------------------|
| var5 1 0 |
| var1 3 0 |
| var3 5 0 |
+----------------------+
levelsof var if value == 0, local(selectedvars) clean
display "`selectedvars'"
var1 var3 var5
restore
This is the approach i recommended in the linked question for identifying negative values. Using levelsof one can do the same thing with findname using a built-in command.
This solution can also be adapted for browse:
preserve
generate obsno = _n
reshape long var, i(obsno)
rename var value
generate var = "var" + string(_j)
browse var obsno value if value == 0
levelsof var if value == 0, local(selectedvars) clean
display "`selectedvars'"
pause
restore
Although i do not see why one would want to browse the results when can simply list them.
EDIT:
Here's an example more closely resembling the OP's dataset:
clear
set seed 12345
set obs 1000
generate id = int((_n - 1) / 300) + 1
forvalues i = 1 / 100 {
generate var`i' = rnormal(0, 150)
}
ds var*
foreach var in `r(varlist)' {
generate rr = runiform()
replace `var' = 0 if rr < 0.0001
drop rr
}
Applying the above solution yields:
display "`selectedvars'"
var13 var19 var35 var36 var42 var86 var88 var90
list id var obsno value if value == 0, noobs sepby(id)
+----------------------------+
| id var obsno value |
|----------------------------|
| 1 var86 18 0 |
| 1 var19 167 0 |
| 1 var13 226 0 |
|----------------------------|
| 2 var88 351 0 |
| 2 var36 361 0 |
| 2 var35 401 0 |
|----------------------------|
| 3 var42 628 0 |
| 3 var90 643 0 |
+----------------------------+

Short answer: wildcards for bunches of variables can't be inserted in if qualifiers. (The if command is different from the if qualifier.)
Your question is contradictory on what you want. At one point your pseudocode has you dropping variables! drop has a clear, destructive meaning to Stata programmers: it doesn't mean "ignore".
But let's stick to the emphasis on browse.
findname, any(# == 0)
finds variables for which any value is 0. search findname, sj to find the latest downloadable version.
Note also that
findname, type(numeric)
will return the numeric variables in r(varlist) (and also a local macro if you so specify).
Then several egen functions compete for finding 0s in each observation for a specified varlist: the command findname evidently helps you identify which varlist.
Let's create a small sandbox to show technique:
clear
set obs 5
gen ID = _n
forval j = 1/5 {
gen var`j' = 1
}
replace var2 = 0 in 2
replace var3 = 0 in 3
list
findname var*, any(# == 0) local(which)
egen zero = anymatch(`which'), value(0)
list `which' if zero
+-------------+
| var2 var3 |
|-------------|
2. | 0 1 |
3. | 1 0 |
+-------------+
So, the problem is split into two: finding the observations with any zeros and finding the observations with any zeros, and then putting the information together.
Naturally, the use of findname is dispensable as you can just write your own loop to identify the variables of interest:
local wanted
quietly foreach v of var var* {
count if `v' == 0
if r(N) > 0 local wanted `wanted' `v'
}
Equally naturally, you can browse as well as list: the difference is just in the command name.

Group by with percentages and raw numbers

I have a dataset that looks like this:
I would like to create a table that groups by area and shows the total amount for the area both as a percentage of total amount and as a raw number, as well as the percent of the total number of records/observations per area and total number of records/observations as a raw number.
The code below works to generate a table of raw numbers but does not the show percent of total:
tabstat amount, by(county) stat(sum count)

There isn't a canned command for doing what you want. You will have to program the table yourself.
Here's a quick example using auto.dta:
. sysuse auto, clear
(1978 Automobile Data)
. tabstat price, by(foreign) stat(sum count)
Summary for variables: price
by categories of: foreign (Car type)
foreign | sum N
---------+--------------------
Domestic | 315766 52
Foreign | 140463 22
---------+--------------------
Total | 456229 74
------------------------------
You can do the calculations and save the raw numbers in variables as follows:
. generate total_obs = _N
. display total_obs
74
. count if foreign == 0
52
. generate total_domestic_obs = r(N)
. count if foreign == 1
22
. generate total_foreign_obs = r(N)
. egen total_domestic_price = total(price) if foreign == 0
. sort total_domestic_price
. local tdp = total_domestic_price
. display total_domestic_price
315766
. egen total_foreign_price = total(price) if foreign == 1
. sort total_foreign_price
. local tfp = total_foreign_price
. display total_foreign_price
140463
. generate total_price = `tdp' + `tfp'
. display total_price
456229
And for the percentages:
. generate pct_domestic_price = (`tdp' / total_price) * 100
. display pct_domestic_price
69.212173
. generate pct_foreign_price = (`tfp' / total_price) * 100
. display pct_foreign_price
30.787828
EDIT:
Here's a more automated way to do the above without having to specify individual values:
program define foo
syntax varlist(min=1 max=1), by(string)
generate total_obs = _N
display total_obs
quietly levelsof `by', local(nlevels)
foreach x of local nlevels {
count if `by' == `x'
quietly generate total_`by'`x'_obs = r(N)
quietly egen total_`by'`x'_`varlist' = total(`varlist') if `by' == `x'
sort total_`by'`x'_`varlist'
local tvar`x' = total_`by'`x'_`varlist'
local tvarall `tvarall' `tvar`x'' +
display total_`by'`x'_`varlist'
}
quietly generate total_`varlist' = `tvarall' 0
display total_`varlist'
foreach x of local nlevels {
quietly generate pct_`by'`x'_`varlist' = (`tvar`x'' / total_`varlist') * 100
display pct_`by'`x'_`varlist'
}
end
The results are identical:
. foo price, by(foreign)
74
52
315766
22
140463
456229
69.212173
30.787828
You will obviously need to format the results in a table of your liking.

Here's another approach. I stole #Pearly Spencer's example. It could be generalised to a command. The main message I want to convey is that list is useful for tabulations and other reports, with just usually some obligation to calculate what you want to show beforehand.
. sysuse auto, clear
(1978 Automobile Data)
. preserve
. collapse (sum) total=price (count) obs=price, by(foreign)
. egen pc2 = pc(total)
. egen pc1 = pc(obs)
. char pc2[varname] "%"
. char pc1[varname] "%"
. format pc* %2.1f
. list foreign obs pc1 total pc2 , subvarname noobs sum(obs pc1 total pc2)
+-----------------------------------------+
| foreign obs % total % |
|-----------------------------------------|
| Domestic 52 70.3 315766 69.2 |
| Foreign 22 29.7 140463 30.8 |
|-----------------------------------------|
Sum | 74 100.0 456229 100.0 |
+-----------------------------------------+
. restore
EDIT Here's an essay in egen with similar flavour but leaving the original data in place and new variables also available for export or graphics.
. sysuse auto, clear
(1978 Automobile Data)
. egen total = sum(price), by(foreign)
. egen obs = count(price), by(total)
. egen tag = tag(foreign)
. egen pc2 = pc(total) if tag
(72 missing values generated)
. egen pc1 = pc(obs) if tag
(72 missing values generated)
. char pc2[varname] "%"
. char pc1[varname] "%"
. format pc* %2.1f
. list foreign obs pc1 total pc2 if tag, subvarname noobs sum(obs pc1 total pc2)
+-----------------------------------------+
| foreign obs % total % |
|-----------------------------------------|
| Domestic 52 70.3 315766 69.2 |
| Foreign 22 29.7 140463 30.8 |
|-----------------------------------------|
Sum | 74 100.0 456229 100.0 |
+-----------------------------------------+

Stata - how to create T variables that have values for each t in panel data

Sorry if the title of my question is unclear, but it's hard to summarize it on one line. I have a panel data set (codes to generate it are at the bottom):
. xtset id year
panel variable: id (strongly balanced)
time variable: year, 1 to 3
delta: 1 unit
. l, sep(3)
+-----------------+
| id year x |
|-----------------|
1. | 1 1 1.1 |
2. | 1 2 1.2 |
3. | 1 3 1.3 |
|-----------------|
4. | 2 1 2.1 |
5. | 2 2 2.2 |
6. | 2 3 2.3 |
+-----------------+
I want to create variables x_1, x_2 and x_3, where x_j has the value of x in year j for each id. I can achieve it as follows (with no elegance pursued):
. forv k=1/3 {
2. capture drop tmp
3. gen tmp = x if year==`k'
4. by id: egen x_`k' = mean(tmp)
5. }
(4 missing values generated)
(4 missing values generated)
(4 missing values generated)
. drop tmp
. l, sep(3)
+-----------------------------------+
| id year x x_1 x_2 x_3 |
|-----------------------------------|
1. | 1 1 1.1 1.1 1.2 1.3 |
2. | 1 2 1.2 1.1 1.2 1.3 |
3. | 1 3 1.3 1.1 1.2 1.3 |
|-----------------------------------|
4. | 2 1 2.1 2.1 2.2 2.3 |
5. | 2 2 2.2 2.1 2.2 2.3 |
6. | 2 3 2.3 2.1 2.2 2.3 |
+-----------------------------------+
Is there a way without using a loop? I know I can write a program or an ado file (determining the variable names automatically), but I wonder if there are some builtin commands for my purpose.
The full commands are here.
clear all
set obs 6
gen id = floor((_n-1)/3)+1
by id, sort: gen year = _n
xtset id year
gen x = id+year/10
l, sep(3)
forv k=1/3 {
capture drop tmp
gen tmp = x if year==`k'
by id: egen x_`k' = mean(tmp)
}
drop tmp
l, sep(3)

Loops are good. What I can do for you is shorten your loop:
clear all
set obs 6
gen id = floor((_n-1)/3)+1
by id, sort: gen year = _n
xtset id year
gen x = id+year/10
l, sep(3)
forv k=1/3 {
by id: gen x_`k' = x[`k']
}
l, sep(3)
There is a decency assumption in there of a balanced panel. This loop makes no such assumption, but you need to loop over the observed years:
forv year = 1/3 {
by id: egen X_`year' = total(x / (year == `year'))
}
See also this discussion, especially Sections 9 and 10.
You may also be interested in separate, which avoids an explicit loop, but only gets you part of the way to where you want to be.
All that said, it's hard to believe that you need these variables at all. The mechanism of time series operators solves many problems, while tools such as rangestat (SSC) fill in many gaps.

A late entry, but you could avoid loops if you wanted by using reshape and merge:
clear *
input float(id year x)
1 1 1.1
1 2 1.2
1 3 1.3
2 1 2.1
2 2 2.2
2 3 2.3
end
tempfile master
save `master'
reshape wide x, i(id) j(year)
tempfile using
save `using'
use `master', clear
merge m:1 id using `using', nogen

This "answer", which I post because it is too long as a comment, contains results from practices following Nick Cox's answer. All credits go to him.
Method 1: Use egen and total, missing.
levelsof year, local(yearlevels)
foreach v of varlist x {
foreach year of local yearlevels {
by id: egen `v'_`year' = total(`v' / (year==`year')), missing
}
}
The missing option handles unbalanced panels.
Method 2: Use separate and then copy the values.
foreach v of varlist x {
separate `v', by(year) gen(`v'_)
local newvars = r(varlist)
foreach w of local newvars {
by id: egen f_`w' = total(`w'), missing
}
drop `newvars'
}
This also handles unbalanced panels, but the new variable names are f_x_1, etc. The first method needs the levels of year, while the second needs creating a set of intermediate variables. I personally slightly prefer the first. It would be wonderful if Method 2 can be shortened.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Generating dummies in Stata - stata

Use i.<dummy_variable_name> For example, in your case, you can use following command for regression: reg y i.year

I also recommend using egen year_dum = group(year) reg y i.year_dum This can be generalized arbitrarily, and you can easily create, e.g., year-by-state fixed effects this way: egen year_state_dum = group(year state) reg y i.year_state_dum

Related

retaining the variable label in reshape

Computing running sum with moving time-window

Browse all the rows and columns that contain a zero

Group by with percentages and raw numbers

Stata - how to create T variables that have values for each t in panel data

Categories

Resources