I'm trying in Stata to add columns to my dataset and name them year_2005,..., year_2017.
Here is my code:
gen a=.
forvalues i=2005(1)2015 {
replace a=(b>i)
rename a "year"+`i'
}
b is a numeric variable in my dataset.
Here's one way to do this:
clear
set obs 1
forvalues i = 1 / 15 {
if `i' < 10 local d 0
generate year_20`d'`i' = runiform()
}
Or alternatively (as per #NickCox comment - see Stata tip 85):
clear
set obs 1
forvalues i = 1 / 15 {
generate year_20`: display %02.0f `i'' = runiform()
}
Or using your example:
clear
set obs 1
forvalues i = 2005(1)2015 {
generate a = .
replace a = runiform()
rename a year_`i'
}
Related
I'm using Stata. I have a dataset with approximately 1800 observations and 1050 variables. Most of them are categorical variables with a few categories. It looks something like this:
------------------------------------------------------
| id | fh_1 | fh_1a | fh_2 | fh_2a | fh_3 | fh_3a |...
------------------------------------------------------
|1111| 1 |closed | 2 | 4 | 1 | open |...
------------------------------------------------------
|1112| 2 | open | 1 | 2 | 3 | closed|...
------------------------------------------------------
.
.
.
I need to export to an Excel sheet the list of all variables in this dataset with all unique values for each variable. It should look something like this:
--------------------------
|variable | unique_values|
--------------------------
| fh | 1 2 3 4 5 |
--------------------------
|fh_1a | closed open |
--------------------------
.
.
.
I think I need a loop with the command levelsof but I'm not sure how to build it. Any suggestions?
foreach v of var * {
levelsof `v'
}
would be a start, but I haven't directly addressed how to make that output Excel-friendly.
One possibility is to put all the output in string variables given that the number of observations exceeds the number of variables.
gen varname = ""
gen levels = ""
local i = 1
foreach v of var * {
levelsof `v'
replace varname = "`v'" in `i'
replace levels = `"`r(levels)'"' in `i'
local ++i
}
Here is one way to solve it. You might run in to issues if you have strings variable where some observations have values that are strings composed of more than one word. Then there is no way to tell if it was one observation with both words or two observations with one word each.
The values are sorted alphabetically, so you might be able to figure out anyway, but it could be ambivalent.
sysuse auto,clear
* Get a list of all vars apart from whatever var we do not want to include
ds make, not
local all_vars_but_id `r(varlist)'
* Get the number of vars, represents the number of rows in the dataset to be exported
local num_vars : word count `all_vars_but_id'
* Get the values for each var and store in local with same name as var
foreach var of local all_vars_but_id {
levelsof `var'
local `var' `r(levels)'
}
*Preserve the original data
preserve
* Remove the data and set up the data set to be exported
clear
set obs `num_vars'
gen var = ""
gen values = ""
* Copy the value of the locals created abobe to one row per variable
local counter 1
foreach var of local all_vars_but_id {
replace var = "`var'" if _n == `counter'
replace values = "``var''" if _n == `counter'
local counter = `counter' + 1
}
* Export to Excel
export excel using "C:\path/to/file/unique_values.xls"
*Restore the original data
restore
Another option using levelsof
input id str6(var1 var2 var3)
1 "open" "2" "3"
2 "closed" "1" "2"
3 "open" "1" "1"
end
reshape long var, i(id)
rename var values
rename _j var
gen unique_values = ""
forvalues i = 1/3 {
levelsof values if var == `i'
replace unique_values = r(levels) if var == `i'
}
replace unique_values = subinstr(unique_values,"`","",.)
replace unique_values = subinstr(unique_values,`"""',"",.)
replace unique_values = subinstr(unique_values,"'","",.)
contract var unique_values
drop _freq
list, noobs
I have a balanced panel with a set of dummies for 'countries' and observations for several years. I want to generate a new set of variables that assigns a number in the sequence 1:n for each year observation of country i, and 0 for any other observation that is not from country i.
As an example, suppose I have two countries and two years. Below on the left is an example of my database. I want a new set of variables as shown on the right:
*Example of Database Example of Desired Output
*country1 country2 year output1 output2
* 1 0 1 1 0
* 1 0 2 2 0
* 0 1 1 0 1
* 0 1 2 0 2
How can I get the desired output? Intuitively I need to multiply 'country*' by 'year' to get 'output*', but I have been unable to make it work in Stata.
Below is what I tried.
gen output = year * country
* country is ambiguous
gen output = year * country*
* invalid syntax
foreach var in country*{
gen output_`var' = year * `var'
}
* invalid name
Your last attempt almost solved it. The issue with your attempt is that you need to tell Stata that you are passing a varlist for you to be able to use the wildcards * and ?. To be able to use a wildcard in foreach, do this:
* Example generated by -dataex-. For more info, type help dataex
clear
input byte(country1 country2 year)
1 0 1
1 0 2
0 1 1
0 1 2
end
foreach var of varlist country* {
gen `var'_year = year * `var'
}
The full name country1, country2 etc. is stored in `var' so I took the freedom to update the name of the result variables to country1_year, country2_year etc. rather than output_country1, output_country2 etc.
Note that this solution will only work if the country* vars only have the values 1 and 0, no observation has a missing value in any variable country* and no observation have the value 1 in more than one variable country*.
How can I generate panel data in Stata?
I would like that each individual is affected by unobserved heterogeneity.
For example, I want the DGP (data generating process) is something like:
Wages_{it}= \beta (Labor market experience_{it}) + \alpha_{i} + \epsilon_{it},
where \alpha_{i} is the unobserved heterogeneity and where \epsilon_{it} is the error term which is normally distributed.
Finally, I would like that (Labor market experience_{it}) is an AR(1) process, e.g.:
Labor market experience_{it}= 0.8 * (Labor market experience_{i,t-1}) + v_{it},
where v_{it} is the error term which is normally distributed.
You can do something like this by using subscripting combined with bysort:
clear
set seed 10011979
set obs 4 // Set the number of panels (N)
gen id = _n
gen alpha = rnormal(0,1)
expand 3 // Set the number of periods (T)
bys id: gen t=_n
xtset id t
bysort id (t): gen lme = rnormal(0,1) + rnormal(0,1) if _n==1
bysort id (t): replace lme = .8 * lme[_n-1] + rnormal(0,1) if _n!=1
gen w = 3 * lme + alpha + rnormal(0,1)
drop alpha
How could I create a variable by dividing it by an IQR? I have done it through a long way as follows.
Sample data and code is the following:
use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear
foreach var of varlist read-socst {
egen `var'75 = pctile(`var'), p(75)
egen `var'25 = pctile(`var'), p(25)
gen `var'q =`var'75 - `var'25
drop `var'75 `var'25
}
gen readI = read/readq
gen sciI = science/scienceq
The simplest way is just to use summarize results directly:
sysuse auto, clear
quietly foreach v of var price-foreign {
su `v', detail
gen `v'q = `v' / (r(p75) - r(p25))
}
The egen route is overkill if it means creating new variables for each original variable, just to hold the quartiles or the IQR as repeated constants. But egen comes into its own when you want to do this by groups:
bysort foreign: egen mpg_upq = pctile(mpg), p(75)
by foreign: egen mpg_loq = pctile(mpg), p(25)
gen mpg_Q = mpg / (mpg_upq - mpg_loq)
Note that the IQR can be 0, and will often be 0 for indicator variables.
In my dataset, I have observations for football matches. One of my variables is hometeam. Now I want to get the average amount of observations per hometeam. How do I do that in Stata?
I know that I could tab hometeam, but since there are over 500 distinct hometeams, I don't want to do the calculation manually.
bysort hometeam : gen n = _N
bysort hometeam : gen tag = _n == 1
su n if tag
EDIT Another way to do it more concisely
bysort hometown : gen n = _N if _n == 1
su n
Why the tagging then? It is often useful to have a tag variable when you are moving back and forth between individual and group level. egen, tag() does the same thing.
Why if _n == 1? You need to have this value just once for each group, and there are two ways of doing it that always work for groups that could be as small as one observation, to do it for the first or the last observation in a group. In a group of 1, they are the same, but that doesn't matter. So if _n == _N is another way to do it.
bysort hometown : gen n = _N if _n == _N
The code needs to be changed in situations where you need not to count missings on some variable
bysort hometown : gen n = sum(!missing(myvar))
by hometown : replace n = . if _n < _N
egen, count() is similar, but not identical.
I assume you can identify the different hometeams with some id variable.
If you want the average number of observations per id this is one way:
clear all
set more off
input id hometeam
1 .
1 5
1 0
3 6
3 2
3 1
3 9
2 7
2 7
end
list, sepby(id)
bysort id: egen c = count(hometeam)
by id: keep if _n == 1
summarize c, meanonly
disp r(mean)
Note that observations with missings are not counted by count. If you did want to count the missings, then you could do:
bysort id: gen c = _n
by id: keep if _n == _N
summarize c, meanonly
disp r(mean)
Option 2: Using the data of #Roberto
collapse (count) hometeam, by(id)
sum hometeam,meanonly