I am trying to construct a balanced dataset in Stata using xtbalance command with range() option. I have data for the years 2010-2013 and 2016. But the survey was not conducted in the years 2014 and 2015. Running xtbalance, range(2010 2016) fails as xtbalance does not realize that 2014 and 2015 are not there, and basically no observations are left in the constructed panel dataset.
How should I implement this? Is it possible to construct a balanced dataset for the years 2010-2013 and 2016? What command in Stata would allow me to do this?
This is more statistics than Stata, but here goes.
If you don't have data for 2014 and 2015, how is Stata expected to find them or estimate them? Only by imputation or interpolation, which would be a massive stretch.
You can get a balanced dataset with tsfill, but it's hard to see any point to that.
You could relabel years 2010-13 and 2016 as say "waves" 1 2 3 4 5, but that ignores the crucial gap.
Best to see and state the problem for what it is, a gap beyond your control.
xtbalance is from SSC, as you are asked to explain. (I see that it includes without permission or documentation code that I wrote, but that's a small deal.)
* Example generated by -dataex-. For more info, type help dataex
clear
input float(id year whatever)
1 2010 1
1 2011 2
1 2012 3
1 2013 4
1 2016 5
2 2010 7
2 2011 8
2 2012 9
2 2013 10
2 2016 12
end
tsset id year
tsfill
list, sepby(id)
+----------------------+
| id year whatever |
|----------------------|
1. | 1 2010 1 |
2. | 1 2011 2 |
3. | 1 2012 3 |
4. | 1 2013 4 |
5. | 1 2014 . |
6. | 1 2015 . |
7. | 1 2016 5 |
|----------------------|
8. | 2 2010 7 |
9. | 2 2011 8 |
10. | 2 2012 9 |
11. | 2 2013 10 |
12. | 2 2014 . |
13. | 2 2015 . |
14. | 2 2016 12 |
+----------------------+
Related
Say I have a bi-yearly panel only with observations at odd years, such as
input id year var
1 2011 23
1 2013 12
1 2015 11
2 2011 44
2 2013 42
2 2015 13
end
and I would like to fill up the missing even years. Here years 2012 and 2014 is missing for all ids.
input id year var
1 2011 23
1 2012 .
1 2013 12
1 2014 .
1 2015 11
2 2011 44
2 2012 .
2 2013 42
2 2014 .
2 2015 13
end
I had a look at help expand but I am unsure that's what I need, since it does not take the by prefix.
As a background info, I need to fill up with even years to able to merge with another panel data-set conducted in even years only
You can set the panel id as id and the time variable as year and use tsfill:
clear
input id year var
1 2011 23
1 2013 12
1 2015 11
2 2011 44
2 2013 42
2 2015 13
end
xtset id year
tsfill
If the min and max year is not constant across panels, you could look at the ,full option.
. list
+-----------------+
| id year var |
|-----------------|
1. | 1 2011 23 |
2. | 1 2012 . |
3. | 1 2013 12 |
4. | 1 2014 . |
5. | 1 2015 11 |
|-----------------|
6. | 2 2011 44 |
7. | 2 2012 . |
8. | 2 2013 42 |
9. | 2 2014 . |
10. | 2 2015 13 |
+-----------------+
I have a panel dataset from 2006 to 2012. I generated a new variable entry that takes a value of 1 for the firm that entered to a country. For instance if a firm has missing value (.) for its sales at time (t) it takes a value of 0 and at (t+1) if it enters to a country in other words has a value for its sales it takes a value of 1. The successful command that I used for this is as follows:
egen firm_id=group(firm country)
by firm_id (year), sort: gen byte entry = ///
sum(inrange(sales, 0,.)) == 1 & sum(inrange(sales[_n - 1],0,.)) == 0
Since my data start from 2006 I excluded the observations for this year with the command:
bysort firm (year) : replace entry = 0 if year == 2006
However what I want is instead of having 0 values,
to have missing values for the subsequent years after its entry (e.g. at t+2 or t+3).
The same I applied for the exit but I changed the sort order of year:
gen nyear = -year
by firm_id (nyear), sort: gen byte exit = ///
sum(inrange(sales, 0,.)) == 1 & sum(inrange(sales[_n - 1],0,.)) == 0
since the last observation year in my data is 2012 I excluded those observations:
bysort firm (year) : replace exit = 0 if year == 2012
Again here what I want is instead of having 0 values,
to have missing values for the subsequent years after its exit (e.g. at t+2 or t+3).
As I understand it the variable sales is missing when are none and positive otherwise.
You want indicators for a year being the first and last years of sales for a firm in a country.
I think this gets you most of the way. First, we need example data!
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(firm_id year sales)
1 2006 .
1 2007 .
1 2008 42
1 2009 42
1 2010 42
1 2011 .
1 2012 .
2 2006 .
2 2007 666
2 2008 666
2 2009 .
2 2010 .
2 2011 .
2 2012 .
end
The first and last dates are the minimum and maximum dates, conditional on there being sales.
egen first = min(cond(sales < ., year, .)), by(firm_id)
egen last = max(cond(sales < ., year, .)), by(firm_id)
For discussion of technique, see section 9 of this paper. Then (1, .) indicators follow directly
generate isfirst = cond(year == first, 1, .)
generate islast = cond(year == last, 1, .)
list, sepby(firm_id)
+----------------------------------------------------------+
| firm_id year sales first last isfirst islast |
|----------------------------------------------------------|
1. | 1 2006 . 2008 2010 . . |
2. | 1 2007 . 2008 2010 . . |
3. | 1 2008 42 2008 2010 1 . |
4. | 1 2009 42 2008 2010 . . |
5. | 1 2010 42 2008 2010 . 1 |
6. | 1 2011 . 2008 2010 . . |
7. | 1 2012 . 2008 2010 . . |
|----------------------------------------------------------|
8. | 2 2006 . 2007 2008 . . |
9. | 2 2007 666 2007 2008 1 . |
10. | 2 2008 666 2007 2008 . 1 |
11. | 2 2009 . 2007 2008 . . |
12. | 2 2010 . 2007 2008 . . |
13. | 2 2011 . 2007 2008 . . |
14. | 2 2012 . 2007 2008 . . |
+----------------------------------------------------------+
I have done anything different for 2006 or 2012. You could just build special rules into the cond() syntax.
How can I delete duplicates which occur in column x but not in column y?
My dataset is as follows:
+-------+---+---+
| year | x | y |
+-------+---+---+
| 2001 | 1 | 2 |
| 2001 | 2 | 3 |
| 2001 | 2 | 3 |
| 2001 | 4 | 6 |
| 2001 | 5 | 9 |
| 2001 | 4 | 2 |
| 2001 | 4 | 9 |
+-------+---+---+
What I want is to remove the entries which occur in column y from the ones in column x.
My result would be: 1,4,5
I am currently learning Stata and I would love to know a good source for all possible commands, if this exists? So I can learn better on my own. Currently I have trouble to find good sources.
In Stata what you call columns are always called variables.
See http://www.statalist.org/forums/help#stata for general advice on how to present data examples in Stata questions. (The comments on CODE delimiters don't apply here.)
This may help. I didn't understand the role of year in your problem.
clear
input year x y
2001 1 2
2001 2 3
2001 2 3
2001 4 6
2001 5 9
2001 4 2
2001 4 9
end
rename x Datax
rename y Datay
gen long obs = _n
reshape long Data, i(obs) j(which) string
bysort Data (which) : drop if which[_N] == "y"
list
+---------------------------+
| obs which year Data |
|---------------------------|
1. | 1 x 2001 1 |
2. | 4 x 2001 4 |
3. | 7 x 2001 4 |
4. | 6 x 2001 4 |
5. | 5 x 2001 5 |
+---------------------------+
All possible commands aren't documented in a single place. Someone could write new commands all the time and they would not be documented anywhere except their help files. Did you mean that? Nor are all existing commands documented in one place: many are user-written and most of those are just documented by their help files.
Most of the official commands in Stata as supplied by StataCorp are documented in the manuals. Literally, there are also undocumented commands (I am not inventing this: see help undocumented) and there are also nondocumented commands that exist, known about because StataCorp mention them in talks or emails. To be as positive as possible: start with the manuals, bundled with your copy of Stata as .pdf files.
I'm using Stata. I have a dataset of multiple firms and their banks within a given year for multiple years. Since firms often have more than one bank there's multiple observations for a firm-year. I have a variable "bank_exityear" which contains the last year a bank is in the sample. I would like to create a variable that for each firm within a given year contains the minimum of "bank_exityear" from the previous year (and for the same firm).
An example data-set is attached here:
The variable I'd like to create is the bold "want". The data starts in 2008.
What would be the best way to create this variable?
Here's a solution using rangestat (from SSC). To install it, type in Stata's Command window:
ssc install rangestat
For the problem at hand, this requires finding the minimum bank_exityear across all observations of the same firmid whose year is one less than the year of the current observation:
clear
input year firmid bankid bank_exityear want
2008 1 1 2008 .
2008 1 2 2015 .
2009 1 2 2015 2008
2009 1 3 2015 2008
2010 1 2 2015 2015
2010 1 3 2015 2105
end
rangestat (min) bank_exityear, interval(year -1 -1) by(firmid)
list
and the results:
. list, sepby(firmid)
+-----------------------------------------------------+
| year firmid bankid bank_e~r want bank_e~n |
|-----------------------------------------------------|
1. | 2008 1 1 2008 . . |
2. | 2008 1 2 2015 . . |
3. | 2009 1 2 2015 2008 2008 |
4. | 2009 1 3 2015 2008 2008 |
5. | 2010 1 2 2015 2015 2015 |
6. | 2010 1 3 2015 2105 2015 |
+-----------------------------------------------------+
This sort of strategy might do the trick:
clear
input year firmid bankid bank_exityear want
2008 1 1 2008 .
2008 1 2 2015 .
2009 1 2 2015 2008
2009 1 3 2015 2008
2010 1 2 2015 2015
2010 1 3 2015 2105
end
tempfile min_year
preserve
collapse (min) want2 = bank_exityear, by(firmid year)
save `min_year'
restore
replace year = year - 1
merge m:1 firmid year using "`min_year'", nogen keep(master match)
replace year = year + 1
This assumes that there are no gaps in year.
your question is a little bit unclear but I believe some combination of
bysort bank_id (year) : gen lag_exit = bank_exit_year[_n-1]
bysort bank_id : egen min_var = min(lag_exit )
should work
I am attempting to make the data balanced for my sample. My data currently looks like:
id year y
1 2000 2
1 2002 4
1 2003 5
2 2001 2
2 2002 3
....
And I would like it to look like:
id year y
1 2000 2
1 2001 .
1 2002 4
1 2003 5
2 2000 .
2 2001 2
2 2002 3
....
I have tried creating a .dta of just the year and merging it to the data; however, I can't get it to work. Essentially I would like to add rows of missing data to the panel. I realize I could just drop ids with unbalanced data, but this is not an option for my methodology.
You need to skim the Data-Management Reference Manual [D] when looking for basic data management functionality. In this case fillin does what you seem to be asking.
clear
input id year y
1 2000 2
1 2002 4
1 2003 5
2 2001 2
2 2002 3
end
fillin id year
list, sepby(id)
+-------------------------+
| id year y _fillin |
|-------------------------|
1. | 1 2000 2 0 |
2. | 1 2001 . 1 |
3. | 1 2002 4 0 |
4. | 1 2003 5 0 |
|-------------------------|
5. | 2 2000 . 1 |
6. | 2 2001 2 0 |
7. | 2 2002 3 0 |
8. | 2 2003 . 1 |
+-------------------------+