Attrition in panel data - Stata

Attrition in panel data - Stata - stata

I am constructing a panel dataset based on the survey data for the years 2010-2013 (four consecutive years). As is usually the case with household survey data, there is an issue of attrition, i.e. some households drop out from the survey from year to year. I need to figure out whether these households are missing at random.
My idea is to come up with a dummy equal to 1 in 2011 if a household present in 2010 is missing in 2011 (and 0 otherwise), and so on for the years 2012, 2013. Then I want to run the logit/probit regression on this dummy with a set of covariates that I would like to control for in my study. The variable for household id is "hhid" and I have of course the time dimension variable "year".
Does anyone have a precise idea how this should be properly coded in Stata? I know it is not complicated, but I just cannot wrap my head around it and figure this out....

Here is an example on how you create a dummy in a panel data and then collapse those dummy to the parent unit-of-observation making the dummy 1 if the parent unit-of-observation was 1 in any time period. Then merge the parent unit-of-observation level data back to the panel data.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte hhid int year
1 2010
1 2011
1 2012
1 2013
2 2010
2 2011
2 2013
3 2010
3 2011
end
*Create a dummy for each year-hh level observation for each year
local year_dummies ""
forvalues year = 2010/2013 {
gen dummy`year' = (year==`year')
local year_dummies "`year_dummies' dummy`year'"
}
*Collapse the data set to hh level where the dummies is 1 if any year-hh level was 1
preserve
collapse (max) `year_dummies' , by(hhid)
tempfile year_dummy_hhlevel
save `year_dummy_hhlevel'
restore
*Rename to not having to overwrite the first step
rename dummy???? org_dummy????
*Merge the hh level data back to the year-hh level
*data merging the hh dummy to each year-hh observation
merge m:1 hhid using `year_dummy_hhlevel', nogen

Your question is if there is a difference in the households you do not observe in year X compare to those you do observe in year X. There is no perfect way to answer this question as you, by definition, did not observe those households.
You did however observe all households in your study in year 0 (2010 in your case). As you imply yourself, you can use observations in year 0 as a proxy to answer if those households are different in year X. I can help you show how you can code this, but StackOverflow is not the appropriate forum to discuss is this is statistically valid given your data, how it was collected and what analysis you intend to use.
One way to code this is to use iebaltab in the package called ietoolkit available from SSC (disclosure, I wrote that command).
You can create an attrition dummy indicating attrition and use iebaltab like this: iebaltab balancevars, grpvar(attrition) where balancevars is a list of variables for characteristics in the household where you want to make sure they were similar in year 0. You can use the option ftest to include the test across all balance variables they way you are suggesting.
Not that this command generates statistics, but it is up to you to decide if this is valid, and the validity of balance tests are hotly debated. But those debates are not about coding which StackOverflow is about.

Related

Panel data - drop observations in stata

I have panel data containing 4 waves. I need help tp keep only individuals who have participated in all waves. I saw this post drop observations, but ending deleting everything.
Can anyone help me?

The details depend on details you haven't given. But for example suppose you have variables id and wave where wave runs 1 2 3 4. Conditions for selecting complete panels only might be
bysort id : keep if _N == 4
or
egen total = total(wave), by(id)
keep if total == 10
These commands won't help if you have a wave variable always present but the problem is missing values on other variables.

longitudinal dataset categorical variables

I have a longitudinal dataset which contains variables on individuals from 2 waves from Feb and June which measure economic activity across these individuals. The variables from Feb and May wave are categorical variables and I am running the proportion command in Stata to get the individual change in economic activity. For example. I am looking for changes in hours worked across 2 waves and I run proportion but am not able to figure out the if condition as I only want individuals who responded in both Feb and June. I want to drop all those who responded in Feb but not in May or likewise.

Let's suppose you have an identifier variable id and a time-like variable, wave that takes values 1 and 2. If so, you are looking for individuals that satisfy
bysort id (wave) : gen wanted = wave[1] == 1 & wave[2] == 2
So wanted is an indicator that is 1 for individuals present for both waves and 0 otherwise and if wanted would be an if condition to select those people wanted.
There are many variations on this, depending on: your variable names; your data layout; how the information on waves is held (could be also, say, a string variable containing values like "Feb", "May" or "June", or a numeric variable holding dates).
You gave a broad-brush description sketching the problem but almost no precise information on the data. The stata tag wiki gives much detailed advice on how to post a question and flags the importance of giving a concrete data example.

Clarification on tabstat use after bysort in Stata

I have a rather simple question regarding the output of tabstat command in Stata.
To be more specific, I have a large panel dataset containing several hundred thousands of observations, over a 9 year period.
The context:
bysort year industry: egen total_expenses=total(expenses)
This line should create total expenses by year and industry (or sum of all expenses by all id's in one particular year for one particular industry).
Then I'm using:
tabstat total_expenses, by(country)
As far as I understand, tabstat should show in a table format the means of expenses. Please do note that ids are different from countries.
In this case tabstat calculates the means for all 9 years for all industries for a particular country, or it just the mean of one year and one industry by each country from my panel data?
What would happen if this command is used in the following context:
bysort year industry: egen mean_expenses=mean(expenses)
tabstat mean_expenses, by(country)
Does tabstat creates means of means? This is a little bit confusing.

I don't know what is confusing you about what tabstat does, but you need to be clear about what calculating means implies. Your dataset is far too big to post here, but for your sake as well as ours creating a tiny sandbox dataset would help you see what is going on. You should experiment with examples where the correct answer (what you want) is obvious or at least easy to calculate.
As a detail, your explanation that ids are different from countries is itself confusing. My guess is that your data are on firms and the identifier concerned identifies the firm. Then you have aggregations by industry and by country and separately by year.
bysort year industry: egen total_expenses = total(expenses)
This does calculate totals and assigns them to every observation. Thus if there are 123 observations for industry A and 2013, there will be 123 identical values of the total in the new variable.
tabstat total_expenses, by(country)
The important detail is that tabstat by default calculates and shows a mean. It just works on all the observations available, unless you specify otherwise. Stata has no memory or understanding of how total_expenses was just calculated. The mean will take no account of different numbers in each (industry, year) combination. There is no selection of individual values for (industry, year) combinations.
Your final question really has the same flavour. What your command asks for is a brute force calculation using all available data. In effect your calculations are weighted by the numbers of observations in whatever combinations of industry, country and year are being aggregated.
I suspect that you need to learn about two commands (1) collapse and (2) egen, specifically its tag() function. If you are using Stata 16, frames may be useful to you. That should apply to any future reader of this using a later version.

Deleting all observations given one observation for each variable's type

I have a table with firm identifiers, fiscal year, quarter and market_capital. I want to delete all firm observations that had a specific market capital at a specific quarter of a specific year. That is, I want to delete all observations for a firm if its market capital for 2006, quarter 2 was below 50.
My table is in the form:
enter image description here

If I understand correctly, you have a Stata dataset containing four variables which I will call firm, year, quarter, and mc (since "Capital Market" shown in the picture of your data is not valid a Stata variable name).
The following code might start you in the right direction, but it is untested since my copy of Stata cannot read the picture of your data, and "I want to retype data from a picture of data" said nobody, ever.
Added in edit: the untested code had an error, so I removed it.

Having a quarterly date variable -- rather than separate year and quarter variables -- will be needed sooner or later.
That could be
gen qdate = yq(year, quarter)
format qdate %tq
Then your code for dropping is
egen todrop = total(capital < 50 & qdate == yq(2006, 2)), by(firm)
drop if todrop
as the variable todrop will be 1 if and only if you want to drop a firm and 0 otherwise.
See this paper for a review of related technique.

Single sample t-Test in Stata without hardcoded null hypothesis

I am pretty new to Stata and I am having difficulty doing something which I would guess is not that unusual of a thing to try to do. I am working with a panel data set (countries and times). Each observation consists of a country, a year, and a variable, call it x. The data is sorted by country year (i.e. all observations corresponding to a given country are consecutive and sorted by year).
Each country has 54 years of data corresponding to 1960 to 2013 inclusive. I would like to run a t-test something like in the following way:
by country: ttest x = x[54] if year != 2013
But I get an error ("weights not allowed") which I don't know how to interpret. I could do it by hardcoding it in and using the usual syntax
by country: ttest x = # if year != 2013
but I want to avoid hard-coding since there are >100 countries and I want to be able to flexibly add / remove countries (and this is just poor form in general).
My first thought was to define a macro using something like
levelsof country, local(levels)
foreach c of local levels {
local y x if year == 2013
ttest x = y if year != 2013
// some code to store the value that I haven't figured out yet
}
but you can't use "if" with declaring a local macro. I am pretty lost and would appreciate any help you all can give. Thank you!

Student's t tests here make little sense without adjustment for time and space dependence structure, unless you have grounds for treating your data as equivalent to independent draws from the same distribution. You can do the tests, but standard errors and P-values are dubious if not bogus. That is, your individual tests on time series face one problem; and collectively your tests face another problem. For a good account, see either edition of Box, Hunter, Hunter, Statistics for experimenters. John Wiley.
That large point aside, Stata is choking on the [] which are being misread as an attempt to specify weights. My guess is that
by country: ttest x = `=x[54]' if year != 2013
would be acceptable syntax to Stata, although still dubious statistics. The detail here is the macro-like syntax
`= '
which has the effect that the expression given will be evaluated by Stata before the line is passed to ttest. So the result, a numeric value, will be what the ttest command sees.
This is naturally similar in spirit to what you were imagining, although your code is some way from being legal and correct.
UPDATE This calculation may also be helpful:
egen mean = mean(x / (year != 2013)), by(country)
egen sd = sd(x / (year != 2013), by(country)
gen z = (x - mean) / sd if year == 2013
list country x z if year == 2013

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js