I'm working on a panel dataset, which has missing values for four variables (at the start, end and in-between of panels). I would like to remove the entire panel which has missing values.
This is the code I have tried to use so far:
bysort BvD_ID YEAR: drop if sum(!missing(REV_LAY,EMP_LAY,FX_ASSET_LAY,MATCOST_LAY))==0
This piece of code successfully removes all observations with missing values in any of the four variables but it retains observations with non-missing values.
Example data:
Firm_ID Year REV_LAY EMP_LAY FX_ASSET_LAY
001 2001 80 25 120
001 2002 75 . 122
001 2003 82 32 128
002 2001 40 15 45
002 2002 42 18 48
002 2003 45 20 50
In the above sample data, I want to drop panel Firm_ID = 001 completely.
You can do something like:
clear
input Firm_ID Year REV_LAY EMP_LAY FX_ASSET_LAY
001 2001 80 25 120
001 2002 75 . 122
001 2003 82 32 128
002 2001 40 15 45
002 2002 42 18 48
002 2003 45 20 50
end
generate index = _n
bysort Firm_ID (index): generate todrop = sum(missing(REV_LAY, EMP_LAY, FX_ASSET_LAY))
by Firm_ID: drop if todrop[_N]
list Firm_ID Year REV_LAY EMP_LAY FX_ASSET_LAY
+-----------------------------------------------+
| Firm_ID Year REV_LAY EMP_LAY FX_ASS~Y |
|-----------------------------------------------|
1. | 2 2001 40 15 45 |
2. | 2 2002 42 18 48 |
3. | 2 2003 45 20 50 |
+-----------------------------------------------+
Related
I am trying to only keep the last available observation for each variable, however, the issue is that different variables per country were measured in different years. My data currently looks like this:
iso3c year Gini variable1 variable2 variable3
AND 2000 . 1.279314 33 22
AND 2001 22 2.571869 . .
AND 2002 . 3.492054 . .
AND 2003 44 3.89996 .
This is my code:
gsort + iso3c - year
drop if Gini==. & variable1==. & variable2==. & variable3==.
bysort iso3c: keep if _n==1
drop year
I tried this with one variable as in below, and then ran the other lines and it worked well.
drop if Gini==.
However, because I have different variables measured in different years per country, Stata ends up only keeping the following:
iso3c Gini variable1 variable2 variable3
AND 44 3.89996 . .
However, I want something like this, where the last available observation for variables 2 and 3 are also kept from the year 2000 even though the variables were not measured in 2004.
iso3c Gini variable1 variable2 variable3
AND 44 3.89996 33 22
Note that collapse (lastnm) Gini variable*, by(iso3c) is a one-line solution to this.
Let's show as well how to get something similar from first principles.
The last non-missing value in each panel is accessible once you sort the non-missings to the end of the panel (temporarily). If no non-missing value is available, necessarily a missing value will be returned instead.
clear
input str3 iso3c year Gini variable1 variable2 variable3
AND 2000 . 1.279314 33 22
AND 2001 22 2.571869 . .
AND 2002 . 3.492054 . .
AND 2003 44 3.89996 . .
end
gen OK = .
foreach v in Gini variable1 variable2 variable3 {
replace OK = !missing(`v')
bysort iso3c (OK year) : gen `v'_lnm = `v'[_N]
}
sort iso3c year
list iso3c year *lnm
+----------------------------------------------------------+
| iso3c year Gini_lnm va~1_lnm va~2_lnm va~3_lnm |
|----------------------------------------------------------|
1. | AND 2000 44 3.89996 33 22 |
2. | AND 2001 44 3.89996 33 22 |
3. | AND 2002 44 3.89996 33 22 |
4. | AND 2003 44 3.89996 33 22 |
+----------------------------------------------------------+
I need help restructuring the data. My Table looks like this
NameHead Department Per_test Per_Delta Per_DB Per_Vul
Nancy Health 55 33.2 33 63
Jim Air 25 22.8 23 11
Shu Water 26 88.3 44 12
Dick Electricity 77 55.9 66 10
Elena General 88 22 67 9
Nancy Internet 66 12 44 79
And I want my table to look like this
NameHead Nancy Jim Shu Dick Elena Nancy
Department Health Air Water Electricity General Internet
Per_test 55 25 26 77 88 66
Per_Delta 33.2 22.8 88.3 55.9 22 12
PerDB 33 23 44 66 67 44
Per_Vul 63 11 12 10 9 79
I tried proc transpose but couldnt get the desired result. Please help!
Thanks!
PROC TRANSPOSE does exactly what you want. You must include a VAR statement if you want to include the character variables.
proc transpose data=have out=want;
var _all_;
run;
Note that you cannot have variables that do not have names. Here is what the dataset looks like.
Obs _NAME_ COL1 COL2 COL3 COL4 COL5 COL6
1 NameHead Nancy Jim Shu Dick Elena Nancy
2 Department Health Air Water Electricity General Internet
3 Percent_test 55 25 26 77 88 66
4 Percent_Delta 33.2 22.8 88.3 55.9 22 12
5 Percent_DB 33 23 44 66 67 44
6 Percent_Vul 63 11 12 10 9 79
I have the following data structure (Picture of structure of my data), where each row represents a household and the variable "group1" identifies the classroom of child 1 in the household, "group2" the classroom of child 2, and so on. It is worth noting that there are around 3000 groups in total, as these are households all over the country.
I need to categorize households as belonging to the same group if at least one of the "group" variables have the same value (i.e., if at least one of their children go to the same class). This can happen if, for two households, "group1" = "group1", but also if "group1" = "group2", or "group3", etc.
I have experimented using inlistand looping through all "group values", but haven't arrived anywhere.
I will be extremely grateful for any help you can offer.
This is easier to do with data in a long layout, with one observation per child. You can then group households with children in the same class (as identified in the group variable) using group_id (from SSC):
* Example generated by -dataex-. To install: ssc install dataex
clear
input long household float(group1 group2 group3 group4)
101 15 16 . .
102 13 14 15 17
103 11 17 . .
104 33 34 35 .
105 34 37 . .
end
reshape long group, i(household) j(child)
drop if mi(group)
clonevar hhgroup = household
group_id hhgroup, matchby(group)
reshape wide group, i(household) j(child)
list
and the results
. list
+--------------------------------------------------------+
| househ~d group1 group2 group3 group4 hhgroup |
|--------------------------------------------------------|
1. | 101 15 16 . . 101 |
2. | 102 13 14 15 17 101 |
3. | 103 11 17 . . 101 |
4. | 104 33 34 35 . 104 |
5. | 105 34 37 . . 104 |
+--------------------------------------------------------+
.
I am trying to reshape some data. The issue is that usually data is either long or wide but this seems to be set up in a way that I cannot figure out how to reshape. The data looks as follows:
year australia canada denmark ...
1999 10 15 20
2000 12 16 25
2001 14 18 40
And I would like to get it into a panel format like the following
year country gdppc
1999 australia 10
2000 australia 12
2001 australia 14
1999 canada 16
2000 canada 18
The problem is just in the variable names. See e.g. this FAQ for the advice that you may need rename first before you can reshape.
For more complicated variants of this problem with similar data, see e.g. this paper.
clear
input year australia canada denmark
1999 10 15 20
2000 12 16 25
2001 14 18 40
end
rename (australia-denmark) gdppc=
reshape long gdppc , i(year) string j(country)
sort country year
list, sepby(country)
+--------------------------+
| year country gdppc |
|--------------------------|
1. | 1999 australia 10 |
2. | 2000 australia 12 |
3. | 2001 australia 14 |
|--------------------------|
4. | 1999 canada 15 |
5. | 2000 canada 16 |
6. | 2001 canada 18 |
|--------------------------|
7. | 1999 denmark 20 |
8. | 2000 denmark 25 |
9. | 2001 denmark 40 |
+--------------------------+
First, have a look at some variables of my dataset:
firm_id year dyrstr Lack total_workers
2432 2002 1980 29
2432 2003 1980 23
2432 2005 1980 1 283
2432 2006 1980 56
2432 2007 1980 21
2433 2004 2001 42
2433 2006 2001 1 29
2433 2008 2001 1 100
2434 2002 2002 21
2434 2003 2002 55
2434 2004 2002 22
2434 2005 2002 24
2434 2006 2002 17
2434 2007 2002 40
2434 2008 2002 110
2434 2009 2002 158
2434 2010 2002 38
2435 2002 2002 80
2435 2003 2002 86
2435 2004 2002 877
2435 2005 2002 254
2435 2006 2002 71
2435 2007 2002 116
2435 2008 2002 118
2435 2009 2002 1165
2435 2010 2002 67
2436 2002 1992 24
2436 2003 1992 25
2436 2004 1992 22
2436 2005 1992 23
2436 2006 1992 21
2436 2007 1992 100
2436 2008 1992 73
2436 2009 1992 23
2436 2010 1992 40
2437 2002 2002 30
2437 2003 2002 31
2437 2004 2002 21
2437 2006 2002 1 56
2437 2007 2002 20
The variables:
firm_id is an identifier for firms
year is the year of the observation
dyrstr is the founding year of a firm
Lack equals 1 if there is a missing observation in the year before (e.g. in line three of the dataset, Lack equals 1 because for the firm with ID 2432, there is no observation in the year 2004)
total_workers is the number of workers
I'd like to fill in the gaps, namely I'd like to create new observations as I show you in the following (only considering the firm with ID 2432):
firm_id year dyrstr Lack total_workers
2432 2002 1980 29
*2432* *2004* *1980* *156*
2432 2003 1980 23
2432 2005 1980 1 283
2432 2006 1980 56
2432 2007 1980 21
The line where I've put the values of the variables in asterisks is the newly created observation. This observation should be a copy of the previous observation but with some modification.
firm_id should stay the same as in the line before
year should be the year from the previous line plus one
dyrstr should stay the same as in the line before
Lack: here it doesn't matter which value this variable has
total_workers equals 0.5*(value of the previous observation + value of consecutive observation)
all other variables of my dataset (which I didn't list here) should stay the same as in the line before
I read something about the the command expand but help expand doesn't help me much. Hopefully one of you can help me!
My suggestions hinge on using expand, which in turn just requires information on the number of observations to be added. I ignore your variable Lack, as Stata itself can work out where the gaps are. My procedure for imputing total_workers is based on using the inbuilt command ipolate and thus would work over gaps longer than 1 year, which don't appear in your example. The number of workers so estimated is not necessarily an integer.
For other interpolation procedures, check out cipolate, csipolate, pchipolate, all accessible via ssc desc cipolate (or equivalent).
This kind of operation depends on getting sort order exactly right, which I don't think is trivial, even with experience, so in getting the code right for similar problems, be prepared for false starts; pepper your trial code with list statements; and work on a good toy example dataset (as you kindly provided here).
. clear
. input firm_id year dyrstr total_workers
firm_id year dyrstr total_w~s
1. 2432 2002 1980 29
2. 2432 2003 1980 23
3. 2432 2005 1980 283
4. 2432 2006 1980 56
5. 2432 2007 1980 21
6. 2433 2004 2001 42
7. 2433 2006 2001 29
8. 2433 2008 2001 100
9. 2434 2002 2002 21
10. 2434 2003 2002 55
11. 2434 2004 2002 22
12. 2434 2005 2002 24
13. 2434 2006 2002 17
14. 2434 2007 2002 40
15. 2434 2008 2002 110
16. 2434 2009 2002 158
17. 2434 2010 2002 38
18. 2435 2002 2002 80
19. 2435 2003 2002 86
20. 2435 2004 2002 877
21. 2435 2005 2002 254
22. 2435 2006 2002 71
23. 2435 2007 2002 116
24. 2435 2008 2002 118
25. 2435 2009 2002 1165
26. 2435 2010 2002 67
27. 2436 2002 1992 24
28. 2436 2003 1992 25
29. 2436 2004 1992 22
30. 2436 2005 1992 23
31. 2436 2006 1992 21
32. 2436 2007 1992 100
33. 2436 2008 1992 73
34. 2436 2009 1992 23
35. 2436 2010 1992 40
36. 2437 2002 2002 30
37. 2437 2003 2002 31
38. 2437 2004 2002 21
39. 2437 2006 2002 56
40. 2437 2007 2002 20
41. end
. scalar N = _N
. bysort firm_id (year) : gen gap = year - year[_n-1]
(6 missing values generated)
. expand gap
(6 missing counts ignored; observations not deleted)
(4 observations created)
. gen orig = _n <= scalar(N)
. bysort firm_id (year) : replace total_workers = . if !orig
(4 real changes made, 4 to missing)
. bysort firm_id (year orig) : replace year = year[_n-1] + 1 if _n > 1 & year != year[_n-1] + 1
(4 real changes made)
. bysort firm_id (year): ipolate total_workers year , gen(total_workers2)
. list, sepby(firm_id)
+------------------------------------------------------------+
| firm_id year dyrstr total_~s gap orig total_~2 |
|------------------------------------------------------------|
1. | 2432 2002 1980 29 . 1 29 |
2. | 2432 2003 1980 23 1 1 23 |
3. | 2432 2004 1980 . 2 0 153 |
4. | 2432 2005 1980 283 2 1 283 |
5. | 2432 2006 1980 56 1 1 56 |
6. | 2432 2007 1980 21 1 1 21 |
|------------------------------------------------------------|
7. | 2433 2004 2001 42 . 1 42 |
8. | 2433 2005 2001 . 2 0 35.5 |
9. | 2433 2006 2001 29 2 1 29 |
10. | 2433 2007 2001 . 2 0 64.5 |
11. | 2433 2008 2001 100 2 1 100 |
|------------------------------------------------------------|
12. | 2434 2002 2002 21 . 1 21 |
13. | 2434 2003 2002 55 1 1 55 |
14. | 2434 2004 2002 22 1 1 22 |
15. | 2434 2005 2002 24 1 1 24 |
16. | 2434 2006 2002 17 1 1 17 |
17. | 2434 2007 2002 40 1 1 40 |
18. | 2434 2008 2002 110 1 1 110 |
19. | 2434 2009 2002 158 1 1 158 |
20. | 2434 2010 2002 38 1 1 38 |
|------------------------------------------------------------|
21. | 2435 2002 2002 80 . 1 80 |
22. | 2435 2003 2002 86 1 1 86 |
23. | 2435 2004 2002 877 1 1 877 |
24. | 2435 2005 2002 254 1 1 254 |
25. | 2435 2006 2002 71 1 1 71 |
26. | 2435 2007 2002 116 1 1 116 |
27. | 2435 2008 2002 118 1 1 118 |
28. | 2435 2009 2002 1165 1 1 1165 |
29. | 2435 2010 2002 67 1 1 67 |
|------------------------------------------------------------|
30. | 2436 2002 1992 24 . 1 24 |
31. | 2436 2003 1992 25 1 1 25 |
32. | 2436 2004 1992 22 1 1 22 |
33. | 2436 2005 1992 23 1 1 23 |
34. | 2436 2006 1992 21 1 1 21 |
35. | 2436 2007 1992 100 1 1 100 |
36. | 2436 2008 1992 73 1 1 73 |
37. | 2436 2009 1992 23 1 1 23 |
38. | 2436 2010 1992 40 1 1 40 |
|------------------------------------------------------------|
39. | 2437 2002 2002 30 . 1 30 |
40. | 2437 2003 2002 31 1 1 31 |
41. | 2437 2004 2002 21 1 1 21 |
42. | 2437 2005 2002 . 2 0 38.5 |
43. | 2437 2006 2002 56 2 1 56 |
44. | 2437 2007 2002 20 1 1 20 |
+------------------------------------------------------------+
The following works if, like in your example database, you don't have consecutive years missing for any given firm. I also assume variable Lack to be numeric and the final result is an unbalanced panel (you were not specific about this point in your question).
* Expand database
expand 2 if Lack == 1, gen(x)
gsort firm_id year -x
* Substitution rules
replace year = year - 1 if x == 1
replace total_workers = (total_workers[_n-1] + total_workers[_n+1])/2 if x == 1
list, sepby(firm_id)
The expand line could be re-written as expand Lack + 1, gen(x), but maybe it is clearer that way.
For the more general case in which you do have consecutive years missing, the following should get you started under the assumption that Lack specifies the number of consecutive years missing. For example, if there is a jump from 2006 to 2009 for a given firm, then Lack = 2 for the 2009 observation.
* Expand database
expand Lack + 1, gen(x)
gsort firm_id year -x
* Substitution rules
replace year = year[_n-1] + 1 if x == 1
Now you just need to come up with an imputation rule for your total_workers:
replace total_workers = ...
If Lack is a string, convert to numeric using real.
You've already awarded the answer, but I have had to do similar before and always use the cross command as follows. Say I am using your dataset already & continue with the following code:
tempfile master year
save `master'
preserve
keep year
duplicates drop
save `year'
restore
//next two lines set me up to correct for different year ranges by firm; if year ranges were standard, this would be omitted
bys firm_id: egen minyear=min(year)
bys firm_id: egen maxyear=max(year)
keep firm_id minyear maxyear
duplicates drop
cross using `year'
merge m:1 firm_id year using `master', assert(1 3) nogen
drop if year<minyear | year>maxyear //this adjusts for years outside the earliest and latest years observed by firm; if year ranges standard, again omitted
Then from here, use the ipolate command in the spirit of #NickCox.
I'm particularly interested in any pros/cons regarding the use of expand and cross. (Beyond the fact that my use here specifically hinges on >0 records for each year being observed in order to construct the crossed dataset, which could be eliminated if I create the `year' tempfile differently.)