First, have a look at some variables of my dataset:
firm_id year dyrstr Lack total_workers
2432 2002 1980 29
2432 2003 1980 23
2432 2005 1980 1 283
2432 2006 1980 56
2432 2007 1980 21
2433 2004 2001 42
2433 2006 2001 1 29
2433 2008 2001 1 100
2434 2002 2002 21
2434 2003 2002 55
2434 2004 2002 22
2434 2005 2002 24
2434 2006 2002 17
2434 2007 2002 40
2434 2008 2002 110
2434 2009 2002 158
2434 2010 2002 38
2435 2002 2002 80
2435 2003 2002 86
2435 2004 2002 877
2435 2005 2002 254
2435 2006 2002 71
2435 2007 2002 116
2435 2008 2002 118
2435 2009 2002 1165
2435 2010 2002 67
2436 2002 1992 24
2436 2003 1992 25
2436 2004 1992 22
2436 2005 1992 23
2436 2006 1992 21
2436 2007 1992 100
2436 2008 1992 73
2436 2009 1992 23
2436 2010 1992 40
2437 2002 2002 30
2437 2003 2002 31
2437 2004 2002 21
2437 2006 2002 1 56
2437 2007 2002 20
The variables:
firm_id is an identifier for firms
year is the year of the observation
dyrstr is the founding year of a firm
Lack equals 1 if there is a missing observation in the year before (e.g. in line three of the dataset, Lack equals 1 because for the firm with ID 2432, there is no observation in the year 2004)
total_workers is the number of workers
I'd like to fill in the gaps, namely I'd like to create new observations as I show you in the following (only considering the firm with ID 2432):
firm_id year dyrstr Lack total_workers
2432 2002 1980 29
*2432* *2004* *1980* *156*
2432 2003 1980 23
2432 2005 1980 1 283
2432 2006 1980 56
2432 2007 1980 21
The line where I've put the values of the variables in asterisks is the newly created observation. This observation should be a copy of the previous observation but with some modification.
firm_id should stay the same as in the line before
year should be the year from the previous line plus one
dyrstr should stay the same as in the line before
Lack: here it doesn't matter which value this variable has
total_workers equals 0.5*(value of the previous observation + value of consecutive observation)
all other variables of my dataset (which I didn't list here) should stay the same as in the line before
I read something about the the command expand but help expand doesn't help me much. Hopefully one of you can help me!
My suggestions hinge on using expand, which in turn just requires information on the number of observations to be added. I ignore your variable Lack, as Stata itself can work out where the gaps are. My procedure for imputing total_workers is based on using the inbuilt command ipolate and thus would work over gaps longer than 1 year, which don't appear in your example. The number of workers so estimated is not necessarily an integer.
For other interpolation procedures, check out cipolate, csipolate, pchipolate, all accessible via ssc desc cipolate (or equivalent).
This kind of operation depends on getting sort order exactly right, which I don't think is trivial, even with experience, so in getting the code right for similar problems, be prepared for false starts; pepper your trial code with list statements; and work on a good toy example dataset (as you kindly provided here).
. clear
. input firm_id year dyrstr total_workers
firm_id year dyrstr total_w~s
1. 2432 2002 1980 29
2. 2432 2003 1980 23
3. 2432 2005 1980 283
4. 2432 2006 1980 56
5. 2432 2007 1980 21
6. 2433 2004 2001 42
7. 2433 2006 2001 29
8. 2433 2008 2001 100
9. 2434 2002 2002 21
10. 2434 2003 2002 55
11. 2434 2004 2002 22
12. 2434 2005 2002 24
13. 2434 2006 2002 17
14. 2434 2007 2002 40
15. 2434 2008 2002 110
16. 2434 2009 2002 158
17. 2434 2010 2002 38
18. 2435 2002 2002 80
19. 2435 2003 2002 86
20. 2435 2004 2002 877
21. 2435 2005 2002 254
22. 2435 2006 2002 71
23. 2435 2007 2002 116
24. 2435 2008 2002 118
25. 2435 2009 2002 1165
26. 2435 2010 2002 67
27. 2436 2002 1992 24
28. 2436 2003 1992 25
29. 2436 2004 1992 22
30. 2436 2005 1992 23
31. 2436 2006 1992 21
32. 2436 2007 1992 100
33. 2436 2008 1992 73
34. 2436 2009 1992 23
35. 2436 2010 1992 40
36. 2437 2002 2002 30
37. 2437 2003 2002 31
38. 2437 2004 2002 21
39. 2437 2006 2002 56
40. 2437 2007 2002 20
41. end
. scalar N = _N
. bysort firm_id (year) : gen gap = year - year[_n-1]
(6 missing values generated)
. expand gap
(6 missing counts ignored; observations not deleted)
(4 observations created)
. gen orig = _n <= scalar(N)
. bysort firm_id (year) : replace total_workers = . if !orig
(4 real changes made, 4 to missing)
. bysort firm_id (year orig) : replace year = year[_n-1] + 1 if _n > 1 & year != year[_n-1] + 1
(4 real changes made)
. bysort firm_id (year): ipolate total_workers year , gen(total_workers2)
. list, sepby(firm_id)
+------------------------------------------------------------+
| firm_id year dyrstr total_~s gap orig total_~2 |
|------------------------------------------------------------|
1. | 2432 2002 1980 29 . 1 29 |
2. | 2432 2003 1980 23 1 1 23 |
3. | 2432 2004 1980 . 2 0 153 |
4. | 2432 2005 1980 283 2 1 283 |
5. | 2432 2006 1980 56 1 1 56 |
6. | 2432 2007 1980 21 1 1 21 |
|------------------------------------------------------------|
7. | 2433 2004 2001 42 . 1 42 |
8. | 2433 2005 2001 . 2 0 35.5 |
9. | 2433 2006 2001 29 2 1 29 |
10. | 2433 2007 2001 . 2 0 64.5 |
11. | 2433 2008 2001 100 2 1 100 |
|------------------------------------------------------------|
12. | 2434 2002 2002 21 . 1 21 |
13. | 2434 2003 2002 55 1 1 55 |
14. | 2434 2004 2002 22 1 1 22 |
15. | 2434 2005 2002 24 1 1 24 |
16. | 2434 2006 2002 17 1 1 17 |
17. | 2434 2007 2002 40 1 1 40 |
18. | 2434 2008 2002 110 1 1 110 |
19. | 2434 2009 2002 158 1 1 158 |
20. | 2434 2010 2002 38 1 1 38 |
|------------------------------------------------------------|
21. | 2435 2002 2002 80 . 1 80 |
22. | 2435 2003 2002 86 1 1 86 |
23. | 2435 2004 2002 877 1 1 877 |
24. | 2435 2005 2002 254 1 1 254 |
25. | 2435 2006 2002 71 1 1 71 |
26. | 2435 2007 2002 116 1 1 116 |
27. | 2435 2008 2002 118 1 1 118 |
28. | 2435 2009 2002 1165 1 1 1165 |
29. | 2435 2010 2002 67 1 1 67 |
|------------------------------------------------------------|
30. | 2436 2002 1992 24 . 1 24 |
31. | 2436 2003 1992 25 1 1 25 |
32. | 2436 2004 1992 22 1 1 22 |
33. | 2436 2005 1992 23 1 1 23 |
34. | 2436 2006 1992 21 1 1 21 |
35. | 2436 2007 1992 100 1 1 100 |
36. | 2436 2008 1992 73 1 1 73 |
37. | 2436 2009 1992 23 1 1 23 |
38. | 2436 2010 1992 40 1 1 40 |
|------------------------------------------------------------|
39. | 2437 2002 2002 30 . 1 30 |
40. | 2437 2003 2002 31 1 1 31 |
41. | 2437 2004 2002 21 1 1 21 |
42. | 2437 2005 2002 . 2 0 38.5 |
43. | 2437 2006 2002 56 2 1 56 |
44. | 2437 2007 2002 20 1 1 20 |
+------------------------------------------------------------+
The following works if, like in your example database, you don't have consecutive years missing for any given firm. I also assume variable Lack to be numeric and the final result is an unbalanced panel (you were not specific about this point in your question).
* Expand database
expand 2 if Lack == 1, gen(x)
gsort firm_id year -x
* Substitution rules
replace year = year - 1 if x == 1
replace total_workers = (total_workers[_n-1] + total_workers[_n+1])/2 if x == 1
list, sepby(firm_id)
The expand line could be re-written as expand Lack + 1, gen(x), but maybe it is clearer that way.
For the more general case in which you do have consecutive years missing, the following should get you started under the assumption that Lack specifies the number of consecutive years missing. For example, if there is a jump from 2006 to 2009 for a given firm, then Lack = 2 for the 2009 observation.
* Expand database
expand Lack + 1, gen(x)
gsort firm_id year -x
* Substitution rules
replace year = year[_n-1] + 1 if x == 1
Now you just need to come up with an imputation rule for your total_workers:
replace total_workers = ...
If Lack is a string, convert to numeric using real.
You've already awarded the answer, but I have had to do similar before and always use the cross command as follows. Say I am using your dataset already & continue with the following code:
tempfile master year
save `master'
preserve
keep year
duplicates drop
save `year'
restore
//next two lines set me up to correct for different year ranges by firm; if year ranges were standard, this would be omitted
bys firm_id: egen minyear=min(year)
bys firm_id: egen maxyear=max(year)
keep firm_id minyear maxyear
duplicates drop
cross using `year'
merge m:1 firm_id year using `master', assert(1 3) nogen
drop if year<minyear | year>maxyear //this adjusts for years outside the earliest and latest years observed by firm; if year ranges standard, again omitted
Then from here, use the ipolate command in the spirit of #NickCox.
I'm particularly interested in any pros/cons regarding the use of expand and cross. (Beyond the fact that my use here specifically hinges on >0 records for each year being observed in order to construct the crossed dataset, which could be eliminated if I create the `year' tempfile differently.)
Related
My dataset looks like the following:
identification number
year
indicator
Data
1112000
2000
JKL_ADS
511
1112001
2001
JKL_ADS
517
1112002
2002
JKL_ADS
721
1112003
2003
JKL_ADS
925
1112004
2004
JKL_ADS
1092
1112000
2000
KLS_DSAK
351
1112001
2001
KLS_DSAK
631
1112002
2002
KLS_DSAK
732
1112003
2003
KLS_DSAK
823
1112004
2004
KLS_DSAK
1092
I want to reshape wide so it looks like this instead:
identification number
year
JKL_ADS
KLS_DSAK
1112000
2000
511
351
1112001
2001
517
631
1112002
2002
721
732
1112003
2003
925
823
1112004
2004
1092
1092
This is a fairly standard application. You didn't give example data in recommended form, so the details here may need modification by you.
Contrary to the question, indicator serves as an argument to j().
* Example generated by -dataex-. For more info, type help dataex
clear
input long identificationnumber int year str8 indicator int data
1112000 2000 "JKL_ADS" 511
1112001 2001 "JKL_ADS" 517
1112002 2002 "JKL_ADS" 721
1112003 2003 "JKL_ADS" 925
1112004 2004 "JKL_ADS" 1092
1112000 2000 "KLS_DSAK" 351
1112001 2001 "KLS_DSAK" 631
1112002 2002 "KLS_DSAK" 732
1112003 2003 "KLS_DSAK" 823
1112004 2004 "KLS_DSAK" 1092
end
. reshape wide data , i(id year) j(indicator) string
(j = JKL_ADS KLS_DSAK)
Data Long -> Wide
-----------------------------------------------------------------------------
Number of observations 10 -> 5
Number of variables 4 -> 4
j variable (2 values) indicator -> (dropped)
xij variables:
data -> dataJKL_ADS dataKLS_DSAK
-----------------------------------------------------------------------------
. rename (data*) (*)
. l
+--------------------------------------+
| identi~r year JKL_ADS KLS_DSAK |
|--------------------------------------|
1. | 1112000 2000 511 351 |
2. | 1112001 2001 517 631 |
3. | 1112002 2002 721 732 |
4. | 1112003 2003 925 823 |
5. | 1112004 2004 1092 1092 |
+--------------------------------------+
I'm using Stata 13 and have to clean a data set in a panel format with different ids for a given period from 2000 to 2003. My data looks like:
id year ln_wage
1 2000 2.30
1 2001 2.31
1 2002 2.31
2 2001 1.89
2 2002 1.89
2 2003 2.10
3 2002 1.60
4 2002 2.46
4 2003 2.47
5 2000 2.10
5 2001 2.10
5 2003 2.12
I would like to keep in my dataset for each year only individuals that appear in t-1 year. In this way, the first year of my sample (2000) will be dropped. I'm looking for output like:
2001:
id year ln_wage
1 2001 2.31
5 2001 2.10
2002:
id year ln_wage
1 2002 2.31
2 2002 1.89
2003:
id year ln_wage
2 2003 2.10
4 2003 2.47
Regards,
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id int year float ln_wage
1 2000 2.3
1 2001 2.31
1 2002 2.31
2 2001 1.89
2 2002 1.89
2 2003 2.1
3 2002 1.6
4 2002 2.46
4 2003 2.47
5 2000 2.1
5 2001 2.1
5 2003 2.12
end
xtset id year
drop if missing(L.ln_wage)
sort year id
list, noobs sepby(year)
+---------------------+
| id year ln_wage |
|---------------------|
| 1 2001 2.31 |
| 5 2001 2.1 |
|---------------------|
| 1 2002 2.31 |
| 2 2002 1.89 |
|---------------------|
| 2 2003 2.1 |
| 4 2003 2.47 |
+---------------------+
// Alternatively, assuming no duplicate years within id exist
bysort id (year): gen todrop = year[_n-1] != year - 1
drop if todrop
I'm working on a panel dataset, which has missing values for four variables (at the start, end and in-between of panels). I would like to remove the entire panel which has missing values.
This is the code I have tried to use so far:
bysort BvD_ID YEAR: drop if sum(!missing(REV_LAY,EMP_LAY,FX_ASSET_LAY,MATCOST_LAY))==0
This piece of code successfully removes all observations with missing values in any of the four variables but it retains observations with non-missing values.
Example data:
Firm_ID Year REV_LAY EMP_LAY FX_ASSET_LAY
001 2001 80 25 120
001 2002 75 . 122
001 2003 82 32 128
002 2001 40 15 45
002 2002 42 18 48
002 2003 45 20 50
In the above sample data, I want to drop panel Firm_ID = 001 completely.
You can do something like:
clear
input Firm_ID Year REV_LAY EMP_LAY FX_ASSET_LAY
001 2001 80 25 120
001 2002 75 . 122
001 2003 82 32 128
002 2001 40 15 45
002 2002 42 18 48
002 2003 45 20 50
end
generate index = _n
bysort Firm_ID (index): generate todrop = sum(missing(REV_LAY, EMP_LAY, FX_ASSET_LAY))
by Firm_ID: drop if todrop[_N]
list Firm_ID Year REV_LAY EMP_LAY FX_ASSET_LAY
+-----------------------------------------------+
| Firm_ID Year REV_LAY EMP_LAY FX_ASS~Y |
|-----------------------------------------------|
1. | 2 2001 40 15 45 |
2. | 2 2002 42 18 48 |
3. | 2 2003 45 20 50 |
+-----------------------------------------------+
I am trying to reshape some data. The issue is that usually data is either long or wide but this seems to be set up in a way that I cannot figure out how to reshape. The data looks as follows:
year australia canada denmark ...
1999 10 15 20
2000 12 16 25
2001 14 18 40
And I would like to get it into a panel format like the following
year country gdppc
1999 australia 10
2000 australia 12
2001 australia 14
1999 canada 16
2000 canada 18
The problem is just in the variable names. See e.g. this FAQ for the advice that you may need rename first before you can reshape.
For more complicated variants of this problem with similar data, see e.g. this paper.
clear
input year australia canada denmark
1999 10 15 20
2000 12 16 25
2001 14 18 40
end
rename (australia-denmark) gdppc=
reshape long gdppc , i(year) string j(country)
sort country year
list, sepby(country)
+--------------------------+
| year country gdppc |
|--------------------------|
1. | 1999 australia 10 |
2. | 2000 australia 12 |
3. | 2001 australia 14 |
|--------------------------|
4. | 1999 canada 15 |
5. | 2000 canada 16 |
6. | 2001 canada 18 |
|--------------------------|
7. | 1999 denmark 20 |
8. | 2000 denmark 25 |
9. | 2001 denmark 40 |
+--------------------------+
I have the following records:
62
STARTHERE 1.1 vol. 84 no. 1 1996 01.1 A 0 1 1996 04 24 0
STARTHERE 1.2 vol. 84 no. 2 1996 01.2 A 0 1 1996 05 23 0
STARTHERE 1.3 vol. 84 no. 3 1996 01.3 A 1 1 1996 08 13 0
STARTHERE 1.4 vol. 84 no. 4 1996 01.4 A 0 1 1996 10 15 0
STARTHERE 1.5 vol. 84 no. 5 1996 01.5 A 0 1 1997 01 22 0
STARTHERE 1.6 vol. 84 no. 6 1996 01.6 A 0 1 1997 02 10 0
63
STARTHERE 1.1 95:1 Feb 2002 1.1 A 0 1 2002 06 03 0
STARTHERE 1.2 95:2 Apr 2002 1.2 A 0 1 2002 06 17 0
STARTHERE 1.3 95:3 Jun 2002 1.3 A 0 1 2002 07 18 0
STARTHERE 1.4 95:4 Aug 2002 1.4 A 0 1 2003 02 24 0
STARTHERE 1.5 95:5 Oct 2002 1.5 A 0 1 2003 02 24 0
64
65
STARTHERE 1.1 34:1 Mar 1996 1.1 A 0 1 1996 07 16 0
STARTHERE 1.2 34:2 Jun 1996 1.2 A 0 1 1996 09 19 0
STARTHERE 1.3 34:3 Sep 1996 1.3 A 0 1 1996 12 17 0
I don't know if this is possible in excel, vba in excel or even through regex. I want to fill the lowest numerical value (e.g. 62) and replace the lower rows with values "STARTHERE" up until the next numerical value (63). Right now, it's done manually but I was thinking if there is a way of doing this mechanically. Through excel formula, VBA, or regex, as these are what I'm familiar with. So that I can get below, it's okay also that the 62 with blank value to the right are stripped but I'm fine even if it's not:
62
62 1.1 vol. 84 no. 1 1996 01.1 A 0 1 1996 04 24 0
62 1.2 vol. 84 no. 2 1996 01.2 A 0 1 1996 05 23 0
62 1.3 vol. 84 no. 3 1996 01.3 A 1 1 1996 08 13 0
62 1.4 vol. 84 no. 4 1996 01.4 A 0 1 1996 10 15 0
62 1.5 vol. 84 no. 5 1996 01.5 A 0 1 1997 01 22 0
62 1.6 vol. 84 no. 6 1996 01.6 A 0 1 1997 02 10 0
62
62 1.1 95:1 Feb 2002 1.1 A 0 1 2002 06 03 0
63 1.2 95:2 Apr 2002 1.2 A 0 1 2002 06 17 0
63 1.3 95:3 Jun 2002 1.3 A 0 1 2002 07 18 0
63 1.4 95:4 Aug 2002 1.4 A 0 1 2003 02 24 0
63 1.5 95:5 Oct 2002 1.5 A 0 1 2003 02 24 0
64
65
65 1.1 34:1 Mar 1996 1.1 A 0 1 1996 07 16 0
65 1.2 34:2 Jun 1996 1.2 A 0 1 1996 09 19 0
65 1.3 34:3 Sep 1996 1.3 A 0 1 1996 12 17 0
Many thanks!
I assume this data is from an Excel spreadsheet, with both the numerical values and the value "STARTHERE" are on the first column (column A). The other data are on column B, C, etc.
Basically, I will loop through the first column from the top to the bottom row. If the value within the selector cell is not a number, it will be equal to the one right above it. If it is, then we skip to the next cell.
Sub help()
ActiveSheet.Columns(1).NumberFormat = "0"
For i = 1 To ActiveSheet.UsedRange.Rows.count
If Not Information.IsNumeric(Cells(i, 1)) Then Cells(i, 1).value = Cells(i - 1, 1).value
Next i
End Sub