I am trying to only keep the last available observation for each variable, however, the issue is that different variables per country were measured in different years. My data currently looks like this:
iso3c year Gini variable1 variable2 variable3
AND 2000 . 1.279314 33 22
AND 2001 22 2.571869 . .
AND 2002 . 3.492054 . .
AND 2003 44 3.89996 .
This is my code:
gsort + iso3c - year
drop if Gini==. & variable1==. & variable2==. & variable3==.
bysort iso3c: keep if _n==1
drop year
I tried this with one variable as in below, and then ran the other lines and it worked well.
drop if Gini==.
However, because I have different variables measured in different years per country, Stata ends up only keeping the following:
iso3c Gini variable1 variable2 variable3
AND 44 3.89996 . .
However, I want something like this, where the last available observation for variables 2 and 3 are also kept from the year 2000 even though the variables were not measured in 2004.
iso3c Gini variable1 variable2 variable3
AND 44 3.89996 33 22
Note that collapse (lastnm) Gini variable*, by(iso3c) is a one-line solution to this.
Let's show as well how to get something similar from first principles.
The last non-missing value in each panel is accessible once you sort the non-missings to the end of the panel (temporarily). If no non-missing value is available, necessarily a missing value will be returned instead.
clear
input str3 iso3c year Gini variable1 variable2 variable3
AND 2000 . 1.279314 33 22
AND 2001 22 2.571869 . .
AND 2002 . 3.492054 . .
AND 2003 44 3.89996 . .
end
gen OK = .
foreach v in Gini variable1 variable2 variable3 {
replace OK = !missing(`v')
bysort iso3c (OK year) : gen `v'_lnm = `v'[_N]
}
sort iso3c year
list iso3c year *lnm
+----------------------------------------------------------+
| iso3c year Gini_lnm va~1_lnm va~2_lnm va~3_lnm |
|----------------------------------------------------------|
1. | AND 2000 44 3.89996 33 22 |
2. | AND 2001 44 3.89996 33 22 |
3. | AND 2002 44 3.89996 33 22 |
4. | AND 2003 44 3.89996 33 22 |
+----------------------------------------------------------+
Related
I'm trying to scale one variable by another lagged variable.
(IB) scaled by the lagged total assets(AT) = ROA
I've tried this two methods below from here.
xtset companyid fyear, year
gen at1 = l.at
gen roa=ib/at1
and
xtset gvkey year
gen roa=(ib)/(at[_n-1])
The first one came back with all zeros for 1.ta
The second one seems to generate values on the previous entry, even if it's a different company. I think this is true because only the first row has a missing value. I would assume there should be a missing value for the first year of each company.
Additionally I've tried this code below but it said invalid syntax.
xtset gvkey year
foreach gvkey {
gen roa = (ib)/(at[_n-1]) }
I'm using compustat so it's similar to below:
gvkey|Year |Ticker | at | ib |
-------|-----|--------|------|------|
001111| 2006| abc |1000 |50 |
001111| 2007| abc |1100 |60 |
001111| 2008| abc |1200 |70 |
001111| 2009| abc |1300 |80 |
001112| 2008| www |28777 |1300 |
001112| 2009| www |26123 |870 |
001113| 2009| ttt |550 |-1000 |
001114| 2010| vvv |551 |-990 |
This is hard to follow. 1.ta may, or may not, be a typo for L.at.
Is gvkey string? At the Stata tag, there is really detailed advice about how to give Stata data examples, which you are not following.
In principle, your first approach is correct, so it is hard to know what went wrong, except that
The second one seems to generate values on the previous entry, even if
it's a different company.
That's exactly correct. The previous observation is the previous observation, and nothing in that command refers or alludes to the panel structure or xtset or tsset information.
Your foreach statement is just wild guessing and nothing to do with any form supported by foreach. foreach isn't needed here at all: the lag operator implies working within panels automatically.
I did this, which may help.
clear
input str6 gvkey Year str3 Ticker at ib
001111 2006 abc 1000 50
001111 2007 abc 1100 60
001111 2008 abc 1200 70
001111 2009 abc 1300 80
001112 2008 www 28777 1300
001112 2009 www 26123 870
001113 2009 ttt 550 -1000
001114 2010 vvv 551 -990
end
egen id = group(gvkey), label
xtset id Year
gen wanted = at/L.ib
list, sepby(gvkey)
+------------------------------------------------------------+
| gvkey Year Ticker at ib id wanted |
|------------------------------------------------------------|
1. | 001111 2006 abc 1000 50 001111 . |
2. | 001111 2007 abc 1100 60 001111 22 |
3. | 001111 2008 abc 1200 70 001111 20 |
4. | 001111 2009 abc 1300 80 001111 18.57143 |
|------------------------------------------------------------|
5. | 001112 2008 www 28777 1300 001112 . |
6. | 001112 2009 www 26123 870 001112 20.09462 |
|------------------------------------------------------------|
7. | 001113 2009 ttt 550 -1000 001113 . |
|------------------------------------------------------------|
8. | 001114 2010 vvv 551 -990 001114 . |
+------------------------------------------------------------+
I'm working on a panel dataset, which has missing values for four variables (at the start, end and in-between of panels). I would like to remove the entire panel which has missing values.
This is the code I have tried to use so far:
bysort BvD_ID YEAR: drop if sum(!missing(REV_LAY,EMP_LAY,FX_ASSET_LAY,MATCOST_LAY))==0
This piece of code successfully removes all observations with missing values in any of the four variables but it retains observations with non-missing values.
Example data:
Firm_ID Year REV_LAY EMP_LAY FX_ASSET_LAY
001 2001 80 25 120
001 2002 75 . 122
001 2003 82 32 128
002 2001 40 15 45
002 2002 42 18 48
002 2003 45 20 50
In the above sample data, I want to drop panel Firm_ID = 001 completely.
You can do something like:
clear
input Firm_ID Year REV_LAY EMP_LAY FX_ASSET_LAY
001 2001 80 25 120
001 2002 75 . 122
001 2003 82 32 128
002 2001 40 15 45
002 2002 42 18 48
002 2003 45 20 50
end
generate index = _n
bysort Firm_ID (index): generate todrop = sum(missing(REV_LAY, EMP_LAY, FX_ASSET_LAY))
by Firm_ID: drop if todrop[_N]
list Firm_ID Year REV_LAY EMP_LAY FX_ASSET_LAY
+-----------------------------------------------+
| Firm_ID Year REV_LAY EMP_LAY FX_ASS~Y |
|-----------------------------------------------|
1. | 2 2001 40 15 45 |
2. | 2 2002 42 18 48 |
3. | 2 2003 45 20 50 |
+-----------------------------------------------+
I need to figure out how to tabulate all possible combinations of data in a dataset. I have a dataset where each person has 2 rows, one row for an activity score and one row for a total score on a test. There are variables for the score at each visit. A person may have anywhere between 1 to 5 visits. I am looking for all possible combinations of the scores for a given person for each score.
For example, here is code to generate the sample data structure.
data example;
input name $ type $ visit1-visit5;
datalines;
Bob activity 10 13 16 . .
Bob total 13 19 17 . .
John activity 11 20 25 20 21
John total 13 15 17 19 22
Steve activity 6 . . . .
Steve total 9 . . . . .
;
run;
I would like to have a dataset that would give me a structure as follows:
Bob activity 10 13
Bob activity 10 16
Bob activity 13 16
Bob total 13 19
Bob total 13 17
Bob total 19 17
John (rows for all possible combinations)
Steve - would have no rows, since he only has one visit (no combinations possible)
Any suggestions?
For N choose 2 and the output structure you want a couple of nested DO's will suffice.
data example;
input name $ type $ visit1-visit5;
datalines;
Bob activity 10 13 16 . .
Bob total 13 19 17 . .
John activity 11 20 25 20 21
John total 13 15 17 19 22
Steve activity 6 . . . .
Steve total 9 . . . . .
;;;;
run;
data by2;
set example;
array v[*] visit:;
n=n(of v[*]);
do i = 1 to n;
col1 = v[i];
do j = i + 1 to n;
col2 = v[j];
output;
end;
end;
drop i j visit:;
run;
proc print;
run;
I am trying to reshape some data. The issue is that usually data is either long or wide but this seems to be set up in a way that I cannot figure out how to reshape. The data looks as follows:
year australia canada denmark ...
1999 10 15 20
2000 12 16 25
2001 14 18 40
And I would like to get it into a panel format like the following
year country gdppc
1999 australia 10
2000 australia 12
2001 australia 14
1999 canada 16
2000 canada 18
The problem is just in the variable names. See e.g. this FAQ for the advice that you may need rename first before you can reshape.
For more complicated variants of this problem with similar data, see e.g. this paper.
clear
input year australia canada denmark
1999 10 15 20
2000 12 16 25
2001 14 18 40
end
rename (australia-denmark) gdppc=
reshape long gdppc , i(year) string j(country)
sort country year
list, sepby(country)
+--------------------------+
| year country gdppc |
|--------------------------|
1. | 1999 australia 10 |
2. | 2000 australia 12 |
3. | 2001 australia 14 |
|--------------------------|
4. | 1999 canada 15 |
5. | 2000 canada 16 |
6. | 2001 canada 18 |
|--------------------------|
7. | 1999 denmark 20 |
8. | 2000 denmark 25 |
9. | 2001 denmark 40 |
+--------------------------+
I have a dataset that has a date variable with missing dates.
var1
15sep2014
15sep2014
17sep2014
18sep2014
22sep2014
22sep2014
22sep2014
29sep2014
06oct2014
I aggregated the data using this command.
gen week = week(var1)
and the results look like this
var 1 week
15sep2014 37
15sep2014 37
17sep2014 38
18sep2014 38
22sep2014 38
I was wondering whether it would be possible to get the month name and year in the week variable.
In general, week() is part of the solution if and only if you define your weeks according to Stata's rules for weeks. They are
Week 1 of the year starts on January 1, regardless.
Week 2 of the year starts on January 8, regardless.
And so on, except that week 52 of the year includes 8 or 9 days, depending on
whether the year is leap or not.
Do you use these rules? I guess not. Then the simplest practice is to define a week by whichever day starts the week. If your weeks start on Sundays, then use the rule (dailydate - dow(dailydate)). If your weeks start on Mondays, ..., Saturdays, adjust the definition.
. clear
. input str9 svar1
svar1
1. "15sep2014"
2. "15sep2014"
3. "17sep2014"
4. "18sep2014"
5. "22sep2014"
6. "22sep2014"
7. "22sep2014"
8. "29sep2014"
9. "06oct2014"
10. end
. gen var1 = daily(svar1, "DMY")
. gen week = var1 - dow(var1)
. format week var1 %td
. list
+-----------------------------------+
| svar1 var1 week |
|-----------------------------------|
1. | 15sep2014 15sep2014 14sep2014 |
2. | 15sep2014 15sep2014 14sep2014 |
3. | 17sep2014 17sep2014 14sep2014 |
4. | 18sep2014 18sep2014 14sep2014 |
5. | 22sep2014 22sep2014 21sep2014 |
|-----------------------------------|
6. | 22sep2014 22sep2014 21sep2014 |
7. | 22sep2014 22sep2014 21sep2014 |
8. | 29sep2014 29sep2014 28sep2014 |
9. | 06oct2014 06oct2014 05oct2014 |
+-----------------------------------+
Much more discussion here, here and here, although the first should be sufficient.
Instead of using the week() function, I would probably use the wofd() function to transform your %td daily date into a %tw weekly date. Then you can just play with the datetime display formats to decide exactly how to format the date. For example:
gen date_weekly = wofd(var1)
format date_weekly %twww:_Mon_ccYY
That code should give you this:
var1 date_weekly
15sep2014 37: Sep 2014
15sep2014 37: Sep 2014
17sep2014 38: Sep 2014
18sep2014 38: Sep 2014
22sep2014 38: Sep 2014
This help file will be useful:
help datetime display formats
And if you want to brush up on the difference between %tw and %td dates, you might refresh yourself here:
help datetime