Stata: Scale x variable by lagged y variable - stata

I'm trying to scale one variable by another lagged variable.
(IB) scaled by the lagged total assets(AT) = ROA
I've tried this two methods below from here.
xtset companyid fyear, year
gen at1 = l.at
gen roa=ib/at1
and
xtset gvkey year
gen roa=(ib)/(at[_n-1])
The first one came back with all zeros for 1.ta
The second one seems to generate values on the previous entry, even if it's a different company. I think this is true because only the first row has a missing value. I would assume there should be a missing value for the first year of each company.
Additionally I've tried this code below but it said invalid syntax.
xtset gvkey year
foreach gvkey {
gen roa = (ib)/(at[_n-1]) }
I'm using compustat so it's similar to below:
gvkey|Year |Ticker | at | ib |
-------|-----|--------|------|------|
001111| 2006| abc |1000 |50 |
001111| 2007| abc |1100 |60 |
001111| 2008| abc |1200 |70 |
001111| 2009| abc |1300 |80 |
001112| 2008| www |28777 |1300 |
001112| 2009| www |26123 |870 |
001113| 2009| ttt |550 |-1000 |
001114| 2010| vvv |551 |-990 |

This is hard to follow. 1.ta may, or may not, be a typo for L.at.
Is gvkey string? At the Stata tag, there is really detailed advice about how to give Stata data examples, which you are not following.
In principle, your first approach is correct, so it is hard to know what went wrong, except that
The second one seems to generate values on the previous entry, even if
it's a different company.
That's exactly correct. The previous observation is the previous observation, and nothing in that command refers or alludes to the panel structure or xtset or tsset information.
Your foreach statement is just wild guessing and nothing to do with any form supported by foreach. foreach isn't needed here at all: the lag operator implies working within panels automatically.
I did this, which may help.
clear
input str6 gvkey Year str3 Ticker at ib
001111 2006 abc 1000 50
001111 2007 abc 1100 60
001111 2008 abc 1200 70
001111 2009 abc 1300 80
001112 2008 www 28777 1300
001112 2009 www 26123 870
001113 2009 ttt 550 -1000
001114 2010 vvv 551 -990
end
egen id = group(gvkey), label
xtset id Year
gen wanted = at/L.ib
list, sepby(gvkey)
+------------------------------------------------------------+
| gvkey Year Ticker at ib id wanted |
|------------------------------------------------------------|
1. | 001111 2006 abc 1000 50 001111 . |
2. | 001111 2007 abc 1100 60 001111 22 |
3. | 001111 2008 abc 1200 70 001111 20 |
4. | 001111 2009 abc 1300 80 001111 18.57143 |
|------------------------------------------------------------|
5. | 001112 2008 www 28777 1300 001112 . |
6. | 001112 2009 www 26123 870 001112 20.09462 |
|------------------------------------------------------------|
7. | 001113 2009 ttt 550 -1000 001113 . |
|------------------------------------------------------------|
8. | 001114 2010 vvv 551 -990 001114 . |
+------------------------------------------------------------+

Related

Trimming my panel dataset - filtering out observations meeting criterion if preceding ID meets the complementary criterion

I am working with a dataset that includes 118,979 observations over 9 wide variables in Stata 16.0. The most prominent variable is whether a company-observation over multiple dates reports either "GPS" or "EPS". These companies can report both a "GPS" observation in a datapoint, as well as an "EPS" observation in the following datapoint. Please refer to the data overview below for further visualisation.
Datasample:
clear
input str8 cusip8 str16 cname str4 measure double actual long anndats_act float(fyear tanalyst meanforcast UE)
"87482X10" "TALMER BANCORP" "EPS" 1.21 20118 2014 29 .8686207 .3930131
"87482X10" "TALMER BANCORP" "GPS" 1.02 20479 2015 34 .8576471 .1893004
I need to drop the GPS observations (over multiple dates) once an identifier (being cusip8 in the table above) has reported an EPS over multiple dates. That is, if a company has reported GPS as well as EPS in e.g. January 1st, 2010, I want to drop the GPS observation such that the EPS is kept.
If a company only reports a GPS, and does not report an EPS during a given date, I want to keep the GPS observation in my dataset.
The following works for me (adjust your variable names as required):
. clear
. input str10(company_id measure) month day year
company_id measure month day year
1. "Company A" "EPS" 1 1 2010
2. "Company A" "GPS" 1 1 2010
3. "Company A" "GPS" 1 1 2010
4. "Company A" "GPS" 1 2 2010
5. "Company B" "EPS" 1 2 2010
6. "Company B" "GPS" 1 1 2010
7. "Company C" "GPS" 1 4 2010
8. "Company C" "EPS" 1 4 2010
9. end
.
. gen date = mdy(month,day,year)
. format date %d
. drop month day year
.
. sort company_id date measure
.
. gen both = 0
. by company_id date: replace both = 1 if measure[1] == "EPS" & measure[2] == "GPS"
(5 real changes made)
.
. list, sepby(company_id)
+----------------------------------------+
| company~d measure date both |
|----------------------------------------|
1. | Company A EPS 01jan2010 1 |
2. | Company A GPS 01jan2010 1 |
3. | Company A GPS 01jan2010 1 |
4. | Company A GPS 02jan2010 0 |
|----------------------------------------|
5. | Company B GPS 01jan2010 0 |
6. | Company B EPS 02jan2010 0 |
|----------------------------------------|
7. | Company C EPS 04jan2010 1 |
8. | Company C GPS 04jan2010 1 |
+----------------------------------------+
.
. drop if measure == "GPS" & both == 1
(3 observations deleted)
.
. list, sepby(company_id)
+----------------------------------------+
| company~d measure date both |
|----------------------------------------|
1. | Company A EPS 01jan2010 1 |
2. | Company A GPS 02jan2010 0 |
|----------------------------------------|
3. | Company B GPS 01jan2010 0 |
4. | Company B EPS 02jan2010 0 |
|----------------------------------------|
5. | Company C EPS 04jan2010 1 |
+----------------------------------------+

Update results in a column from multiple columns with different names

Based on the image, I would like to loop through the columns to find where there is a text mo. It updates mo with the results not the text mo. The challenge has been how to select the result in the next column different from where mo is.
Your answer to my comment above suggests to me that the question you ask reflects the wrong approach to the larger problem. Your description suggests that you have observations with a varying number of testname/testvalue pairs, such as
+----------------------------------------+
| id day test1 val1 test2 val2 |
|----------------------------------------|
| A 1 mo 11 . |
| A 2 mo 12 df 98.2 |
|----------------------------------------|
| B 1 df 98.3 mo 23 |
| B 2 mo 14 . |
+----------------------------------------+
and your objective is to produce observations that look like this
+----------------------+
| id day df mo |
|----------------------|
| A 1 . 11 |
| A 2 98.2 12 |
|----------------------|
| B 1 98.3 23 |
| B 2 . 14 |
+----------------------+
If that is the case, here is a reproducible example that you can copy, paste into Stata's Do-file Editor window, execute it, and examine the output to see how the technique avoids all the complexity you introduce by trying to use loops to accomplish the task. The reshape command is one of Stata's most powerful data management tools and it will benefit you to learn how to use it.
clear
input str8 id int day str8 test1 float val1 str8 test2 float val2
A 1 "mo" 11 "" .
A 2 "mo" 12 "df" 98.2
B 1 "df" 98.3 "mo" 23
B 2 "mo" 14 "" .
end
list, sepby(id) noobs
reshape long test val, i(id day) j(num)
drop if missing(test)
drop num
list, sepby(id) noobs
reshape wide val, i(id day) j(test) str
rename val* *
list, sepby(id) noobs

Reshape from wide to long without Identifier

I have problems in reshaping data from wide to long format:
I have no identifier variable for the wide variables.
My dataset is quite wide. I do have about 7000 variables.
The number of variables per ID is not constant, so for some IDs I have 5 and for others I have 10 variables.
I was hoping that this Stata FAQ could help me, but unfortunately this does not work properly (see following code snippets).
So I do have data that looks like the following example:
clear
input str45 Year
"2010"
"2011"
"2012"
"2014"
end
input str45 A101Meas0010
"1.50"
"1.70"
"1.71"
"1.71"
input str45 A101Meas0020
"50"
"60"
"65"
"64"
input str45 A101Meas0020A
"51"
"62"
"64"
"68"
input str45 FE123Meas0010
"1.60"
"1.75"
"1.92"
"1.94"
input str45 FE123Meas0020
"60"
"72"
"88"
"92"
list
+-------------------------------------------------------------+
| Year A10~0010 A10~0020 A1~0020A FE1~0010 FE1~0020 |
|-------------------------------------------------------------|
1. | 2010 1.50 50 51 1.60 60 |
2. | 2011 1.70 60 62 1.75 72 |
3. | 2012 1.71 65 64 1.92 88 |
4. | 2014 1.71 64 68 1.94 92 |
+-------------------------------------------------------------+
The final table I want to achieve would look something like this:
+--------------------------------------------------+
| Year ID Meas0010 Meas0020 Meas0020A |
|--------------------------------------------------|
1. | 2010 A101 1.50 50 . |
2. | 2010 FE123 1.60 51 60 |
3. | 2011 A101 1.70 60 . |
4. | 2011 FE123 1.75 62 72 |
5. | 2012 A101 1.71 65 . |
6. | 2012 FE123 1.92 64 88 |
7. | 2014 A101 1.71 64 . |
8. | 2014 FE123 1.94 68 92 |
+--------------------------------------------------+
I tried following code snippet close to the example from the Stata FAQ, but this throws an error:
unab vars : *Meas*
local stubs : subinstr local vars "Meas0010" "", all
local stubs : subinstr local stubs "Meas0020" "", all
local stubs : subinstr local stubs "Meas0020A" "", all
reshape long "`stubs'", i(Year) j(Measurement) string
(note: j = Meas0010 Meas0020 Meas0020A)
(note: A101AMeas0010 not found)
variable A101Meas0010 not found
r(111);
Any ideas how to reshape this? I never had to reshape such an odd structure before.
Additional Question: In the example above I did have to specify the Measurement-Names Meas0010, Meas0020 and Meas0020A. Is it possible to automate this as well? All measurement names start with the keyword Meas, so the variable names are always of the structure _ID+MeasName, e.g. A101Meas0020A stands for ID A101 and Measurement Meas0020A.
The annoying thing is: I do know how to do this in MATLAB, but I am forced to use Stata here.
Your variable name structure is a little awkward, but there is a syntax to match. It's better covered in the help for reshape, and is only barely mentioned in the FAQ you cite (which I wrote, so I can be emphatic that it's intended as a supplement to the help, not the first line of documentation).
Your example yields to
clear
input str4 (Year A101Meas0010 A101Meas0020 A101Meas0020A FE123Meas0010 FE123Meas0020)
"2010" "1.50" "50" "51" "1.60" "50"
"2011" "1.70" "60" "62" "1.75" "60"
"2012" "1.71" "65" "64" "1.92" "65"
"2014" "1.71" "64" "68" "1.94" "64"
end
reshape long #Meas0010 #Meas0020 #Meas0020A, i(Year) j(ID) string
destring, replace
sort Year ID
list, sepby(Year)
+-----------------------------------------------+
| Year ID Meas0010 Meas0020 Me~0020A |
|-----------------------------------------------|
1. | 2010 A101 1.5 50 51 |
2. | 2010 FE123 1.6 50 . |
|-----------------------------------------------|
3. | 2011 A101 1.7 60 62 |
4. | 2011 FE123 1.75 60 . |
|-----------------------------------------------|
5. | 2012 A101 1.71 65 64 |
6. | 2012 FE123 1.92 65 . |
|-----------------------------------------------|
7. | 2014 A101 1.71 64 68 |
8. | 2014 FE123 1.94 64 . |
+-----------------------------------------------+
It seems bizarre that your example enters everything as string: note the destring in my code.
Without access to your dataset, I'd say that you should be able to find the more general syntax without automation. You know that there are at most about 10 measurements in the fullest case. In any event you are already showing the syntax tricks needed to remove strings you don't need.

Stata: Aggregating by week

I have a dataset that has a date variable with missing dates.
var1
15sep2014
15sep2014
17sep2014
18sep2014
22sep2014
22sep2014
22sep2014
29sep2014
06oct2014
I aggregated the data using this command.
gen week = week(var1)
and the results look like this
var 1 week
15sep2014 37
15sep2014 37
17sep2014 38
18sep2014 38
22sep2014 38
I was wondering whether it would be possible to get the month name and year in the week variable.
In general, week() is part of the solution if and only if you define your weeks according to Stata's rules for weeks. They are
Week 1 of the year starts on January 1, regardless.
Week 2 of the year starts on January 8, regardless.
And so on, except that week 52 of the year includes 8 or 9 days, depending on
whether the year is leap or not.
Do you use these rules? I guess not. Then the simplest practice is to define a week by whichever day starts the week. If your weeks start on Sundays, then use the rule (dailydate - dow(dailydate)). If your weeks start on Mondays, ..., Saturdays, adjust the definition.
. clear
. input str9 svar1
svar1
1. "15sep2014"
2. "15sep2014"
3. "17sep2014"
4. "18sep2014"
5. "22sep2014"
6. "22sep2014"
7. "22sep2014"
8. "29sep2014"
9. "06oct2014"
10. end
. gen var1 = daily(svar1, "DMY")
. gen week = var1 - dow(var1)
. format week var1 %td
. list
+-----------------------------------+
| svar1 var1 week |
|-----------------------------------|
1. | 15sep2014 15sep2014 14sep2014 |
2. | 15sep2014 15sep2014 14sep2014 |
3. | 17sep2014 17sep2014 14sep2014 |
4. | 18sep2014 18sep2014 14sep2014 |
5. | 22sep2014 22sep2014 21sep2014 |
|-----------------------------------|
6. | 22sep2014 22sep2014 21sep2014 |
7. | 22sep2014 22sep2014 21sep2014 |
8. | 29sep2014 29sep2014 28sep2014 |
9. | 06oct2014 06oct2014 05oct2014 |
+-----------------------------------+
Much more discussion here, here and here, although the first should be sufficient.
Instead of using the week() function, I would probably use the wofd() function to transform your %td daily date into a %tw weekly date. Then you can just play with the datetime display formats to decide exactly how to format the date. For example:
gen date_weekly = wofd(var1)
format date_weekly %twww:_Mon_ccYY
That code should give you this:
var1 date_weekly
15sep2014 37: Sep 2014
15sep2014 37: Sep 2014
17sep2014 38: Sep 2014
18sep2014 38: Sep 2014
22sep2014 38: Sep 2014
This help file will be useful:
help datetime display formats
And if you want to brush up on the difference between %tw and %td dates, you might refresh yourself here:
help datetime

Stata - Dynamically define variable names in loop

I am pretty new to Stata programming.
My question: I need to reorder/reshape a dataset through (I guess) a macro.
I have a dataset of individuals, with a variable birthyear' (year of birth) and variables each containing weight at a given CALENDAR year: e.g.
BIRTHYEAR | W_1990 | W_1991 | W_1992 | ... | w_2000
1989 | 7.2 | 9.3 | 10.2 | ... | 35.2
1981 | 33.2 | 35.3 | ...
I would like to obtain new variables containing weight at different ages, e.g. Weight_age_1, Weight_age_2, etc.: this means take for instance first obs of example, leave Weight_age_1 blank, put 7.2 in Weight_age_2, and so on.
I have tried something like...
forvalues i = 1/10{
capture drop weight_age_`i'
capture drop birth`i
gen birth_`i'=birthyear-1+`i'
tostring birth_`i', replace
gen weight_age_`i'= w_birth_`i'
}
.. but it doesn't work.
Can you please help me?
Experienced Stata users wouldn't try to write a self-contained program here: they would see that the heart of the problem is a reshape.
clear
input birthyear w_1990 w_1991 w_1992
1989 7.2 9.3 10.2
1981 33.2 35.3 37.6
end
gen id = _n
reshape long w_, i(id)
rename _j year
gen age = year - birthyear
l, sepby(id)
+-----------------------------------+
| id year birthy~r w_ age |
|-----------------------------------|
1. | 1 1990 1989 7.2 1 |
2. | 1 1991 1989 9.3 2 |
3. | 1 1992 1989 10.2 3 |
|-----------------------------------|
4. | 2 1990 1981 33.2 9 |
5. | 2 1991 1981 35.3 10 |
6. | 2 1992 1981 37.6 11 |
+-----------------------------------+
To get the variables you say you want, you could reshape wide, but this long structure is by far the more convenient way to store these data for future Stata work.
P.S. The heart of your programming problem is that you are getting confused between the names of variables and their contents.
But this is a "look-up" approach made to work:
clear
input birthyear w_1990 w_1991 w_1992
1989 7.2 9.3 10.2
1981 33.2 35.3 37.6
end
quietly forval j = 1/10 {
gen weight_`j' = .
forval k = 1990/1992 {
replace weight_`j' = w_`k' if (`k' - birthyear) == `j'
}
}
The essential trick is to do name manipulation using local macros. In Stata, variables are mainly for holding data; single-valued constants are better held in local macros and scalars. (Your sense of the word "macro" as meaning script or program is not how the term is used in Stata.)
As above: this is the data structure you ask for, but it is likely to be more problematic than that produced by reshape long.