I have a dataset that has a date variable with missing dates.
var1
15sep2014
15sep2014
17sep2014
18sep2014
22sep2014
22sep2014
22sep2014
29sep2014
06oct2014
I aggregated the data using this command.
gen week = week(var1)
and the results look like this
var 1 week
15sep2014 37
15sep2014 37
17sep2014 38
18sep2014 38
22sep2014 38
I was wondering whether it would be possible to get the month name and year in the week variable.
In general, week() is part of the solution if and only if you define your weeks according to Stata's rules for weeks. They are
Week 1 of the year starts on January 1, regardless.
Week 2 of the year starts on January 8, regardless.
And so on, except that week 52 of the year includes 8 or 9 days, depending on
whether the year is leap or not.
Do you use these rules? I guess not. Then the simplest practice is to define a week by whichever day starts the week. If your weeks start on Sundays, then use the rule (dailydate - dow(dailydate)). If your weeks start on Mondays, ..., Saturdays, adjust the definition.
. clear
. input str9 svar1
svar1
1. "15sep2014"
2. "15sep2014"
3. "17sep2014"
4. "18sep2014"
5. "22sep2014"
6. "22sep2014"
7. "22sep2014"
8. "29sep2014"
9. "06oct2014"
10. end
. gen var1 = daily(svar1, "DMY")
. gen week = var1 - dow(var1)
. format week var1 %td
. list
+-----------------------------------+
| svar1 var1 week |
|-----------------------------------|
1. | 15sep2014 15sep2014 14sep2014 |
2. | 15sep2014 15sep2014 14sep2014 |
3. | 17sep2014 17sep2014 14sep2014 |
4. | 18sep2014 18sep2014 14sep2014 |
5. | 22sep2014 22sep2014 21sep2014 |
|-----------------------------------|
6. | 22sep2014 22sep2014 21sep2014 |
7. | 22sep2014 22sep2014 21sep2014 |
8. | 29sep2014 29sep2014 28sep2014 |
9. | 06oct2014 06oct2014 05oct2014 |
+-----------------------------------+
Much more discussion here, here and here, although the first should be sufficient.
Instead of using the week() function, I would probably use the wofd() function to transform your %td daily date into a %tw weekly date. Then you can just play with the datetime display formats to decide exactly how to format the date. For example:
gen date_weekly = wofd(var1)
format date_weekly %twww:_Mon_ccYY
That code should give you this:
var1 date_weekly
15sep2014 37: Sep 2014
15sep2014 37: Sep 2014
17sep2014 38: Sep 2014
18sep2014 38: Sep 2014
22sep2014 38: Sep 2014
This help file will be useful:
help datetime display formats
And if you want to brush up on the difference between %tw and %td dates, you might refresh yourself here:
help datetime
Related
I have daily data and want to convert them to weekly, using the following definition. Every Monday denotes the beginning of week i, and Sunday denotes the end of week i.
My date variable is called day and is already has %td format. I have a feeling that I should use the dow() function, combined with egen, group() but I struggle to get it quite right.
If your data are once a week and you have data for Mondays only, then your date variable is fine and all you need to do is declare delta(7) if you use tsset or xtset.
If your data are for two or more days a week and you wish to collapse or contract to weekly data, then you can convert to a suitable time basis like this:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float date
22067
22068
22069
22070
22071
22072
22073
22074
22075
22076
22077
22078
22079
22080
end
format %td date
gen wdate = cond(dow(date) == 1, date, cond(dow(date) == 0, date - 6, date - dow(date) + 1))
format wdate %td
gen dow = dow(date)
list, sepby(wdate)
+-----------------------------+
| date dow wdate |
|-----------------------------|
1. | 01jun2020 1 01jun2020 |
2. | 02jun2020 2 01jun2020 |
3. | 03jun2020 3 01jun2020 |
4. | 04jun2020 4 01jun2020 |
5. | 05jun2020 5 01jun2020 |
6. | 06jun2020 6 01jun2020 |
7. | 07jun2020 0 01jun2020 |
|-----------------------------|
8. | 08jun2020 1 08jun2020 |
9. | 09jun2020 2 08jun2020 |
10. | 10jun2020 3 08jun2020 |
11. | 11jun2020 4 08jun2020 |
12. | 12jun2020 5 08jun2020 |
13. | 13jun2020 6 08jun2020 |
14. | 14jun2020 0 08jun2020 |
+-----------------------------+
In short, index weeks by the Mondays that start them. Now collapse or contract your dataset. Naturally if you have panel or longitudinal data some identifier may be involved too. delta(7) remains essential for anything depending on tsset or xtset.
There is no harm in using egen to map to successive integers, but no advantage in that either.
A theme underlying this is that Stata's own weeks are idiosyncratic, always starting week 1 on 1 January and always having 8 or 9 days in week 52. For more on weeks in Stata, see the papers here and here, which include the advice given in this answer, and much more.
I'm trying to scale one variable by another lagged variable.
(IB) scaled by the lagged total assets(AT) = ROA
I've tried this two methods below from here.
xtset companyid fyear, year
gen at1 = l.at
gen roa=ib/at1
and
xtset gvkey year
gen roa=(ib)/(at[_n-1])
The first one came back with all zeros for 1.ta
The second one seems to generate values on the previous entry, even if it's a different company. I think this is true because only the first row has a missing value. I would assume there should be a missing value for the first year of each company.
Additionally I've tried this code below but it said invalid syntax.
xtset gvkey year
foreach gvkey {
gen roa = (ib)/(at[_n-1]) }
I'm using compustat so it's similar to below:
gvkey|Year |Ticker | at | ib |
-------|-----|--------|------|------|
001111| 2006| abc |1000 |50 |
001111| 2007| abc |1100 |60 |
001111| 2008| abc |1200 |70 |
001111| 2009| abc |1300 |80 |
001112| 2008| www |28777 |1300 |
001112| 2009| www |26123 |870 |
001113| 2009| ttt |550 |-1000 |
001114| 2010| vvv |551 |-990 |
This is hard to follow. 1.ta may, or may not, be a typo for L.at.
Is gvkey string? At the Stata tag, there is really detailed advice about how to give Stata data examples, which you are not following.
In principle, your first approach is correct, so it is hard to know what went wrong, except that
The second one seems to generate values on the previous entry, even if
it's a different company.
That's exactly correct. The previous observation is the previous observation, and nothing in that command refers or alludes to the panel structure or xtset or tsset information.
Your foreach statement is just wild guessing and nothing to do with any form supported by foreach. foreach isn't needed here at all: the lag operator implies working within panels automatically.
I did this, which may help.
clear
input str6 gvkey Year str3 Ticker at ib
001111 2006 abc 1000 50
001111 2007 abc 1100 60
001111 2008 abc 1200 70
001111 2009 abc 1300 80
001112 2008 www 28777 1300
001112 2009 www 26123 870
001113 2009 ttt 550 -1000
001114 2010 vvv 551 -990
end
egen id = group(gvkey), label
xtset id Year
gen wanted = at/L.ib
list, sepby(gvkey)
+------------------------------------------------------------+
| gvkey Year Ticker at ib id wanted |
|------------------------------------------------------------|
1. | 001111 2006 abc 1000 50 001111 . |
2. | 001111 2007 abc 1100 60 001111 22 |
3. | 001111 2008 abc 1200 70 001111 20 |
4. | 001111 2009 abc 1300 80 001111 18.57143 |
|------------------------------------------------------------|
5. | 001112 2008 www 28777 1300 001112 . |
6. | 001112 2009 www 26123 870 001112 20.09462 |
|------------------------------------------------------------|
7. | 001113 2009 ttt 550 -1000 001113 . |
|------------------------------------------------------------|
8. | 001114 2010 vvv 551 -990 001114 . |
+------------------------------------------------------------+
I have a data set that has data sorted by months and years. I want to destring the month variable so that I can ultimately create one date variable, but as they are all labeled as January, February, etc. how do I destring the variable?
You don't. That's a job for date functions. All are documented, e.g. via help datetime.
destring is for numbers that happen to be read as string variables so that typical entries might be "42" and "666". Import as string usually arises when the variable includes metadata (e.g. header lines), or non-Stata flags for missings (e.g. "NA"), or some other non-numeric characters, often in as few as one observation. Import from MS Excel is a common cause, as spreadsheet users tend to be loose on sprinkling text in numeric data columns.
A variable with values such as "January" doesn't qualify. It's in your mind that month names map on to month numbers, but destring doesn't share that knowledge.
Date functions have this job:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 month float year
"January" 2017
"February" 1942
end
gen mdate = monthly(month + string(year), "MY")
list
+-------------------------+
| month year mdate |
|-------------------------|
1. | January 2017 684 |
2. | February 1942 -215 |
+-------------------------+
format mdate %tm
list
+--------------------------+
| month year mdate |
|--------------------------|
1. | January 2017 2017m1 |
2. | February 1942 1942m2 |
+--------------------------+
(Declaration of interest: original author of destring.)
See also this thread.
I am pretty new to Stata programming.
My question: I need to reorder/reshape a dataset through (I guess) a macro.
I have a dataset of individuals, with a variable birthyear' (year of birth) and variables each containing weight at a given CALENDAR year: e.g.
BIRTHYEAR | W_1990 | W_1991 | W_1992 | ... | w_2000
1989 | 7.2 | 9.3 | 10.2 | ... | 35.2
1981 | 33.2 | 35.3 | ...
I would like to obtain new variables containing weight at different ages, e.g. Weight_age_1, Weight_age_2, etc.: this means take for instance first obs of example, leave Weight_age_1 blank, put 7.2 in Weight_age_2, and so on.
I have tried something like...
forvalues i = 1/10{
capture drop weight_age_`i'
capture drop birth`i
gen birth_`i'=birthyear-1+`i'
tostring birth_`i', replace
gen weight_age_`i'= w_birth_`i'
}
.. but it doesn't work.
Can you please help me?
Experienced Stata users wouldn't try to write a self-contained program here: they would see that the heart of the problem is a reshape.
clear
input birthyear w_1990 w_1991 w_1992
1989 7.2 9.3 10.2
1981 33.2 35.3 37.6
end
gen id = _n
reshape long w_, i(id)
rename _j year
gen age = year - birthyear
l, sepby(id)
+-----------------------------------+
| id year birthy~r w_ age |
|-----------------------------------|
1. | 1 1990 1989 7.2 1 |
2. | 1 1991 1989 9.3 2 |
3. | 1 1992 1989 10.2 3 |
|-----------------------------------|
4. | 2 1990 1981 33.2 9 |
5. | 2 1991 1981 35.3 10 |
6. | 2 1992 1981 37.6 11 |
+-----------------------------------+
To get the variables you say you want, you could reshape wide, but this long structure is by far the more convenient way to store these data for future Stata work.
P.S. The heart of your programming problem is that you are getting confused between the names of variables and their contents.
But this is a "look-up" approach made to work:
clear
input birthyear w_1990 w_1991 w_1992
1989 7.2 9.3 10.2
1981 33.2 35.3 37.6
end
quietly forval j = 1/10 {
gen weight_`j' = .
forval k = 1990/1992 {
replace weight_`j' = w_`k' if (`k' - birthyear) == `j'
}
}
The essential trick is to do name manipulation using local macros. In Stata, variables are mainly for holding data; single-valued constants are better held in local macros and scalars. (Your sense of the word "macro" as meaning script or program is not how the term is used in Stata.)
As above: this is the data structure you ask for, but it is likely to be more problematic than that produced by reshape long.
I have the following goal regarding my data structure
group; month; year; next_year
1; February; 2014; 0
1; March; 2006; 0
1; November; 2013; 1
2; January; 2014; 0
3; January; 2004; 0
I do have group, month and year, however the column next_year needs to be generated from the first three: For each observation, I want to check if there is another observation within the same group that has a date-value which falls into the period of next year. If so, I want to set the value of next_year to 1, otherwise to 0 (see example).
I started by converting the date into a format that Stata can interpret - using ym(month, year) - such that I can make comparisons. However, I am not sure how to iterate over all observations within the group in order to determine if that is the case or not.
I would know how to do it in e.g. Java, but I don't for Stata. I suppose I should not start with loops as there are probably some implemented commands for such problems.
If you want to check if there is a following observation within the next 12 months, you can try:
clear
set more off
*----- example data -----
input group str8 month year
1 March 2006
1 March 2013
1 November 2013
1 January 2013
2 January 2014
3 January 2004
end
*----- what you want -----
gen dat = monthly(month + string(year), "MY")
format dat %tm
bysort group (dat): gen next = dat[_n+1] - dat <= 12
list, sepby(group)
Make sure you understand the difference between Nick's code and mine. They work under different assumptions. You can check the differences running both pieces of code with the data I have provided (originally Nick's but with one observation deleted to get the point across; by chance, if you use Nick's data without the modification, the results will be the same).
You are correct in avoiding an explicit loop. This kind of problem is soluble using by:.
I modified your example to have two observations for one group in one year.
clear
input group str8 month year
1 February 2014
1 March 2006
1 March 2013
1 November 2013
2 January 2014
3 January 2004
end
bysort group (year) : gen next_year = year[_n+1] == year + 1
bysort group year (next_year) : replace next_year = next_year[_N]
list, sepby(group)
+------------------------------------+
| group month year next_y~r |
|------------------------------------|
1. | 1 March 2006 0 |
2. | 1 November 2013 1 |
3. | 1 March 2013 1 |
4. | 1 February 2014 0 |
|------------------------------------|
5. | 2 January 2014 0 |
|------------------------------------|
6. | 3 January 2004 0 |
+------------------------------------+
Getting an explicit sort order is essential. Within group, we look ahead to see if the next year is the current year plus 1, assigning 1 if true and 0 if false. That will at most be true for the last observation for a given group and year. If there is more than one observation for each group and year, any occurrence of 1 must be spread to all such observations.
For a tutorial on by:, see Speaking Stata: How to move step by: step.
The assumption here is that you mean in the next calendar year, not in the next 12 months. Making your dates into Stata monthly dates will be needed for most other problems, but doesn't make this one easier. Here is one way to do that in your situation, assuming that month is string and year is numeric:
gen mdate = monthly(month + string(year), "MY")
format mdate %tm