Destringing variables - stata

I have a data set that has data sorted by months and years. I want to destring the month variable so that I can ultimately create one date variable, but as they are all labeled as January, February, etc. how do I destring the variable?

You don't. That's a job for date functions. All are documented, e.g. via help datetime.
destring is for numbers that happen to be read as string variables so that typical entries might be "42" and "666". Import as string usually arises when the variable includes metadata (e.g. header lines), or non-Stata flags for missings (e.g. "NA"), or some other non-numeric characters, often in as few as one observation. Import from MS Excel is a common cause, as spreadsheet users tend to be loose on sprinkling text in numeric data columns.
A variable with values such as "January" doesn't qualify. It's in your mind that month names map on to month numbers, but destring doesn't share that knowledge.
Date functions have this job:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 month float year
"January" 2017
"February" 1942
end
gen mdate = monthly(month + string(year), "MY")
list
+-------------------------+
| month year mdate |
|-------------------------|
1. | January 2017 684 |
2. | February 1942 -215 |
+-------------------------+
format mdate %tm
list
+--------------------------+
| month year mdate |
|--------------------------|
1. | January 2017 2017m1 |
2. | February 1942 1942m2 |
+--------------------------+
(Declaration of interest: original author of destring.)
See also this thread.

Related

Calculating days passed

I have a variable date like this:
I want to calculate how many days have passed since, say, Jan 1 of 1960.
However, this is tedious. Also in some years, February has 28 days.
What I've been trying is basically looking up every single calendar, calculate how many days are there in each year, recognize string like jan as month variable 1 and so on.
Is there any short and efficient way to do this?
You need to use the daily() or date() function:
display date("1/1/2012", "DMY") - date("1/1/1960", "DMY")
18993
More generally, if you have a string variable with dates:
clear
input str10 date1
"01/01/2012"
"01/01/2011"
"01/01/2014"
"19/12/2014"
end
generate date2 = date(date, "DMY") - date("1/1/1960", "DMY")
list
+--------------------+
| date1 date2 |
|--------------------|
1. | 01/01/2012 18993 |
2. | 01/01/2011 18628 |
3. | 01/01/2014 19724 |
4. | 19/12/2014 20076 |
+--------------------+
If the variable containing the dates is numeric:
clear
input date1
18993
18628
19724
20076
end
format %tdDD/NN/CCYY date1
generate date2 = date1 - date("1/1/1960", "DMY")

How do I convert (daily) date to month date?

In Stata, how do I convert date in the form of:
09mar2005 00:00:00
to a month-year variable?
If it matters, the date format is %tc.
What I have in mind is to plot monthly averages (instead of the daily average I have) of variables across time.
To get where you are now, you or somebody else may have done something like this:
clear
set obs 1
gen earlier = "09mar2005 00:00:00"
gen double nowhave = clock(earlier, "DMY hms")
format nowhave %tc
list
+-----------------------------------------+
| earlier nowhave |
|-----------------------------------------|
1. | 09mar2005 00:00:00 09mar2005 00:00:00 |
+-----------------------------------------+
Note that a string date and a numeric date-time variable with appropriate date-time format %tc just look the same when you list them, but they are quite different beasts.
To get where you want to be -- with a monthly date -- you convert from clock (date-time) to daily to monthly:
gen mdate = mofd(dofc(nowhave))
format mdate %tm
list
+--------------------------------------------------+
| earlier nowhave mdate |
|--------------------------------------------------|
1. | 09mar2005 00:00:00 09mar2005 00:00:00 2005m3 |
+--------------------------------------------------+
All is documented at help datetime. The function names stand for month of daily date and daily date of clock.

Convert wide-like data to a long one in Stata?

I have a dataset like
year CNMubiBeijing CNMubiTianjing CNMubiShanghai ··· ··· Wulumuqi
1998 . . . .
1999 . . . .
····
2013 . . . .
As you can see, the first row is a list of city names in China,like Beijing, Shanghai and so on, combined with a prefix "CNMubi" (which is redundant). The first column corresponds to the year,and the observations are of another variable(like local government's tax revenue).It's similar to a "wide" type data and I want to convert it to a long type panel data like
city year tax_rev
Beijing 1998
···
Beijing 2013
Shanghai 1998
···
Shanghai 2013
Two immediate solutions come into my mind. One is to directly use the --reshape-- command, like reshape long CNMubi,i(year) j(city_eng) but it turn out give me a column of missing values (column of city_eng)
The second possible solution is use loop,like
foreach var of varlist _all {
replace city_eng="`var'"
}
It also doesn't work (in fact,the new generated city_eng equals to the last variables in the varlist), I need to "expand" the data from a mn to a mnm matrix. So how can I achieve my goal, thank you.
This works:
clear
set more off
*----- example data -----
input ///
year CNMubiBeijing CNMubiTianjing
1998 . .
1999 . .
2000
2001
2002
2003
end
set seed 259376
replace CNMubiBeijing = runiform()
replace CNMubiTianjing = runiform()
*----- what you want -----
reshape long CNMubi, i(year) j(city) string
sort city year
list, sepby(city)
Notice the string option, since j() contains string values.
The result is:
. sort city year
. list, sepby(city)
+----------------------------+
| year city CNMubi |
|----------------------------|
1. | 1998 Beijing .658855 |
2. | 1999 Beijing .494634 |
|----------------------------|
3. | 1998 Tianjing .0204465 |
4. | 1999 Tianjing .0454614 |
+----------------------------+

Stata: Aggregating by week

I have a dataset that has a date variable with missing dates.
var1
15sep2014
15sep2014
17sep2014
18sep2014
22sep2014
22sep2014
22sep2014
29sep2014
06oct2014
I aggregated the data using this command.
gen week = week(var1)
and the results look like this
var 1 week
15sep2014 37
15sep2014 37
17sep2014 38
18sep2014 38
22sep2014 38
I was wondering whether it would be possible to get the month name and year in the week variable.
In general, week() is part of the solution if and only if you define your weeks according to Stata's rules for weeks. They are
Week 1 of the year starts on January 1, regardless.
Week 2 of the year starts on January 8, regardless.
And so on, except that week 52 of the year includes 8 or 9 days, depending on
whether the year is leap or not.
Do you use these rules? I guess not. Then the simplest practice is to define a week by whichever day starts the week. If your weeks start on Sundays, then use the rule (dailydate - dow(dailydate)). If your weeks start on Mondays, ..., Saturdays, adjust the definition.
. clear
. input str9 svar1
svar1
1. "15sep2014"
2. "15sep2014"
3. "17sep2014"
4. "18sep2014"
5. "22sep2014"
6. "22sep2014"
7. "22sep2014"
8. "29sep2014"
9. "06oct2014"
10. end
. gen var1 = daily(svar1, "DMY")
. gen week = var1 - dow(var1)
. format week var1 %td
. list
+-----------------------------------+
| svar1 var1 week |
|-----------------------------------|
1. | 15sep2014 15sep2014 14sep2014 |
2. | 15sep2014 15sep2014 14sep2014 |
3. | 17sep2014 17sep2014 14sep2014 |
4. | 18sep2014 18sep2014 14sep2014 |
5. | 22sep2014 22sep2014 21sep2014 |
|-----------------------------------|
6. | 22sep2014 22sep2014 21sep2014 |
7. | 22sep2014 22sep2014 21sep2014 |
8. | 29sep2014 29sep2014 28sep2014 |
9. | 06oct2014 06oct2014 05oct2014 |
+-----------------------------------+
Much more discussion here, here and here, although the first should be sufficient.
Instead of using the week() function, I would probably use the wofd() function to transform your %td daily date into a %tw weekly date. Then you can just play with the datetime display formats to decide exactly how to format the date. For example:
gen date_weekly = wofd(var1)
format date_weekly %twww:_Mon_ccYY
That code should give you this:
var1 date_weekly
15sep2014 37: Sep 2014
15sep2014 37: Sep 2014
17sep2014 38: Sep 2014
18sep2014 38: Sep 2014
22sep2014 38: Sep 2014
This help file will be useful:
help datetime display formats
And if you want to brush up on the difference between %tw and %td dates, you might refresh yourself here:
help datetime

Compare each obs with rest of its sub-group

I have the following goal regarding my data structure
group; month; year; next_year
1; February; 2014; 0
1; March; 2006; 0
1; November; 2013; 1
2; January; 2014; 0
3; January; 2004; 0
I do have group, month and year, however the column next_year needs to be generated from the first three: For each observation, I want to check if there is another observation within the same group that has a date-value which falls into the period of next year. If so, I want to set the value of next_year to 1, otherwise to 0 (see example).
I started by converting the date into a format that Stata can interpret - using ym(month, year) - such that I can make comparisons. However, I am not sure how to iterate over all observations within the group in order to determine if that is the case or not.
I would know how to do it in e.g. Java, but I don't for Stata. I suppose I should not start with loops as there are probably some implemented commands for such problems.
If you want to check if there is a following observation within the next 12 months, you can try:
clear
set more off
*----- example data -----
input group str8 month year
1 March 2006
1 March 2013
1 November 2013
1 January 2013
2 January 2014
3 January 2004
end
*----- what you want -----
gen dat = monthly(month + string(year), "MY")
format dat %tm
bysort group (dat): gen next = dat[_n+1] - dat <= 12
list, sepby(group)
Make sure you understand the difference between Nick's code and mine. They work under different assumptions. You can check the differences running both pieces of code with the data I have provided (originally Nick's but with one observation deleted to get the point across; by chance, if you use Nick's data without the modification, the results will be the same).
You are correct in avoiding an explicit loop. This kind of problem is soluble using by:.
I modified your example to have two observations for one group in one year.
clear
input group str8 month year
1 February 2014
1 March 2006
1 March 2013
1 November 2013
2 January 2014
3 January 2004
end
bysort group (year) : gen next_year = year[_n+1] == year + 1
bysort group year (next_year) : replace next_year = next_year[_N]
list, sepby(group)
+------------------------------------+
| group month year next_y~r |
|------------------------------------|
1. | 1 March 2006 0 |
2. | 1 November 2013 1 |
3. | 1 March 2013 1 |
4. | 1 February 2014 0 |
|------------------------------------|
5. | 2 January 2014 0 |
|------------------------------------|
6. | 3 January 2004 0 |
+------------------------------------+
Getting an explicit sort order is essential. Within group, we look ahead to see if the next year is the current year plus 1, assigning 1 if true and 0 if false. That will at most be true for the last observation for a given group and year. If there is more than one observation for each group and year, any occurrence of 1 must be spread to all such observations.
For a tutorial on by:, see Speaking Stata: How to move step by: step.
The assumption here is that you mean in the next calendar year, not in the next 12 months. Making your dates into Stata monthly dates will be needed for most other problems, but doesn't make this one easier. Here is one way to do that in your situation, assuming that month is string and year is numeric:
gen mdate = monthly(month + string(year), "MY")
format mdate %tm