Invalid format for date labels in graph - stata

I'm only starting with Stata but went through a lot of available pages already to find answer to this.
Using simple dataset with two variables aa and bb.
aa is formatted as %td
bb is formatted as %8.0g
The graph command I've been using is as follows:
graph twoway tsline bb,
title("Numbers by Day", size(medsmall))
ytitle("Value", size(small))
xtitle("Date", size(small))
ysize(2)
xsize(4)
tlabel(#15, labsize(vsmall), format(%tcD_m_CY))
ylabel(#10, labsize(vsmall))
I am trying to format dates as something else rather than 21jan2016
but whatever I put in the format function I get an error "Invalid Date"
%tcD_m_CY is just an example from Stata forum: I tried double quotes and other things and it all fails..
(I did use tsset first to define date axis.)

Your question lacks a minimal, complete, verifiable example in so far as (1) it lacks data we can read in and (2) several details in your example are irrelevant to your problem. See https://stackoverflow.com/help/mcve
Your example shows an argument fed to the format() suboption of tlabel() (not function) which starts %tc: this insists to Stata that values which have been input as daily dates (and counted as # days with an origin 0 = 1 January 1960) are to be interpreted as date-times (and counted as # milliseconds with origin 0 = 01jan1960 00:00:00).
So, by your instruction, a daily date such as 25 July 2016 (which is held as 20660 given that origin) is to be displayed as if it were a date-time. Such a date-time is just about 2 seconds after the start of 1 January 1960; and the rest of your display format D_m_CY says "just show me day, month and year" and the day, month and year are, as said, 1 January 1960 according to this instruction.
I see nothing invalid in your date format so far as Stata is concerned; the problem is human, that it is not at all what you want. Naturally, I can't explain exactly what was wrong with whatever other code you tried and don't show us.
The fake data and code below illustrate some technique. For daily dates, labelling every day is usually a bad idea with more than about a week's worth of data, as you just don't have enough space; similarly showing the same year again and again is usually unnecessary and a poor use of space. An axis title such as "Date" is superfluous so long as dates are clearly given. These points apply whatever software you are using.
clear
set obs 15
gen aa = daily("30 Jun 2016", "DMY") + _n
format aa %td
mat bb = (12, 14, 10, 8, 6, 8, 9, 11, 13, 15, 17, 19, 21, 23, 25)
gen bb = bb[1, _n]
tsset aa
graph twoway tsline bb, ///
title("Numbers by Day") ytitle("Value") xtitle("") ///
tlabel(#7, format(%tdd_M))
It's your graph, but the bottom line is simple: daily dates will need some kind of %td format, and %tc format is utterly wrong, on a par with confusing cents and millions of dollars as units.
You don't say exactly what you read, but this is well documented: help datetime in Stata and whatever it points to are all you need to study.
Note also http://www.statalist.org/forums/help#spelling

Related

How do I format text column with year to date format in Power BI

See Attached Image. I imported data with a date column (red area). I realized I needed to change it to date format so I created a new column and entered the formula (blue area) to change the column to a date format, however if you look at the green area, the year is not lining up starting from 2013. How can I fix this? Also my data points are annual so I don't need the "day", "month" format, I just want the year if possible.
This formula is trying to parse textual representation of a date, while here you have the year part only. Also, parsing is unnecessary complicated for your case. Just use DATE function to construct a date from parts, like this:
dateFormatted = DATE([Calendar Year], 1, 1)
Which should give you January 1st date in each year (which I guess is the desired result):

Calculate the number of firms at a given month

I'm working on a dataset in Stata
The first column is the name of the firm. the second column is the start date of this firm and the third column is the expiration date of this firm. If the expdate is missing, this firm is still in business. I want to create a variable that will record the number of firms at a given time. (preferably to be a monthly variable)
I'm really lost here. Please help!
Next time, try using dataex (ssc install dataex) rather than a screen shot, this is recommended in the Stata tag wiki, and will help others help you!
Here is an example for how to count the number of firms that are alive in each period (I'll use years, but point out where you can switch to month). This example borrows from Nick Cox's Stata journal article on this topic.
First, load the data:
* Example generated by -dataex-. To install: ssc install dataex
clear
input long(firmID dt_start dt_end)
3923155 20080123 99991231
2913168 20070630 99991231
3079566 20000601 20030212
3103920 20020805 20070422
3357723 20041201 20170407
4536020 20120201 20170407
2365954 20070630 20190630
4334271 20110721 20191130
4334338 20110721 20170829
4334431 20110721 20190429
end
Note that my in my example data my dates are not in Stata format, so I'll convert them here:
tostring dt_start, replace
generate startdate=date(dt_start, "YMD")
tostring dt_end, replace
generate enddate=date(dt_end, "YMD")
format startdate enddate
Next make a variable with the time interval you'd like to count within:
generate startyear = year(startdate)
generate endyear = year(enddate)
In my dataset I have missing end dates that begin with '9999' while you have them as '.' I'll set these to the current year, the assumption being that the dataset is current. You'll have to decide whether this is appropriate in your data.
replace endyear = year(date("$S_DATE","DMY")) if endyear == 9999
Next create an observation for the first and last years (or months) that the firm is alive:
expand 2
by firmID, sort: generate year = cond(_n == 1, startyear, endyear)
keep firmID year
duplicates drop // keeps one observation for firms that die in the period they were born
Now expand the dataset to have an observation for every period between the start and end date. For this I use tsfill.
xtset firmID year
tsfill
Now I have one observation per existing firm in each period. All that remains is to count the observations by year:
egen entities = count(firmID), by(year)
drop firmID
duplicates drop

xline option when date is formatted %th?

I'm doing a connected twoway plot with x-axis as dates formatted as %th with values 2011h1 to 2017h2. I want to put a vertical line at 2016h2 but nothing I've tried has worked.
xline(2016h2)
xline("2016h2")
xline(date==2016h2)
xline(date=="2016h2")
I'm thinking it might be because I formatted dates with
gen date = yh(year, half)
format date %th
I think this is a MWE:
age1820 date
10.42 2011h1
10.33 2011h2
11.66 2012h1
11.01 2012h2
14.29 2013h1
10.95 2013h2
12.42 2014h1
7.04 2014h2
7.07 2015h1
6.95 2015h2
4 2016h1
8.07 2016h2
5.98 2017h1
3.19 2017h2
graph twoway connected age1820 date, xline(2016h2)
Your example will not really work as written without some additional work. I think in future posts you may want to shoot for a fully working example to maximize the chance that you get a good answer quickly. This is why I made up some fake data below.
Try something like this:
clear
set obs 20
gen date = _n + 100
format date %th
gen age = _n*2
display %th 116
display %th 117
tw connected age date, xline(116 `=th(2018h2)') tline(2019h1)
The crux of the matter is that Stata deals with dates as integers that have a special label attached to them by the format command (but not a value label). For example, 0 corresponds to 1960h1. In other words, you need to either:
tell xline() the number that corresponds to the date you want
use th() to figure out what that number is and force the evaluation inside xline().
use tline(), which is smart enough to understand dates.
I think the third is the best option.

How to destring a date in Stata containing just the year?

I have a string variable in Stata called YEAR with format "aaaa" (e.g. 2011). I want to replace "aaaa" with "31decaaaa" and destring the obtained variable.
My feeling is that the best way to proceed could be firstly destringing the variable YEAR and then adding "31dec". To destring the variable YEAR I have tried the command date but it does not seem to work. Any suggestion?
It would be best to describe your eventual goal here, as use of destring just appears to be what you have in mind as the next step.
If your goal is, given a string variable year, to produce a daily date variable for 31 December in each year, then destring is not necessary. Here are three ways to do it:
gen date = daily("31 Dec" + year, "DMY")
gen date = date("31 Dec" + year, "DMY")
gen date = mdy(12, 31, real(year))
Incidentally, there is no likely gain for Stata use in daily dates 365 or 366 days apart, as they just create a time series that is mostly implicit gaps.
If your data are yearly, but just associated with the end of each calendar year, keep them as yearly and use a display format to show "31 Dec", or the equivalent, in output.
. di %ty!3!1_!D!e!c_CCYY 2015
31 Dec 2015
Detail. date() is a function, not a command, in Stata. We can't comment on "does not seem to work" as no details are given of what you tried or what happened. daily() is just a synonym for date().

Filter data to get only first day of the month rows

I have a dataset of daily data. I need to get only the data of the first day of each month in the data set (The data is from 1972 to 2013). So for example I would need Index 20, Date 2013-12-02 value of 0.1555 to be extracted.
The problem I have is that the first day for each month is different, so I cannot use a step such as relativedelta(months=1), how would I go about of extracting these values from my dataset?
Is there a similar command as I have found in another post for R?
R - XTS: Get the first dates and values for each month from a daily time series with missing rows
17 2013-12-05 0.1621
18 2013-12-04 0.1698
19 2013-12-03 0.1516
20 2013-12-02 0.1555
21 2013-11-29 0.1480
22 2013-11-27 0.1487
23 2013-11-26 0.1648
I would groupby the month and then get the zeroth (nth) row of each group.
First set as index (I think this is necessary):
In [11]: df1 = df.set_index('date')
In [12]: df1
Out[12]:
n val
date
2013-12-05 17 0.1621
2013-12-04 18 0.1698
2013-12-03 19 0.1516
2013-12-02 20 0.1555
2013-11-29 21 0.1480
2013-11-27 22 0.1487
2013-11-26 23 0.1648
Next sort, so that the first element is the first date of that month (Note: this doesn't appear to be necessary for nth, but I think that's actually a bug!):
In [13]: df1.sort_index(inplace=True)
In [14]: df1.groupby(pd.TimeGrouper('M')).nth(0)
Out[14]:
n val
date
2013-11-26 23 0.1648
2013-12-02 20 0.1555
another option is to resample and take the first entry:
In [15]: df1.resample('M', 'first')
Out[15]:
n val
date
2013-11-30 23 0.1648
2013-12-31 20 0.1555
Thinking about this, you can do this much simpler by extracting the month and then grouping by that:
In [21]: pd.DatetimeIndex(df.date).to_period('M')
Out[21]:
<class 'pandas.tseries.period.PeriodIndex'>
[2013-12, ..., 2013-11]
Length: 7, Freq: M
In [22]: df.groupby(pd.DatetimeIndex(df.date).to_period('M')).nth(0)
Out[22]:
n date val
0 17 2013-12-05 0.1621
4 21 2013-11-29 0.1480
This time the sortedness of df.date is (correctly) relevant, if you know it's in descending date order you can use nth(-1):
In [23]: df.groupby(pd.DatetimeIndex(df.date).to_period('M')).nth(-1)
Out[23]:
n date val
3 20 2013-12-02 0.1555
6 23 2013-11-26 0.1648
If this isn't guaranteed then sort by the date column first: df.sort('date').
One way is to add a column for the year, month and day:
df['year'] = df.SomeDatetimeColumn.map(lambda x: x.year)
df['month'] = df.SomeDatetimeColumn.map(lambda x: x.month)
df['day'] = df.SomeDatetimeColumn.map(lambda x: x.day)
Then group by the year and month, order by day, and take only the first entry (which will be the minimum day entry).
df.groupby(
['year', 'month']
).apply(lambda x: x.sort('day', ascending=True)).head(1)
The use of the lambda expressions makes this less than ideal for large data sets. You may not wish to grow the size of the data by keeping separately stored year, month, and day values. However, for these kinds of ad hoc date alignment problems, sooner or later having these values separated is very helpful.
Another approach is to group directly by a function of the datetime column:
dfrm.groupby(
by=dfrm.dt.map(lambda x: (x.year, x.month))
).apply(lambda x: x.sort('dt', ascending=True).head(1))
Normally these problems arise because of a dysfunctional database or data storage schema that exists one level prior to the Python/pandas layer.
For example, in this situation, it should be commonplace to rely on the existence of a calendar database table or a calendar data set which contains (or makes it easy to query for) the earliest active date in a month relative to the given data set (such as, the first trading day, the first week day, the first business day, the first holiday, or whatever).
If a companion database table exists with this data, it should be easy to combine it with the dataset you already have loaded (say, by joining on the date column you already have) and then it's just a matter of applying a logical filter on the calendar data columns.
This becomes especially important once you need to use date lags: for example, lining up a company's 1-month-ago market capitalization with the company's current-month stock return, to calculate a total return realized over that 1-month period.
This can be done by lagging the columns in pandas with shift, or trying to do a complicated self-join that is likely very bug prone and creates the problem of perpetuating the particular date convention to every place downstream that uses data from that code.
Much better to simply demand (or do it yourself) that the data must have properly normalized date features in its raw format (database, flat files, whatever) and to stop what you are doing, fix that date problem first, and only then get back to carrying out some analysis with the date data.
import pandas as pd
dates = pd.date_range('2014-02-05', '2014-03-15', freq='D')
df = pd.DataFrame({'vals': range(len(dates))}, index=dates)
g = df.groupby(lambda x: x.strftime('%Y-%m'), axis=0)
g.apply(lambda x: x.index.min())
#Or depending on whether you want the index or the vals
g.apply(lambda x: x.ix[x.index.min()])
The above didn't work for me because I needed more than one row per month where the number of rows every month could change. This is what I did:
dates_month = pd.bdate_range(df['date'].min(), df['date'].max(), freq='1M')
df_mth = df[df['date'].isin(dates_month)]