Conversion of date-time data - stata

Managed to answer the question, though not by using 'help datetime' (already did that) or by reading N.Cox's 'Speaking Stata: On numbers and strings'.
Solution:
gen dob_ymd_nn = date(dob_ymd,"DMYhm")
format dob_ymd_nn %td
Thank you
My Stata variable dob_dmy shows date of birth of participant. Database unfortunately added time (all read 00:00). It is currently a string variable (str16). When I sort, it sorts not on the date but first on the day. See below
63. | 01/01/1975 00:00 |
64. | 01/01/1985 00:00 |
65. | 01/02/2010 00:00 |
I would like to drop the time and change format that will allow me to sort by actual date.

#Stan indicated the main idea, that you must convert from a string to a numeric date variable. #Roberto Ferrer underlined that this is all documented prominently within Stata itself. No internet search is needed.
Using your data as a sandbox (you can create such yourself easily in future questions using dataex (SSC)) and taking the hint in the variable name that the dates run day, month, year, then we can just ignore the useless time of day with substr() and pass the useful stuff to daily(). Add a date format for readability and then sorting works as desired.
. clear
. input str16 sdate
sdate
1. "01/02/2010 00:00"
2. "01/01/1985 00:00"
3. "01/01/1975 00:00"
4. end
. gen ddate = daily(substr(sdate, 1, 10), "DMY")
. format ddate %td
. sort ddate
. list
+------------------------------+
| sdate ddate |
|------------------------------|
1. | 01/01/1975 00:00 01jan1975 |
2. | 01/01/1985 00:00 01jan1985 |
3. | 01/02/2010 00:00 01feb2010 |
+------------------------------+

If you're storing dates as strings in MM/DD/YYYY format, you won't be able to sort them, except by month, then day, then year (which isn't very helpful). You need to convert them to dates, and THEN sort them.
From the following link:
gen date_obs = clock(datetime_obs, "MD20Yhm") //Obvously you have 4 digit years, so would change this to "MDYhm"
format date_obs %tc
http://www.stata.com/statalist/archive/2013-08/msg01434.html

Related

Power BI calculate sum only last value of duplicate ID

I'm struggling to create a Measure that sums a column and have it filter out duplicate IDs while taking only the latest row.
For example, there is a table as such:
UID | Quantity | Status | StatusDate
aaa | 3 | Shipped | 11/1/2020
aaa | 3 | Delivered | 11/5/2020
bbb | 5 | Ordered | 10/29/2020
ccc | 8 | Shipped | 11/4/2020
So the idea would be to sum the quantity, but I would only want to count quantity for id "aaa" once and only count towards the latest status ("Delivered" in this case). I would make a visual that shows the quantities with status as its axis. I also need to add a date Slicer so I could go back in time. So, when I go before 11/5/2020, instead of "Delivered," it would switch back to "Shipped."
I tried several methods:
SUMMARIZE to a table filtering "MAX" date value if UID is the same. I found this doesn't work with the date slicer since it seems like it is not actually recalculating the filtering and just slicing away rows outside of the dates. Seems to be the same whether the SUMMARIZE is set as a new table or VAR in the Measure.
CALCULATE seems promising but I can't seem to figure out a syntax
that filters that I need. Example of one that doesn't work (I also tried SUMX instead of SUM but that doesn't work either):
CALCULATE(
SUM(Table1[Quantity]),
FILTER(Table1, [StatusDate]=MAXX(FILTER(Table1,[UID]=EARLIER([UID])),[StatusDate])
)
I also tried adding a column that states whether if the row is "old" as well as a numerical "rank" to the different statuses. But once again, I run into the issue where the date slicer is not exactly filtering to recalculate those columns. For example, if the date slicer is set to 11/3/2020, it should add "3" to "Shipped" instead of "Delivered." But instead of that, it just removes the row which tells me that it is not actually recalculating the columns (like #1).
Any help would be appreciated :-) Thank you!
You can try something like this:
Measure =
VAR d = LASTDATE(Table1[StatusDate])
VAR tb = SUMMARIZE(FILTER(Table1, Table1[StatusDate] <= d),
Table1[UID],
"last", LASTDATE(Table1[StatusDate]))
RETURN CALCULATE(SUM(Table1[Quantity]), TREATAS(tb, Table1[UID], Table1[StatusDate]))
The tb variable contains a table which has the latest date per UID. You then use that to filter your main table with the TREATAS function.
One other alternative is to create a table with the RANK function ordered by date and then doing a SUM over that table, where Rank = 1.

Split data into categories in the same row in Power BI

I have a table that contains multiple columns with their named having either the suffix _EXPECTED or _ACTUAL. For example, I'm looking at my sold items from my SoldItems Table and I have the following columns: APPLES_EXPECTED, BANANAS_EXPECTED, KIWIS_EXPECTED, APPLES_ACTUAL, BANANAS_ACTUAL, KIWIS_ACTUAL (The Identifier of the table is the date, so we have results per date). I want to show that data in a table form, something like this (for a selected date in filters:
+------------+----------+--------+
| Sold items | Expected | Actual |
+------------+----------+--------+
| Apples | 10 | 15 |
| Bananas | 8 | 5 |
| Kiwis | 2 | 1 |
+------------+----------+--------+
How can I manage something like this in Power BI ? I tried playing with the matrix/table visualization, however, I can't figure out a way to merge all the expected and actual columns together.
It looks like the easiest option for you would be to mould the data a bit differently using Power query. You can UNPIVOT your data so that all the expected and actual values become rows instead of columns. For example take the following sample:
Date Apples_Expected Apples_Actual
1/1/2019 1 2
Once you unpivot this it will become:
Date Fruit Count
1/1/2019 Apples_Expected 1
1/1/2019 Apples_Actual 2
Once you unpivot, it should be fairly straightforward to get the view you are looking for. The following link should walk you through the steps to unpivot:
https://support.office.com/en-us/article/unpivot-columns-power-query-0f7bad4b-9ea1-49c1-9d95-f588221c7098
Hope this helps.

xline option when date is formatted %th?

I'm doing a connected twoway plot with x-axis as dates formatted as %th with values 2011h1 to 2017h2. I want to put a vertical line at 2016h2 but nothing I've tried has worked.
xline(2016h2)
xline("2016h2")
xline(date==2016h2)
xline(date=="2016h2")
I'm thinking it might be because I formatted dates with
gen date = yh(year, half)
format date %th
I think this is a MWE:
age1820 date
10.42 2011h1
10.33 2011h2
11.66 2012h1
11.01 2012h2
14.29 2013h1
10.95 2013h2
12.42 2014h1
7.04 2014h2
7.07 2015h1
6.95 2015h2
4 2016h1
8.07 2016h2
5.98 2017h1
3.19 2017h2
graph twoway connected age1820 date, xline(2016h2)
Your example will not really work as written without some additional work. I think in future posts you may want to shoot for a fully working example to maximize the chance that you get a good answer quickly. This is why I made up some fake data below.
Try something like this:
clear
set obs 20
gen date = _n + 100
format date %th
gen age = _n*2
display %th 116
display %th 117
tw connected age date, xline(116 `=th(2018h2)') tline(2019h1)
The crux of the matter is that Stata deals with dates as integers that have a special label attached to them by the format command (but not a value label). For example, 0 corresponds to 1960h1. In other words, you need to either:
tell xline() the number that corresponds to the date you want
use th() to figure out what that number is and force the evaluation inside xline().
use tline(), which is smart enough to understand dates.
I think the third is the best option.

Stata: Gaps between dates

I have a situation where I need to need to order several dates to see if there is a gap in coverage. My data set looks like this, where id is the panel id and start and end are dates.
id start end
a 01.01.15 02.01.15
a 02.01.15 03.01.15
b 05.01.15 06.01.15
b 07.01.15 08.01.15
b 06.01.15 07.01.15
I need to identify any cases where there is a gap in coverage, meaning when the 2nd start date for an id is greater than the first end date for the same id. Also it should be noted that the same id can have undetermined number of observations and they might not be in a particular order. I wrote the code below for a case where there are only two observations per id.
bys id: gen y=1 if end < start[_n+1]
However, this code does not produce the desired results. I'm thinking that there should be another way to approach this problem.
Your approach seems sound in essence, assuming that your date variables are really Stata daily date variables formatted suitably. You don't explain at all what "does not produce the desired results" means to you.
The code below creates a sandbox similar to your example, but with string variables converted to daily dates.
Key details include:
Observations must be sorted by date within panel.
The end date for the observation after the last in each panel would always be returned as missing, and so as greater than any known date. The code here returns the corresponding indicator as missing.
clear
input str1 id str8 (s_start s_end)
a "01.01.15" "02.01.15"
a "02.01.15" "03.01.15"
b "05.01.15" "06.01.15"
b "07.01.15" "08.01.15"
b "06.01.15" "07.01.15"
b "10.01.15" "12.01.15"
end
foreach v in start end {
gen `v' = daily(s_`v', "DMY", 2050)
format `v' %td
}
// the important line here
bysort id (start) : gen first = end < start[_n+1] if _n < _N
list , sepby(id)
+----------------------------------------------------------+
| id s_start s_end start end first |
|----------------------------------------------------------|
1. | a 01.01.15 02.01.15 01jan2015 02jan2015 0 |
2. | a 02.01.15 03.01.15 02jan2015 03jan2015 . |
|----------------------------------------------------------|
3. | b 05.01.15 06.01.15 05jan2015 06jan2015 0 |
4. | b 06.01.15 07.01.15 06jan2015 07jan2015 0 |
5. | b 07.01.15 08.01.15 07jan2015 08jan2015 1 |
6. | b 10.01.15 12.01.15 10jan2015 12jan2015 . |
+----------------------------------------------------------+

Sort observations in a custom order

I have a dataset that results from the joins between a few results from a proc univariate.
After some more joins, I have a final dataset with a variable called "Measure", which has the name of certain measures, like 'mean' and 'standard deviation', for example, and other variables each with values for these measures, representing a month in a certain year.
I'd like to sort these measures in a particular order and, for now, I'm doing a proc transpose, doing a retain to stabilish the order I want, and doing another transpose. The problem is that this a really naive solution and I feel it just takes longer than it should take.
Is there a simpler/more effective way to do this sort?
An example of what I want to do, with random values:
What I have:
Measures | 2013/01 | 2013/02 | 2013/03
Mean | 10 | 9 | 11
Std Devi.| 1 | 1 | 1
Median | 3 | 5 | 4
What I want:
Measures | 2013/01 | 2013/02 | 2013/03
Std Devi.| 1 | 1 | 1
Median | 3 | 5 | 4
Mean | 10 | 9 | 11
I hope I was clear enough.
Thanks in advance
Couple of straightforward solutions. First, you could simply add a variable that you sort by and then drop. Don't need to transpose, just do it in the data step or PROC SQL after the join. if measures='Mean' then sortorder=3; else if measures='MEdian' then sortorder=2;... then sort by sortorder and then drop it in the PROC SORT step.
Second, if you're using entirely numeric values, you can use PROC MEANS to do the sorting for you, with a custom format that defines the order (using NOTSORTED and order=data on the class statement) and idgroup functionality in PROC MEANS to do the sorting and output the right values. This is overkill in most cases, but if the dataset is huge it might be appropriate.
Third, if you're doing the joins in SQL, you can order by the variable that you input into a order you want - I can explain that in more detail if you find that the most useful.