Stata: Gaps between dates - stata

I have a situation where I need to need to order several dates to see if there is a gap in coverage. My data set looks like this, where id is the panel id and start and end are dates.
id start end
a 01.01.15 02.01.15
a 02.01.15 03.01.15
b 05.01.15 06.01.15
b 07.01.15 08.01.15
b 06.01.15 07.01.15
I need to identify any cases where there is a gap in coverage, meaning when the 2nd start date for an id is greater than the first end date for the same id. Also it should be noted that the same id can have undetermined number of observations and they might not be in a particular order. I wrote the code below for a case where there are only two observations per id.
bys id: gen y=1 if end < start[_n+1]
However, this code does not produce the desired results. I'm thinking that there should be another way to approach this problem.

Your approach seems sound in essence, assuming that your date variables are really Stata daily date variables formatted suitably. You don't explain at all what "does not produce the desired results" means to you.
The code below creates a sandbox similar to your example, but with string variables converted to daily dates.
Key details include:
Observations must be sorted by date within panel.
The end date for the observation after the last in each panel would always be returned as missing, and so as greater than any known date. The code here returns the corresponding indicator as missing.
clear
input str1 id str8 (s_start s_end)
a "01.01.15" "02.01.15"
a "02.01.15" "03.01.15"
b "05.01.15" "06.01.15"
b "07.01.15" "08.01.15"
b "06.01.15" "07.01.15"
b "10.01.15" "12.01.15"
end
foreach v in start end {
gen `v' = daily(s_`v', "DMY", 2050)
format `v' %td
}
// the important line here
bysort id (start) : gen first = end < start[_n+1] if _n < _N
list , sepby(id)
+----------------------------------------------------------+
| id s_start s_end start end first |
|----------------------------------------------------------|
1. | a 01.01.15 02.01.15 01jan2015 02jan2015 0 |
2. | a 02.01.15 03.01.15 02jan2015 03jan2015 . |
|----------------------------------------------------------|
3. | b 05.01.15 06.01.15 05jan2015 06jan2015 0 |
4. | b 06.01.15 07.01.15 06jan2015 07jan2015 0 |
5. | b 07.01.15 08.01.15 07jan2015 08jan2015 1 |
6. | b 10.01.15 12.01.15 10jan2015 12jan2015 . |
+----------------------------------------------------------+

Related

Create a new variable using observations and labels

I have a variable that looks like this:
I want a new variable that multiplies the labels with the frequency, so for example the first row would be 170,105=70,105, and 2 would be 236,377=72754 and so on. I want my new variable to look like this:
​​​​​​​How can I do this?
On the face of it you have at least 119167 observations. The "at least" refers to the possibility of missing values, not tabulated by default.
You don't say whether you want these values in the same observations or in a much reduced new dataset. If the former, then consider this (noting that 3845 * 4 = 15380).
clear
input apple freq
1 70105
2 36377
3 8840
4 3845
end
expand freq
tab apple
bysort apple : gen new = apple * _N
tabdisp apple, c(new)
----------------------
apple | new
----------+-----------
1 | 70105
2 | 72754
3 | 26520
4 | 15380
----------------------
```

Stata: Keep the first observation by group

I have a data set that looks like this:
id firm earnings A
1 A 100 0
1 A 200 0
2 B 50 1
2 B 70 1
3 C 900 0
bys id firm, I want to keep only the first observation if A==0 and want to keep all the observations if A==1.
I've tried the following code:
if A==0{
bys id firm: keep if _n==1
}
However, this code drops all the _n>1 observations no matter what the A value is.
The if (conditional) {do something} syntax is used in control flow rather than in defining variables. As you have it now Stata is only testing if A==1 in the first row. Try adding additional conditions using and (&) or or (|) statements. Try this:
bys id firm: keep if (_n==1 & A==0) | A==1

Power BI calculate sum only last value of duplicate ID

I'm struggling to create a Measure that sums a column and have it filter out duplicate IDs while taking only the latest row.
For example, there is a table as such:
UID | Quantity | Status | StatusDate
aaa | 3 | Shipped | 11/1/2020
aaa | 3 | Delivered | 11/5/2020
bbb | 5 | Ordered | 10/29/2020
ccc | 8 | Shipped | 11/4/2020
So the idea would be to sum the quantity, but I would only want to count quantity for id "aaa" once and only count towards the latest status ("Delivered" in this case). I would make a visual that shows the quantities with status as its axis. I also need to add a date Slicer so I could go back in time. So, when I go before 11/5/2020, instead of "Delivered," it would switch back to "Shipped."
I tried several methods:
SUMMARIZE to a table filtering "MAX" date value if UID is the same. I found this doesn't work with the date slicer since it seems like it is not actually recalculating the filtering and just slicing away rows outside of the dates. Seems to be the same whether the SUMMARIZE is set as a new table or VAR in the Measure.
CALCULATE seems promising but I can't seem to figure out a syntax
that filters that I need. Example of one that doesn't work (I also tried SUMX instead of SUM but that doesn't work either):
CALCULATE(
SUM(Table1[Quantity]),
FILTER(Table1, [StatusDate]=MAXX(FILTER(Table1,[UID]=EARLIER([UID])),[StatusDate])
)
I also tried adding a column that states whether if the row is "old" as well as a numerical "rank" to the different statuses. But once again, I run into the issue where the date slicer is not exactly filtering to recalculate those columns. For example, if the date slicer is set to 11/3/2020, it should add "3" to "Shipped" instead of "Delivered." But instead of that, it just removes the row which tells me that it is not actually recalculating the columns (like #1).
Any help would be appreciated :-) Thank you!
You can try something like this:
Measure =
VAR d = LASTDATE(Table1[StatusDate])
VAR tb = SUMMARIZE(FILTER(Table1, Table1[StatusDate] <= d),
Table1[UID],
"last", LASTDATE(Table1[StatusDate]))
RETURN CALCULATE(SUM(Table1[Quantity]), TREATAS(tb, Table1[UID], Table1[StatusDate]))
The tb variable contains a table which has the latest date per UID. You then use that to filter your main table with the TREATAS function.
One other alternative is to create a table with the RANK function ordered by date and then doing a SUM over that table, where Rank = 1.

Looping with distance matching

I want to match treated firms to control firms by industry and year considering firms that are the closest in terms of profitability (roa). I want a 1:1 match. I am using a distance measure (mahalanobis).
I have 530,000 firm-year observations in my sample, namely 267,000 treated observations and 263,000 control observations approximatively. Here is my code:
gen neighbor1 = .
gen idobs = .
levelsof industry
local a = r(levels)
levelsof year
local b = r(levels)
foreach i in `a' {
foreach j in `b'{
capture noisily psmatch2 treat if industry == `i' & year == `j', mahalanobis(roa)
capture noisily replace neighbor1 = _n1 if industry == `i' & year == `j'
capture noisily replace idobs = _id if industry == `i' & year == `j'
drop _treated _support _weight _id _n1 _nn
}
}
Treat is my treatment variable. It takes the value of 1 for treated observations and 0 for non-treated observations.
The command psmatch2 creates the variable _n1 and _id among others. _n1 is the id number of the matched observation (closest neighbor) and _id is an id number (1 - 530,000) that is unique to each observation.
The code 'works', i.e. I get no error message. My variable neighbor1 has 290,724 non-missing observations.
However, these 290,724 observations vary between 1 and 933 which is odd. The variable neighbor1 should provide me the observation id number of the matched observation, which can vary between 1 and 530,000.
It seems that the code erases or ignores the result of the matching process in different subgroups. What am I doing wrong?
Edit:
I found a public dataset and adapted my previous code so that you can run my code with this dataset and see more clearly what the problem could be.
I am using Vella and Verbeek (1998) panel data on 545 men worked every year from 1980-1987 from this website: https://www.stata.com/texts/eacsap/
Let's say that I want to match treated observations, i.e. people, to control observations by marriage status (married) and year considering people that worked a similar number of hours (hours), i.e. the shortest distance.
I create a random treatment variable (treat) for the sake of this example.
use http://www.stata.com/data/jwooldridge/eacsap/wagepan.dta
gen treat = round(runiform())
gen neighbor1 = .
gen idobs = .
levelsof married
local a = r(levels)
levelsof year
local b = r(levels)
foreach i in `a' {
foreach j in `b'{
capture noisily psmatch2 treat if married == `i' & year == `j', mahalanobis(hours)
capture noisily replace neighbor1 = _n1 if married == `i' & year == `j'
capture noisily replace idobs = _id if married == `i' & year == `j'
drop _treated _support _weight _id _n1 _nn
}
}
What this code should do is to look at each subgroup of observations: 444 observations in 1980 that are not married, 101 observations in 1980 that are married, ..., and 335 observations in 1987 that are married. In each of these subgroups, I would like to match a treated observation to a control observation considering the shortest distance in the number of hours worked.
There are two problems that I see after running the code.
First, the variable idobs should take a unique number between 1 and 4360 because there are 4360 observations in this dataset. It is just an ID number. It is not the case. A few observations can have an ID number 1, 2 and so on.
Second, neighbor1 varies between 1 and 204 meaning that the matched observations have only ID numbers varying from 1 to 204.
What is the problem with my code?
Here is a solution using the command iematch, installed through the package ietoolkit -> ssc install ietoolkit. For disclosure, I wrote this command. psmatch2 is great if you want the ATT. But if all you want is to match observations across two groups using nearest neighbor, then iematch is cleaner.
In both commands you need to make each industry-year match in a subset, then combine that information. In both commands the matched group ID will restart from 1 in each subset.
Using your example data, this creates one matchID var for each subset, then you will have to find a way to combine these to a single matchID without conflicts across the data set.
* use data set and keep only vars required for simplicity
use http://www.stata.com/data/jwooldridge/eacsap/wagepan.dta, clear
keep year married hour
* Set seed for replicability. NEVER use the 123456 seed in production, randomize a new seed
set seed 123456
*Generate mock treatment
gen treat = round(runiform())
*generate vars to store results
gen matchResult = .
gen matchDiff = .
gen matchCount = .
*Create locals for loops
levelsof married
local married_statuses = r(levels)
levelsof year
local years = r(levels)
*Loop over each subgroup
foreach year of local years {
foreach married_status of local married_statuses {
*This command is similar to psmatch2, but a simplified version for
* when you are not looking for the ATT.
* This command is only about matching.
iematch if married == `married_status' & year == `year', grp(treat) match(hour) seedok m1 maxmatch(1)
*These variables list meta info about the match. See helpfile for docs,
*but this copy info from each subset in this loop to single vars for
*the full data set. Then the loop specfic vars are dropped
replace matchResult = _matchResult if married == `married_status' & year == `year'
replace matchDiff = _matchDiff if married == `married_status' & year == `year'
replace matchCount = _matchCount if married == `married_status' & year == `year'
drop _matchResult _matchDiff _matchCount
*For each loop you will get a match ID restarting at 1 for each loop.
*Therefore we need to save them in one var for each loop and combine afterwards.
rename _matchID matchID_`married_status'_`year'
}
}

Conversion of date-time data

Managed to answer the question, though not by using 'help datetime' (already did that) or by reading N.Cox's 'Speaking Stata: On numbers and strings'.
Solution:
gen dob_ymd_nn = date(dob_ymd,"DMYhm")
format dob_ymd_nn %td
Thank you
My Stata variable dob_dmy shows date of birth of participant. Database unfortunately added time (all read 00:00). It is currently a string variable (str16). When I sort, it sorts not on the date but first on the day. See below
63. | 01/01/1975 00:00 |
64. | 01/01/1985 00:00 |
65. | 01/02/2010 00:00 |
I would like to drop the time and change format that will allow me to sort by actual date.
#Stan indicated the main idea, that you must convert from a string to a numeric date variable. #Roberto Ferrer underlined that this is all documented prominently within Stata itself. No internet search is needed.
Using your data as a sandbox (you can create such yourself easily in future questions using dataex (SSC)) and taking the hint in the variable name that the dates run day, month, year, then we can just ignore the useless time of day with substr() and pass the useful stuff to daily(). Add a date format for readability and then sorting works as desired.
. clear
. input str16 sdate
sdate
1. "01/02/2010 00:00"
2. "01/01/1985 00:00"
3. "01/01/1975 00:00"
4. end
. gen ddate = daily(substr(sdate, 1, 10), "DMY")
. format ddate %td
. sort ddate
. list
+------------------------------+
| sdate ddate |
|------------------------------|
1. | 01/01/1975 00:00 01jan1975 |
2. | 01/01/1985 00:00 01jan1985 |
3. | 01/02/2010 00:00 01feb2010 |
+------------------------------+
If you're storing dates as strings in MM/DD/YYYY format, you won't be able to sort them, except by month, then day, then year (which isn't very helpful). You need to convert them to dates, and THEN sort them.
From the following link:
gen date_obs = clock(datetime_obs, "MD20Yhm") //Obvously you have 4 digit years, so would change this to "MDYhm"
format date_obs %tc
http://www.stata.com/statalist/archive/2013-08/msg01434.html