Repeated occurrences of value before creating flag - grouping

This question is related to Identify unique levels of categorical variable
I have a dataset as follows:
clear
input int(id date) str8 druggroup
1001 18401 "loop"
1001 18414 "loop"
1001 18428 "loop"
1001 18462 "loop"
1001 18428 "CCB"
1001 18462 "arb"
1002 18401 "arb"
1002 18473 "arb"
1002 18414 "thiazide"
1002 18428 "thiazide"
1002 18428 "CCB"
1002 18466 "CCB"
end
format %td date
I want to create a new variable which contains the earliest date for which I have evidence of the use of three separate druggroups for each id.
The rule for defining "evidence of three" is that I want repeat evidence of druggroup 1 occurring again and in addition an occurrence of druggroups 2 and 3. In other words, druggroup 1 will obviously occur once, in the first row, but I want it to occur again. Druggroups 2 and 3 do not need to be repeat occurrences but they both must occur.
The code I have written so far does not take into account that the first occurring drug needs to occur once more after its first occurrence to count as evidence of repeat use.
Here is the code I have so far:
bysort id druggroup (date) : gen firstdate = date[1]
format firstdate %td
list
egen group = group(id firstdate druggroup)
bysort id (group date druggroup): gen count_1 = sum(group != group[_n-1])
replace firstdate=date[2] if count_1==1
list
by id: gen start_date=firstdate if count_1==3
format start_date %td
by id : egen start_d=max(start_date)
format start_d %td
list
Here is what I actually want:
clear
input int(id date) str8 druggroup float(firstdate group count_1 start_date start_d)
1001 18401 "loop" 18414 1 1 . 18462
1001 18414 "loop" 18414 1 1 . 18462
1001 18428 "CCB" 18428 2 2 . 18462
1001 18428 "loop" 18414 1 1 . 18462
1001 18462 "loop" 18414 1 1 . 18462
1001 18462 "arb" 18462 3 3 18462 18462
1002 18401 "arb" 18414 4 1 . 18473
1002 18414 "thiazide" 18414 5 2 . 18473
1002 18428 "CCB" 18428 6 3 . 18473
1002 18428 "thiazide" 18414 5 2 . 18473
1002 18466 "CCB" 18428 6 3 . 18473
1002 18473 "arb" 18414 4 1 18473 18473
end
format %td date
format %td firstdate
format %td start_date
format %td start_d

Here a solution I find nice because it based on prime number. But it works only if you have 3 drugs per id.
bysort id druggroup (date) : gen firstdate = date[1]
egen group = group(id firstdate druggroup)
bysort id (group date druggroup): gen count = sum(group != group[_n-1])
sort id date
replace count = 5 if count == 3
replace count = 3 if count == 2
replace count = 2 if count == 1
We will calcul the cumulative product on each date (but the first date as you do not want to count the first drug occurence). Once this product is a multiple of 2*3*5, ie 15, it means the three drugs have been taken (plus the first one)
bysort id (date) : gen temp_prod = sum(ln(count)) if _n !=1
by id (date) : replace temp_prod = int(exp(temp_prod))
gen temp_mod = mod(temp_prod, 15)
bysort id (temp_mod) : gen start_date = date if _n == 1
sort id date
drop temp* first count
format %td start_date

We have to find the first date for each drug and each patient, and then the second date for the first drug used. There could be a problem if two or more drugs were dispensed on the first day. The implication seems to be that this doesn't happen.
I have to say that I usually need several tries to get the commands exactly right for this kind of problem.
Some egen technique used here is explained in Section 10 of at http://www.stata-journal.com/sjpdf.html?articlenum=dm0055
clear
input int(id date) str8 druggroup
1001 18401 "loop"
1001 18414 "loop"
1001 18428 "loop"
1001 18462 "loop"
1001 18428 "CCB"
1001 18462 "arb"
1002 18401 "arb"
1002 18473 "arb"
1002 18414 "thiazide"
1002 18428 "thiazide"
1002 18428 "CCB"
1002 18466 "CCB"
end
format %td date
local d druggroup
bysort id `d' (date): gen first = date[1] if _n == 1
bysort id (date `d') : gen counter = sum(first < .)
bysort id `d' (date) : replace first = first[1]
bysort id (first date druggroup) : gen date1 = date[2] if `d'[2] == `d'[1]
by id: egen date2 = min(date / (counter == 2))
by id: egen date3 = min(date / (counter == 3))
gen when = max(date1, date2, date3) if !missing(date1, date2, date3)
sort id date
format first date? when %td
l id date `d' first when, sepby(id)
+-----------------------------------------------------+
| id date druggr~p first when |
|-----------------------------------------------------|
1. | 1001 19may2010 loop 19may2010 19jul2010 |
2. | 1001 01jun2010 loop 19may2010 19jul2010 |
3. | 1001 15jun2010 loop 19may2010 19jul2010 |
4. | 1001 15jun2010 CCB 15jun2010 19jul2010 |
5. | 1001 19jul2010 arb 19jul2010 19jul2010 |
6. | 1001 19jul2010 loop 19may2010 19jul2010 |
|-----------------------------------------------------|
7. | 1002 19may2010 arb 19may2010 30jul2010 |
8. | 1002 01jun2010 thiazide 01jun2010 30jul2010 |
9. | 1002 15jun2010 thiazide 01jun2010 30jul2010 |
10. | 1002 15jun2010 CCB 15jun2010 30jul2010 |
11. | 1002 23jul2010 CCB 15jun2010 30jul2010 |
12. | 1002 30jul2010 arb 19may2010 30jul2010 |
+-----------------------------------------------------+
.

I think I've come up with a slightly more straightforward way (as always very happy to be corrected!). But there is one glitch in my method that I would appreciate help with please.
bysort id druggroup (date) : gen firstdate = date[1]
format firstdate %td
egen group2 = group(id firstdate druggroup)
bysort id (group2 druggroup date): gen count_1 = sum(group2 != group2[_n-1])
by id: replace firstdate=date[2] if count_1==1 //be careful of ordering here
by id : egen s_d=max(firstdate)
format s_d %td
The problem is the GROUP code. the order of the groups is getting messed up as stata is factoring in alphabetical order if I have two druggroups on the same date. I don't want alphabetical order - I want stata to preserve the order I have arranged the data in for ordering. Is there a way to tell group to stop automatically alphabetically ordering when I have two drug groups on the same date?
EDIT
I haven't managed to figure out how to break ties here yet. This is my workaround, is not perfect but deals with the problem of 'egen(group)' automatically putting its own alphabetical order on druggroups that occur on the same date.
In my workaround, I have taken the second drug that occurs on the same date and changed its date to date+1. This allows me to preserve the ordering and still seems to achieve the correct result.
The objective here is to make a new date variable; the date should be the earliest date that I have evidence of 3 medicines after the occurrence of the first medicine (so need the first drug to occur again after its first occurrence, but two others to occur).
Code and new sample data below.
clear
input int(id date) str8 druggroup byte tag
1001 18401 "loop" 1
1001 18414 "loop" 2
1001 18428 "loop" 2
1001 18428 "CCB" 2
1001 18462 "loop" 2
1001 18462 "arb" 2
2002 18401 "thiazide" 1
2002 18401 "arb" 2
2002 18428 "CCB" 2
2002 18428 "thiazide" 2
2002 18466 "CCB" 2
2002 18473 "arb" 2
3003 18401 "BB" 1
3003 18401 "arb" 2
3003 18428 "BB" 2
3003 18428 "CCB" 2
3003 18466 "CCB" 2
3003 18473 "arb" 2
end
format %td date
*make date_copy var
gen date_copy= date
replace date_copy=date+1 if date==date[_n-1] & tag[_n-1]==1
format date_copy %td
bysort id druggroup (date_copy) : gen firstdate = date_copy[1]
format firstdate %td
list
sort id date tag
list
*getting groups and new count
egen group = group(id firstdate druggroup)
bysort id (group date druggroup): gen count_1 = sum(group != group[_n-1])
list
by id : replace firstdate=date[2] if count_1==1
list
by id : egen s_d=max(firstdate)
format s_d %td
list

Related

Sas base: one-to-one reading by biggest table or getting data from next row

Im new in sas base and need help.
I have 2 tables with different data and I need merge it.
But on step I need data from next row.
example what I need:
ID Fdate Tdate NFdate NTdate
id1 date1 date1 date2 date2
id2 date2 date2 date3 date3
....
I did it by 2 merges:
data result;
merge table1 table2 by ...;
merge table1(firstobs=2) table2(firstobs=2) by...;
run;
I expected 10 rows but got 9 becouse one-to-one reading stopted on last row of smallest table(merge). How I can get the last row (do one-to-one reading by biggest table)?
Most simple data steps stop not at the bottom of the step but in the middle when they read past the end of the input. The reason you are getting N-1 observations is because the second input has one fewer records. So you need to do something to stop that.
One simple way is to not execute the second read when you are processing the last observation read by the first one. You can use the END= option to create a boolean variable that will let you know when that happens.
Here is simple example using SASHELP.CLASS.
data test;
set sashelp.class end=eof;
if not eof then set sashelp.class(firstobs=2 keep=name rename=(name=next_name));
else call missing(next_name);
run;
Results:
next_
Obs Name Sex Age Height Weight name
1 Alfred M 14 69.0 112.5 Alice
2 Alice F 13 56.5 84.0 Barbara
3 Barbara F 13 65.3 98.0 Carol
4 Carol F 14 62.8 102.5 Henry
5 Henry M 14 63.5 102.5 James
6 James M 12 57.3 83.0 Jane
7 Jane F 12 59.8 84.5 Janet
8 Janet F 15 62.5 112.5 Jeffrey
9 Jeffrey M 13 62.5 84.0 John
10 John M 12 59.0 99.5 Joyce
11 Joyce F 11 51.3 50.5 Judy
12 Judy F 14 64.3 90.0 Louise
13 Louise F 12 56.3 77.0 Mary
14 Mary F 15 66.5 112.0 Philip
15 Philip M 16 72.0 150.0 Robert
16 Robert M 12 64.8 128.0 Ronald
17 Ronald M 15 67.0 133.0 Thomas
18 Thomas M 11 57.5 85.0 William
19 William M 15 66.5 112.0

How to produce result based on date with multiple conditions on power bi?

I'm trying to create validation based on date and some filters
my input table is
Status Type Date PolicyNo
PS T607 01-01-2020 1002
PS T608 01-01-2020 1002
CF T646 01-01-2020 1002
PS T607 04-01-2020 1003
My condition is
1) In a single day how to apply multiple conditions
Ex. 01-01-2020 on day 1002 Policy(1002) we have three Type T607 with any one of (T608/T646) with status (PS/CF) the output value could be 0 otherwise 1
2) My expected output is
Status Type Date PolicyNo Accept
PS T607 01-01-2020 1002 0
PS T608 01-01-2020 1002 0
CF T646 01-01-2020 1002 0
PS T607 04-01-2020 1003 1
EDIT:
Date
01-01-2020
01-01-2020
01-01-2020
PolicyNo
1002
1002
1002
Type : T697 with (T608 or T646)
T607 - compalsory so (&&)
T608 - Optional so (||)
T646 - Optional so
(and)
Status : PS or CF
PS - Optional so (||)
CF - Optional
Conclude Condition: Same date (ex.01-01-2020) and Same PolicyNo(ex.1002) with (Type: T697 with (T608 or T646)) with (Status: PS or CF)
Multiple conditions in M (Power Query) for a custom column:
= if [Date] = Date.From(DateTime.LocalNow()) and [Type] = "T607" and [PolicyNo] = 1003 then 1 else 0
And so on...
Note: The syntax has to be lower case, becaue M is case sensitive.
You also can stack the if´s or use else if´s. You can also use a or condition.
You can do the same in DAX thou. With IF() and OR() functions (as new column):
= IF(OR([Date] = TODAY(), [Type] = "T607", [PolicyNo] = 1003), 1, 0)
EDIT
To your 4th comment. This logik works just fine (simplified sample):

Converting daily data in to weekly in Pandas

I have a dataframe as given below:
Index Date Country Occurence
0 2013-12-30 US 1
1 2013-12-30 India 3
2 2014-01-10 US 1
3 2014-01-15 India 1
4 2014-02-05 UK 5
I want to convert daily data into weekly,grouped by anatomy,method being sum.
Itried resampling,but the output gave Multi Index data frame from which i was not able to access "Country" and "Date" columns(pls refer above)
The desired output is given below:
Date Country Occurence
Week1 India 4
Week2
Week1 US 2
Week2
Week5 Germany 5
You can groupby on country and resample on week
In [63]: df
Out[63]:
Date Country Occurence
0 2013-12-30 US 1
1 2013-12-30 India 3
2 2014-01-10 US 1
3 2014-01-15 India 1
4 2014-02-05 UK 5
In [64]: df.set_index('Date').groupby('Country').resample('W', how='sum')
Out[64]:
Occurence
Country Date
India 2014-01-05 3
2014-01-12 NaN
2014-01-19 1
UK 2014-02-09 5
US 2014-01-05 1
2014-01-12 1
And, you could use reset_index()
In [65]: df.set_index('Date').groupby('Country').resample('W', how='sum').reset_index()
Out[65]:
Country Date Occurence
0 India 2014-01-05 3
1 India 2014-01-12 NaN
2 India 2014-01-19 1
3 UK 2014-02-09 5
4 US 2014-01-05 1
5 US 2014-01-12 1

Stata: Generate new variable with all values (e.g. not just max or min) for a group based on other variable in the group

I want to create new variables for the group country (iso_o/iso_d) with characteristics of the variable indepdate.
So far I have been typing:
gen include=1 if heg_o != 1
egen iso_o_indepdate1=min(indepdate * include), by(iso_o)
egen iso_o_indepdate2=max(indepdate * include), by(iso_o)
replace iso_o_indepdate2=. if iso_o_indepdate1==iso_o_indepdate2
drop include
*
gen include=1 if heg_d !=1
egen iso_d_indepdate1=min(indepdate * include), by(iso_d)
egen iso_d_indepdate2=max(indepdate * include), by(iso_d)
replace iso_d_indepdate2=. if iso_d_indepdate1==iso_d_indepdate2
drop include
The problem is I can use min() and max() combined to create two new variables for the values within indepdate, but if there are more then three I haven't been able to get a solution. Here a small table.
iso_o group indepdate new1 new2 new3
FRA 1 1960 1960 1980 1999
FRA 1 1980 1960 1980 1999
FRA 1 1999 1960 1980 1999
FRA 1 . 1960 1980 1999
USA 2 1955 1955 . .
USA 2 . 1955 . .
USA 2 . 1955 . .
So for this small example I could try work with intervals, however the dataset is very large and therefore I cannot tell for sure how many values are in one interval.
Any hint on another approach for this?
You can reshape and then merge:
clear all
set more off
*----- example data ---
input ///
str3 iso_o group indepdate new1 new2 new3
FRA 1 1960 1960 1980 1999
FRA 1 1980 1960 1980 1999
FRA 1 1999 1960 1980 1999
FRA 1 . 1960 1980 1999
USA 2 1955 1955 . .
USA 2 . 1955 . .
USA 2 . 1955 . .
end
drop new*
list, sepby(group)
tempfile orig
save "`orig'"
*----- what you want -----
bysort group (indepdate) : gen j = _n
reshape wide indepdate, i(group) j(j)
keep group indepdate*
merge 1:m group using "`orig'", assert(match) nogenerate
// list
sort group indepdate
order iso_o group indepdate indepdate*
list, sepby(group)
See help dropmiss to drop variables that have only missing values.
But the bigger question is why do you want to do this?

Filling in the months and years in date range?

I am new to SAS and wondered how to most efficiently list the months and years that fall between a starting date and ending date, in addition to the starting and ending date themselves. I've read about the INTCK and INTNX functions, the EXPAND function for time series data, and even CALENDAR FILL, but I'm not sure how to use them for this particular purpose. This task is easy to accomplish manually with a small dataset in Excel thanks to the drag-down autofill feature, but I need to find a way to do this in SAS due to the size of the dataset. Any suggestions would be greatly appreciated. Thank you!
The dataset is in a large text file organized like this now:
ID        Start  End
1000   08/01/2012         12/31/2012
1001   07/01/2010         05/31/2011
1002   04/01/1990         10/31/1991
But the output should look like this in the end:
ID MonthYear
1000 08/12
1000 09/12
1000 10/12
1000 11/12
1000 12/12
1001 07/10
1001 08/10
1001 09/10
1001 10/10
1001 11/10
1001 12/10
1001 01/11
1001 02/11
1001 03/11
1001 04/11
1001 05/11
1002 04/90
1002 05/90
1002 06/90
1002 07/90
1002 08/90
1002 09/90
1002 10/90
1002 11/90
1002 12/90
1002 01/91
1002 02/91
1002 03/91
1002 04/91
1002 05/91
1002 06/91
1002 07/91
1002 08/91
1002 09/91
1002 10/91
data want2;
set have;
do i = 0 to intck('month',start,end);
monthyear=intnx('month',start,i,'b');
output;
end;
format monthyear monyy.;
keep id monthyear;
run;
This will do the trick. PROC EXPAND may be more efficient, though I think it requires a number of desired observations rather than a start/end combination (though you could get that, I suppose).
data have;
informat start end MMDDYY10.;
input ID Start End;
datalines;
1000 08/01/2012 12/31/2012
1001 07/01/2010 05/31/2011
1002 04/01/1990 10/31/1991
;;;;
run;
data want;
set have;
format monthyear MMYYS5.; *formats the numeric monthyear variable with your desired format;
monthyear=start; *start with the initial observation;
output; *output it;
do _t = 1 by 1 until (month(monthyear)=month(end)); *iterate until end;
monthyear = intnx('month',monthyear,1,'b'); *go to the next start of month;
output; *output it;
end;
run;