I have two datasets:
1) Let's call the fist data set - "Provider". It contains a year's list of providers (over 3900 shifts/records), the date which they worked a shift and the shift type. Note that there are shift types that span over midnight.
Date Provider Shift
1/8/2019 Bob ED A/B 11p-7a (ED A/B)
1/10/2019 Bob ED C/D 11p-7a (ED C/D)
1/16/2019 Bob ED C 3p-12a (ED C)
1/9/2019 Sue UMC 5p-2a (UMC)
1/11/2019 Bob ED C/D 11p-7a (ED C/D)
1/13/2019 Bob ED PH/night 10p-4a (ED PH/night)
2) I have another data set - "Patients". It contains the year's worth of FINs, date/time when they saw a provider, and the name of the provider they saw of all patients seen at our location.
FIN Date Provider Name
1 1/8/2019 23:40 Bob
2 1/9/2019 01:46 Timbo
3 1/9/2019 01:30 Bob
4 1/9/2019 05:06 Patty
5 1/9/2019 02:50 Bob
6 1/9/2019 17:23 Sue
7 1/9/2019 06:45 Mike
8 1/10/2019 01:35 Sue
I'm looking to create a new data set that contains the number of patients seen during a given shift.
So for example the data set would look like this:
Shift date Shift FIN Provider
1/8/2019 ED A/B 11p-7a (ED A/B) 1 Bob
1/8/2019 ED A/B 11p-7a (ED A/B) 3 Bob
1/8/2019 ED A/B 11p-7a (ED A/B) 5 Bob
1/9/2019 UMC 5p-2a (UMC) 6 Sue
1/9/2019 UMC 5p-2a (UMC) 8 Sue
I could very easily create this data set by merging the two data sets then matching based on date and provider name; however, as I mentioned before, some of the shifts span past midnight so I am unable to match by date.
There are roughly 20 different shift types I'm interested in gathering data for, of which 6 span over midnight. I would need to structure my data so that say if a provider worked the ED A/B 11p-7a (ED A/B) shift on 1/8/2019 then count any patient who he/she also saw before 7am on 1/9/2019? If possible, I would then need to create some sort of macro (I think).
Hope this makes sense - thanks for help!
You will need to process the Provider data to compute shift start and end datetimes. This will require locating the ##p-##a, and, presumably, also ##a-##a , ##a-##p and ##p-##p text portions in Shift.
After the shift datetimes are computed the data can be joined in this manner:
patients
join
provider
on
patients.date between provider.shift_start and provider.shift_end
& patients.provider = provider.provider
Example
data provider;
attrib
date informat=mmddyy10. format=mmddyy10.
provider length=$10
shift length=$60
;
input date& provider& shift&; datalines;
1/8/2019 Bob ED A/B 11p-7a (ED A/B)
1/10/2019 Bob ED C/D 11p-7a (ED C/D)
1/16/2019 Bob ED C 3p-12a (ED C)
1/9/2019 Sue UMC 5p-2a (UMC)
1/11/2019 Bob ED C/D 11p-7a (ED C/D)
1/13/2019 Bob ED PH/night 10p-4a (ED PH/night)
1/15/2019 Bob ED PH/night 10p-9p (ED PH/night)
1/17/2019 Bob ED PH/night 2-11a (ED PH/night)
;
data patients;
attrib
fin length=8
service_dt length=8 format=datetime20. informat=anydtdtm20.
provider length=$10
;
input FIN& service_dt& Provider&; datalines;
1 1/8/2019 23:40 Bob
2 1/9/2019 01:46 Timbo
3 1/9/2019 01:30 Bob
4 1/9/2019 05:06 Patty
5 1/9/2019 02:50 Bob
6 1/9/2019 17:23 Sue
7 1/9/2019 06:45 Mike
8 1/10/2019 01:35 Sue
;
* compute shift start and end datetimes;
* presume the shift time ranges are valid;
* this example does not deal with start and ends at noon or midnight;
data provider_range;
set provider;
rxid = prxparse('/(\d{1,2})(a|p)-(\d{1,2})(a|p)/');
if prxmatch(rxid,shift) then do;
length t1 $2 p1 $1 t2 $2 p2 $1;
t1 = prxposn(rxid,1,shift); t1n=input(t1,2.);
p1 = prxposn(rxid,2,shift);
t2 = prxposn(rxid,3,shift); t2n=input(t2,2.);
p2 = prxposn(rxid,4,shift);
select (p1||p2);
when ('aa', 'pp') do;
shift_start = dhms(date, t1n+12*(p2='p'),0,0);
shift_end = dhms(date, t2n+12*(p2='p'),0,0);
end;
when ('ap') do;
shift_start = dhms(date, t1n+0,0,0);
shift_end = dhms(date, t2n+12,0,0);
end;
otherwise /* pa */ do;
shift_start = dhms(date, t1n+12,0,0);
shift_end = dhms(date, t2n+24,0,0);
end;
end;
end;
else do;
put 'ERROR: Invalid shift, ' shift ;
delete;
end;
format shift_start shift_end datetime20.;
drop rxid t1: p1: t2: p2:;
run;
* this join does not use SAS SQL BETWEEN, the join criteria
* uses explicit construct a <= b and b <= c instead;
proc sql;
create table want as
select
provider.date as shift_date,
provider.shift,
patients.service_dt,
patients.fin,
patients.provider
from patients
join provider_range as provider
on patients.provider = provider.provider and
provider.shift_start <= patients.service_dt and
provider.shift_end >= patients.service_dt
order by
fin
;
quit;
Related
I have a large Stata dataset that contains the following variables: year, state, household_id, individual_id, partner_id, and race. Here is an example of my data:
year state household_id individual_id partner_id race
1980 CA 23 2 1 3
1980 CA 23 1 2 1
1990 NY 43 4 2 1
1990 NY 43 2 4 1
Note that, in the above table, column 1 and 2 are married to each other.
I want to create a variable that is one if the person is in an interracial marriage.
As a first step, I used the following code
by household_id year: gen inter=0 if race==race[partner_id]
replace inter=1 if inter==.
This code worked well but gave the wrong result in a few cases. As an alternative, I created a string variable identifying each user and its partner, using
gen id_user=string(household_id)+"."+string(individual_id)+string(year)
gen id_partner=string(household_id)+"."+string(partner_id)+string(year)
What I want to do now is to create something like what vlookup does in Excel: for each column, save locally the id_partner, find it in the id_user and find their race, and compare it with the race of the original user.
I guess it should be something like this?
gen inter2==1 if (find race[idpartner]) == (race[iduser])
The expected output should be like this
year state household_id individual_id partner_id race inter2
1980 CA 23 2 1 3 1
1980 CA 23 1 2 1 1
1990 NY 43 4 2 1 0
1990 NY 43 2 4 1 0
I don't think you need anything so general. As you realise, the information on identifiers suffices to find couples, and that in turn allows comparison of race for the people in each couple.
In the code below _N == 2 is meant to catch data errors, such as one partner but not the other being an observation in the dataset or repetitions of one partner or both.
clear
input year str2 state household_id individual_id partner_id race
1980 CA 23 2 1 3
1980 CA 23 1 2 1
1990 NY 43 4 2 1
1990 NY 43 2 4 1
end
generate couple_id = cond(individual_id < partner_id, string(individual_id) + ///
" " + string(partner_id), string(partner_id) + ///
" " + string(individual_id))
bysort state year household_id couple_id : generate mixed = race[1] != race[2] if _N == 2
list, sepby(household_id) abbreviate(15)
+-------------------------------------------------------------------------------------+
| year state household_id individual_id partner_id race couple_id mixed |
|-------------------------------------------------------------------------------------|
1. | 1980 CA 23 2 1 3 1 2 1 |
2. | 1980 CA 23 1 2 1 1 2 1 |
|-------------------------------------------------------------------------------------|
3. | 1990 NY 43 4 2 1 2 4 0 |
4. | 1990 NY 43 2 4 1 2 4 0 |
+-------------------------------------------------------------------------------------+
This idea is documented in this article. The link gives free access to a pdf file.
In a compare with id, how can I output only the difference and the new records
but not the old records no more present?
Example, suppose I have two tables:
mybase:
key other
1 Ann
3 Ann
4 Charlie
5 Emily
and mycompare:
key other
2 Bill
3 Charlie
4 Charlie
running:
proc compare data=mybase
compare=mycompare
outnoequal
outdif
out=myoutput
listvar
outcomp
outbase
method = absolute
criterion = 0.0001
;
id key;
run;
I get a table "myoutput" like this:
type obs key other
base 1 1 Ann
compare 1 2 Bill
base 2 3 Ann
compare 2 3 Charlie
dif 2 3 XXXXXXX
base 4 5 Emily
I would like to have this:
type obs key other
compare 1 2 Bill
base 2 3 Ann
compare 2 3 Charlie
dif 2 3 XXXXXXX
This works for your example. I think you want to output records that are not matched in base and any records that match and have differences.
data mybase;
input key other $;
cards;
1 Ann
3 Ann
4 Charlie
5 Emily
;;;;
data mycompare;
input key other $;
cards;
2 Bill
3 Charlie
4 Charlie
;;;;
proc compare data=mybase
compare=mycompare
outnoequal
outdif
out=myoutput
listvar
outcomp
outbase
method = absolute
criterion = 0.0001
;
id key;
run;
proc print;
run;
data test;
set myoutput;
by key;
if (first.key and last.key) and _type_ eq 'BASE' then delete;
run;
proc print;
run;
Obs _TYPE_ _OBS_ key other
1 COMPARE 1 2 Bill
2 BASE 2 3 Ann
3 COMPARE 2 3 Charlie
4 DIF 1 3 XXXXXXX.
I have a dataset that can be simplified in the following format:
clear
input str9 Date ID VarA VarB
"12jan2010" 5 21 42
"12jan2010" 6 47 21
"15jan2010" 10 7 68
"17jan2010" 6 -5 -3
"19jan2010" 6 -1 -1
end
In the dataset, there is Date, ID, VarA, and VarB. Each ID represents a unique set of transactions. I want to collapse (sum) VarA VarB, by(Date) in Stata. However, I want to keep the date of the first observation for each ID number.
Essentially, I want the above dataset to become the following:
+--------------------------------+
| Date ID Var1 Var2 |
|--------------------------------|
| 12jan2010 5 21 42 |
| 12jan2010 6 41 17 |
| 15jan2010 10 7 68 |
+--------------------------------+
12jan2010 17jan2010 and 19jan2010 have the same ID, so I want to collapse (sum) Var1 Var2 for these three observations. I want to keep the date 12jan2010 because it is the date for the first observation. The other two observations are dropped.
I know it might be possible to collapse by ID first and then merge with the original dataset and then subset. I was wondering if there is an easier way to make this work. Thanks!
collapse allows you to calculate a variety of statistics, so you can convert your string date into a numerical date, then take the minimum of the numerical date to get the first occurrence.
clear
input str9 Date ID VarA VarB
"12jan2010" 5 21 42
"12jan2010" 6 47 21
"15jan2010" 10 7 68
"17jan2010" 6 -5 -3
"19jan2010" 6 -1 -1
end
gen Date2 = date(Date, "DMY")
format Date2 %td
collapse (sum) VarA VarB (min) Date2 , by(ID)
order Date2, first
li
yielding
+------------------------------+
| Date2 ID VarA VarB |
|------------------------------|
1. | 12jan2010 5 21 42 |
2. | 12jan2010 6 41 17 |
3. | 15jan2010 10 7 68 |
+------------------------------+
In response to the comment: You can generate the formatted date for only observations where VarA is > 0 (and not missing). (Assuming that, per your comment, VarA & VarB always have the same sign.)
// now assume ID 6 has an earliest date of 17jan2005 (obs.4)
// but you want to return your 'first date' as the
// first date where varA & varB are both positive
clear
input str9 Date ID VarA VarB
"12jan2010" 5 21 42
"12jan2010" 6 47 21
"15jan2010" 10 7 68
"17jan2005" 6 -5 -3
"19jan2010" 6 -1 -1
end
gen Date2 = date(Date, "DMY") if VarA > 0 & !missing(VarA)
format Date2 %td
collapse (sum) VarA VarB (min) Date2 , by(ID)
order Date2, first
li
yielding
+------------------------------+
| Date2 ID VarA VarB |
|------------------------------|
1. | 12jan2010 5 21 42 |
2. | 12jan2010 6 41 17 |
3. | 15jan2010 10 7 68 |
+------------------------------+
Consider the following data set test :
Drug Quantity State Year
A
B
C
. . . .
How would I sum up the quantities of each drug grouped by state and year? Would it be something like:
data test;
by Drug State Year;
Total = sum(Quantity)
run;
You need something like this:
data test;
input Drug $ Quantity State $ Year;
datalines;
A 10 NY 2013
A 20 NY 2014
B 110 NY 2013
B 210 NY 2014
A 50 OH 2013
A 60 OH 2014
B 150 OH 2013
B 260 OH 2014
A 22 NY 2014
B 100 OH 2013
;
RUN;
proc means data= test SUM MAXDEC=0;
class Drug State Year;
var Quantity;
RUN;
Mucio answer is good, but if you are after SAS SQL version, here it is:
data test;
input Drug $ Quantity State $ Year;
datalines;
A 10 NY 2013
A 20 NY 2014
B 110 NY 2013
B 210 NY 2014
A 50 OH 2013
A 60 OH 2014
B 150 OH 2013
B 260 OH 2014
A 22 NY 2014
B 100 OH 2013
;
RUN;
PROC SQL;
CREATE TABLE EGTASK.QUERY_FOR_TEST AS
SELECT t1.Drug,
t1.State,
t1.Year,
/* SUM_of_Quantity */
(SUM(t1.Quantity)) AS SUM_of_Quantity
FROM WORK.TEST t1
GROUP BY t1.Drug,
t1.State,
t1.Year;
QUIT;
Result:
Tried various formats of date, but output do not reflects any date. What could be the issue?
data c;
input age gender income color$ doj$;
format doj date9.;
datalines;
19 1 14000 W 14/07/1988
45 2 45000 b 15/09/1956
34 2 56000 y 14/09/1967
33 1 45000 b 14/02/1956
;
run;
You are mixing things up a bit.
The date formats are to be applied on numeric data, not on text data.
So you should not read in doj as $ (text), but as a date (so a date informat).
Try DDMMYY10. for doj on your input statement:
data c;
input age gender income color$ doj ddmmyy10.;
format doj date9.;
datalines;
19 1 14000 W 14/07/1988
45 2 45000 b 15/09/1956
34 2 56000 y 14/09/1967
33 1 45000 b 14/02/1956
;
run;