I need to create a (simple?) error report from a dataset.
The dataset looks like this:
Contract DrvrNum LicNum
------- ------- ---------
2212621 2 8241323
2212621 2 65256129
6385371 1 973385261
6385371 3 973385261
2366922 1 B931151BA
2366922 2 B931151BA
1007922 1 60916004
1007922 2 60916004
The first 2 observations indicate that I have two different license numbers for the same driver, whereas the next three pairs of observations indicate that a license number has been duplicated for two or more drivers.
My output needs to look like this:
Contract DrvrNum LicNum ErrorReason
------- ------- --------- -----------
2212621 2 8241323
2212621 2 65256129 Multiple License Numbers for Same Driver
6385371 1 973385261
6385371 3 973385261 Duplicate License Number
2366922 1 B931151BA
2366922 2 B931151BA Duplicate License Number
1007922 1 60916004
1007922 2 60916004 Duplicate License Number
I have tried using the LAG() function in a data step combined with first.Contract = 0, but because every other observation is FALSE, the lag values got all out of whack, keeping me from doing something like:
if LicNum = lag(LicNum) then ErrorReason = 'Duplicate License Number';
else ErrorReason = 'Multiple License Numbers for Same Driver';
If anyone can provide some assistance, I would be grateful. I'm at my wit's end with this one.
Thanks!
The queue of values for LAG() is based on when you execute the LAG() function. It has nothing to do with the observations in your dataset. So in general you do not want to conditionally execute the LAG() function. So instead you should assign the value unconditionally to a variable and then you can conditionally test the value of the variable.
But LAG() will not solve your problem if there are more than two observations per contract.
Try something like this that will keep track of all of the license numbers under the contract.
data have ;
input Contract :$10. DrvrNum LicNum :$10. ;
cards;
2212621 2 8241323
2212621 2 65256129
6385371 1 973385261
6385371 3 97338526x
6385371 3 973385261
2366922 1 B931151BA
2366922 2 B931151BA
1007922 1 60916004
1007922 2 60916004
;
data want ;
set have ;
by contract drvrnum licnum notsorted;
length ErrorReason $100 LicenseList $200 ;
retain licenselist ;
if first.contract then licenselist=LicNum;
else do;
if indexw(licenselist,LicNum,' ') then
ErrorReason = catx(' ',ErrorReason,'Duplicate License Number.')
;
else licenselist=catx(' ',licenselist,licnum);
if first.licnum and not first.drvrnum then
ErrorReason = catx(' ',ErrorReason,'Multiple License Numbers for Same Driver.')
;
end;
run;
Related
I have a dataset on multiple outcome for individuals in two groups that were treated (or not treated) by an intervention at two time points. However, not every individual has complete data for each measure at each time point.
id
outcome
outcome_value
group
time
1
depression
10
1
1
1
depression
8
1
2
2
depression
10
2
1
2
depression
.
2
2
1
anxiety
12
1
1
1
anxiety
8
1
2
2
anxiety
12
2
1
2
anxiety
6
2
2
How do I exclude IDs that do not have an outcome in both periods? I only want to see how outcomes changed between groups over time for observations have data in all periods. I am using the mixed command in Stata to conduct this analysis.
First drop the missing rows
keep if !missing(outcome_value)
Then, keep the ID/outcome combinations that have _N==2
bysort id outcome: keep if _N==2
Output:
id outcome outco~ue group time ct
1 anxiety 8 1 2 2
1 anxiety 12 1 1 2
1 depression 10 1 1 2
1 depression 8 1 2 2
2 anxiety 6 2 2 2
2 anxiety 12 2 1 2
As #NickCox has pointed out in the comments, while we cannot directly combine these two, there is still a one-line approach:
bysort id outcome (time) : keep if !missing(outcome_value[1], outcome_value[2])
Of note, we cannot do this:
bysort id outcome : keep if !missing(outcome_value) & _N==2
because _N is not reduced by group until after the rows with missing outcome have been removed.
I would like to add summary record after each group of records connected with specific shop. So, I have this:
Shop_id Trans_id Count
1 1 10
1 2 23
1 3 12
2 1 8
2 2 15
And want to have this:
Shop_id Trans_id Count
1 1 10
1 2 23
1 3 12
. . 45
2 1 8
2 2 15
. . 23
I have done this using PROC SQL but I would like to do this using PROC REPORT as I have read that PROC REPORT should handle such cases.
Try this:
data have;
input shop_id Trans_id Count;
cards;
1 1 10
1 2 23
1 3 12
2 1 8
2 2 15
;
proc report data=have out=want(drop=_:);
define shop_id/group;
define trans_id/order;
define count/sum;
break after shop_id/summarize;
compute after shop_id;
if _break_='shop_id' then shop_id='';
endcomp;
run;
A need to create a new variable to repeat the earliest date for a ID visit and if it missing it should type missing, after a missing it should keep the earliest date since it was missing(like in the example). I've tried the LAG function and it didn't work; I also try the keep function but just repeat the 25NOV2015 for all records. The final result/"what I need" is in the last column.
Thanks
Example
You need to use retain statement. Retain means your value in each observation won't be reinitialized to a missing. So in the next iteration of data step your variable remembers its value.
Sample data
data a;
input date;
format date ddmmyy10.;
datalines;
.
5
6
7
.
1
2
.
9
;
run;
Solution
data b;
set a;
retain new_date;
format new_date ddmmyy10.;
if date = . then
new_date = .;
if new_date = . then
new_date = date;
run;
Since you didn't post any data I will make up some. Also since the fact that your variable is a date doesn't really impact the answer I will just use some integers as they are easier to type.
data have ;
input id value ## ;
cards;
1 . 1 2 1 3 1 . 1 5 1 6 1 . 1 8
2 1 2 2 2 3 2 . 2 5 2 6
;;;;
Basically your algorithm says that you want to store the value when either the current value is missing or stored value is missing. With multiple BY groups you would also want to set it when you start a new group.
data want ;
set have ;
by id ;
retain new_value ;
if first.id or missing(new_value) or missing(value)
then new_value=value;
run;
Results:
new_
Obs id value value
1 1 . .
2 1 2 2
3 1 3 2
4 1 . .
5 1 5 5
6 1 6 5
7 1 . .
8 1 8 8
9 2 1 1
10 2 2 1
11 2 3 1
12 2 . .
13 2 5 5
14 2 6 5
I've tried to Google and read around this problem, but I can't seem to find an adequate solution. I'm hoping someone here can help me. I'm sorry if it's too simple but I would appreciate any advice or help.
I'm working with a longitudinal dataset and I would like to assign an encounter number for each person (ID) who may have had one or more interactions with our laboratory (accesssion). The dataset looks something like this, and I would like to create a new variable (encounter) that numbers each unique encounter for each individual sequentially.
ID accession encounter
----------------------------------
1 1234 1
1 1234 1
1 1235 2
1 1236 3
1 1236 3
2 1000 1
2 1001 2
2 1001 2
3 1111 1
3 1112 2
4 1001 1
4 1001 1
I've tried using first.variable statements such as:
data new; set old;
by id accession;
if first.id & first.accession then encounter=1;
else encounter+1;
run;
I haven't been successful because it won't retain the same encounter number if both the id and accession number remain the same.
Thank you in advance for helping to point me in the right direction.
Your close. At the first of each ID you want to set it to 0, and at the first of each accession you want to increment.
data new; set old;
by id accession;
Retain encounter;
if first.id then encounter=0;
If first.accession then encounter+1;
run;
I have a dataframe as given below:
Index Date Country Occurence
0 2013-12-30 US 1
1 2013-12-30 India 3
2 2014-01-10 US 1
3 2014-01-15 India 1
4 2014-02-05 UK 5
I want to convert daily data into weekly,grouped by anatomy,method being sum.
Itried resampling,but the output gave Multi Index data frame from which i was not able to access "Country" and "Date" columns(pls refer above)
The desired output is given below:
Date Country Occurence
Week1 India 4
Week2
Week1 US 2
Week2
Week5 Germany 5
You can groupby on country and resample on week
In [63]: df
Out[63]:
Date Country Occurence
0 2013-12-30 US 1
1 2013-12-30 India 3
2 2014-01-10 US 1
3 2014-01-15 India 1
4 2014-02-05 UK 5
In [64]: df.set_index('Date').groupby('Country').resample('W', how='sum')
Out[64]:
Occurence
Country Date
India 2014-01-05 3
2014-01-12 NaN
2014-01-19 1
UK 2014-02-09 5
US 2014-01-05 1
2014-01-12 1
And, you could use reset_index()
In [65]: df.set_index('Date').groupby('Country').resample('W', how='sum').reset_index()
Out[65]:
Country Date Occurence
0 India 2014-01-05 3
1 India 2014-01-12 NaN
2 India 2014-01-19 1
3 UK 2014-02-09 5
4 US 2014-01-05 1
5 US 2014-01-12 1