SAS count and summarize if conditions on variables are true - sas

I have a data set with 3 columns: Name, System, UserID. I want to count how many times an individual appears in the report, but not count them if they're different people with the same name. Distinction is made by the UserID field and only within a single system. If a single name has multiple lines with the same system and different User ID, then all the observations with that name would be marked for review. For this data set, I would want to see the output below.
Name System UserID
John Doe Sys1 [blank]
John Doe Sys1 AB1234
John Doe Sys2 AB2345
Jane Doe Sys1 AA2345
Jane Doe Sys1 AA23456
Jane Doe Sys2 AA2345
Joe Smith Sys1 JS963
Joe Smith Sys2 JS741
Name Count System Follow-up
John Doe 1 Sys1 - Yes
John Doe 1 Sys1 - AB1234 Yes
John Doe 1 Sys2 - AB2345 Yes
Jane Doe 1 Sys1 - AA2345 Yes
Jane Doe 1 Sys1 - AA23456 Yes
Jane Doe 1 Sys2 - AA2345 Yes
Joe Smith 2 Sys1 - JS963, Sys2 - JS741 No
Any help would be greatly appreciated!
The code I have is below. It's currently just doing a count of names in general and don't know how to add the conditions.
PROC SQL;
CREATE TABLE Sorted_Master_Original AS
SELECT Name,
COUNT(Name) AS Total,
System,
UserID,
CATX(' - ',System,UserID) AS SystemID
FROM Master_Original
WHERE Name <> ""
GROUP BY Name;
QUIT;
DATA TESTDATA.Final_Listing;
LENGTH SystemsAccessed $200.;
DO UNTIL (last.Name);
SET Sorted_Master_Original;
BY Name NOTSORTED;
SystemsAccessed=CATX(', ',SystemsAccessed,SystemID);
END;
DROP System SystemID;
RUN;

The case of determining a signal over a group, and then applying to each member in the group can be accomplished using two DOW loops in sequence. The first is as you have coded with a last. loop test and the set and by within the loop, and the second will be for reiterating the group via a like sized loop over the group in separate SET buffer -- at which time the signal can be applied.
Data
data have;
length Name System UserID $20;
input Name & System & UserID; datalines;
John Doe Sys1 .
John Doe Sys1 AB1234
John Doe Sys2 AB2345
Jane Doe Sys1 AA2345
Jane Doe Sys1 AA23456
Jane Doe Sys2 AA2345
Joe Smith Sys1 JS963
Joe Smith Sys2 JS741
Bob Smith Sys3 MS13
run;
Order for by group processing
proc sort data=have;
by Name System UserId;
run;
DATA Step with sequential DOW loops
data want(keep=name count system_userid_list followup);
* loop over name group;
do _n_ = 1 by 1 until (last.name);
set have;
by name system userid;
* tests of conditions within the group determine some signal;
* check if there is more than one userid within a system within the name group;
if not (first.system and last.system) then
count = 1;
end;
if not count then count = _n_;
length system_userid_list $200;
followup = ifc(count=1 and _n_>1 ,'Yes','No');
* followup 'signal' will be applied/available to each row of the group;
* either output single row as followup, or concat to an aggregate list;
* mixed output of singles and aggregates, good idea?;
* reiterate over group in second SET buffer;
do _n_ = 1 to _n_;
set have;
item = catx(' - ',system,userid);
if count > 1 then
system_userid_list = catx(',',system_userid_list,item);
else do;
system_userid_list = item;
output;
end;
end;
if count > 1 then output;
run;

Related

Number of Records Per Hour by Category

I'm struggling conceptualizing a code I would like to develop that would output the average number of patients seen by provider. Here is what a snippet of what my dataset, which spans 3 years worth of data, looks like (I have three variables, the patient_ID, provider name and the time which the provider saw the patient which is displayed in a date/time format:
patient_fin first_Md_seen Provider_Seen_Date_Time
1 Bob 5/1/2018 4:19:00 AM
2 Bob 5/1/2018 4:29:00 AM
3 Bob 5/1/2018 4:30:00 PM
4 Sally 5/1/2018 7:39:00 AM
5 Sally 5/1/2018 7:49:00 AM
6 Sally 5/1/2018 8:55:00 PM
7 Bubba 5/3/2018 12:19:00 AM
8 Bob 5/3/2018 4:10:00 AM
....
To calculate the number of a patients seen by a provider, I wrote the following code:
data ED_TAT3;
SET ED_TAT2;
if patient_fin ne . then Patient_fin_count=1;
run;
proc means data = ED_TAT3;
class first_Md_seen;
var Patient_fin_count;
run;
Now, I need to figure out how many hours a provider worked so I can divide the number of patients seen by the number of hours worked.
I think I can use the Provider_Seen_Date_Time variable as a proxy after running the following code to get the hour 'hour = hour (datepart(Provider_Seen_Date_Time))'.
Would a code like this give me the correct number of hours a provider
data new1;
set new;
hour = hour (datepart(Provider_Seen_Date_Time));
if Provider_Name = 'Bob' and hour ne . then hour_worked = 1;
run;
Is there:
1) a more accurate or efficient (there are hundreds of different providers) way to figure out the total number of hours worked per provider?
OR
2) which is the more ideal code, to simply figure out the number of patients per hour a provider saw.
Desired output:
Provider Avg Patients Seen per Hour
Bob 5
Sally 4
Bubba 6
Thanks in advance!
Based on what is given , you can try following code.. however, I still have concerns about the data
data ed_tat2;
input patient_fin first_Md_seen$ Provider_Seen_Date_Time mdyampm25.2;
format Provider_Seen_Date_Time mdyampm25.;
hour = hour (Provider_Seen_Date_Time);
date_seen=datepart(Provider_Seen_Date_Time);
format date_seen date9.;
datalines;
1 Bob 5/1/2018 4:19:00 AM
2 Bob 5/1/2018 4:30:00 PM
3 Sally 5/1/2018 7:39:00 AM
4 Sally 5/1/2018 7:59:00 PM
5 Bubba 5/3/2018 12:19:00 AM
6 Bob 5/3/2018 4:10:00 AM
7 Bob 5/3/2018 4:30:00 AM
8 Bob 5/3/2018 5:10:00 AM
run;
proc sort data=ed_tat2; by first_Md_seen date_seen hour; run;
data ed_tat3;
set ed_tat2;
by first_Md_seen date_seen hour;
if not first.first_Md_seen and date_seen=lag(date_seen) and hour=lag(hour) then hour=0;
else hour=1;
run;
proc sql;
select first_Md_seen, date_seen, count(patient_fin) as number_of_patients_seen, sum(hour) as number_of_hours, count(patient_fin)/sum(hour) as patients_seen_per_hour
from ed_tat3
where hour ne .
group by first_Md_seen, date_seen;
select first_Md_seen, count(patient_fin) as number_of_patients_seen, sum(hour) as number_of_hours, count(patient_fin)/sum(hour) as patients_seen_per_hour
from ed_tat3
where hour ne .
group by first_Md_seen;
quit;
You can do this easily within two proc freqs.
The first will calculate the number of patients seen by doctor per hour and the second uses the first output to calculate the number of hours worked per doctor, per day. You can easily modify these by modifying the TABLE statements.
data ed_tat2;
input patient_fin first_Md_seen $ Provider_Seen_Date_Time mdyampm25.2;
format Provider_Seen_Date_Time mdyampm25.;
hour=hour (Provider_Seen_Date_Time);
date_seen=datepart(Provider_Seen_Date_Time);
format date_seen date9.;
datalines;
1 Bob 5/1/2018 4:19:00 AM
2 Bob 5/1/2018 4:30:00 PM
3 Sally 5/1/2018 7:39:00 AM
4 Sally 5/1/2018 7:59:00 PM
5 Bubba 5/3/2018 12:19:00 AM
6 Bob 5/3/2018 4:10:00 AM
7 Bob 5/3/2018 4:30:00 AM
8 Bob 5/3/2018 5:10:00 AM
;
run;
*counts per hour;
proc freq data=ed_tat2 noprint;
table first_Md_seen*date_seen*hour / out=provider_counts;
run;
*hours worked per doctor;
proc freq data=provider_counts noprint;
table first_Md_seen*date_seen / out=provider_hours;
run;
title 'Number of patients seen';
proc print data=provider_counts label;
label count='# of patients per hour';
title 'Number of hours worked';
proc print data=provider_hours label;
label count='# of hours worked in a day';
run;

Need a suggestion on how to merge datasets with similar but not identical joint id

So what I have is a data set that has each city and state in one column. The other data set also has city and state in one column BUT some of the cities are combined. For example:
Data set one will have:
CITY STATE POPULATION
Cape Coral Fl 1000000
Fort Myers FL 2000000
Gainesville FL 100000
Data set two will have:
CITY STATE EMPLOYMENT
Cape Coral - Fort Myers FL 900
Gainesville FL 1000
I thought about doing a "fuzzy" match, but then for the hyphenated cities I won't get the full population. I could try to break up the hyphenated cities and then dividing the employment in half, but I don't know how to do that.
I am hoping there is an easier solution out there that I haven't thought of. I went ahead and did a traditional merge by CITY STATE, but it only matched half of my data set.
Thanks in advance!
The second data set can be broken into more rows if you make some assumptions, such as each component city is separated by dash (-) and state is always last piece.
data two;
length city_state $100;
input CITY_STATE & EMPLOYMENT;
datalines;
Cape Coral - Fort Myers FL 900
Gainesville FL 1000
run;
data two_b;
length city_state_item $100;
set two;
state = scan (city_state, -1, ' ');
p = find (city_state, trim(state), -101);
city_state_base = substr(city_state,1,p-1);
do _n_ = 1 by 1 while (scan(city_state_base,_n_,'-') ne '');
city_state_item = catx (' ', scan(city_state_base,_n_,'-'), state);
OUTPUT;
employment = 0;
end;
drop p city_state_base state;
run;
After splitting you will have to match ONE.city_state to TWO_B.city_state_item and deal with how employment will be split or not split depending on how the matched data get re-aggregated or used to compute some employment to population ratio.
Making some assumptions this solution could work:
data a;
length city_state $100;
input CITY_STATE & POPULATION;
datalines;
Cape Coral Fl 1000000
Fort Myers FL 2000000
Gainesville FL 100000
run;
data b;
length city_state $100;
input CITY_STATE & EMPLOYMENT;
datalines;
Cape Coral - Fort Myers FL 900
Gainesville FL 1000
Run;
Proc sql;
select a.city_state, b.city_state, a.population, case when b.city_state contains '-' then b.EMPLOYMENT /2 else b.EMPLOYMENT End as EMPLOYMENT from a
inner join b
on b.city_state contains substr(a.city_state,1,length(a.city_state)-length(scan(a.city_state,-1,' ')));
quit;
result:
city_state | city_state |POPULATION |EMPLOYMENT
------------------------------------------------------------------------
Cape Coral Fl | Cape Coral - Fort Myers FL | 1000000 | 450
Fort Myers FL | Cape Coral - Fort Myers FL | 2000000 | 450
Gainesville FL | Gainesville FL | 100000 | 1000
Assuming every city_state with a - includes two city states, you can half it
case when b.city_state contains '-' then b.EMPLOYMENT /2 else b.EMPLOYMENT End as EMPLOYMENT
Assuming every city_state ends with the short state, you can remove the state and do a contains statement:
b.city_state contains substr(a.city_state,1,length(a.city_state)-length(scan(a.city_state,-1,' ')));

How to read missing data when reading txt file in SAS?

I am new to SAS.
I want to read a txt file in SAS. The thing is the file has missing data in some lines.
For example the data is:
James Monroe Monroe Hall Virginia 58 4/28/1758 1816
John Quincy Adams Braintree Massachusetts 57 7/11/1767 1824 113,142 30.92%
Andrew Jackson Waxhaws Region South/North Carolina 61 3/15/1767 1828 642,806 55.93%
Martin Van Buren Kinderhook New York 54 12/5/1782 1836 763,291 50.79%
William Henry Harrison Charles City County Virginia 68 2/9/1773 1840 1,275,583 52.87%
The columns I want are 'FullName', 'City', 'State', 'Age', DOB', 'Year','Number' and 'Percentage'.
My code is:
infile 'C:\sasfiles\testfile.txt' missover dlm='09'x dsd
input FullName City State Age DOB Year Number PercVote;
Run;
But I get the error
9 CHAR Rutherford B. Hayes.Delaware ..Ohio..54.10/4/1822.1876.4,034,142.47.92% 73
ZONE 5776676676242246767046667676222004666003303323233330333303233323330332332
NUMR 2548526F2402E081953945C1712500099F89F9954910F4F18229187694C034C142947E925
FullName=. City=. State=. Age=. DOB=. Year=54 Number=. PercVote=1876 ERROR=1 N=19
NOTE: Invalid data for FullName in line 20 1-17.
NOTE: Invalid data for City in line 20 19-32.
NOTE: Invalid data for State in line 20 37-40.
NOTE: Invalid data for Number in line 20 46-55.
NOTE: Invalid data for PercVote in line 20 62-70.
You need to provide data in a format that can by properly parsed.
When using delimited data then use adjacent delimiters to indicate there is a value missing. It is best to define the variables before you use them. If you define them in order then your input statement can be very simple.
data want ;
infile cards dsd dlm='|' firstobs=2 truncover ;
length name $13 city $11 state $15 age 8 dob 8 yod 8 ;
input name -- yod ;
informat dob mmddyy10.;
format dob yymmdd10.;
cards;
----+----0----+----0----+----0----+----0----+----0----+----0----+----0
James Monroe|Monroe Hall|Virginia||4/28/1758|1816
John Quincy Adams|Braintree|Massachusetts|57|7/11/1767|1824
;
Or force the data into columns and use column input.
data want ;
infile cards firstobs=2 truncover ;
input name $ 1-13 city $ 19-29 state $ 31-45 age 47-48 #50 dob mmddyy10. yod 61-64;
format dob yymmdd10.;
cards;
----+----0----+----0----+----0----+----0----+----0----+----0----+----0
James Monroe Monroe Hall Virginia 4/28/1758 1816
John Quincy Adams Braintree Massachusetts 57 7/11/1767 1824
;
Try this:
data want;
infile cards dlm='09'x missover;
input (FullName City State) (:$32.) Age :8. DOB :mmddyy9. Year :4. Number :comma8. PercVote :percent8.;
format DOB mmddyy10. number comma16. percvote percent6.2;
cards;
James Monroe Monroe Hall Virginia 58 4/28/1758 1816
John Quincy Adams Braintree Massachusetts 57 7/11/1767 1824 113,142 30.92%
Andrew Jackson Waxhaws Region South/North Carolina 61 3/15/1767 1828 642,806 55.93%
Martin Van Buren Kinderhook New York 54 12/5/1782 1836 763,291 50.79%
William Henry Harrison Charles City County Virginia 68 2/9/1773 1840 1,275,583 52.87%
;
run;

SAS transpose using values as column names and summarize

I'm trying to transpose a data using values as variable names and summarize numeric data by group, I tried with proc transpose and with proc report (across) but I can't do this, the unique way that I know to do this is with data set (if else and sum but the changes aren't dynamically)
For example I have this data set:
school name subject picked saving expenses
raget John math 10 10500 3500
raget John spanish 5 1200 2000
raget Ruby nosubject 10 5000 1000
raget Ruby nosubject 2 3000 0
raget Ruby math 3 2000 500
raget peter geography 2 1000 0
raget noname nosubject 0 0 1200
and I need this in 1 line, sum of 'picked' by the names of students, and later sum of picked by subject, the last 3 columns is the sum total for picked, saving and expense:
school john ruby peter noname math spanish geography nosubject picked saving expenses
raget 15 15 2 0 13 5 2 12 32 22700 8200
If it's possible to be dynamically changed if I have a new student in the school or subject?
It's a little difficult because you're summarising at more than one level, so I've used PROC SUMMARY and chosen different _TYPE_ values. See below:
data have;
infile datalines;
input school $ name $ subject : $10. picked saving expenses;
datalines;
raget John math 10 10500 3500
raget John spanish 5 1200 2000
raget Ruby nosubject 10 5000 1000
raget Ruby nosubject 2 3000 0
raget Ruby math 3 2000 500
raget peter geography 2 1000 0
raget noname nosubject 0 0 1200
;
run;
proc summary data=have;
class school name subject;
var picked saving expenses;
output out=want1 sum(picked)=picked sum(saving)=saving sum(expenses)=expenses;
run;
proc transpose data=want1 (where=(_type_=5)) out=subs (where=(_NAME_='picked'));
by school;
id subject;
run;
proc transpose data=want1 (where=(_type_=6)) out=names (where=(_NAME_='picked'));
by school;
id name;
run;
proc sql;
create table want (drop=_TYPE_ _FREQ_ name subject) as
select
n.*,
s.*,
w.*
from want1 (where=(_TYPE_=4)) w,
names (drop=_NAME_) n,
subs (drop=_NAME_) s
where w.school = n.school
and w.school = s.school;
quit;
I've also tested this code by adding new schools, names and subjects and they do appear in the final table. You'll note that I haven't hardcoded anything (e.g. no reference to math or John), so the code is dynamic enough.
PROC REPORT is an interesting alternative, particularly if you want the printed output rather than as a dataset. You can use ODS OUTPUT to get the output dataset, but it's messy as the variable names aren't defined for some reason (they're "C2" etc.). The printed output of this one is a little messy also as the header rows don't line up, but that can be fixed with some finagling if that's desired.
data have;
input school $ name $ subject $ picked saving expenses;
datalines;
raget John math 10 10500 3500
raget John spanish 5 1200 2000
raget Ruby nosubject 10 5000 1000
raget Ruby nosubject 2 3000 0
raget Ruby math 3 2000 500
raget peter geography 2 1000 0
raget noname nosubject 0 0 1200
;;;;
run;
ods output report=want;
proc report nowd data=have;
columns school (name subject),(picked) picked=picked2 saving expenses;
define picked/analysis sum ' ';
define picked2/analysis sum;
define saving/analysis sum ;
define expenses/analysis sum;
define name/across;
define subject/across;
define school/group;
run;

Setting up negative to positive counts for time

If there is a data set that has months and each person has a different month of starting a job. For example:
person date date_started date_count
Tim 1/1/2000 3/1/2000 -2
Tim 2/1/2000 3/1/2000 -1
Tim 3/1/2000 3/1/2000 0
John 1/1/2000 7/1/2000 -6
John 2/1/2000 7/1/2000 -5
John 3/1/2000 7/1/2000 -4
John 4/1/2000 7/1/2000 -3
John 5/1/2000 7/1/2000 -2
John 6/1/2000 7/1/2000 -1
John 7/1/2000 7/1/2000 0
John 8/1/2000 7/1/2000 1
John 9/1/2000 7/1/2000 2
John 10/1/2000 7/1/2000 3
Mary 3/1/2000 3/1/2000 0
Mary 4/1/2000 3/1/2000 1
What is the most efficient way to get the date_count column? I also have a column that is 1 in your first month and 0 otherwise. I rather use that in making the date_count
I don't understand what the difficulty is here. The question seems poorly explained to me.
You mention months, but your example shows daily dates, so the role of months in the problem is a mystery.
The variable you want is just the difference between two daily dates. So long as you have two daily date variables (Dimitriy explains how to get those from string dates), it is just a subtraction.
(Added later) My uncertainty shows what happens when one assumes on an international list that local conventions are universal. There are two conventions easily confused, showing dates as day/month/year and showing dates as month/day/year. Evidently you are using the second convention. If so, the problem is to convert from daily dates to monthly dates using mofd(); then as said it is a subtraction.
I don't know if this is the optimal way, but I think it should work:
/* convert your dates to Stata's date format from strings */
gen date2=daily(date,"MDY");
gen date_started2=daily(date_started,"MDY");
format date2 date_started2 %td;
/* this is the main code */
gen before = date_started2>date2;
bys person before: egen date_count2 = rank(abs(date_started2 - date2));
replace date_count2 = date_count2 - 1 if before==0;
replace date_count2 = -date_count2 if before==1;
drop before;
Edit:
Mea culpa. I completely misunderstood your question to mean that you wanted a countdown to start date for each person-observation event. You actually want something much simpler:
gen date_count2=mofd(daily(date,"MDY")) - mofd(daily(date_started,"MDY"));
This assumes you are working with date and date_started that are stores as string variables. The daily() converts to Stata date format, and mofd() converts to calendar months. Then it's just the difference.