I have a dataset with patient diagnosis codes, and I need to use wildcard characters to categorize their diagnoses.
patientID diagnosis cancer age gender
1 250.0 0 65 M
1 250.00 1 65 M
2 250.01 1 23 M
2 250.02 0 23 M
3 250.11 0 50 F
3 250.12 0 50 F
4. 513.01. 1 34 M
Diagnoses with the 5th character as 0 or 2 need to be classified as type 2 diabetes, and those ending in 1 and 3 need to be classified as type 1 diabetes. However, 250.0 only has 4 characters and needs to be classified as type 2.
This in the data step doesn't work
if diagnosis_code ='250.%0' then t2dm = 1;
if diagnosis_code ='250.%1' then t1dm = 1;
No need for wildcards for that test. Use the colon modifier to test prefix of the code and substr() function to test the 6th character (5th digit).
if diagnosis_code='250.0' or
(diagnosis_code=:'250.' and substr(diagnosis_code,6)='0') then t2dm = 1;
if diagnosis_code=:'250.' and substr(diagnosis_code,6)='1' then t1dm = 1;
Wildcard matches in DATA step if statements can be done using the PRXMATCH function. PRX means Perl regular expression.
PRXMATCH (regular-expression-pattern,text-to-evaluate)
PRXMATCH Function documentation
Sample data
data have; input
patientID diagnosis_code $ cancer age gender $; datalines;
1 250.0 0 65 M
1 250.00 1 65 M
2 250.01 1 23 M
2 250.02 0 23 M
3 250.11 0 50 F
3 250.12 0 50 F
4. 513.01. 1 34 M
run;
Example code
data want;
set have;
t2dm = prxmatch('/^250\.\d*0$/', trim(diagnosis_code)) > 0;
t1dm = prxmatch('/^250\.\d*1$/', trim(diagnosis_code)) > 0;
run;
Notes for the sample code
/ bounds a regex pattern
^ match at the beginning
250 match 250
\. match an actual period
\d match a digit
\d* match zero or more digits
0 1 match a 0 or 1
0$ 1$ match the 0 or 1 at the end
trim() trim the text to evaluate so the match at the end works
> 0 a match will return position p in text or 0 if no match, p > 0 will logically evaluate to 0 or 1 and be assigned to the flag variable
Related
Im trying to set any5 = 'Yes' if there is a number 5 in any of the columns Q1 to Q5. However my code below only shows for the last column.
data survey;
infile datalines firstobs=2;
input ID 3. Q1-Q5;
array score{5} _temporary_ (5,5,5,5,5);
array Ques{5} Q1-Q5;
do i =1 to 5;
if Ques{i} = score{i} then any5='Yes';
else any5='No';
end;
drop i;
datalines;
ID Q1 Q2 Q3 Q4 Q5
535 1 3 5 4 2
12 5 5 4 4 3
723 2 1 2 1 1
7 3 5 1 4 2
;
run;
Keep it simple :-)
data survey;
infile datalines;
input ID 3. Q1-Q5;
array Ques{*} Q1 - Q5;
any5 = ifc(5 in Ques, 'Yes', 'No');
datalines;
535 1 3 5 4 2
12 5 5 4 4 3
723 2 1 2 1 1
7 3 5 1 4 2
;
Use the COUNTC function to compute the number of times 5 is repeated in your Q 1-Q5 columns then use the IFC function to return a character value based on whether the expression is true, false, or missing.
data survey;
infile datalines firstobs=2;
input ID 3. Q1-Q5;
any5 = ifc(countc(cats(of Q:),'5')>0,'Yes','No');
datalines;
ID Q1 Q2 Q3 Q4 Q5
535 1 3 5 4 2
12 5 5 4 4 3
723 2 1 2 1 1
7 3 5 1 4 2
;
run;
Result:
535 1 3 5 4 2 Yes
12 5 5 4 4 3 Yes
723 2 1 2 1 1 No
7 3 5 1 4 2 Yes
Use the WHICHN function to determine the index of the target value in a list of values.
In your case assign the test for any index matching
any5 = whichn (5, of ques(*)) > 0;
From the documentation:
WHICHN Function
Searches for a numeric value that is equal to the first argument, and
returns the index of the first matching value.
Syntax
WHICHN(argument, value-1 <, value-2, ...>)
It is a simple mistake in your logic. You are setting ANY5 to YES or NO on every time through the loop. Since you continue going through the loop even after the match is found you overwrite the results from the previous times through the loop, so only the results of the last test survive.
Here is one way. Set the answer to NO before the loop and remove the ELSE clause.
any5='No ';
do i =1 to 5;
if Ques{i} = 5 then any5='Yes';
end;
Or stop when you have your answer.
do i =1 to 5 until(any5='Yes');
if Ques{i} = score{i} then any5='Yes';
else any5='No';
end;
Or skip the looping altogether.
if whichn(5, of Q1-Q5) then any5='Yes';
else any5='No';
Or even easier create any5 as numeric instead of character. SAS will return 1 for TRUE and 0 for FALSE as the result of a boolean expression.
any5 = ( 0 < whichn(5, of Q1-Q5) );
I've got the below code that works beautifully for comparing rows in a group when the first row doesnt matter.
data want_Find_Change;
set WORK.IA;
by ID;
array var[*] $ RATING;
array lagvar[*] $ zRATING;
array changeflag[*] RATING_UPDATE;
do i = 1 to dim(var);
lagvar[i] = lag(var[i]);
end;
do i = 1 to dim(var) ;
changeflag[i] = (var[i] NE lagvar[i] AND NOT first.ID);
end;
drop i;
run;
Unfortunately, when I use a dataset that has two rows per group I get incorrect returns, I'm assuming because the first row has to be used in the comparison. How can I compare the only to rows and a return only on the second row. This did not work:
data Change;
set WORK.Two;
by ID;
changeflag = last.RATING NE first.RATING;
run;
Example of the data I have and want
Group Name Sport DogName Eligibility
1 Tom BBALL Toto Yes
1 Tom golf spot Yes
2 Nancy vllyball Jimmy yes
2 Nancy vllyball rover no
want
Group Name Sport DogName Eligibility N_change S_change D_Change E_change
1 Tom BBall Toto Yes 0 0 0 0
1 Tom golf spot Yes 0 1 1 0
2 Nancy vllyball Jimmy yes 0 0 0 0
2 Nancy vllyball rover no 0 0 1 1
If you want only the first row to not be flagged, you first need to create a variable enumerating the rows within each group. You can do so with:
data temp;
set have;
count + 1;
by Group;
if first.Group then count = 1;
run;
In a second step, you can run a proc sql with a subquery, count distinct by groups, and case when:
proc sql;
create table want as
select
Group, Name, Sport, DogName, Eligibility,
case when count_name > 1 and count > 1 then 1 else 0 end as N_change,
case when count_sport > 1 and count > 1 then 1 else 0 end as S_change,
case when count_dog > 1 and count > 1 then 1 else 0 end as D_change,
case when count_E > 1 and count > 1 then 1 else 0 end as E_change
from (select *,
count(distinct(Name)) as count_name,
count(distinct(Sport)) as count_sport,
count(distinct(DogName)) as count_dog,
count(distinct(Eligibility)) as count_E
from temp
group by Group);
quit;
Best,
I have this data
data have;
input cust_id pmt months;
datalines;
AA 100 0
AA 50 1
AA 200 2
AA 350 3
AA 150 4
AA 700 5
BB 500 0
BB 300 1
BB 1000 2
BB 800 3
run;
and I'd like to generate an output that looks like this
data want;
input cust_id pmt months i;
datalines;
AA 100 0 0
AA 50 0 1
AA 200 0 2
AA 350 0 3
AA 150 0 4
AA 700 0 5
AA 50 1 0
AA 200 1 1
AA 350 1 2
AA 150 1 3
AA 700 1 4
AA 200 2 0
AA 350 2 1
AA 150 2 2
AA 700 2 3
AA 350 3 0
AA 150 3 1
AA 700 3 2
AA 150 4 0
AA 700 4 1
AA 700 5 0
BB 500 0 0
BB 300 0 1
BB 1000 0 2
BB 800 0 3
BB 300 1 0
BB 1000 1 1
BB 800 1 2
BB 1000 2 0
BB 800 2 1
BB 800 3 0
run;
There are few thousand rows with different cust_ID and different months length. I tried joining tables but it couldn't get me the sequence of 100 50 200 350 150 700 (for cust_ID AA). I could only replicated 100 if my months are 0, 50 if months are 1 & so on. I created a maxval which is the maximum month value. My code is something like this
data temp1;
set have;
do i = 0 to maxval;
if (months <=maxval) then output;
end;
i thought of creating a uniquekey to join my have data and temp1 data but it could only give me
AA 100 0 0
AA 50 0 1
AA 200 0 2
AA 350 0 3
AA 150 0 4
AA 700 0 5
AA 100 1 0
AA 50 1 1
AA 200 1 2
AA 350 1 3
AA 150 1 4
AA 100 2 0
AA 50 2 1
AA 200 2 2
AA 350 2 3
AA 100 3 0
AA 50 3 1
AA 200 3 2
AA 100 4 0
AA 50 4 1
AA 100 5 0
Any thoughts or different approach on how to generate my want table? Thank you!
This problem is a little tricky because you have things going in three directions
The number of group repetitions descends from group count. Within each repetition:
The payments item start index ascends and terminates at group count
The months (as I) item start index is 1 and termination descends from group count
SQL
One SQL approach is a three-way reflexive join with-in group. The months values act as a within group index and must be monotonic by 1 from 0 for this to work.
proc sql;
create table want as
select X.cust_id, Z.pmt, X.months, Y.months as i
from have as X
join have as Y on X.cust_id = Y.cust_id
join have as Z on Y.cust_id = Z.cust_id
where
X.months + Y.months = Z.months
order by
X.cust_id, X.months, Z.months
;
quit;
DATA Step
A DOW loop is used to count the group size. 2-deep looping crosses the combinations and three point= values are computed (finagled) to retrieve the relevant values.
data want2;
if 0 then set have; * prep pdv to match have;
retain point_end ;
point_start = sum(point_end,0);
do group_count = 1 by 1 until (last.cust_id);
set have(keep=cust_id);
by cust_id;
end;
do index1 = 1 to group_count;
point1 = point_start + index1;
set have (keep=months) point = point1;
do index2 = 0 to group_count - index1 ;
point2 = point_start + index1 + index2;
set have (keep=pmt) point=point2;
point3 = point_start + index2 + 1;
set have (keep=months rename=months=i) point=point3;
output;
end;
end;
point_end = point1;
keep cust_id pmt months i;
run;
Try the following:
data want(drop = start_obs limit j);
retain start_obs 1;
/* read by cust_id group */
do until(last.cust_id);
set have end = last_obs;
by cust_id;
end;
limit = months;
do j = 0 to limit;
i = 0;
do obs_num = start_obs + j to start_obs + limit;
/* read specific observations using direct access */
set have point = obs_num;
months = j;
output;
i = i + 1;
end;
end;
/* prepare for next direct access read */
start_obs = limit + 2;
if last_obs then
stop;
run;
Consider the following example:
input group day month year number treatment NUM
1 1 2 2000 1 1 2
1 1 6 2000 2 0 .
1 1 9 2000 3 0 .
1 1 5 2001 4 0 .
1 1 1 2010 5 1 1
1 1 5 2010 6 0 .
2 1 1 2001 1 1 0
2 1 3 2002 2 1 0
end
gen date = mdy(month,day,year)
format date %td
drop day month year
For each group, I have a varying number of observations. Each observations refers to an event that is specified with a date. Variable number is the numbering within each group.
Now, I want to count the number of observations that occur one year starting from the date of each treatment observation (excluding itself) within this group. This means, I want to create the variable NUM that I have already put into my example above. I do not care about the number of observations with treatment = 0.
EDIT Begin: The following information was found to be missing but necessary to tackle this problem: The treatment variable will have a value of 1 if there is no observation within the same group in the last year. Thus it is also not possible that the variable NUM will have to consider observations with treatment = 1. In principal, it is possible that there are two observations within a group that have identical dates. EDIT End
I have looked into Stata tip 51: Events in intervals. It seems to work out however my dataset is huge (> 1 mio observations) such that it is really really inefficient - especially because I do not care about all treatment = 0 observations.
I was wondering if there is any alternative. My approach was to look for the observation with the latest date within each group that is still in the range of 1 year (and maybe store it in variable latestDate). Then I would simply subtract the value in variable number of the observation found from the value in count of the treatment = 0 variable.
Note: My "inefficient" code looks as follows
gsort -treatment
gen treatment_id = _n
replace treatment_id = . if treatment==0
gen count=.
sum treatment_id, meanonly
qui forval i = 1/`r(max)'{
count if inrange(date-date[`i'],1,365) & group == group[`i']
replace count = r(N) in `i'
}
sort group date
I am assuming that treatment can't occur within 1 year of the previous treatment (in the group). This is true in your example data, but may not be true in general. But, assuming that it is the case, then this should work. I'm using carryforward which is on SSC (ssc install carryforward). Like your latestDate thought, I determine one year after the most recent treatment and count the number of observations in that window.
sort group date
gen yrafter = (date + 365) if treatment == 1
by group: carryforward yrafter, replace
format yrafter %td
gen in_window = date <= yrafter & treatment == 0
egen answer = sum(in_window), by(group yrafter)
replace answer = . if treatment == 0
I can't promise this will be faster than a loop but I suspect that it will be.
The question is not completely clear.
Consider the following data with two different results, num2 and num3:
+-----------------------------------------+
| date2 group treat num2 num3 |
|-----------------------------------------|
| 01feb2000 1 1 3 2 |
| 01jun2000 1 0 . . |
| 01sep2000 1 0 . . |
| 01nov2000 1 1 0 0 |
| 01may2002 1 0 . . |
| 01jan2010 1 1 1 1 |
| 01may2010 1 0 . . |
|-----------------------------------------|
| 01jan2001 2 1 0 0 |
| 01mar2002 2 1 0 0 |
+-----------------------------------------+
The variable num2 is computed assuming you are interested in counting all observations that are within a one-year period after a treated observation (treat == 1), be those observations equal to 0 or 1 for treat. For example, after 01feb2000, there are three observations that comply with the time span condition; two have treat==0 and one has treat == 1, and they are all counted.
The variable num3 is also counting observations that are within a one-year period after a treated observation, but only the cases for which treat == 0.
num2 is computed with code in the spirit of the article you have cited. The use of in makes the run more efficient and there is no gsort (as in your code), which is quite slow. I have assumed that in each group there are no repeated dates:
clear
set more off
input ///
group str15 date count treat num
1 01.02.2000 1 1 2
1 01.06.2000 2 0 .
1 01.09.2000 3 0 .
1 01.11.2000 3 1 .
1 01.05.2002 4 0 .
1 01.01.2010 5 1 1
1 01.05.2010 6 0 .
2 01.01.2001 1 1 0
2 01.03.2002 2 1 0
end
list
gen date2 = date(date,"DMY")
format date2 %td
drop date count num
order date
list, sepby(group)
*----- what you want -----
gen num2 = .
isid group date, sort
forvalues j = 1/`=_N' {
count in `j'/L if inrange(date2 - date2[`j'], 1, 365) & group == group[`j']
replace num2 = r(N) in `j'
}
replace num2 = . if !treat
list, sepby(group)
num3 is computed with code similar in spirit (and results) as that posted by #jfeigenbaum:
<snip>
*----- what you want -----
isid group date, sort
by group: gen indicat = sum(treat)
sort group indicat, stable
by group indicat: egen num3 = total(inrange(date2 - date2[1], 1, 365))
replace num3 = . if !treat
list, sepby(group)
Even more than two interpretations are possible for your problem, but I'll leave it at that.
(Note that I have changed your example data to include cases that probably make the problem more realistic.)
I was wondering, if anybody has a quick and easy way to collapse transactional data into one observation for easier modelling processing.
For example, let's say we look at a negotiation with a customer, every record is a quote for a certain car model with options A, B and C (all nominal indicators). The last record indicates a sale.
DATA TEMPSET;
INPUT CUST_ID $ A $ B $ C $;
DATALINES;
01 1 0 3
01 1 1 0
01 1 1 3
01 0 1 3
02 0 0 2
02 1 0 2
02 1 1 2
02 1 2 2
02 0 2 2
;
RUN;
To make things easier I would love to have one the resulting dataset to look like:
CUST_ID A B C A-1 B-1 C-1 A-2 B-2 C-2 A-3 B-3 C-3 A-4 B-4 C-4
01 1 0 3 1 1 0 1 1 3 0 1 3 . . .
02 0 0 2 1 0 2 1 1 2 1 2 2 0 2 2
My approach was a two dimensional array to create the variables. But then I could not combine it with a DO loop, trying to assign assign the values since it has multiple obs. I also tried using macro variables with SYMPUT/SYMGET and then LAST.CUst_ID = 1 to trigger the output, still with the problems of not having always the same length of quote history as well as requireing a hardcoding for each variable, which is practical for three variables, but not with the number increases. Any suggestions are welcome, probably possible with PROC SQL in a much simpler fashion?
Thanks!
PROC TRANSPOSE is your friend here. Make a vertical dataset with the wanted name and value, and transpose.
DATA TEMPSET;
INPUT CUST_ID $ A $ B $ C $;
DATALINES;
01 1 0 3
01 1 1 0
01 1 1 3
01 0 1 3
02 0 0 2
02 1 0 2
02 1 1 2
02 1 2 2
02 0 2 2
;
RUN;
data tempset_i;
set tempset;
by cust_id;
if first.cust_id then row=0;
row+1;
array vars a b c;
do _i = 1 to dim(vars);
varname = cats(vname(vars[_i]),row);
value = vars[_i];
output;
end;
keep cust_id varname value;
run;
proc transpose data=tempset_i out=tempset_t(drop=_name_);
by cust_ID;
id varname;
var value;
run;