How to compose new string variable using bysort - stata

I want to prepare Stata code to generate new string variable as suspect from identified urid on the basis of corresponding score less than or equal to 60 and missing score in each said.
Input:
primkey ssuid sup urid score
10312551 1255 601 122 60
10312552 1255 601 122 80
10312553 1255 601 123 90
10312554 1255 601 124 66
10312561 1256 601 122 40
10312562 1256 601 123 30
10312563 1256 601 124 .
10312564 1256 601 125 66
10312581 1258 602 126 80
10312582 1258 602 127 95
10312583 1258 602 127 100
10312584 1258 602 128 .
Output:
ssuid sup suspect
1255 601 122
1256 601 122,123,124
1258 602 128
The variables primkey and score fields are not required in output.
The following is the code that I have already tried:
sort state ssuid urid sup
gen x=_n
gen suspect=.
replace suspect=urid if (score <=60 | score==.) & urid=urid[x-1]
drop x
sort state ssuid suspect
gen x=_n
tostring suspect, replace
replace suspect=suspect[x-1]+","+suspect if suspect!="." & ssuid==ssuid[x-1]
drop x
gen x=strlen(ssususpect_sup)
gsort state ssuid -x
drop x
gen x=_n
replace suspect=suspect_sup[x-1] if ssuid==ssuid[x-1]
drop x
bys ssuid:gen x=_n
keep if x==1
drop x
However, this is not producing the desired result.

Your example doesn't make complete sense.
For ssuid 1255 the lowest score is 60. So, no score is less than 60.
For ssuid 1258 the lowest score is 80.
Stata's rule is that missing is arbitrarily large positive, so never less than 60.
It seems that your rule is less than or equal to 60 or missing.
With those fixes this works for me:
clear
input primkey ssuid sup urid score
10312551 1255 601 122 60
10312552 1255 601 122 80
10312553 1255 601 123 90
10312554 1255 601 124 66
10312561 1256 601 122 40
10312562 1256 601 123 30
10312563 1256 601 124 .
10312564 1256 601 125 66
10312581 1258 602 126 80
10312582 1258 602 127 95
10312583 1258 602 127 100
10312584 1258 602 128 .
end
egen tokeep = max(score <= 60 | missing(score)), by(ssuid)
keep if tokeep
drop tokeep
drop if inrange(score, 61, .)
bysort ssuid (primkey) : gen suspect = string(urid) if _n == 1
by ssuid: replace suspect = suspect[_n-1] + ///
"," + string(urid) if _n > 1 & urid != urid[_n-1]
by ssuid: keep if _n == _N
keep ssuid sup suspect
list
+---------------------------+
| ssuid sup suspect |
|---------------------------|
1. | 1255 601 122 |
2. | 1256 601 122,123,124 |
3. | 1258 602 128 |
+---------------------------+
EDIT Let's look at the original code. I have made some comments and found some code that can be slimmed and one clear bug, but I bailed out after that. Problems must be reproducible!
sort state ssuid urid sup
gen x=_n
gen suspect=.
replace suspect=urid if (score <=60 | score==.) & urid=urid[x-1]
drop x
sort state ssuid suspect
gen x=_n
tostring suspect, replace
replace suspect=suspect[x-1]+","+suspect if suspect!="." & ssuid==ssuid[x-1]
drop x
gen x=strlen(ssususpect_sup)
gsort state ssuid -x
drop x
gen x=_n
replace suspect=suspect_sup[x-1] if ssuid==ssuid[x-1]
drop x
bys ssuid:gen x=_n
keep if x==1
drop x
*1 There is nothing in the data example about a variable -state-.
* What do you don't show, we need not try to replicate
*2 Your use of x for _n isn't needed. You can use _n directly.
* That saves 8 lines
sort ssuid urid sup
gen suspect=.
replace suspect=urid if (score <=60 | score==.) & urid=urid[_n-1]
sort ssuid suspect
tostring suspect, replace
replace suspect=suspect[_n-1]+","+suspect if suspect!="." & ssuid==ssuid[_n-1]
gen x=strlen(ssususpect_sup)
gsort ssuid -x
drop x
replace suspect=suspect_sup[_n-1] if ssuid==ssuid[_n-1]
bys ssuid: keep if _n == 1
*3 Your first three lines can be slimmed to one
by sort ssuid urid (sup) : gen suspect = urid if (score <=60 | score==.) & urid=urid[_n-1]
sort ssuid suspect
tostring suspect, replace
replace suspect=suspect[_n-1]+","+suspect if suspect!="." & ssuid==ssuid[_n-1]
gen x=strlen(ssususpect_sup)
gsort ssuid -x
drop x
replace suspect=suspect_sup[_n-1] if ssuid==ssuid[_n-1]
bys ssuid: keep if _n == 1
*4 It's simpler to create -suspect- as string in the first place
* so cut the -tostring- line
bysort ssuid urid (sup) : gen suspect = string(urid) if (score <=60 | score==.) & urid=urid[_n-1]
sort ssuid suspect
replace suspect=suspect[_n-1]+","+suspect if suspect!="." & ssuid==ssuid[_n-1]
gen x=strlen(ssususpect_sup)
gsort ssuid -x
drop x
replace suspect=suspect_sup[_n-1] if ssuid==ssuid[_n-1]
bys ssuid: keep if _n == 1
*5 bug: testing for equality needs == not =
bysort ssuid urid (sup) : gen suspect = string(urid) if (score <=60 | score==.) & urid==urid[_n-1]
sort ssuid suspect
replace suspect=suspect[_n-1]+","+suspect if suspect!="." & ssuid==ssuid[_n-1]
gen x=strlen(ssususpect_sup)
gsort ssuid -x
drop x
replace suspect=suspect_sup[_n-1] if ssuid==ssuid[_n-1]
bys ssuid: keep if _n == 1
*6 you refer to a variable -ssussusspect_sup- which you haven't supplied as
* data, or created in your code. At this point I bail out.

Related

How to collapse numbers with same identifier but different date, but preserve the date of first observation for each identifier

I have a dataset that can be simplified in the following format:
clear
input str9 Date ID VarA VarB
"12jan2010" 5 21 42
"12jan2010" 6 47 21
"15jan2010" 10 7 68
"17jan2010" 6 -5 -3
"19jan2010" 6 -1 -1
end
In the dataset, there is Date, ID, VarA, and VarB. Each ID represents a unique set of transactions. I want to collapse (sum) VarA VarB, by(Date) in Stata. However, I want to keep the date of the first observation for each ID number.
Essentially, I want the above dataset to become the following:
+--------------------------------+
| Date ID Var1 Var2 |
|--------------------------------|
| 12jan2010 5 21 42 |
| 12jan2010 6 41 17 |
| 15jan2010 10 7 68 |
+--------------------------------+
12jan2010 17jan2010 and 19jan2010 have the same ID, so I want to collapse (sum) Var1 Var2 for these three observations. I want to keep the date 12jan2010 because it is the date for the first observation. The other two observations are dropped.
I know it might be possible to collapse by ID first and then merge with the original dataset and then subset. I was wondering if there is an easier way to make this work. Thanks!
collapse allows you to calculate a variety of statistics, so you can convert your string date into a numerical date, then take the minimum of the numerical date to get the first occurrence.
clear
input str9 Date ID VarA VarB
"12jan2010" 5 21 42
"12jan2010" 6 47 21
"15jan2010" 10 7 68
"17jan2010" 6 -5 -3
"19jan2010" 6 -1 -1
end
gen Date2 = date(Date, "DMY")
format Date2 %td
collapse (sum) VarA VarB (min) Date2 , by(ID)
order Date2, first
li
yielding
+------------------------------+
| Date2 ID VarA VarB |
|------------------------------|
1. | 12jan2010 5 21 42 |
2. | 12jan2010 6 41 17 |
3. | 15jan2010 10 7 68 |
+------------------------------+
In response to the comment: You can generate the formatted date for only observations where VarA is > 0 (and not missing). (Assuming that, per your comment, VarA & VarB always have the same sign.)
// now assume ID 6 has an earliest date of 17jan2005 (obs.4)
// but you want to return your 'first date' as the
// first date where varA & varB are both positive
clear
input str9 Date ID VarA VarB
"12jan2010" 5 21 42
"12jan2010" 6 47 21
"15jan2010" 10 7 68
"17jan2005" 6 -5 -3
"19jan2010" 6 -1 -1
end
gen Date2 = date(Date, "DMY") if VarA > 0 & !missing(VarA)
format Date2 %td
collapse (sum) VarA VarB (min) Date2 , by(ID)
order Date2, first
li
yielding
+------------------------------+
| Date2 ID VarA VarB |
|------------------------------|
1. | 12jan2010 5 21 42 |
2. | 12jan2010 6 41 17 |
3. | 15jan2010 10 7 68 |
+------------------------------+

Rolling sum with unbalanced panel with non-even times in Stata

I have an unbalanced daily panel where entries occur at uneven times. I would like to generate the rolling sum of some variable x over the past 365 days. I can think of two ways to do this, but the first is memory hungry and the second is processor hungry. Is there a third alternative that avoids these problems?
Here are my two solutions. Is there a third solution without memory or speed problems?
clear
set obs 200
set seed 2001
/* panel variables */
generate id = 1 + int(2*runiform())
generate time = mdy(1, 1, 2000) + int(10*365*runiform())
format time %td
duplicates drop
xtset id time
/* data */
generate x = runiform()
/* first approach is to fill the panel with `tsfill` */
/* then remove "seasonality" with `s.` */
tsfill
generate sx = sum(x)
generate ssx = s365.sx
/* second approach without `tsfill` */
/* but nested loop is fairly slow */
drop if missing(x)
generate double ssx_alt = 0
forvalues i = 1/`= _N' {
local j = `i'
local delta = time[`i'] - time[`j']
while ((`j' > 0) & (`delta' < 365) & (id[`i'] == id[`j'])) {
local x = cond(missing(x[`j']), 0, x[`j'])
replace ssx_alt = ssx_alt + `x' in `i'
local j = `j' - 1
local delta = time[`i'] - time[`j']
}
}
The sum over the last # days is the difference between two cumulative sums, the cumulative sum to now and the cumulative sum to # days ago. The extension to panel data is easy, but not shown here. I don't think gaps disturb this principle once you have applied tsfill.
. set obs 20
obs was 0, now 20
. gen t = _n
. gen y = 100 + _n
. gen sumy = sum(y)
. tsset t
time variable: t, 1 to 20
delta: 1 unit
. gen diff = sumy - L10.sumy
(10 missing values generated)
. l
+------------------------+
| t y sumy diff |
|------------------------|
1. | 1 101 101 . |
2. | 2 102 203 . |
3. | 3 103 306 . |
4. | 4 104 410 . |
5. | 5 105 515 . |
|------------------------|
6. | 6 106 621 . |
7. | 7 107 728 . |
8. | 8 108 836 . |
9. | 9 109 945 . |
10. | 10 110 1055 . |
|------------------------|
11. | 11 111 1166 1065 |
12. | 12 112 1278 1075 |
13. | 13 113 1391 1085 |
14. | 14 114 1505 1095 |
15. | 15 115 1620 1105 |
|------------------------|
16. | 16 116 1736 1115 |
17. | 17 117 1853 1125 |
18. | 18 118 1971 1135 |
19. | 19 119 2090 1145 |
20. | 20 120 2210 1155 |
+------------------------+

Percentage calculation based on multiple columns SAS

I have a data with patientID and the next column with illness, which has more than one category seperated by commas. I need to find the total number of patients per each illness category and the percentage of patients per category. I tried the normal way, it gives the frequency correct but not the percent.
The data looks like this.
ID Type_of_illness
4 lf13
5 lf5,lf11
63
13 lf12
85
80
15
20
131 lf6,lf7,lf12
22
24
55 lf12
150 lf12
34 lf12
49 lf12
151 lf12
60
74
88
64
82 lf13
5 lf5,lf7
112
87 lf17
78
79 lf16
83 lf11
where the empty spaces represent no illness. I first separated the illnesses into separate columns, but then got stuck there not knowing how to process to find out the percent.
The code I wrote is as below:
Data new;
set old;
array P(3) L1 L2 L3;
do i to dim(p);
p(i)=scan(type_of_illness,i,',');
end;
run;
Then I created a new column to copy all the illnesses to it so I thought it would give me the correct frequency, but it did not give me the correct percent.
data new;
set new;
L=L1;output;
L=L2;output;
L=L3;output;
run;
proc freq data=new;
tables L;run;
I have to create a table something like
*Total numer of patients Percent*
.......................................
lf5
lf7
lf6
lf11
lf12
lf13
Please help.
You're trying to output percentages on non-mutually exclusive groups (each illness). It isn't obvious in SAS how to do this.
The following takes Joe's input code but takes an alternative route in determining percentages from event data (a 'long' dataset, if you will). I prefer this to creating a binary variable for each illness at the patient level (a 'wide' dataset) as, for me, this soon gets unwieldy. That said, if you then go on to do some modelling then a 'wide' dataset is usually more useful.
The following code produces output as follows:-
| | Pats | Pats | | | Mean | | |
| | with 0 |with 1+ | % with | Num | events | | |
| |records | record | record | Events |per pat |Std Dev | Median |
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf11 | 24| 2| 8| 2| 1.0| 0.00| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf12 | 19| 7| 27| 7| 1.0| 0.00| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf13 | 24| 2| 8| 2| 1.0| 0.00| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf16 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf17 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf5 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf6 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf7 | 24| 2| 8| 2| 1.0| 0.00| 1|
---------------------------------------------------------------------------------------|
Note that patient 5 is repeated in your data for illness lf5. My code only counts this record once. This is fine if a chronic illness but not if acute. Also, my code includes patients in the denominator who do not have an event.
Finally, you can see another example of this code using dates - with test data - here at the mycodestock.com code sharing site => https://mycodestock.com/public/snippet/11251
Here's the code for the table above:-
options nodate nonumber nocenter pageno=1 obs=max nofmterr ps=52 ls=100 formchar="|----||---|-/\<>*";
data have;
format type_of_illness $30.;
infile datalines truncover;
input ID Type_of_illness $;
datalines;
4 lf13
5 lf5,lf11
63
13 lf12
85
80
15
20
131 lf6,lf7,lf12
22
24
55 lf12
150 lf12
34 lf12
49 lf12
151 lf12
60
74
88
64
82 lf13
5 lf5,lf7
112
87 lf17
78
79 lf16
83 lf11
;;;;
proc sort;
by id;
run;
** Create patient level data;
proc sort data = have(keep = id) out = pat_data nodupkey;
by id;
run;
** Create event table (1 row per patient*event);
** NOTE: Patients without events are dropped (as is usual in events data);
data events(drop = i type_of_illness);
set have;
attrib grp length = $5 label = 'Illness';
do i = 1 to countc(type_of_illness, ',') + 1;
grp = scan(type_of_illness, i, ',');
if grp ne '' then output;
end;
run;
** Count the number of events each patient had for each grp;
** NOTE: The NODUPKEY in the PROC SORT remove duplicate records (within PAT & GRP);
** NOTE: The use of CLASSDATA and COMPLETETYPES ensures zero counts for all patients and grps;
proc sort in = events out = perc2_summ_grp_pat nodupkey;
by grp id;
proc summary data = perc2_summ_grp_pat nway missing classdata = pat_data completetypes;
by grp;
class id;
output out = perc2_summ_grp_pat(rename=(_freq_ = num_events) drop=_type_);
run;
** Add a denominator variable - value '1' for each row.;
** Ensure when num_events = 0 the value is set to missing;
** Create a flag variable - set to 1 - if a patient has a record (no matter how many);
data perc2_summ_grp_pat;
set perc2_summ_grp_pat;
denom = 1;
if num_events = 0 then num_events = .;
flg_scripts = ifn(num_events, 1, .);
run;
proc tabulate data = perc2_summ_grp_pat format=comma8.;
title1 bold "Table 1: N, % and basic statistics of events within non-mutually exclusive groups";
title2 "Units: Patients - within each group level";
title3 "The statistics summarises the number of events (not whether a patient had at least 1 event)";
title4 "This means, for the statistics, only patients with 1+ record are included in the denominator";
class grp;
var denom flg_scripts num_events;
table grp='', flg_scripts=''*(nmiss='Pats with 0 records' n='Pats with 1+ record' pctsum<denom>='% with record')
num_events=''*(sum='Num Events' mean='Mean events per pat'*f=8.1 stddev='Std Dev'*f=8.2 p50='Median');
run; title; footnote;
You're going about this right, but you need to pick percent differently. Normally percent is 'percent of whole dataset', which means that it is going to triplicate your base. You want the percent based to the illness. This means you need a 1/0 for each illness.
The one downside is you have the 0's in your automatic tables; you would have to output the table to a dataset and remove them, then proc print/report the resulting dataset to get the 1's only - or use PROC SQL to generate the table.
data have;
format type_of_illness $30.;
infile datalines truncover;
input ID Type_of_illness $;
datalines;
4 lf13
5 lf5,lf11
63
13 lf12
85
80
15
20
131 lf6,lf7,lf12
22
24
55 lf12
150 lf12
34 lf12
49 lf12
151 lf12
60
74
88
64
82 lf13
5 lf5,lf7
112
87 lf17
78
79 lf16
83 lf11
;;;;
run;
data want;
set have;
array L[8] lf5-lf7 lf11-lf13 lf16 lf17;
do _t = 1 to dim(L);
if find(type_of_illness,trim(vname(L[_t]))) then L[_t]=1;
else L[_t]=0;
end;
run;
proc tabulate data=want;
class lf:;
tables lf:,n pctn;
run;
The multilabel format solution is interesting, so I present it separately.
Using the same have, we create a format that takes every combination of illnesses and outputs a row for each illness in it, ie, if you have "1,2,3", it outputs rows
1,2,3 = 1
1,2,3 = 2
1,2,3 = 3
Enabling multilabel formats and using a class-enabled proc like proc tabulate, you can then use this to allow each respondent to count in each of the label values, but not be counted more than once against the total.
data for_procformat;
set have;
start=type_of_illness; *start is the input to the format;
hlo=' m'; *m means multilabel, adding a space
here to leave room for the o later;
type='c'; *character format - n is numeric;
fmtname='$ILLF'; *whatever name you like;
do _t = 1 to countw(type_of_illness,','); *for each 'word' do this once;
label=scan(type_of_illness,_t,','); *label is the 'result' of the format;
if not missing(label) then output;
end;
if _n_=1 then do; *this block adds a row to deal with values;
hlo='om'; *not defined (in this case, just missings);
label='No Illness'; *the o means 'other';
output;
end;
run;
proc sort data=for_procformat nodupkey; *remove duplicates (which there will be many);
by start label;
run;
proc format cntlin=for_procformat; *import the formats;
quit;
proc tabulate data=have;
class type_of_illness/mlf missing ; *mlf means multilabel formats;
format type_of_illness $ILLF.; *apply said format;
tables type_of_illness,n pctn; *and your table;
run;

How to output R square and RMSE in SAS

I regress Y on X. The regression is grouped by DATE.
proc reg data=A noprint;
model Y=X;
by DATE;
run;
In the regression result for each DATE, there is a R square and a RMSE value.
How do I output a table of R square and RMSE values as below. Thank you!
|------------|-----------|------------|
| Date | R Square | RMSE |
|------------|-----------|------------|
| 1/1/2009 | R1 | RMSE1 |
| 2/1/2009 | R2 | RMSE2 |
| 3/1/2009 | R3 | RMSE3 |
| .... | .... | .... |
| 10/1/2010 | R22 | RMSE22 |
| 11/1/2010 | R23 | RMSE23 |
| 12/1/2010 | R24 | RMSE24 |
|------------|-----------|------------|
If you look at one of the examples of the SAS PROC REG, this is pretty easy to do.
The example data:
data htwt;
input sex $ age :3.1 height weight ##;
datalines;
f 143 56.3 85.0 f 155 62.3 105.0 f 153 63.3 108.0 f 161 59.0 92.0
f 191 62.5 112.5 f 171 62.5 112.0 f 185 59.0 104.0 f 142 56.5 69.0
f 160 62.0 94.5 f 140 53.8 68.5 f 139 61.5 104.0 f 178 61.5 103.5
f 157 64.5 123.5 f 149 58.3 93.0 f 143 51.3 50.5 f 145 58.8 89.0
m 164 66.5 112.0 m 189 65.0 114.0 m 164 61.5 140.0 m 167 62.0 107.5
m 151 59.3 87.0
;
run;
Now, the PROC REG. Pay particular attention to the OUTEST statement; this is going to output a dataset est1 that will contain what you want.
proc reg outest=est1 outsscp=sscp1 rsquare;
by sex;
eq1: model weight=height;
eq2: model weight=height age;
proc print data=sscp1;
title2 'SSCP type data set';
proc print data=est1;
title2 'EST type data set';
run;
quit;
So now, the last PROC PRINT of est1 can be easily routed to contain only what you want:
proc print data=est1;
var sex _RMSE_ _RSQ_;
run;
You can use this code:
proc reg data=A noprint;
model Y=X;
by DATE;
ods output FitStatistics=final_table;
run;
data final_table; set final_table;
rename cValue1=RMSE;
rename cValue2=RSquared;
proc print data=final_table;
where Label1='Root MSE';
var DATE RSquared RMSE;
run;
I simply turned the trace on and took the name of the table I was interested in and saved it as the output of proc reg.
ODS Trace on;
proc reg data=A noprint;
model Y=X;
by DATE;
run;
ODS Trace off;
this prints out the name of all the output tables in the log file.

In the following SAS statement, what do the parameters "noobs" and "label" stand for?

In the following SAS statement, what do the parameters "noobs" and "label" stand for?
proc print data-sasuser.schedule noobs label;
per SAS 9.2 documentation on PROC PRINT:
"NOOBS - Suppress the column in the output that identifies each observation by number"
"LABEL - Use variables' labels as column headings"
noobs don't show you the column of observations number
(1,2,3,4,5,....)
my first title
results without noobs
Obs name sex group height weight
1 mike m a 21 150
2 henry m b 30 140
3 norian f b 18 130
4 nadine f b 32 135
5 dianne f a 23 135
results with noobs
my first title
name sex group height weight
mike m a 21 150
henry m b 30 140
norian f b 18 130
nadine f b 32 135
dianne f a 23 135