I have the dataset with Time and Interval variable as below. I would like to add a sequential ID (Indicator) with SAS based on a condition that Interval is greater than 0.1, as follows:
Time
Interval
Indicator
11:40:38
0.05
.
11:40:41
0.05
.
11:40:44
0.05
.
11:40:47
0.05
.
11:40:50
0.05
.
11:42:50
2
1
11:42:53
0.05
2
11:42:56
0.05
3
11:42:59
0.05
4
11:43:02
0.05
5
11:43:05
0.05
6
11:43:08
0.05
7
11:43:18
0.16667
1
11:43:21
0.05
2
11:43:24
0.05
3
11:43:27
0.05
4
11:43:30
0.05
5
11:43:33
0.05
6
If I use the code
`data out1; set out ;
by Time;
retain indicator;
if Interval > 0.1 then indicator=1;
indicator+1;
run;`
Indicator is not missing for the first five observations. I would like that it starts counting only when the condition is met (Interval > 0.1).
Thanks!
You can do it with a little modification:
data out1;
set out ;
retain indicator;
if Interval>0.1 then indicator=0;
if indicator^=. then indicator+1;
run;
The summuation will start after the condition Interval>0.1 has been met, because indicator is equal to missing value before that, so indicator+1 would not be calculated.
And you need to initial indicator as 0, not 1. If indicator is equal to 0, indicator^=. will be satisfied and indicator+1 will be calculated.
For yucks, here is a one-liner of #WhyMath logic.
data want;
set have;
retain seq;
seq = ifn(interval > 0.1, 1, ifn(seq, sum(seq,1), seq));
run;
If you want to retain INDICATOR it cannot be on the input dataset, otherwise the SET statement will overwrite the retained value with the value read from the existing dataset.
If you want INDICATOR to start as missing when using the SUM statement then you need to explicitly say so in the RETAIN statement. Otherwise the SUM statement will cause the variable to be initialized to zero.
If looks like you only want to increment when the new variable has already been assigned at least one value.
data want;
set have;
retain new .;
if interval>0.1 then new=1;
else if new > 0 then new+1;
run;
Results:
OBS Time Interval Indicator new
1 11:40:38 0.05000 . .
2 11:40:41 0.05000 . .
3 11:40:44 0.05000 . .
4 11:40:47 0.05000 . .
5 11:40:50 0.05000 . .
6 11:42:50 2.00000 1 1
7 11:42:53 0.05000 2 2
8 11:42:56 0.05000 3 3
9 11:42:59 0.05000 4 4
10 11:43:02 0.05000 5 5
11 11:43:05 0.05000 6 6
12 11:43:08 0.05000 7 7
13 11:43:18 0.16667 1 1
14 11:43:21 0.05000 2 2
15 11:43:24 0.05000 3 3
16 11:43:27 0.05000 4 4
17 11:43:30 0.05000 5 5
18 11:43:33 0.05000 6 6
Related
I would like to plot dataset and obtain desired output with the right setup.
Plot the scatter such that the points are in shade red-color, from light red to dark red depending on the scale (ratio) of 0-1 (0=light red, 1=dark red).
Show the legend also showing the scale red color according to the ration 0-1 (point 1.)
Data explanation:
area - city (shortcut)
id - user id
var - variable
time - datetime
exit - consumer left
ratio - proportion (between 0-1)
Data sample and attempt plotting (obviously not correct):
data data;
input area $ id $ var $ time $ exit $ ratio $;
datalines;
A 1 1 1 0 0.18
A 1 1 2 0 0.11
A 2 1 1 1 0.14
A 2 1 2 0 0.15
A 2 1 3 0 0.14
A 3 1 1 0 0.17
A 3 1 2 0 0.19
A 3 1 3 1 0.21
A 3 1 4 0 0.14
B 4 2 1 0 0.14
B 4 2 2 1 0.15
B 5 2 1 0 0.17
B 5 2 2 0 0.25
B 5 2 3 0 0.31
A 1 3 1 0 0.22
A 1 3 2 0 0.13
A 2 3 1 1 0.16
A 2 3 2 0 0.11
A 2 3 3 0 0.22
A 3 3 1 0 0.27
A 3 3 2 0 0.29
A 3 3 3 1 0.31
A 3 3 4 0 0.24
B 4 4 1 0 0.24
B 4 4 2 1 0.35
B 5 4 1 0 0.47
B 5 4 2 0 0.15
B 5 4 3 0 0.21
;;
run;
data attrs;
input id $ risk $ fillcolor $;
datalines;
ratio 0.05 Verylightred
ratio 0.15 Lightred
ratio 0.20 Red
ratio 0.25 Darkred
ratio 0.30 Verydarkred
ratio 0.35 Verydarkstrongred
;
run;
proc sgpanel data=data dattrmap=attrs;
panelby area exit;
scatter y=id x=var / markerattrs = (symbol = squarefilled) group=ratio attrid=ratio;
run;
This will get you closer.
Ratio should be numeric to be graphed
Ratio is continuous, how should it be used to group?
For the colour on the data attribute map, the length of the colours is not long enough and risk should be numeric
I don't know exactly how to specify the ranges you'd like for the colours you'd like but this gets you closer using the automatic legend.
One way to get at this is to add the variable to the data set for each group and then you can control the colour of each group with the data attribute map. This would mean adding a column in the 'data' data set called ratio_group whcih maps to the values in the data attribute map table. Use that variable the group.
data data;
input area $ id $ var $ time $ exit $ ratio ;
datalines;
A 1 1 1 0 0.18
A 1 1 2 0 0.11
A 2 1 1 1 0.14
A 2 1 2 0 0.15
A 2 1 3 0 0.14
A 3 1 1 0 0.17
A 3 1 2 0 0.19
A 3 1 3 1 0.21
A 3 1 4 0 0.14
B 4 2 1 0 0.14
B 4 2 2 1 0.15
B 5 2 1 0 0.17
B 5 2 2 0 0.25
B 5 2 3 0 0.31
A 1 3 1 0 0.22
A 1 3 2 0 0.13
A 2 3 1 1 0.16
A 2 3 2 0 0.11
A 2 3 3 0 0.22
A 3 3 1 0 0.27
A 3 3 2 0 0.29
A 3 3 3 1 0.31
A 3 3 4 0 0.24
B 4 4 1 0 0.24
B 4 4 2 1 0.35
B 5 4 1 0 0.47
B 5 4 2 0 0.15
B 5 4 3 0 0.21
;;
run;
proc sgpanel data=data ;
panelby area exit;
scatter y=id x=var / markerattrs = (symbol = squarefilled size=10)
colorresponse=ratio
colormodel=(verylightred lightred red darkred verydarkred verydarkstrongred);
colaxis grid minorgrid;
rowaxis grid minorgrid;
run;
For marker size look at the SIZE option under the MARKERATTRS option.
For grids, look at the GRID/MINORGRID options under the COLAXIS and ROWAXIS statements.
COLAXIS documentation
In SAS, I would like to create a label that check the previous sell indicator: if the sell indicator of the previous time period is 1/0 and in the current is 0/1 (meaning that it has changed) then I assign a value 1 to the ind variable.
The dataset looks like:
Customer Time Sell_Ind
1 2 1
1 3 0
1 4 0
2 23 0
2 24 0
2 30 0
5 12 1
5 11 0
And so on.
My expected output would be
Customer Time Sell_Ind Ind
1 2 1 0
1 3 0 1
1 4 0 0
2 23 0 0
2 24 0 0
2 30 0 0
5 12 1 0
5 11 0 1
The previous/current check is meant by customer.
I have tried as follows
data mydata;
set original;
By customer;
Lag_sell_ind=lag(sell_ind);
If first.customer then Lag_sell_ind=.;
Run;
But it does not return the expected output.
In sql I would probably use partition by customer over time but I do not know how to do the same in SAS.
You were halfway through, you only need to add one if statement to achieve the desired output.
data want;
set have;
by customer;
lag=lag(sell_ind);
if first.customer then lag=.;
if sell_ind ne lag and lag ne . then ind = 1;
else ind = 0;
drop lag;
run;
You can simplify this using the IFN Function like below.
data have;
input Customer Time Sell_Ind;
datalines;
1 2 1
1 3 0
1 4 0
2 23 0
2 24 0
2 30 0
5 12 1
5 11 0
;
data want;
set have;
by customer;
Lag_sell_ind = ifn(first.customer, 0, lag(sell_ind));
Run;
I have the following data that has been prepared with stset. The resulting variables signify cohort entry and exit times along with event status. In addition, a numerical variable - prob has been calculated based on the riskset size.
For those subjects that are not cases (where _d == 0), I need to sum all values of the prob variable where _t falls within that subject's follow-up time.
For example, subject 8 enters the cohort at _t0 == 0 and exits at _t == 8. Between these times, there are three prob values 0.9, 0.875 and 0.875 - giving the desired answer for subject 8 as 2.65.
* Example generated by -dataex-. To install: ssc install dataex
clear
input long id byte(_t0 _t _d) float prob
1 0 1 0 .
2 0 2 0 .
3 1 3 1 .9
4 0 4 0 .
5 0 5 1 .875
6 0 6 1 .875
7 5 7 0 .
8 0 8 0 .
9 0 9 1 .8333333
10 0 10 1 .8
11 0 11 0 .
12 8 12 1 .6666667
13 0 13 0 .
14 0 14 0 .
15 0 15 0 .
end
The desired output would return all of the data with an additional variable signifying the summed values of prob.
Thanks so much in advance.
I have a tricky question about conditional sum in SAS. Actually, it is very complicated for me and therefore, I cannot explain it by words. Therefore I want to show an example:
A B
5 3
7 2
8 6
6 4
9 5
8 2
3 1
4 3
As you can see, I have a datasheet that has two columns. First of all, I calculated the conditional cumulative sum of column A ( I can do it by myself-So no need help for that step):
A B CA
5 3 5
7 2 12
8 6 18
6 4 8 ((12+8)-18)+6
9 5 17
8 2 18
3 1 10 (((17+8)-18)+3
4 3 14
So my condition value is 18. If the cumulative more than 18, then it equal 18 and next value if sum of the first value after 18 and exceeds amount over 18. ( As I said I can do it by myself )
So the tricky part is I have to calculate the cumulative sum of column B according to column A:
A B CA CB
5 3 5 3
7 2 12 5
8 6 18 9.5 (5+(6*((18-12)/8)))
6 4 8 5.5 ((5+6)-9.5)+4
9 5 17 10.5 (5.5+5)
8 2 18 10.75 (10.5+(2*((18-7)/8)))
3 1 10 2.75 ((10.5+2)-10.75)+1
4 3 14 5.75 (2.75+3)
As you can see from example the cumulative sum of column B is very specific. When column CA is equal to our condition value (18), then we calculate the proportion of the last value for getting our condition value (18) and then use this proportion for computing cumulative sum of column B.
Looks like when the sum of A reaches 18 or more you want to split the values of A and B between the current and the next record. One way is to remember the left over values for A and B and carry them forward in your new cumulative variables. Just make sure to output the observation before resetting those variables.
data want ;
set have ;
ca+a;
cb+b;
if ca >= 18 then do;
extra_a=ca - 18;
extra_b=b - b*((a - extra_a)/a) ;
ca=18;
cb=cb-extra_b ;
end;
output;
if ca=18 then do;
ca=extra_a;
cb=extra_b;
end;
drop extra_a extra_b ;
run;
data have;
input patient level timepoint;
datalines;
1 0 1
1 0 2
1 0 3
1 3 4
1 0 5
1 0 6
2 0 1
2 4 2
2 0 3
2 3 4
2 0 5
2 0 6
2 0 7
2 2 8
2 0 9
2 0 10
3 3 1
3 0 2
3 0 3
4 0 1
4 0 2
4 0 3
4 0 4
4 1 5
4 0 6
4 0 7
4 0 8
4 0 9
4 0 10
;;
proc print; run;
/*
Condition 1: If there is one non-zero numeric value, in level, sorted by timepoint for a patient, set level to 2.5 for the record that is immediately prior to this time point; and set level = 1.5 for the next prior time point; set level to 2.5 for the record that is immediate post this time point; and set level to 1.5 for the next post record. The levels by timepoint should look like, ... 1.5, 2.5, non-zero numeric value, 2.5, 1.5 ... (Note: ... are kept as 0s).
Condition 2: If there are two or more non-zero numeric values, in level, sorted by timepoint for a patient, find the FIRST non-zero numeric value, and set level to 2.5 for the record that is immediate prior this time point; and set level to 1.5 for the next prior time point; then find the LAST non-zero numeric value record, set level to 2.5 for the record that is immediate post this last non-zero numeric value, and set level to 1.5 for the next post record; Set all zero values (i.e. level=0) to level = 2.5 for records between the first and last non-zero numeric values; The levels by timepoint should look like: ... 1.5, 2.5, FIRST Non-zero Numeric value, 2.5, Non-zero Numeric value, 2.5, LAST Non-zero Numeric value, 2.5, 1.5 ....
*/
I've tried data steps using N-1, N-2, N+1, N+2, arrays/do loops (my first thought was to use multiple arrays for this so that I could use the i=index to go to previous i-1/i+1 or i-2/1+2 records, but it was hard to grasp the concept of how to even code it.). All of this has to be done BY Patient, so there may be instances where there is only one record before the first non-zero and not two. The same could be true for post record as well. I searched all different types of examples and help, but none that could help with my needs. Thanks in advance for any help.
This is how I want the data to look like:
data want;
input patient level timepoint;
datalines;
1 0 1
1 1.5 2
1 2.5 3
1 3 4
1 2.5 5
1 1.5 6
2 2.5 1
2 4 2
2 2.5 3
2 3 4
2 2.5 5
2 2.5 6
2 2.5 7
2 2 8
2 2.5 9
2 1.5 10
3 3 1
3 2.5 2
3 1.5 3
4 0 1
4 0 2
4 1.5 3
4 2.5 4
4 1 5
4 2.5 6
4 1.5 7
4 0 8
4 0 9
4 0 10
;;
proc print; run;
I approached this by first finding the timepoints of the first and last non-zero levels. Then I merged those into the original set, and changed levels based on the rules you mentioned.
proc sort data = have;
by patient timepoint;
run;
data have2;
retain first 0 last 0;
set have;
by patient timepoint;
if level ne 0 and first = 0 then first = timepoint;
if level ne 0 then last = timepoint;
if last.patient then do;
output;
first = 0;
last = 0;
end;
keep patient first last;
run;
proc sort data=have2;
by patient;
run;
data merged;
merge have have2;
by patient;
if level = 0 then do;
if first-timepoint = 1 then level = 2.5;
if first-timepoint = 2 then level = 1.5;
if last-timepoint = -1 then level = 2.5;
if last-timepoint = -2 then level = 1.5;
if first < timepoint < last then level = 2.5;
end;
drop first last;
run;