I've got some data which is essentially lots of columns of information/data and dates and then two columns of numbers and a column which is a flag (ie its either a 1 or a 0). Each row is information on an individual at a particular month.
For the two columns of numbers I want to create two new columns which are the cumulative numbers for each individual over time. And for the flag I want it to be 1 for all future dates for that individual once it has first become 1 for that individual.
I'm struggling to word this (and so also to google what I want to do!) so I've put what I have and what I want below. In this example: A1, B1, C1 would be one individual and A1, B2, C3 would be another individual.
I've got this:
Col1
Col2
Col3
Date
Value_1
Value_2
Flag
A1
B1
C1
01Jan2021
0
100
0
A1
B1
C1
01Feb2021
0
0
0
A1
B1
C1
01Mar2021
10
100
0
A1
B1
C1
01Apr2021
50
0
0
A1
B1
C1
01May2021
0
10
1
A1
B1
C1
01Jun2021
10
0
0
A1
B1
C1
01Jul2021
0
0
0
A1
B2
C3
01Jan2021
0
0
0
A1
B2
C3
01Feb2021
0
20
1
A1
B2
C3
01Mar2021
10
20
0
A1
B2
C3
01Apr2021
40
20
0
A1
B2
C3
01May2021
0
0
0
A1
B2
C3
01Jun2021
30
0
0
A1
B2
C3
01Jul2021
0
0
0
And I want this:
Col1
Col2
Col3
Date
Value_1_full
Value_2_full
Flag
A1
B1
C1
01Jan2021
0
100
0
A1
B1
C1
01Feb2021
0
100
0
A1
B1
C1
01Mar2021
10
200
0
A1
B1
C1
01Apr2021
60
200
0
A1
B1
C1
01May2021
60
210
1
A1
B1
C1
01Jun2021
70
210
1
A1
B1
C1
01Jul2021
70
210
1
A1
B2
C3
01Jan2021
0
0
0
A1
B2
C3
01Feb2021
0
20
1
A1
B2
C3
01Mar2021
10
40
1
A1
B2
C3
01Apr2021
50
60
1
A1
B2
C3
01May2021
50
60
1
A1
B2
C3
01Jun2021
80
60
1
A1
B2
C3
01Jul2021
80
60
1
I could do this if the only data I had was for a single individual, but there's lots of them. The code I've written is just giving me the total cumulative of the column - I can't figure out how to calculate them separately for each individual. I'm also struggling to write the code for the flag column for a similar reason. I've put the code below and would be very appreciative of any help/advice.
Note: I'm really new to SAS and to write this question I've struggled to get the date field in correctly by just typing out the data for this example (I've used this "Ignore" bit of the code below as a work around to get it into SAS) so if you could let me know what I've done wrong here that would also be greatly appreciated for the future!
data data_1;
input Col1 $ Col2 $ Col3 $ Date date8. Ignore Value_1 Value_2 Flag;
format Date date8.;
datalines;
A1 B1 C1 "'01Jan2021'd" 0 100 0
A1 B1 C1 "'01Feb2021'd" 0 0 0
A1 B1 C1 "'01Mar2021'd" 10 100 0
A1 B1 C1 "'01Apr2021'd" 50 0 0
A1 B1 C1 "'01May2021'd" 0 10 1
A1 B1 C1 "'01Jun2021'd" 10 0 0
A1 B1 C1 "'01Jul2021'd" 0 0 0
A1 B2 C3 "'01Jan2021'd" 0 0 0
A1 B2 C3 "'01Feb2021'd" 0 20 1
A1 B2 C3 "'01Mar2021'd" 10 20 0
A1 B2 C3 "'01Apr2021'd" 40 20 0
A1 B2 C3 "'01May2021'd" 0 0 0
A1 B2 C3 "'01Jun2021'd" 30 0 0
A1 B2 C3 "'01Jul2021'd" 0 0 0
;
run;
Data data_2;
set data_1;
drop Ignore;
run;
proc sort data=data_2
out=data_3;
by Col1 Col2 Col3 Date;
run;
data data_4;
set data_3;
by Col1 Col2 Col3 Date;
retain Col1 Col2 Col3 Date Value_1 Value_2 Flag Value_1_full Value_2_full;
if first.Col1 AND first.Col2 AND first.Col3 AND first.Date then Value_1_full = Value_1;
else Value_1_full = Value_1_full + Value_1;
run;
So you're pretty close! I think this gets there...
proc sort data=data_1(drop=ignore)
out=data_3;
by Col1 Col2 Col3 Date;
run;
data data_4;
set data_3;
by Col1 Col2 Col3 Date;
retain Col1 Col2 Col3 Date Value_1 Value_2 Flag Value_1_full Value_2_full;
if first.Col3 then Value_1_full = Value_1;
else Value_1_full = Value_1_full + Value_1;
if first.col3 then flag=0;
flag = max(flag,flag_Early);
run;
Only a few small changes. I removed one pointless data step (The drop can be done in any of the other places you use the data) and change the if first. to be if first.col3.
You don't need col2 and col1 - first.col3 is what you care about, the other two changing would also cause first.col3 to also be true by default.
you also don't want First.date there - first.date is true EVERY TIME the date changes (or any other variable before it in the by), and that happens on every row, so it is always true! You don't want that.
Finally, for flag you need to make a new variable. Old variables are in fact always retained! But they're also replaced every iteration with new values. So we rename it to flag_early or whatever you like, and use the max function to assign a 1 to flag any time flag_early has a 1 or keep the 1 in flag if it has it from before - again resetting it every time first.col3 is true.
Related
Hope you can help with an solution, either a SQL or data step.
I need to combine multiple rows if customer id is the same, and add some vars with code too.
I have following static variable containers:
%let FirstColSuffix=<Somecode1>
%let SecondColSuffix=#<SomeCode2>
%let ThirdColSuffix=#<SomeCode3>
Data have;
Customerid Firstcol Secondcol Thirdcol
1 A1 A2 A3
2 B1 B2 B3
2 C1 C2 C3
2 D1 D2 D3
3 E1 E2 E3
3 F1 F2 F3
3 G1 G2 G3
3 H1 H2 H3
Data want;
Customerid Firstcol Secondcol Thirdcol Result
1 A1 A2 A3 A1<SomeCode1>A2#<SomeCode2>A3#<SomeCode3>
2 B1 B2 B3 B1<SomeCode1>B2#<SomeCode2>B3#<SomeCode3>
2 C1 C2 C3 B1<SomeCode1>B2#<SomeCode2>B3#<SomeCode3>C1<SomeCode1>C2#<SomeCode2>C3#<SomeCode3>
2 D1 D2 D3 B1<SomeCode1>B2#<SomeCode2>B3#<SomeCode3>C1<SomeCode1>C2#<SomeCode2>C3#<SomeCode3>D1<SomeCode1>D2#<SomeCode2>D3#<SomeCode3>
3 E1 E2 E3 E1<SomeCode1>E2#<SomeCode2>E3#<SomeCode3>
3 F1 F2 F3 E1<SomeCode1>E2#<SomeCode2>E3#<SomeCode3>F1<SomeCode1>F2#<SomeCode2>F3#<SomeCode3>
3 G1 G2 G3 E1<SomeCode1>E2#<SomeCode2>E3#<SomeCode3>F1<SomeCode1>F2#<SomeCode2>F3#<SomeCode3>G1<SomeCode1>G2#<SomeCode2>G3#<SomeCode3>
3 H1 H2 H3 E1<SomeCode1>E2#<SomeCode2>E3#<SomeCode3>F1<SomeCode1>F2#<SomeCode2>F3#<SomeCode3>G1<SomeCode1>G2#<SomeCode2>G3#<SomeCode3>H1<SomeCode1>H2#<SomeCode2>H3#<SomeCode3>
I only need output if last customer id (but with data from all matching customer id outputted in last row in column "result".
So in this example I need the line 1, 4 and 8
Can anyone help? :-)
Use retain and by-group processing. We'll continually concatenate result to itself for each row we read and carry that value forward. At the last customer ID, we'll output. At the first customer ID, result is reset.
data want;
set have;
by Customerid;
length Result $500.;
retain Result;
if(first.Customerid) then call missing(Result);
Result = cats(Result, FirstCol, "&FirstColSuffix", SecondCol, "&SecondColSuffix", ThirdCol, "&ThirdColSuffix");
if(last.Customerid);
run;
Output:
I have this data
data have;
input cust_id pmt months;
datalines;
AA 100 0
AA 50 1
AA 200 2
AA 350 3
AA 150 4
AA 700 5
BB 500 0
BB 300 1
BB 1000 2
BB 800 3
run;
and I'd like to generate an output that looks like this
data want;
input cust_id pmt months i;
datalines;
AA 100 0 0
AA 50 0 1
AA 200 0 2
AA 350 0 3
AA 150 0 4
AA 700 0 5
AA 50 1 0
AA 200 1 1
AA 350 1 2
AA 150 1 3
AA 700 1 4
AA 200 2 0
AA 350 2 1
AA 150 2 2
AA 700 2 3
AA 350 3 0
AA 150 3 1
AA 700 3 2
AA 150 4 0
AA 700 4 1
AA 700 5 0
BB 500 0 0
BB 300 0 1
BB 1000 0 2
BB 800 0 3
BB 300 1 0
BB 1000 1 1
BB 800 1 2
BB 1000 2 0
BB 800 2 1
BB 800 3 0
run;
There are few thousand rows with different cust_ID and different months length. I tried joining tables but it couldn't get me the sequence of 100 50 200 350 150 700 (for cust_ID AA). I could only replicated 100 if my months are 0, 50 if months are 1 & so on. I created a maxval which is the maximum month value. My code is something like this
data temp1;
set have;
do i = 0 to maxval;
if (months <=maxval) then output;
end;
i thought of creating a uniquekey to join my have data and temp1 data but it could only give me
AA 100 0 0
AA 50 0 1
AA 200 0 2
AA 350 0 3
AA 150 0 4
AA 700 0 5
AA 100 1 0
AA 50 1 1
AA 200 1 2
AA 350 1 3
AA 150 1 4
AA 100 2 0
AA 50 2 1
AA 200 2 2
AA 350 2 3
AA 100 3 0
AA 50 3 1
AA 200 3 2
AA 100 4 0
AA 50 4 1
AA 100 5 0
Any thoughts or different approach on how to generate my want table? Thank you!
This problem is a little tricky because you have things going in three directions
The number of group repetitions descends from group count. Within each repetition:
The payments item start index ascends and terminates at group count
The months (as I) item start index is 1 and termination descends from group count
SQL
One SQL approach is a three-way reflexive join with-in group. The months values act as a within group index and must be monotonic by 1 from 0 for this to work.
proc sql;
create table want as
select X.cust_id, Z.pmt, X.months, Y.months as i
from have as X
join have as Y on X.cust_id = Y.cust_id
join have as Z on Y.cust_id = Z.cust_id
where
X.months + Y.months = Z.months
order by
X.cust_id, X.months, Z.months
;
quit;
DATA Step
A DOW loop is used to count the group size. 2-deep looping crosses the combinations and three point= values are computed (finagled) to retrieve the relevant values.
data want2;
if 0 then set have; * prep pdv to match have;
retain point_end ;
point_start = sum(point_end,0);
do group_count = 1 by 1 until (last.cust_id);
set have(keep=cust_id);
by cust_id;
end;
do index1 = 1 to group_count;
point1 = point_start + index1;
set have (keep=months) point = point1;
do index2 = 0 to group_count - index1 ;
point2 = point_start + index1 + index2;
set have (keep=pmt) point=point2;
point3 = point_start + index2 + 1;
set have (keep=months rename=months=i) point=point3;
output;
end;
end;
point_end = point1;
keep cust_id pmt months i;
run;
Try the following:
data want(drop = start_obs limit j);
retain start_obs 1;
/* read by cust_id group */
do until(last.cust_id);
set have end = last_obs;
by cust_id;
end;
limit = months;
do j = 0 to limit;
i = 0;
do obs_num = start_obs + j to start_obs + limit;
/* read specific observations using direct access */
set have point = obs_num;
months = j;
output;
i = i + 1;
end;
end;
/* prepare for next direct access read */
start_obs = limit + 2;
if last_obs then
stop;
run;
I have done a groupby which resulted in a dataframe similar to the below example.
df = pd.DataFrame({'a': ['A', 'A','A', 'B', 'B','B'], 'b': ['A1', 'A2','A3' ,'B1', 'B2','B3'], 'c': ['2','3','4','5','6','1'] })
>>> df
a b c
0 A A1 2
1 A A2 3
2 A A3 4
3 B B1 5
4 B B2 6
5 B B3 1
desired output
>>> df
a b c
4 B B2 6
3 B B1 5
5 B B3 1
2 A A3 4
1 A A2 3
0 A A1 2
As you can see it is a double ranking based on column a then column b. We first start with the highest which is B and within B we also start with the highest which is B2.
how i can do that in python please
you can first find maxima in each group and sort your DF descending by this local maxima and column c:
In [49]: (df.assign(x=df.groupby('a')['c'].transform('max'))
.sort_values(['x','c'], ascending=[0,0])
.drop('x',1))
Out[49]:
a b c
4 B B2 6
3 B B1 5
5 B B3 1
2 A A3 4
1 A A2 3
0 A A1 2
Use
In [1072]: df.sort_values(by=['a', 'c'], ascending=[False, False])
Out[1072]:
a b c
4 B B2 6
3 B B1 5
5 B B3 1
2 A A3 4
1 A A2 3
0 A A1 2
I think need first get max values by aggregating, then create ordered Categorical by ordering by max indices and last sort_values working as you need:
c = df.groupby('a')['c'].max().sort_values(ascending=False)
print (c)
a
B 6
A 4
Name: c, dtype: object
df['a'] = pd.Categorical(df['a'], categories=c.index, ordered=True)
df = df.sort_values(by=['a', 'c'], ascending=[True, False])
print (df)
a b c
4 B B2 6
3 B B1 5
5 B B3 1
2 A A3 4
1 A A2 3
0 A A1 2
I need to find for every row the last 3hr usage (Usage is one of the columns in dataset) grouped by User and ID_option.
Every line(row) represent one record (within 3 min time interval). For example (including desired column sum_usage_3hr):
User ID_option time usage sum_usage_3hr
1 a1 12OCT2017:11:20:32 3 10
1 a1 12OCT2017:10:23:24 7 14
1 b1 12OCT2017:09:34:55 12 12
2 b1 12OCT2017:08:55:06 4 6
1 a1 12OCT2017:07:59:53 7 7
2 b1 12OCT2017:06:59:12 2 2
I have used code below for hash table:
data want;
if _n_=1 then do;
if 0 then set have(rename=(usage=_usage));
declare hash h(dataset:'have(rename=(usage=_usage))',hashexp:20);
h.definekey('user','id_option','time');
h.definedata('_usage');
h.definedone();
end;
set have;
sum_usage_3hr=0;
do i=time-3*3600 to time ;
if h.find(key:user,key:id_option,key:i)=0 then sum_usage_3hr+_usage;
end;
drop _usage i;
run;
But I got an error: Invalid DO loop control information, either the INITIAL or TO expression is missing or the BY expression is missing, zero, or invalid. If I add:
output;
end:
just above the "run;" it gives me an error: 'No matching DO/Select statement'.
Anybody know what causes the problem?
I have also the version with sorting the table firstly and gives me the same error.
Thank you
After implementing the for answer:
User ID_option time usage sum_usage_3hr col_i_got
1 a1 12OCT2017:11:22:32 3 12 3
1 a1 12OCT2017:11:20:24 0 9 3
1 a1 12OCT2017:10:34:55 2 9 2
1 a1 12OCT2017:09:55:06 0 7 2
1 a1 12OCT2017:09:43:45 0 7 0
1 a1 12OCT2017:08:59:53 7 7 7
1 a1 12OCT2017:06:59:12 0 0 7
Try this out:
Problem 1:
Input:
data have;
input User ID_option $ time usage ;
informat time datetime18.;
format time datetime18.;
cards;
1 a1 12OCT2017:11:20:32 3
1 a1 12OCT2017:10:23:24 7
1 b1 12OCT2017:09:34:55 12
2 b1 12OCT2017:08:55:06 4
1 a1 12OCT2017:07:59:53 7
2 b1 12OCT2017:06:59:12 2
;
run;
Code:
proc sort data=have out=have1;
by user id_option time;
quit;
data have2;
set have1;
by user id_option;
format previous_time datetime18.;
previous_time = lag(time);
previous_usage = lag(usage);
if first.ID_option then previous_time=.;
if previous_time ~= . and intnx("hour",time,-3,"s") <= previous_time <= time then sum_usage_3hr=usage+previous_usage;
else sum_usage_3hr = usage;
drop previous_time previous_usage;
run;
proc sort data=have2 out=want;
by descending time ;
quit;
Output:
User ID_option time usage sum_usage_3hr
1 a1 12Oct2017 11:20:32 3 10
1 a1 12Oct2017 10:23:24 7 14
1 b1 12Oct2017 9:34:55 12 12
2 b1 12Oct2017 8:55:06 4 6
1 a1 12Oct2017 7:59:53 7 7
2 b1 12Oct2017 6:59:12 2 2
Problem2:
Input:
data have;
input user1 ID_option $ time usage ;
informat time datetime18.;
format time datetime18.;
cards;
1 a1 12OCT2017:11:22:32 3
1 a1 12OCT2017:11:20:24 0
1 a1 12OCT2017:10:34:55 2
1 a1 12OCT2017:09:55:06 0
1 a1 12OCT2017:09:43:45 0
1 a1 12OCT2017:08:59:53 7
1 a1 12OCT2017:06:59:12 0
;
run;
Code:
proc sql;
create table want as
select user1,id_option,time,min(usage) as usage,sum(usage1) as sum_usage_3hr
from
(
select a.*,b.time as time1 ,b.usage as usage1
from
have a
left join
have b
on a.user1 = b.user1 and a.id_option = b.id_option and b.time <= a.time
where intck("hour",a.time ,b.time) >= -3
)
group by 1,2,3
order by time desc;
quit;
Output:
user1 ID_option time usage sum_usage_3hr
1 a1 12Oct2017 11:22:32 3 12
1 a1 12Oct2017 11:20:24 0 9
1 a1 12Oct2017 10:34:55 2 9
1 a1 12Oct2017 9:55:06 0 7
1 a1 12Oct2017 9:43:45 0 7
1 a1 12Oct2017 8:59:53 7 7
1 a1 12Oct2017 6:59:12 0 0
Let me know in case of any queries.
I have a list of comprising of sub-lists with different numbers of entries, as follows:
x <- list(
c("a1", "a2", "a3", "a4", "a5", "a6", "a7"),
c("b1","b2","b3","b4"),
c("c1","c2","c3"),
c("d1")
)
I want to convert this file to a dataframe with three columns (1st column is sequence of the sub-list, i.e. 1 to 4: 2nd column is the entries; the 3rd stands for my stop code, so, I used 1 for every lines, the final results is as follows:
1 a1 1
1 a2 1
1 a3 1
1 a4 1
1 a5 1
1 a6 1
1 a7 1
2 b1 1
2 b2 1
2 b3 1
2 b4 1
3 c1 1
3 c2 1
3 c3 1
4 d1 1
I tried to use cbind, however, it seems to me only works for sub-list with same number of entries. Are there any smarter way of doing this?
here is an example:
data.frame(
x=rep(1:length(x), sapply(x, length)),
y=unlist(x),
z=1
)
library(reshape2)
x <- melt(x) ## Done...
## Trivial...
x$stop <- 1
x <- x[c(2,1,3)]
One option is to use the split, apply, combine functionality in package plyr. In this case you need ldply which will take a list and combine the elements into data.frame:
library(plyr)
ldply(seq_along(x), function(i)data.frame(n=i, x=x[[i]], stop=1))
n x stop
1 1 a1 1
2 1 a2 1
3 1 a3 1
4 1 a4 1
5 1 a5 1
6 1 a6 1
7 1 a7 1
8 2 b1 1
9 2 b2 1
10 2 b3 1
11 2 b4 1
12 3 c1 1
13 3 c2 1
14 3 c3 1
15 4 d1 1