SAS code - sum of last N rows for every row - sas

I have a dataset like this for each ID;
Months
ID
Number
2018-07-01
1
0
2018-08-01
1
0
2018-09-01
1
1
2018-10-01
1
3
2018-11-01
1
1
2018-12-01
1
2
2019-01-01
1
0
2019-02-01
1
0
2019-03-01
1
1
2019-04-01
1
0
2019-05-01
1
0
2019-06-01
1
0
2019-07-01
1
1
2019-08-01
1
0
2019-09-01
1
0
2019-10-01
1
2
2019-11-01
1
0
2019-12-01
1
0
2020-01-01
1
0
2020-02-01
1
0
2020-03-01
1
0
2020-04-01
1
0
2020-05-01
1
0
2020-06-01
1
0
2020-07-01
1
0
2020-08-01
1
1
2020-09-01
1
0
2020-10-01
1
0
2020-11-01
1
1
2020-12-01
1
0
2021-01-01
1
0
2021-02-01
1
1
2021-03-01
1
1
2021-04-01
1
0
2018-07-01
2
0
.......
.......
.......
(Similar values for each ID)
I want a dataset like this;
Months
ID
Number
Sum_Next_6Number
2018-07-01
1
0
7
2018-08-01
1
0
7
2018-09-01
1
1
7
2018-10-01
1
3
4
2018-11-01
1
1
3
2018-12-01
1
2
1
2019-01-01
1
0
2
2019-02-01
1
0
2
2019-03-01
1
1
1
2019-04-01
1
0
3
2019-05-01
1
0
3
2019-06-01
1
0
3
2019-07-01
1
1
2
2019-08-01
1
0
2
2019-09-01
1
0
2
2019-10-01
1
2
0
2019-11-01
1
0
0
2019-12-01
1
0
0
2020-01-01
1
0
0
2020-02-01
1
0
1
2020-03-01
1
0
1
2020-04-01
1
0
1
2020-05-01
1
0
2
2020-06-01
1
0
2
2020-07-01
1
0
2
2020-08-01
1
1
2
2020-09-01
1
0
3
2020-10-01
1
0
3
2020-11-01
1
1
Nan
2020-12-01
1
0
Nan
2021-01-01
1
0
Nan
2021-02-01
1
1
Nan
2021-03-01
1
1
Nan
2021-04-01
1
0
Nan
2018-07-01
2
0
0
.......
.......
.......
.......
If there is no 6 months left then this values should be Nan.
Is there a way to do this? Thank you in advance.

data want(drop = i n);
set have curobs = c nobs = nobs;
Sum_Next_6Numbers = 0;
do p = c + 1 to 6 + c;
if p > nobs then do;
Sum_Next_6Numbers = .; leave;
end;
set have(keep = Number ID rename = (Number = n id = i)) point = p;
if id ne i then do;
Sum_Next_6Numbers = .; leave;
end;
Sum_Next_6Numbers + n;
end;
run;

Related

Count the number of unique ids for every subset of variables

I want to find the number of unique ids for every subset combination of the variables. For example
data have;
input id var1 var2 var3;
datalines;
5 1 0 0
5 1 1 1
5 1 0 1
5 0 0 0
6 1 0 0
7 1 1 1
8 1 0 1
9 0 0 0
10 1 0 0
11 1 0 0
12 1 . 1
13 0 0 1
;
run;
I want the result to be
var1 var2 var3 count
. . 0 5
. . 1 5
. 0 . 7
. 0 0 5
. 0 1 3
. 1 . 2
. 1 1 2
0 . . 3
0 . 0 2
0 . 1 1
0 0 . 3
0 0 0 2
0 0 1 1
1 . . 7
1 . 0 4
1 . 1 4
1 0 . 5
1 0 0 4
1 0 1 2
1 1 . 2
1 1 1 2
which is the result of appending all the possible proc sql; group bys (var1 is shown below)
proc sql;
create table sub1 as
select var1, count(distinct id) as count
from have
where not missing(var1)
group by var1
;
quit;
I don't care about the case where all variables are missing or when any of the variables in the group by are missing. Is there a more efficient way of doing this?
You can use Proc SUMMARY to compute the combinations of var1-var3 values for each id by group. From the SUMMARY output a SQL query can count the distinct ids per combination.
Example:
data have;
input id var1 var2 var3;
datalines;
5 1 0 0
5 1 1 1
5 1 0 1
5 0 0 0
6 1 0 0
7 1 1 1
8 1 0 1
9 0 0 0
10 1 0 0
11 1 0 0
12 1 . 1
13 0 0 1
;
proc summary noprint missing data=have;
by id;
class var1-var3;
output out=combos;
run;
proc sql;
create table want as
select var1, var2, var3, count(distinct id) as count
from combos
group by var1, var2, var3
;

Convert word Python Pandas Data Frame into Zero One Data Frame

Input
userID col1 col2 col3 col4 col5 col6 col7 col8 col9
1 Java c c++ php python perl html hadoop nodejs
2 nodejs c# c++ oops css html angular java php
3 php python html java angular hadoop c nodejs c#
4 python php css perl hadoop c nodejs c# html
5 perl css python hadoop c nodejs c# java php
6 Java python css perl nodejs c# java php hadoop
7 javascript java perl nodejs angular php mysql hadoop html
8 angular mysql mongodb cs hadoop angular oops html perl
9 nodejs hadoop mysql mongodb angular oops html python java
Desire Output
userID Java C C++ php python perl html hadoop nodejs oops mysql mongo
1 1 1 1 1 1 1 1 1 1 0 0 0
2 1 0 1 1 0 0 1 0 1 0 0 0
3 1 1 0 1 1 1 1 1 1 0 0 0
4 0 0 0 0 1 1 1 0 1 1 1 1
Use get_dummies + groupby by column names and aggregate max:
df = pd.get_dummies(df.set_index('userID'), prefix='', prefix_sep='')
df = df.groupby(level=0, axis=1).max().reset_index()
print (df)
userID Java angular c c# c++ cs css hadoop html java javascript \
0 1 1 0 1 0 1 0 0 1 1 0 0
1 2 0 1 0 1 1 0 1 0 1 1 0
2 3 0 1 1 1 0 0 0 1 1 1 0
3 4 0 0 1 1 0 0 1 1 1 0 0
4 5 0 0 1 1 0 0 1 1 0 1 0
5 6 1 0 0 1 0 0 1 1 0 1 0
6 7 0 1 0 0 0 0 0 1 1 1 1
7 8 0 1 0 0 0 1 0 1 1 0 0
8 9 0 1 0 0 0 0 0 1 1 1 0
mongodb mysql nodejs oops perl php python
0 0 0 1 0 1 1 1
1 0 0 1 1 0 1 0
2 0 0 1 0 0 1 1
3 0 0 1 0 1 1 1
4 0 0 1 0 1 1 1
5 0 0 1 0 1 1 1
6 0 1 1 0 1 1 0
7 1 1 0 1 1 0 0
8 1 1 1 1 0 0 1

Convert this Word DataFrame into Zero One Matrix Format DataFrame in Python Pandas

Want to convert user_Id and skills dataFrame matrix into zero one DataFrame matrix format user and their corresponding skills
Input DataFrame
user_Id skills
0 user1 [java, hdfs, hadoop]
1 user2 [python, c++, c]
2 user3 [hadoop, java, hdfs]
3 user4 [html, java, php]
4 user5 [hadoop, php, hdfs]
Desired Output DataFrame
user_Id java c c++ hadoop hdfs python html php
user1 1 0 0 1 1 0 0 0
user2 0 1 1 0 0 1 0 0
user3 1 0 0 1 1 0 0 0
user4 1 0 0 0 0 0 1 1
user5 0 0 0 1 1 0 0 1
You can join new DataFrame created by astype if need convert lists to str (else omit), then remove [] by strip and use get_dummies:
df = df[['user_Id']].join(df['skills'].astype(str).str.strip('[]').str.get_dummies(', '))
print (df)
user_Id c c++ hadoop hdfs html java php python
0 user1 0 0 1 1 0 1 0 0
1 user2 1 1 0 0 0 0 0 1
2 user3 0 0 1 1 0 1 0 0
3 user4 0 0 0 0 1 1 1 0
4 user5 0 0 1 1 0 0 1 0
df1 = df['skills'].astype(str).str.strip('[]').str.get_dummies(', ')
#if necessary remove ' from columns names
df1.columns = df1.columns.str.strip("'")
df = pd.concat([df['user_Id'], df1], axis=1)
print (df)
user_Id c c++ hadoop hdfs html java php python
0 user1 0 0 1 1 0 1 0 0
1 user2 1 1 0 0 0 0 0 1
2 user3 0 0 1 1 0 1 0 0
3 user4 0 0 0 0 1 1 1 0
4 user5 0 0 1 1 0 0 1 0

Using first function

I need to create a new variable WHLDR given the conditions below. I'm not sure the last else if is correct. So if multi > 1 and ref_1 = 0 if rel =0 and ref_1=1 then the first id which meets this condition whldr=1 if not then whldr =0, and continues. This is my code and sample data below.
data temp_all;
merge temp_1 (in=inA)
temp_2 (in=inB)
temp_3 (in=inC)
;
by id;
firstid=first.id;
if multi = 1 then do;
if rel = 0 then whldr=1;
else whldr = 0;
end;
else if multi > 1 and ref_1 >= 1 then do;
if rel =0 and ref_1=1 then whldr=1;
else whldr = 0;
end;
else if multi > 1 and ref_1 = 0 then do;
if rel =0 and ref_1=1 then do;
if rel =0 and ref_0 ne '0' then do;
if first.id=1 then whldr=1 ;
else whldr=0;
end;
end;
end;
run;
Here is sample data:
data have ;
input id a rel b multi ;
cards;
105 . 0 0 1
110 1 0 1 1
110 0 1 1 1
110 . 2 1 1
113 1 0 1 1
113 2 1 1 1
113 0 2 1 1
113 0 2 1 1
135 1 0 1 1
135 0 1 1 1
176 1 0 1 1
176 0 1 1 1
189 1 0 1 1
189 2 1 1 1
189 0 4 1 1
189 0 4 1 1
;
If you have a variable named WHLDR and you want the first observation where it has the value 1 then you can run a data step like this.
data want ;
set have (obs=1);
where whldr=1 ;
run;

find the count of num column changing by id

Please can anyone help me the follwing probelm.
I have following dummy data:
id num
1 1
1 2
1 1
1 2
1 1
1 2
2 1
2 15
2 1
2 1
2 1
2 15
2 1
2 15
How to count number of times num (column) is changing for each id?
Please find the results and new column.
I need results like this
id number no_of_times
1 1 1
1 2 1
1 1 1
1 2 2
1 1 1
1 2 3
2 1 1
2 15 1
2 1 1
2 1 1
2 1 1
2 15 2
2 1 1
2 15 3
Hope you can understand after seeing the results
The following hash approach works for the test data provided with the question:
data have;
input id number no_of_times_target;
cards;
1 1 1
1 2 1
1 1 1
1 2 2
1 1 1
1 2 3
2 1 1
2 15 1
2 1 1
2 1 1
2 1 1
2 15 2
2 1 1
2 15 3
;
run;
data want;
set have;
by id;
if _n_ = 1 then do;
length prev_number no_of_times 8;
declare hash h();
rc = h.definekey('number','prev_number');
rc = h.definedata('no_of_times');
rc = h.definedone();
end;
prev_number = lag(number);
if number > prev_number and not(first.id) then do;
rc = h.find();
no_of_times = sum(no_of_times,1);
rc = h.replace();
end;
else no_of_times = 1;
if last.id then rc = h.clear();
drop rc prev_number;
run;