I have a question on SAS Programming. It is about conditional sum. But it is very specific for me. Therefore, I want to ask as an example. I have the following dataset:
Group A Quantity
1 10 7
1 8 4
1 7 3
1 10 5
2 11 6
2 13 8
2 9 7
2 13 9
I want to add two more columns to this dataset. The new dataset should be:
Group A Quantity B NewColumn
1 10 7 10 12 (7+5)
1 8 4 10 12
1 7 3 10 12
1 10 5 10 12
2 13 6 13 15 (6+9)
2 10 8 13 15
2 9 7 13 15
2 13 9 13 15
So, the column B should be equal tha maximum value of each group and it is the same for all observations of each group. In this example, Group number 1 has 4 values. They are 10, 8, 7, 10. The maximum among these values is 10. Therefore, the values of the observations of the B column for the first group are all equal to 10. Maximum number for group number 2 is 13. Therefore, the values of the observations of the B column for the second group are all equal to 13.
The column C is more complicated. Its value depends on the all columns. Similiar to B column, it will be the same within group. More detailed, it is the sum of the specific observations of QUANTITIES column. These specific observations should belong to the observations that have the maximum value in each group. In our example, it is 12 for the first group. The reason is, the maximum number of first group is 10. and the quantities belong to 10 are 7 and 5. So, the sum of these is 12. For the second group it is 15. because the maximum value of the second group is 13 and the quantities belong to 13 are 6 and 9. So the sum is 15.
I hope. I can explain it. Many thanks in advance.
You can do this with proc sql:
proc sql;
select t.*, max_a as b,
(select sum(t2.quantity)
from t t2
where t2.group = t.group and t.a = max_a
) as c
from t join
(select group, max(a) as max_a
from t
group by group
) g
on t.group = g.group;
run;
If the data is coming from an underlying database, most databases support window functions which make this easier.
This is untested (I'm away from sas) and will probably have mistakes, but a triple DoW loop should work. One pass to get the max per group, second pass to get the sum, third pass to output the records. Something like:
data want ;
do until(last.group) ;
by group ;
set have ;
B=max(A,B) ;
end ;
do until(last.group) ;
set have ;
by group ;
if A = B then NewColumn = sum(NewColumn, Quantity) ;
end;
do until(last.group);
set have ;
by group;
output ;
end ;
run;
Related
Partner
UserID
Marks
Group
A
1
4
AM
A
2
7
AM
A
1
4
AM
B
3
5
CM
C
4
6
TM
B
3
5
CM
I want to calculate sum of 'Marks' for each partner excluding double rows.
I've tried (sum(maxOver(Marks, [UserID, Partner], PRE_AGG))). But it's giving me a table like :
Partner
Marks
A
15
B
10
C
6
Whereas, I want a table as below :
Partner
Marks
A
11
B
5
C
6
Thank you for your help, cheers!
You can create a calculated field with a countOver() function to detect the duplicate rows, and then use it as a filter in a sumIf() function.
Example:
sumIf({Marks},countOver({Marks,[{Partner},{UserID},{Marks},{Group}],PRE_AGG)=1)
I begin to use Power BI, and I don't know how to group lines.
I have this kind of data :
api user 01/07/21 02/07/21 03/07/21 ...
a 25 null 3 4
b 25 1 null 2
c 25 1 4 5
a 30 4 3 5
b 30 3 2 2
c 30 1 1 3
And I would like to have the sum of the values per user, not by api and user
user 01/07/21 02/07/21 03/07/21 ...
25 2 7 11
30 8 6 10
Do you know how to do it please ?
I created a table with your sample data (make sure your values are treated as numbers):
Then create a Matrix visual, with "user" in Rows and your desired columns in the Values section:
My data has some problem. The survey is conducted on housing unit. So the two rows with the same person ID might not actually indicate the same person.
I want to assign different ID for actually different person.
Let's say I have this data.
id yearmonth age
1 200001 12
1 200002 12
1 200003 14
1 200004 14
1 200005 14
3rd row is definitely different person. Its age increase by 2.
So I want to change ID like
id yearmonth age
1 200001 12
1 200002 12
10 200003 14
10 200004 14
10 200005 14
How can I do this? I think I can change the ID of 3rd row by writing
bysort id (yearmonth): replace id=id*10 if age[_n-1]>age+1 | age[_n-1]+1<age
(where I multiply by 10 because all IDs have the same number of numbers, so that multiplying by 10 won't give any duplicate)
But how can I change all subsequent rows?
Building on what you have, something like this might do what you want.
bysort id (yearmonth): generate idchange = age[_n-1]>age+1 | age[_n-1]+1<age
bysort id (yearmonth): generate numchange = sum(idchange)
replace id = 10*id + (idchange-1) if idchange>0
Note that this will handle the case where one original id has two or more changes detected. For up to 10 changes, anyhow.
id yearmonth age
2 200001 12
2 200002 14
2 200003 15
2 200004 18
2 200005 18
A need to create a new variable to repeat the earliest date for a ID visit and if it missing it should type missing, after a missing it should keep the earliest date since it was missing(like in the example). I've tried the LAG function and it didn't work; I also try the keep function but just repeat the 25NOV2015 for all records. The final result/"what I need" is in the last column.
Thanks
Example
You need to use retain statement. Retain means your value in each observation won't be reinitialized to a missing. So in the next iteration of data step your variable remembers its value.
Sample data
data a;
input date;
format date ddmmyy10.;
datalines;
.
5
6
7
.
1
2
.
9
;
run;
Solution
data b;
set a;
retain new_date;
format new_date ddmmyy10.;
if date = . then
new_date = .;
if new_date = . then
new_date = date;
run;
Since you didn't post any data I will make up some. Also since the fact that your variable is a date doesn't really impact the answer I will just use some integers as they are easier to type.
data have ;
input id value ## ;
cards;
1 . 1 2 1 3 1 . 1 5 1 6 1 . 1 8
2 1 2 2 2 3 2 . 2 5 2 6
;;;;
Basically your algorithm says that you want to store the value when either the current value is missing or stored value is missing. With multiple BY groups you would also want to set it when you start a new group.
data want ;
set have ;
by id ;
retain new_value ;
if first.id or missing(new_value) or missing(value)
then new_value=value;
run;
Results:
new_
Obs id value value
1 1 . .
2 1 2 2
3 1 3 2
4 1 . .
5 1 5 5
6 1 6 5
7 1 . .
8 1 8 8
9 2 1 1
10 2 2 1
11 2 3 1
12 2 . .
13 2 5 5
14 2 6 5
I would like to label how many unique clusters of data are in a longitudinal dataset and have each member of the cluster carry the cluster count. Distinct clusters are those sharing a set of dates within an id. The order of those distinct cluster relative to previous (earlier) clusters creates the desired result. This coding is necessary to address the problem of event ordering required for a time-dependent covariate analysis.
input id date
1 28jan2015
1 28jan2015
2 26nov2015
3 19oct2015
4 26dec2015
5 23dec2015
6 22may2015
6 23sep2015
6 23sep2015
7 14jan2015
7 27feb2015
7 30may2015
8 16apr2015
8 16apr2015
8 16apr2015
8 16apr2015
8 16apr2015
9 17jul2015
9 03oct2015
9 03oct2015
10 27jul2015
end
I have attempted:
bys id (date): gen count_obs = [_n]
bys id date: gen count_interval_obs = [_n]
egen n_interval = group(id date)
resulting in accurate counts of the total number of observations per id and enumeration of the number of observations within a date. However, the egen function group() results in identifying each unique set of dates, but numbers the groups without regard to id, giving:
id wrong_cluster correct_cluster
1 28jan2015 1 1
1 28jan2015 1 1
2 26nov2015 2 1
3 19oct2015 3 1
4 26dec2015 4 1
5 23dec2015 5 1
6 22may2015 6 1
6 23sep2015 7 2
6 23sep2015 7 2
etc.
egen, group() cannot be used with the by: prefix.
Any assistance would be appreciated.
Todd
Edit: Added an explanation of why the cluster identification is necessary. Clarified what rule defines a cluster.
#Roberto Ferrer has given a direct approach. It follows from the logic he uses that there is also a route using egen's group() function:
egen group = group(id date2)
bysort id (group): gen clust2 = sum(group != group[_n-1])
For each id, when the date is different than the preceding observation, add 1 to the running sum. The 1 is realized when the condition inside sum() is met.
clear
set more off
input id str15 date
1 28jan2015
1 28jan2015
2 26nov2015
3 19oct2015
4 26dec2015
5 23dec2015
6 22may2015
6 23sep2015
6 23sep2015
7 14jan2015
7 27feb2015
7 30may2015
8 16apr2015
8 16apr2015
8 16apr2015
8 16apr2015
8 16apr2015
9 17jul2015
9 03oct2015
9 03oct2015
10 27jul2015
end
gen date2 = date(date, "DMY")
format %td date2
drop date
list, sepby(id)
*----- what you want -----
bysort id (date2) : gen clust = sum(date2 != date2[_n-1])
list, sepby(id)