Pandas: Combining and summing rows based on values from other rows - python-2.7

In a Panda's data frame, I'd like combine all 'other' rows from col_2 into a one row for each value from col_1 by assigning col_3 the sum of all corresponding values.
EDIT - Clarification: In total, I have about 20 columns (where values in those columns is unique for each col_1. there however 80,000 other fields; however, there are three columns affecting my question
Current dataframe df:
col_1 col_2 col_3
1 a 30
1 b 25
1 other 1
1 other 5
2 a 321
2 b 1
2 other 45
2 other 52
2 other 17
2 other 8
Desired resultin :
col_1 col_2 col_3
1 a 30
1 b 25
1 other 6
2 a 321
2 b 1
2 other 122
How can I do this in Pandas?

You can groupby on col_1 and col_2 and call sum and then reset_index:
In [188]:
df.groupby(['col_1','col_2']).sum().reset_index()
Out[188]:
col_1 col_2 col_3
0 1 a 30
1 1 b 25
2 1 other 6
3 2 a 321
4 2 b 1
5 2 other 122

Related

Power BI DAX - Grouping rows when a value is found in row

I have the below table. I need to group them base on product and increment group number when set = 1 but returns back to 1 if new product is in next line. I have created an index already.
Index
Product
Set
1
Table
0
2
Table
0
3
Table
1
4
Table
0
5
Table
0
6
Table
1
7
Table
0
8
Table
1
9
Chair
0
10
Chair
0
11
Chair
0
12
Chair
1
13
Chair
0
14
Chair
0
15
Chair
1
Here's the result I'm after:
Index
Product
Set
Group
1
Table
0
1
2
Table
0
1
3
Table
1
1
4
Table
0
2
5
Table
0
2
6
Table
1
2
7
Table
0
3
8
Table
1
3
9
Chair
0
1
10
Chair
0
1
11
Chair
0
1
12
Chair
1
1
13
Chair
0
2
14
Chair
0
2
15
Chair
1
2
With this
Grouping=
RANKX (
FILTER (
'fact',
'fact'[Set] <> 0
&& EARLIER ( 'fact'[Product] ) = 'fact'[Product]
),
'fact'[Index],
,
ASC

Left join PROC SQL using threshold date

I am hoping you can help me! Please help!!!!
I am in SAS using PROC SQL and I have datasets A and B with different measurements (relating to patient's health) as follows:
Dataset A
ID Date measurement_a
1 20JUN2013 52.3
1 12JUL2013 65.6
1 28NOV2014 37.4
1 02DEC2014 61.3
1 22SEP2015 40.5
1 15OCT2015 60.5
2 03JUN2011 46.5
2 19JUL2011 54.1
2 29OCT2012 53.6
...
Dataset B
ID Date measurement_b
1 21MAR2007 43
1 13JUL2007 45
1 07APR2009 47
1 14MAY2009 46
1 16FEB2012 42
1 27AUG2012 53
1 12DEC2012 58
1 20JUN2013 56
1 10DEC2013 53
1 23MAY2014 49
1 17SEP2014 44
1 23SEP2015 40
2 16DEC2011 58
2 22AUG2012 54
2 20FEB2013 56
2 29MAY2013 53
...
What I am looking for is that if the date in Dataset B is within 6 months of the date in Dataset a, then a new variable called "time" will be added, saying 1,2,3,etc. for how many ever match with ** only measurement_a** length (in other words, I do not need to retain values of measurement_b if it does not match the date in Dataset a. Here is an example of what I mean:
Desired result/dataset:
ID Time measurement_a measurement_b
1 1 52.3 56 (Dataset B Date = 20JUN2013 - Matched exactly)
1 2 65.6 53 (Dataset B date = 10DEC2013 - Within six months of 12JUL2013 [Dataset A Date])
1 3 37.4 44 (Dataset B date = 17SEP2014 - Within six months of 28NOV2014 [Dataset A Date])
1 4 61.3 . (because 17SEP2014 [Dataset B] is closest to 28NOV2014 [Dataset A])
1 5 40.5 40 (because 23SEP2015 [Dataset B] is closest to 22SEP2015 [Dataset A])
1 6 60.5 . (No date in Dataset B that is within 6 months of Date in Dataset A [15OCT2015])
2 1 46.5 . (See below)
2 2 54.1 58 (because 03JUL2011 [Dataset B] is closest to 19JUL2011 [Dataset A])
2 3 53.6 54 (Dataset B date = 22AUG2012 - Within 6 months of Dataset A date = 29OCT2012)
...
I have joined on ID but the times is proving difficult. I know it could be the difference in months in the "where" statement in the following code:
PROC SQL;
CREATE TABLE join_test as
SELECT * FROM data_a as a
LEFT_JOIN data_b as b
ON a.id = b.id
WHERE days(a.Date - b.Date) <= 180 ;
QUIT;
But this does not do the trick.
Can some please help me?
I really appreciate it. Thanks in advance.
In the join criteria add the use of the SAS function INTCK to compute the number of month intervals between the two date values. Proc SQL does not have a way to introduce a serial count value, so you will have to add that in a subsequent step. A LEFT JOIN will create a result set with every id/date in table A.
Example:
The columns a.date, b_date and c_months_apart were added to show how the join works. You can safely remove them from the select.
proc sql;
create table stage1 as
select
a.id
, a.date
, a.measurement_a
, b.measurement_b
, b.date as b_date
, intck('month', a.date, b.date, 'C') as c_months_apart
from
a left join b
on a.id = b.id
and intck('month', a.date, b.date, 'C') between 0 and 6
order by a.id, a.date, b.date
;
data want;
set stage1;
by id;
if first.id then time=1; else time+1;
run;
Output (want)
measurement_ measurement_ c_months_
ID Date a b b_date apart time
1 20JUN2013 52.3 56 20JUN2013 0 1
1 20JUN2013 52.3 53 10DEC2013 5 2
1 12JUL2013 65.6 53 10DEC2013 4 3
1 28NOV2014 37.4 . . . 4
1 02DEC2014 61.3 . . . 5
1 22SEP2015 40.5 40 23SEP2015 0 6
1 15OCT2015 60.5 . . . 7
2 03JUN2011 46.5 58 16DEC2011 6 1
2 19JUL2011 54.1 58 16DEC2011 4 2
2 29OCT2012 53.6 56 20FEB2013 3 3

Return all values using LOOKUPVALUE, not just matches

I have two tables with related fields. I am trying to return the enrollment# from table A into a column in table B.
table A
Serial# Enrollment#
A 1
B 2
C 3
D 4
E 5
table B
Serial# Enrollment#
A 1
B 20
C 3
D 4
E 50
I want this calculated column in table B
Serial# Enrollment# tableAEnrollment#
A 1 1
B 20 2
C 3 3
D 4 4
E 50 5
however this is what I am getting:
Serial# Enrollment# tableAEnrollment#
A 1 1
B 20
C 3 3
D 4 4
E 50
my function is:
tableAEnrollemnt# = LOOKUPVALUE(A[Enrollment #], A[Serial #], B[Serial #])
Its only bringing back where enrollment numbers match. What am I doing wrong?
Thanks in advance!

Running Total in Matrix Rows

I have incremental data elements that I want to summarize. I'm pulling the incremental data into a matrix object just fine, but I need to summarize them by cumulating across columns (within the Row)
What I'm seeing:
Column: 1 2 3 4 5
Row |-----------------------------------------
1 | 10 15 5 4 1
2 | 12 12 3 1
3 | 10 9 6
4 | 9 15
5 | 11
What I want to see:
Column: 1 2 3 4 5
Row |-----------------------------------------
1 | 10 25 30 34 35
2 | 12 24 27 28
3 | 10 19 25
4 | 9 24
5 | 11
What I've tried, this just returns the incremental data (as if I just pointed it to [INC_AMT]:
Cum_Loss = CALCULATE(
SUM('Table1'[INC_AMT]),
FILTER(All (Table1[ColNum]), Table1[ColNum] <= max(Table1[Column])))
Give this measure a try:
PERIODIC INCREMENTAL SUM = CALCULATE
(
SUM('TestData'[INC_AMT])
, FILTER(
ALLSELECTED(TestData)
, and(
TestData[ColNum] <= max(TestData[ColNum])
, TestData[RowNum] = max(TestData[RowNum])
)
)
)
I found it helpful to not think about the measures in a matrix perspective. Transform it to a table and you see that one way to think about it is that it's just a cumulative sum where 'Row Number' is also the same. So, add that requirement to your filter and... presto.

subset of dataset using first and last in sas

Hi I am trying to subset a dataset which has following
ID sal count
1 10 1
1 10 2
1 10 3
1 10 4
2 20 1
2 20 2
2 20 3
3 30 1
3 30 2
3 30 3
3 30 4
I want to take out only those IDs who are recorded 4 times.
I wrote like
data AN; set BU
if last.count gt 4 and last.count lt 4 then delete;
run;
But there is something wrong.
EDIT - Thanks for clarifying. Based on your needs, PROC SQL will be more direct:
proc sql;
CREATE TABLE AN as
SELECT * FROM BU
GROUP BY ID
HAVING MAX(COUNT) = 4
;quit;
For posterity, here is how you could do it with only a data step:
In order to use first. and last., you need to use a by clause, which requires sorting:
proc sort data=BU;
by ID DESCENDING count;
run;
When using a SET statement BY ID, first.ID will be equal to 1 (TRUE) on the first instance of a given ID, 0 (FALSE) for all other records.
data AN;
set BU;
by ID;
retain keepMe;
If first.ID THEN DO;
IF count = 4 THEN keepMe=1;
ELSE keepMe=0;
END;
if keepMe=0 THEN DELETE;
run;
During the datastep BY ID, your data will look like:
ID sal count keepMe first.ID
1 10 4 1 1
1 10 3 1 0
1 10 2 1 0
1 10 1 1 0
2 20 3 0 1
2 20 2 0 0
2 20 1 0 0
3 30 4 1 1
3 30 3 1 0
3 30 2 1 0
3 30 1 1 0
If I understand correct, you are trying to extract all observations are are repeated 4 time or more. if so, your use of last.count and first.count is wrong. last.var is a boolean and it will indicate which observation is last in the group. Have a look at Tim's suggestion.
In order to extract all observations that are repeated four times or more, I would suggest to use the following PROC SQL:
PROC SQL;
CREATE TABLE WORK.WANT AS
SELECT /* COUNT_of_ID */
(COUNT(t1.ID)) AS COUNT_of_ID,
t1.ID,
t1.SAL,
t1.count
FROM WORK.HAVE t1
GROUP BY t1.ID
HAVING (CALCULATED COUNT_of_ID) ge 4
ORDER BY t1.ID,
t1.SAL,
t1.count;
QUIT;
Result:
1 10 1
1 10 2
1 10 3
1 10 4
3 30 1
3 30 2
3 30 3
3 30 4
Slight variation on Tims - assuming you don't necessarily have the count variable.
proc sql;
CREATE TABLE AN as
SELECT * FROM BU
GROUP BY ID
HAVING Count(ID) >= 4;
quit;