pandas new column on condition - python-2.7

i have a dataframe like this :
Name one two
John A 20
John P 30
Alex B 40
David C 50
Harry A 60
Harry P 40
I want to add those rows where A and P are simultaneously occurring for the specific names such as
Name one two
John A+P 50
Alex B 40
David C 50
Harry A+P 100
I tried with sum function of row wise in pandas but didn't got output as in such form needed. Kindly help me out !

Use DataFrameGroupBy.agg with join and sum:
df = df.groupby('Name', sort=False, as_index=False).agg({'one':'+'.join, 'two':'sum'})
print (df)
Name one two
0 John A+P 50
1 Alex B 40
2 David C 50
3 Harry A+P 100

Related

AWS Quicksight - question about creating a calculated field using if else and custom aggregation

I have a data that looks like this
Date
Name
SurveyID
Score
Error
2022-02-17
Jack
10
95
Name
2022-02-17
Jack
10
95
Address
2022-02-16
Tom
9
100
2022-02-16
Carl
8
93
Zip
2022-02-16
Carl
8
93
Email
2022-02-15
Dan
7
72
Zip
2022-02-15
Dan
7
72
Email
2022-02-15
Dan
7
72
Name
2022-02-15
Dan
6
90
Phone
2022-02-14
Tom
5
98
Gender
I wanted to have a segmentation data using the avg. score per individual.
Segment
A: 98%-100%
B: 95%-97%
C: 90%-94%
D: 80%-89%
E: 0% -79%
I did an if else formula which is this:
ifelse(Score} >= 98,'A',ifelse({Score} >= 95,'B',ifelse({Score} >= 90,'C',ifelse({Score} >= 80,'D','E'))))
This is now the output of what I did:
Date
Name
SurveyID
Score
Error
Segement
2022-02-17
Jack
10
95
Name
B
2022-02-17
Jack
10
95
Address
B
2022-02-16
Tom
9
100
A
2022-02-16
Carl
8
93
Zip
C
2022-02-16
Carl
8
93
Email
C
2022-02-15
Dan
7
72
Zip
E
2022-02-15
Dan
7
72
Email
E
2022-02-15
Dan
7
72
Name
E
2022-02-15
Dan
6
90
Phone
C
2022-02-14
Tom
5
98
Gender
A
I realized that the calculation I did only applies for the score. I was expecting an output like this:
Name
Average Score
Total Survey
Segement
Jack
95
1
B
Tom
99
2
A
Carl
93
1
C
Dan
81
2
D
I have tried to create another calculated field for Average Score which is:
avgOver({Score}, [Name], PRE_AGG)
I believe I am missing a distinct count of survey IDs in that formula, that I do not know where to place. As for segmentation calculation, I cannot on my life figure that part out without getting aggregation errors on Quicksight. Please help, thank you.
Got the answer from Quicksight Community. Pasting it here.
For segmentation, you can use the calculated field which you created for average score .
avg_score = avgOver(Score,[Name],PRE_AGG)
Segment
ifelse
(
{avg_score}>= 98,'A',
{avg_score}>= 95,'B',
{avg_score}>= 90,'C',
{avg_score}>= 80,'D',
'E'
)
The survey id can be used to get the distinct count per individual.

Sas base: one-to-one reading by biggest table or getting data from next row

Im new in sas base and need help.
I have 2 tables with different data and I need merge it.
But on step I need data from next row.
example what I need:
ID Fdate Tdate NFdate NTdate
id1 date1 date1 date2 date2
id2 date2 date2 date3 date3
....
I did it by 2 merges:
data result;
merge table1 table2 by ...;
merge table1(firstobs=2) table2(firstobs=2) by...;
run;
I expected 10 rows but got 9 becouse one-to-one reading stopted on last row of smallest table(merge). How I can get the last row (do one-to-one reading by biggest table)?
Most simple data steps stop not at the bottom of the step but in the middle when they read past the end of the input. The reason you are getting N-1 observations is because the second input has one fewer records. So you need to do something to stop that.
One simple way is to not execute the second read when you are processing the last observation read by the first one. You can use the END= option to create a boolean variable that will let you know when that happens.
Here is simple example using SASHELP.CLASS.
data test;
set sashelp.class end=eof;
if not eof then set sashelp.class(firstobs=2 keep=name rename=(name=next_name));
else call missing(next_name);
run;
Results:
next_
Obs Name Sex Age Height Weight name
1 Alfred M 14 69.0 112.5 Alice
2 Alice F 13 56.5 84.0 Barbara
3 Barbara F 13 65.3 98.0 Carol
4 Carol F 14 62.8 102.5 Henry
5 Henry M 14 63.5 102.5 James
6 James M 12 57.3 83.0 Jane
7 Jane F 12 59.8 84.5 Janet
8 Janet F 15 62.5 112.5 Jeffrey
9 Jeffrey M 13 62.5 84.0 John
10 John M 12 59.0 99.5 Joyce
11 Joyce F 11 51.3 50.5 Judy
12 Judy F 14 64.3 90.0 Louise
13 Louise F 12 56.3 77.0 Mary
14 Mary F 15 66.5 112.0 Philip
15 Philip M 16 72.0 150.0 Robert
16 Robert M 12 64.8 128.0 Ronald
17 Ronald M 15 67.0 133.0 Thomas
18 Thomas M 11 57.5 85.0 William
19 William M 15 66.5 112.0

How to group values and sum in pandas [duplicate]

I am using this dataframe:
Fruit Date Name Number
Apples 10/6/2016 Bob 7
Apples 10/6/2016 Bob 8
Apples 10/6/2016 Mike 9
Apples 10/7/2016 Steve 10
Apples 10/7/2016 Bob 1
Oranges 10/7/2016 Bob 2
Oranges 10/6/2016 Tom 15
Oranges 10/6/2016 Mike 57
Oranges 10/6/2016 Bob 65
Oranges 10/7/2016 Tony 1
Grapes 10/7/2016 Bob 1
Grapes 10/7/2016 Tom 87
Grapes 10/7/2016 Bob 22
Grapes 10/7/2016 Bob 12
Grapes 10/7/2016 Tony 15
I would like to aggregate this by Name and then by Fruit to get a total number of Fruit per Name. For example:
Bob,Apples,16
I tried grouping by Name and Fruit but how do I get the total number of Fruit?
Use GroupBy.sum:
df.groupby(['Fruit','Name']).sum()
Out[31]:
Number
Fruit Name
Apples Bob 16
Mike 9
Steve 10
Grapes Bob 35
Tom 87
Tony 15
Oranges Bob 67
Mike 57
Tom 15
Tony 1
To specify the column to sum, use this: df.groupby(['Name', 'Fruit'])['Number'].sum()
Also you can use agg function,
df.groupby(['Name', 'Fruit'])['Number'].agg('sum')
If you want to keep the original columns Fruit and Name, use reset_index(). Otherwise Fruit and Name will become part of the index.
df.groupby(['Fruit','Name'])['Number'].sum().reset_index()
Fruit Name Number
Apples Bob 16
Apples Mike 9
Apples Steve 10
Grapes Bob 35
Grapes Tom 87
Grapes Tony 15
Oranges Bob 67
Oranges Mike 57
Oranges Tom 15
Oranges Tony 1
As seen in the other answers:
df.groupby(['Fruit','Name'])['Number'].sum()
Number
Fruit Name
Apples Bob 16
Mike 9
Steve 10
Grapes Bob 35
Tom 87
Tony 15
Oranges Bob 67
Mike 57
Tom 15
Tony 1
Both the other answers accomplish what you want.
You can use the pivot functionality to arrange the data in a nice table
df.groupby(['Fruit','Name'],as_index = False).sum().pivot('Fruit','Name').fillna(0)
Name Bob Mike Steve Tom Tony
Fruit
Apples 16.0 9.0 10.0 0.0 0.0
Grapes 35.0 0.0 0.0 87.0 15.0
Oranges 67.0 57.0 0.0 15.0 1.0
df.groupby(['Fruit','Name'])['Number'].sum()
You can select different columns to sum numbers.
A variation on the .agg() function; provides the ability to (1) persist type DataFrame, (2) apply averages, counts, summations, etc. and (3) enables groupby on multiple columns while maintaining legibility.
df.groupby(['att1', 'att2']).agg({'att1': "count", 'att3': "sum",'att4': 'mean'})
using your values...
df.groupby(['Name', 'Fruit']).agg({'Number': "sum"})
You can set the groupby column to index then using sum with level
df.set_index(['Fruit','Name']).sum(level=[0,1])
Out[175]:
Number
Fruit Name
Apples Bob 16
Mike 9
Steve 10
Oranges Bob 67
Tom 15
Mike 57
Tony 1
Grapes Bob 35
Tom 87
Tony 15
You could also use transform() on column Number after group by. This operation will calculate the total number in one group with function sum, the result is a series with the same index as original dataframe.
df['Number'] = df.groupby(['Fruit', 'Name'])['Number'].transform('sum')
df = df.drop_duplicates(subset=['Fruit', 'Name']).drop('Date', 1)
Then, you can drop the duplicate rows on column Fruit and Name. Moreover, you can drop the column Date by specifying axis 1 (0 for rows and 1 for columns).
# print(df)
Fruit Name Number
0 Apples Bob 16
2 Apples Mike 9
3 Apples Steve 10
5 Oranges Bob 67
6 Oranges Tom 15
7 Oranges Mike 57
9 Oranges Tony 1
10 Grapes Bob 35
11 Grapes Tom 87
14 Grapes Tony 15
# You could achieve the same result with functions discussed by others:
# print(df.groupby(['Fruit', 'Name'], as_index=False)['Number'].sum())
# print(df.groupby(['Fruit', 'Name'], as_index=False)['Number'].agg('sum'))
There is an official tutorial Group by: split-apply-combine talking about what you can do after group by.
If you want the aggregated column to have a custom name such as Total Number, Total etc. (all the solutions on here results in a dataframe where the aggregate column is named Number), use named aggregation:
df.groupby(['Fruit', 'Name'], as_index=False).agg(**{'Total Number': ('Number', 'sum')})
or (if the custom name doesn't need to have a white space in it):
df.groupby(['Fruit', 'Name'], as_index=False).agg(Total=('Number', 'sum'))
this is equivalent to SQL query:
SELECT Fruit, Name, sum(Number) AS Total
FROM df
GROUP BY Fruit, Name
Speaking of SQL, there's pandasql module that allows you to query pandas dataFrames in the local environment using SQL syntax. It's not part of Pandas, so will have to be installed separately.
#! pip install pandasql
from pandasql import sqldf
sqldf("""
SELECT Fruit, Name, sum(Number) AS Total
FROM df
GROUP BY Fruit, Name
""")
You can use dfsql
for your problem, it will look something like:
df.sql('SELECT fruit, sum(number) GROUP BY fruit')
https://github.com/mindsdb/dfsql
here is an article about it:
https://medium.com/riselab/why-every-data-scientist-using-pandas-needs-modin-bringing-sql-to-dataframes-3b216b29a7c0
You can use reset_index() to reset the index after the sum
df.groupby(['Fruit','Name'])['Number'].sum().reset_index()
or
df.groupby(['Fruit','Name'], as_index=False)['Number'].sum()

Effective way to store list of list of dict to csv

I've got dataframe like this :
Name Nationality Tall Age
John USA 190 24
Thomas French 194 25
Anton Malaysia 180 23
Chris Argentina 190 26
so let say i got incoming data structure like this. each element representing the data of each row. :
data = [{
'food':{'lunch':'Apple',
'breakfast':'Milk',
'dinner':'Meatball'},
'drink':{'favourite':'coke',
'dislike':'juice'}
},
..//and 3 other records
].
'data' is some variable that save predicted food and drink from my machine learning. There is more record(about 400k rows) but i process them by batch size (right now i process 2k data each iteration) through iteration. Expected result like:
Name Nationality Tall Age Lunch Breakfast Dinner Favourite Dislike
John USA 190 24 Apple Milk Meatball Coke Juice
Thomas French 194 25 ....
Anton Malaysia 180 23 ....
Chris Argentina 190 26 ....
Is there's an effective way to achive that dataframe? so far i've already tried to iterate the data variables and get the value of each predicted label. which its feels like that process took much time.
You need flatenning dictionaries first, create DataFrame and join to original:
data = [{
'a':{'lunch':'Apple',
'breakfast':'Milk',
'dinner':'Meatball'},
'b':{'favourite':'coke',
'dislike':'juice'}
},
{
'a':{'lunch':'Apple1',
'breakfast':'Milk1',
'dinner':'Meatball2'},
'b':{'favourite':'coke2',
'dislike':'juice3'}
},
{
'a':{'lunch':'Apple4',
'breakfast':'Milk5',
'dinner':'Meatball4'},
'b':{'favourite':'coke2',
'dislike':'juice4'}
},
{
'a':{'lunch':'Apple3',
'breakfast':'Milk8',
'dinner':'Meatball7'},
'b':{'favourite':'coke4',
'dislike':'juice1'}
}
]
#or use another solutions, both are nice
L = [{k: v for x in d.values() for k, v in x.items()} for d in data]
df1 = pd.DataFrame(L)
print (df1)
breakfast dinner dislike favourite lunch
0 Milk Meatball juice coke Apple
1 Milk1 Meatball2 juice3 coke2 Apple1
2 Milk5 Meatball4 juice4 coke2 Apple4
3 Milk8 Meatball7 juice1 coke4 Apple3
df2 = df.join(df1)
print (df2)
Name Nationality Tall Age breakfast dinner dislike favourite \
0 John USA 190 24 Milk Meatball juice coke
1 Thomas French 194 25 Milk1 Meatball2 juice3 coke2
2 Anton Malaysia 180 23 Milk5 Meatball4 juice4 coke2
3 Chris Argentina 190 26 Milk8 Meatball7 juice1 coke4
lunch
0 Apple
1 Apple1
2 Apple4
3 Apple3

Creating All Possible Combinations in a Table Using SAS

I have a table with four variables and i want the table a table with combination of all values. Showing a table with only 2 columns as an example.
NAME AMOUNT COUNT
RAJ 90 1
RAVI 20 4
JOHN 30 5
JOSEPH 40 3
The following output is to show the values only for raj and the output should be for all names.
NAME AMOUNT COUNT
RAJ 90 1
RAJ 90 4
RAJ 90 5
RAJ 90 3
RAJ 20 1
RAJ 20 4
RAJ 20 5
RAJ 20 3
RAJ 30 1
RAJ 30 4
RAJ 30 5
RAJ 30 3
RAJ 40 1
RAJ 40 4
RAJ 40 5
RAJ 40 3
.
.
.
.
There are a couple of useful options in SAS to do this; both create a table with all possible combinations of variables, and then you can just drop the summary data that you don't need. Given your initial dataset:
data have;
input NAME $ AMOUNT COUNT;
datalines;
RAJ 90 1
RAVI 20 4
JOHN 30 5
JOSEPH 40 3
;;;;
run;
There is PROC FREQ with SPARSE.
proc freq data=have noprint;
tables name*amount*count/sparse out=want(drop=percent);
run;
There is also PROC TABULATE.
proc tabulate data=have out=want(keep=name amount count);
class name amount count;
tables name*amount,count /printmiss;
run;
This has the advantage of not conflicting with the name for the COUNT variable.
Try
PROC SQL;
CREATE TABLE tbl_out AS
SELECT a.name AS name
,b.amount AS amount
,c.count AS count
FROM tbl_in AS a, tbl_in AS b, tbl_in AS c
;
QUIT;
This performs a double self-join and should have the desired effect.
Here's a variation on #JustinJDavies's answer, using an explicit CROSS JOIN clause:
data have;
input NAME $ AMOUNT COUNT;
datalines;
RAJ 90 1
RAVI 20 4
JOHN 30 5
JOSEPH 40 3
run;
PROC SQL;
create table combs as
select *
from have(keep=NAME)
cross join have(keep=AMOUNT)
cross join have(keep=COUNT)
order by name, amount, count;
QUIT;
Results:
NAME AMOUNT COUNT
JOHN 20 1
JOHN 20 3
JOHN 20 4
JOHN 20 5
JOHN 30 1
JOHN 30 3
JOHN 30 4
JOHN 30 5
...