Pandas groupby dictionary - list

New to pandas, sorry if the solution is quite obvious.
I have a dataframe (see below) with different movie scenes and the environment for that movie scene
import pandas as pd
data = [{'movie' : 'movie_X', 'scene' : '1', 'environment' : 'home'},
{'movie' : 'movie_X', 'scene' : '2', 'environment' : 'car'},
{'movie' : 'movie_X', 'scene' : '3', 'environment' : 'home'},
{'movie' : 'movie_Y', 'scene' : '1', 'environment' : 'home'},
{'movie' : 'movie_Y', 'scene' : '2', 'environment' : 'office'},
{'movie' : 'movie_Z', 'scene' : '1', 'environment' : 'boat'},
{'movie' : 'movie_Z', 'scene' : '2', 'environment' : 'beach'},
{'movie' : 'movie_Z', 'scene' : '3', 'environment' : 'home' }]
myDF = pd.DataFrame(data)
In this case, the the movies have multiple genres to which they belong to. I have a dictionary (below) describing for each movie which genres it belongs to
genreDict = {'movie_X' : ['romance', 'action'],
'movie_Y' : ['comedy', 'romance', 'action'],
'movie_Z' : ['horror', 'thriller', 'romance']}
I wanted to group myDF by this dictionary, specifically be able to tell the number of times a specific environment turned up in a particular genre (for example, in the genre horror, 'boat' was counted once, 'beach' was counted once, and 'home' was counted once). What would be the best and most efficient way of going about this? I have tried mapping the dictionary to the dataframe and then grouping by the list:
myDF['genres'] = myDF['movie'].map(genreDict)
Which returns:
movie scene environment genres
0 movie_X 1 home [romance, action]
1 movie_X 2 car [romance, action]
2 movie_X 3 home [romance, action]
3 movie_Y 1 home [comedy, romance, action]
4 movie_Y 2 office [comedy, romance, action]
5 movie_Z 1 boat [horror, thriller, romance]
6 movie_Z 2 beach [horror, thriller, romance]
7 movie_Z 3 home [horror, thriller, romance]
However, I got an error saying the list was unhashable. Hopefully you all can help :)

Non scalar objects cause problems in pandas generally. In addition to that, you need to tidy up your data so that your next steps are easier (main operations on tabular structures are generally defined on tidy data sets). You need a data set where you don't list all the genres in a row, but instead each genre has its own row.
Here's one of the possible ways to achieve that:
genre_df = pd.DataFrame(myDF['movie'].map(genreDict).tolist())
df = myDF.join(genre_df.stack().rename('genre').reset_index(level=1, drop=True))
df
Out:
environment movie scene genre
0 home movie_X 1 romance
0 home movie_X 1 action
1 car movie_X 2 romance
1 car movie_X 2 action
2 home movie_X 3 romance
2 home movie_X 3 action
3 home movie_Y 1 comedy
3 home movie_Y 1 romance
3 home movie_Y 1 action
4 office movie_Y 2 comedy
4 office movie_Y 2 romance
4 office movie_Y 2 action
5 boat movie_Z 1 horror
5 boat movie_Z 1 thriller
5 boat movie_Z 1 romance
6 beach movie_Z 2 horror
6 beach movie_Z 2 thriller
6 beach movie_Z 2 romance
7 home movie_Z 3 horror
7 home movie_Z 3 thriller
7 home movie_Z 3 romance
Once you have a structure like this, it is much easier to group or cross tabulate your data:
df.groupby('genre').size()
Out:
genre
action 5
comedy 2
horror 3
romance 8
thriller 3
dtype: int64
pd.crosstab(df['genre'], df['environment'])
Out:
environment beach boat car home office
genre
action 0 0 1 3 1
comedy 0 0 0 1 1
horror 1 1 0 1 0
romance 1 1 1 4 1
thriller 1 1 0 1 0
Here's a great read by Hadley Wickham: Tidy Data.

If larger dataframe faster is use numpy for repeat rows by lists with numpy.repeat, numpy.concatenate and Index.values:
#get length of lists in column genres
l = myDF['genres'].str.len()
#convert column to numpy array
vals = myDF['genres'].values
#repeat index by lenghts
idx = np.repeat(myDF.index, l)
#expand rows by duplicated index values
myDF = myDF.loc[idx]
#flattening lists column
myDF['genres'] = np.concatenate(vals)
#default monotonic index (0,1,2...)
myDF = myDF.reset_index(drop=True)
print (myDF)
environment movie scene genres
0 home movie_X 1 romance
1 home movie_X 1 action
2 car movie_X 2 romance
3 car movie_X 2 action
4 home movie_X 3 romance
5 home movie_X 3 action
6 home movie_Y 1 comedy
7 home movie_Y 1 romance
8 home movie_Y 1 action
9 office movie_Y 2 comedy
10 office movie_Y 2 romance
11 office movie_Y 2 action
12 boat movie_Z 1 horror
13 boat movie_Z 1 thriller
14 boat movie_Z 1 romance
15 beach movie_Z 2 horror
16 beach movie_Z 2 thriller
17 beach movie_Z 2 romance
18 home movie_Z 3 horror
19 home movie_Z 3 thriller
20 home movie_Z 3 romance
Then use groupby and aggregate size:
df1 = df.groupby(['genres','environment']).size().reset_index(name='count')
print (df1)
genres environment count
0 action car 1
1 action home 3
2 action office 1
3 comedy home 1
4 comedy office 1
5 horror beach 1
6 horror boat 1
7 horror home 1
8 romance beach 1
9 romance boat 1
10 romance car 1
11 romance home 4
12 romance office 1
13 thriller beach 1
14 thriller boat 1
15 thriller home 1

Related

How can I add data into a SAS table of different groups?

The primary key is car, model, and date, I have to fill in the empty fields with the previous data but that its primary key is car and model.
Example:
Row Car Model Date Sec Door Colour
1 Ford Focus 2002 1 5 blue
2 Ford Focus 2002 2 5 blue
3 Ford Focus 2002 3 5 blue
4 Ford Focus 2002 4 5 blue
5 Ford kuga 2004 5 5 blue
6 Ford kuga 2004 1 5
7 Ford kuga 2004 2 5
8 Ford Mondeo 2004 3 5 red
9 Ford Mondeo 2004 4 4 red
10 Ford Mondeo 2004 5 red
11 Ford Mondeo 2004 6 red
12 Ford Mondeo 2004 7 4 red
13 Mercedes Benz 2010 1 3
14 Mercedes Benz 2010 1 3 white
15 Mercedes Benz 2010 1 5 Yellow
16 Mercedes 190E 2011 1 red
17 Mercedes 190E 2012 1 6
And the final output of the table is ...
Output:
Row Car Model Date Sec Door Colour
1 Ford Focus 2002 1 5 blue
2 Ford Focus 2002 2 5 blue
3 Ford Focus 2002 3 5 blue
4 Ford Focus 2002 4 5 blue
5 Ford kuga 2004 5 5 blue
6 Ford kuga 2004 1 5 blue
7 Ford kuga 2004 2 5 blue
8 Ford Mondeo 2004 3 5 red
9 Ford Mondeo 2004 4 4 red
10 Ford Mondeo 2004 5 4 red
11 Ford Mondeo 2004 6 4 red
12 Ford Mondeo 2004 7 4 red
13 Mercedes Benz 2010 1 3 red
14 Mercedes Benz 2010 1 3 white
15 Mercedes Benz 2010 1 5 Yellow
16 Mercedes 190E 2011 1 5 red
17 Mercedes 190E 2012 1 6 red
How is it done? Thank you
The UPDATE trick will work to produce the output you show.
data cars;
retain dummyby 1;
infile cards firstobs=2;
input row car $ model $ date sex door colour $;
cards;
Row Car Model Date Sec Door Colour
1 Ford Focus 2002 1 5 blue
2 Ford Focus 2002 2 5 blue
3 Ford Focus 2002 3 5 blue
4 Ford Focus 2002 4 5 blue
5 Ford kuga 2004 5 5 blue
6 Ford kuga 2004 1 5 .
7 Ford kuga 2004 2 5 .
8 Ford Mondeo 2004 3 5 red
9 Ford Mondeo 2004 4 4 red
10 Ford Mondeo 2004 5 . red
11 Ford Mondeo 2004 6 . red
12 Ford Mondeo 2004 7 4 red
13 Mercedes Benz 2010 1 3 .
14 Mercedes Benz 2010 1 3 white
15 Mercedes Benz 2010 1 5 Yellow
16 Mercedes 190E 2011 1 . red
17 Mercedes 190E 2012 1 6 .
;;;;
run;
data locf;
update cars(obs=0) cars;
by dummyby; *Use BY CAR; to LOCF for each car.;
output;
drop dummyby;
run;
proc print;
run;

cumulative average powerbi by month

I have below dataset.
Math Literature Biology date student
4 2 5 2019-08-25 A
4 5 4 2019-08-08 A
5 4 5 2019-08-23 A
5 5 5 2019-08-15 A
5 5 5 2019-07-19 A
5 5 5 2019-07-15 A
5 5 5 2019-07-03 A
5 5 5 2019-06-26 A
1 1 2 2019-06-18 A
2 3 3 2019-06-14 A
5 5 5 2019-05-01 A
2 1 3 2019-04-26 A
I need to develop a solution in powerbi so in output I have cumulative average per subject per month
For example
April May June July August
Math | 2 3.5 3 3.75 4
Literature | 1 3 3 3.75 3.83
Biology | 3 4 3.6 4.125 4.33
Can you help?
You can use a matrix visualization for this.
Create a month-year variable and use it in the columns.
Use Average of Math,Literature and Biology in values
Under the format pane --> Values --> Show on rows --> Select this
This should give the view you are looking for. You can edit the value headers to your requirement.

Sum 5 rows at a time in an ordered SAS table with no unique identifier using proc sql

I'm working with a SAS table where I have ordered data that I need to sum in intervals of 5. I don't have a unique ID I can use for the group by statement and I'm struggling to find a solution.
Say I have this table
Number Name X Y
1 Susan 2 1
2 Susan 3 3
3 Susan 3 3
4 Susan 4 1
5 Susan 1 2
6 Susan 1 1
7 Susan 1 1
8 Susan 2 4
9 Susan 1 5
10 Susan 4 2
1 Steve 2 4
2 Steve 2 3
3 Steve 1 2
4 Steve 3 5
5 Steve 1 1
6 Steve 1 3
7 Steve 2 3
8 Steve 2 4
9 Steve 1 1
10 Steve 1 1
I'd want the output to look like
Number Name X Y
1-5 Susan 13 10
6-10 Susan 9 13
1-5 Steve 9 15
6-10 Steve 7 12
Is there an easy way to get output like this using proc sql? Thanks!
Try this:
proc sql;
select ceil(Number/5) as Grouping, Name, sum(X), sum(Y)
from have
group by Name, Grouping;
quit;

add 'id' in pandas dataframe

I have a dataframe, the DOCUMENT_ID is the unique id which will contain multiple words from WORD column. I need to add ids for the each word within that document.
I need to add
DOCUMENT_ID WORD COUNT
0 262056708396949504 4
1 262056708396949504 DVD 1
2 262056708396949504 Girls 1
3 262056708396949504 Gone 1
4 262056708396949504 Gras 1
5 262056708396949504 Hurricane 1
6 262056708396949504 Katrina 1
7 262056708396949504 Mardi 1
8 262056708396949504 Wild 1
10 262056708396949504 donated 1
11 262056708396949504 generated 1
13 262056708396949504 revenues 1
15 262056708396949504 themed 1
17 262056708396949504 torwhore 1
18 262056708396949504 victims 1
20 262167541718319104 18
21 262167541718319104 CCUFoodMan 1
22 262167541718319104 CCUinvolved 1
23 262167541718319104 Congrats 1
24 262167541718319104 Having 1
25 262167541718319104 K 1
29 262167541718319104 blast 1
30 262167541718319104 blasty 1
31 262167541718319104 carebrighton 1
32 262167541718319104 hurricane 1
34 262167541718319104 started 1
37 262197573421502464 21
My expected outcome:
DOCUMENT_ID WORD COUNT WORD_ID
0 262056708396949504 4 1
1 262056708396949504 DVD 1 2
2 262056708396949504 Girls 1 3
3 262056708396949504 Gone 1
4 262056708396949504 Gras 1
.........
20 262167541718319104 18 1
21 262167541718319104 CCUFoodMan 1 2
22 262167541718319104 CCUinvolved 1 3
I have added for empty cells also but can be ignored.
Answer
df['WORD_ID'] = df.groupby(['DOCUMENT_ID']).cumcount()+1
Explanation
Let's build a DataFrame.
import pandas as pd
df = pd.DataFrame({'DOCUMENT_ID' : [262056708396949504, 262056708396949504, 262056708396949504, 262056708396949504, 262167541718319104, 262167541718319104, 262167541718319104], 'WORD' : ['DVD', 'Girls', 'Gras', 'Gone', 'DVD', 'Girls', "Gone"]})
df
DOCUMENT_ID WORD
0 262056708396949504 DVD
1 262056708396949504 Girls
2 262056708396949504 Gras
3 262056708396949504 Gone
4 262167541718319104 DVD
5 262167541718319104 Girls
6 262167541718319104 Gone
Given that your words are nested within unique Document_ID, we need a group by operation.
df['WORD_ID'] = df.groupby(['DOCUMENT_ID']).cumcount()+1
Output:
DOCUMENT_ID WORD WORD_ID
0 262056708396949504 DVD 1
1 262056708396949504 Girls 2
2 262056708396949504 Gras 3
3 262056708396949504 Gone 4
4 262167541718319104 DVD 1
5 262167541718319104 Girls 2
6 262167541718319104 Gone 3

split a dataframe column by regular expression on characters separated by a "."

In R, I have the following dataframe:
Name Category
1 Beans 1.12.5
2 Pears 5.7.9
3 Eggs 10.6.5
What I would like to have is the following:
Name Cat1 Cat2 Cat3
1 Beans 1 12 5
2 Pears 5 7 9
3 Eggs 10 6 5
Ideally some expression built inside plyr would be nice...
I will investigate on my side but as searching this might take me a lot of time I was just wondering if some of you do have some hints to perform this...
I've written a function concat.split (a "family" of functions, actually) as part of my splitstackshape package for dealing with these types of problems:
# install.packages("splitstackshape")
library(splitstackshape)
concat.split(mydf, "Category", ".", drop=TRUE)
# Name Category_1 Category_2 Category_3
# 1 Beans 1 12 5
# 2 Pears 5 7 9
# 3 Eggs 10 6 5
It also works nicely on "unbalanced" data.
dat <- data.frame(Name = c("Beans", "Pears", "Eggs"),
Category = c("1.12.5", "5.7.9.8", "10.6.5.7.7"))
concat.split(dat, "Category", ".", drop = TRUE)
# Name Category_1 Category_2 Category_3 Category_4 Category_5
# 1 Beans 1 12 5 NA NA
# 2 Pears 5 7 9 8 NA
# 3 Eggs 10 6 5 7 7
Because "long" or "molten" data are often required in these types of situations, the concat.split.multiple function has a "long" argument too:
concat.split.multiple(dat, "Category", ".", direction = "long")
# Name time Category
# 1 Beans 1 1
# 2 Pears 1 5
# 3 Eggs 1 10
# 4 Beans 2 12
# 5 Pears 2 7
# 6 Eggs 2 6
# 7 Beans 3 5
# 8 Pears 3 9
# 9 Eggs 3 5
# 10 Beans 4 NA
# 11 Pears 4 8
# 12 Eggs 4 7
# 13 Beans 5 NA
# 14 Pears 5 NA
# 15 Eggs 5 7
The qdap package has the colsplit2df for just these sort of situations:
#recreate your data first:
dat <- data.frame(Name = c("Beans", "Pears", "Eggs"), Category = c("1.12.5",
"5.7.9", "10.6.5"),stringsAsFactors=FALSE)
library(qdap)
colsplit2df(dat, 2, paste0("cat", 1:3))
## > colsplit2df(dat, 2, paste0("cat", 1:3))
## Name cat1 cat2 cat3
## 1 Beans 1 12 5
## 2 Pears 5 7 9
## 3 Eggs 10 6 5
If you have a consistent number of categories, then this will work:
#recreate your data first:
dat <- data.frame(Name = c("Beans", "Pears", "Eggs"), Category = c("1.12.5",
"5.7.9", "10.6.5"),stringsAsFactors=FALSE)
spl <- strsplit(dat$Category,"\\.")
len <- sapply(spl,length)
dat[paste0("cat",1:max(len))] <- t(sapply(spl,as.numeric))
Result:
dat
Name Category cat1 cat2 cat3
1 Beans 1.12.5 1 12 5
2 Pears 5.7.9 5 7 9
3 Eggs 10.6.5 10 6 5
If you have differing numbers of separated values, then this should account for it:
#example unbalanced data
dat <- data.frame(Name = c("Beans", "Pears", "Eggs"), Category = c("1.12.5",
"5.7.9", "10.6.5"),stringsAsFactors=FALSE)
dat$Category[2] <- "5.7"
spl <- strsplit(dat$Category,"\\.")
len <- sapply(spl,length)
spl <- Map(function(x,y) c(x,rep(NA,max(len)-y)), spl, len)
dat[paste0("cat",1:max(len))] <- t(sapply(spl,as.numeric))
Result:
Name Category cat1 cat2 cat3
1 Beans 1.12.5 1 12 5
2 Pears 5.7 5 7 NA
3 Eggs 10.6.5 10 6 5