Pandas data frame: plotting average for comma separated strings - python-2.7

In my dataset I have a column with Topics which are strings separated by coma.
df = pd.DataFrame({'Stats': [3377, 1843, 15234], 'Topics': ["A, B, C, D", "A, B", "C, D"]})
What I need is to plot average Stats per Topic (A,B,C,D). Something like this:
Could anyone suggest a smart way of doing it?

I'm not sure what your desired output is, but this should hopefully get you going in the right direction. Key point is to split out the topics, and then you can do whatever analytics you want.
df2 = pd.DataFrame([(row.Stats, topic.strip())
for _, row in df.iterrows()
for topic in row.Topics.split(',')],
columns=['Stats', 'Topic'])
>>> df2.groupby('Topic').Stats.mean()
Topic
A 2610.0
B 2610.0
C 9305.5
D 9305.5
Name: Stats, dtype: float64
>>> df2.head()
Stats Topic
0 3377 A
1 3377 B
2 3377 C
3 3377 D
4 1843 A

Related

How to add multiple Sentences (which are stored in a list) into a pandas dataframe

I would like to create an aspect analysis from user reviews. The reviews contain various aspects and therefore the reviews need to be separated into sentences. I save the data in a pandas dataframe and separate the sentences with the nltk library.
I put the separate sentences in a list that I want to format into a dataframe and connect to the original dataframe. However, I get an error. Instead of an extra column, I get 19 new columns. (the individual sentences are not stored in a cell, I think every single sentence gets their own column) I also tested itertools but I also get a wrong record.
Can someone help me to get the right format?
I would like to have a new dataframe which looks like that:
U_REVIEW | SENTENCES
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Im a Sentence. Iam another Sentence in a Row. |[u'Im a Sentence', u'Iam another Sentence in a Row.']
Here we go, next Sentence. Blub, more blubs. |[u"Here weg o, next Sentence.", u'Blub, more blubs.']
Once again, more Sentence. And some other information. The Restaurant was ok, but not awesome.|[u"Once again, more Sentence.", u'And some other information.',u’The Restaurant was ok, but not awesome.’]
That’s how my code looks like:
ta = ta[['U_REVIEW']]
Output:
U_REVIEW
Im a Sentence. Iam another Sentence in a Row.
Here we go, next Sentence. Blub, more blubs.
Once again, more Sentence. And some other information. The Restaurant was ok, but not awesome.
# the empty lists
sentences = []
ss = []
for sentence in ta['U_REVIEW']:
# seperates the review into sentence
sentence = sent_tokenize(sentence)
sentences.append(sentence)
test = itertools.chain(sentences)
#new dataframe to add the Sentences
df2 = pd.DataFrame(sentences)
#create Column
cols2 = ['REVIEW_SENTENCES']
# bring the two dataframes together
df2 = pd.DataFrame(sentences, columns=cols2)
Output of senteces:
[[u'Im a Sentence', u'Iam another Sentence in a Row.'],[u"Here weg o, next Sentence.", u'Blub, more blubs.'],[u"Once again, more Sentence.", u'And some other information.',u’The Restaurant was ok, but not awesome.’]]
Output of test:
<itertools.chain object at 0x000000001316DC18>
Output and Information of the new Dataframe df2:
AssertionError: 1 columns passed, passed data had 19 columns
U_REVIEW | 0 | 1 | 2 ...
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Im a Sentence. Iam another Sentence in a Row. |Im a Sentence |Iam another Sentence in a Row. |
Here we go, next Sentence. Blub, more blubs. |Here we go, next Sentence.|Blub, more blubs. |
Once again, more Sentence. And some other information. The Restaurant was ok, but not awesome.|Once again, more Sentence.|And some other information. |The Restaurant was ok, but not awesome.
Here is a Testset of a Dataframe:
import pandas as pd
ta = pd.DataFrame( ['Im a Sentence. Iam another Sentence in a Row','Here we go, next Sentence. Blub, more blubs.','Once again, more Sentence. And some other information. The Restaurant was ok, but not awsome.'])
ta.columns =['U_REVIEW']
try this I have done it in python 3.5 I think it should work for 2.5 also:
In [45]: df = pd.DataFrame(ta.U_REVIEW.str.split('.',expand=True).replace('',np.nan).fillna(np.nan).values.flatten()).dropna()
In [46]: df
Out[46]:
0
0 Im a Sentence
1 Iam another Sentence in a Row
4 Here we go, next Sentence
5 Blub, more blubs
8 Once again, more Sentence
9 And some other information
10 The Restaurant was ok, but not awsome
is this what you want:
ta.U_REVIEW.str.split('.',expand=True)
Out[50]:
0 1 \
0 Im a Sentence Iam another Sentence in a Row
1 Here we go, next Sentence Blub, more blubs
2 Once again, more Sentence And some other information
2 3
0 None None
1 None
2 The Restaurant was ok, but not awsome
or
In [52]: ta.U_REVIEW.str.split('.').apply(list)
Out[52]:
0 [Im a Sentence, Iam another Sentence in a Row]
1 [Here we go, next Sentence, Blub, more blubs, ]
2 [Once again, more Sentence, And some other in...
Name: U_REVIEW, dtype: object

Superhuman Level - Pandas DataFrame Reshaping because of Duplicates

Do you like puzzles that only superhumans can solve? This is the final test to prove such an ability.
A single company might get different levels of funding (seed, a) from multiple banks possibly at different times.
Let's look at the data then the story to get a better picture.
import pandas as pd
data = {'id':[1,2,2,3,4],'company':['alpha','beta','beta','alpha','alpha'],'bank':['z', 'x', 'y', 'z', 'j'],
'rd': ['seed', 'seed', 'seed', 'a', 'a'], 'funding': [100, 200, 200, 300, 50],
'date': ['2006-12-01', '2004-09-01', '2004-09-01', '2007-05-01', '2007-09-01']}
df = pd.DataFrame(data, columns = ['id','company', 'round', 'bank', 'funding', 'date'])
df
Yields:
id company rd bank funding date
0 1 alpha seed z 100 2006-12-01
1 2 beta seed x 200 2004-09-01
2 2 beta seed y 200 2004-09-01
3 3 alpha a z 300 2007-05-01
4 4 alpha a j 50 2007-09-01
Desired Output:
company bank_seed funding_seed date_seed bank_a funding_a date_a
0 alpha z 100 2006-12-01 [z,j] 350 2007-09-01
1 beta [x,y] 200 2004-09-01 None None None
As you can see, I am not a superhuman but shall try to explain my thought process.
Let's look at company alpha
Company alpha first got seed money for $100 from bank z in late 2006. A few months later, their investors were very happy with their progress so bank z gave them money ($300 more!). However, Company alpha needed a little more cash but had to go to some random Swiss bank j to stay alive. Bank j reluctantly gave $50 more. Yay! They now have $350 from their updated 'a' round ending in September 2007.
Company beta is pretty new. They got funding totaling $200 from two different banks. But wait... there's nothing in here about their round 'a'. That's okay we'll put None for now and check back with them later.
The issue is that Company alpha sucks and got money from the Swiss...
This is my non-working code that had worked on a subset of my data - it won't work here.
import itertools
unique_company = df.company.unique()
df_indexed = df.set_index(['company', 'rd'])
index = pd.MultiIndex.from_tuples(list(itertools.product(unique_company, list(df.rd.unique()))))
reindexed = df_indexed.reindex(index, fill_value=0)
reindexed = reindexed.unstack().applymap(lambda cell: 0 if '1970-01-01' in str(cell) else cell)
working_df = pd.DataFrame(reindexed.iloc[:,
reindexed.columns.get_level_values(0).isin(['company', 'funding'])].to_records())
If you know how to solve part of this problem, go ahead and put it below. Thank you in advance for taking the time to look at this! :)
Lastly, if you want to see how my code does work. Then, do this but you lose so much valuable info...
df = df.drop_duplicates(subset='id')
df = df.drop_duplicates(subset='rd')
Take a pre-processing step to spread out the funding across records with the same 'id' and 'date'
df.funding /= df.groupby(['id', 'date']).funding.transform('count')
Then process
d1 = df.groupby(['company', 'round']).agg(
dict(bank=lambda x: tuple(x), funding='sum', date='last')
).unstack().sort_index(1, 1)
d1.columns = d1.columns.to_series().map('{0[0]}_{0[1]}'.format)
d1
bank funding date bank funding date
round a a a seed seed seed
company
alpha (z, j) 350.0 2007-09-01 (z,) 100.0 2006-12-01
beta None NaN NaT (x, y) 200.0 2004-09-01
Groupby, aggregate and unstack will get you close to what you want
df.groupby(['company', 'round']).agg({'bank': lambda x: ','.join(x), 'funding': 'sum', 'date': 'max'}).unstack().reset_index()
df.columns = ['_'.join(col).strip() for col in df.columns.values]
You get
company_ bank_a bank_seed funding_a funding_seed date_a date_seed
0 alpha z,j z 350.0 100.0 2007-09-01 2006-12-01
1 beta None x,y NaN 400.0 None 2004-09-01

Reducing the Sparsity of a One-Hot Encoded dataset

I'm trying to do some feature selection algorithms on the UCI adult data set and I'm running into a problem with Univaraite feature selection. I'm doing onehot encoding on all the categorical data to change them to numerical but that gives me a lot of f scores.
How can I avoid this? What should I do to make this code better?
# Encode
adult['Gender'] = adult['sex'].map({'Female': 0, 'Male': 1}).astype(int)
adult = adult.drop(['sex'], axis=1)
adult['Earnings'] = adult['income'].map({'<=50K': 0, '>50K': 1}).astype(int)
adult = adult.drop(['income'], axis=1)
#OneHot Encode
adult = pd.get_dummies(adult, columns=["race"])
target = adult["Earnings"]
data = adult.drop(["Earnings"], axis=1)
selector = SelectKBest(f_classif, k=5)
selector.fit_transform(data, target)
for n,s in zip( data.head(0), selector.scores_):
print "F Score ", s,"for feature ", n
EDIT:
Partial results of current code:
F Score 26.1375747945 for feature race_Amer-Indian-Eskimo
F Score 3.91592196913 for feature race_Asian-Pac-Islander
F Score 237.173133254 for feature race_Black
F Score 31.117798305 for feature race_Other
F Score 218.117092671 for feature race_White
Expected Results:
F Score "f_score" for feature "race"
By doing the one hot encoding the feature in above is split into many sub-features, where I would just like to generalize it to just race (see Expected Results) if that is possible.
One way in which you can reduce the number of features, whilst still encoding your categories in a non-ordinal manner, is by using binary encoding. One-hot-encoding has a linear growth rate n where n is the number of categories in a categorical feature. Binary encoding has log_2(n) growth rate. In other words, doubling the number of categories adds a single column for binary encoding, where as it doubles the number of columns for one-hot encoding.
Binary encoding can be easily implemented in python by using the categorical_encoding package. The package is pip installable and works very seamlessly with sklearn and pandas. Here is an example
import pandas as pd
import category_encoders as ce
df = pd.DataFrame({'cat1':['A','N','K','P'], 'cat2':['C','S','T','B']})
enc_bin = ce.binary_encoding.BinaryEncoding(cols=['cat1']) # cols=None, all string columns encoded
df_trans = enc_bin.fit_transform(df)
print(df_trans)
Out[1]:
cat1_0 cat1_1 cat2
0 1 1 C
1 0 1 S
2 1 0 T
3 0 0 B
Here is the code from a previous answer by me using the same variables as above but with one-hot encoding. Lets compare how the two different outputs look.
import pandas as pd
import category_encoders as ce
df = pd.DataFrame({'cat1':['A','N','K','P'], 'cat2':['C','S','T','B']})
enc_ohe = ce.one_hot.OneHotEncoder(cols=['cat1']) # cols=None, all string columns encoded
df_trans = enc_ohe.fit_transform(df)
print(df_trans)
Out[2]:
cat1_0 cat1_1 cat1_2 cat1_3 cat2
0 0 0 1 0 C
1 0 0 0 1 S
2 1 0 0 0 T
3 0 1 0 0 B
See how binary encoding uses half as many columns to uniquely describe each category within the category cat1.

Obtaining a pandas dataframe from a dict with tuples as keys

I am new to python and have been struggling with this problem for quite a while. I have a dict like this:
dict1 = {(a,a) : 5, (a,b) :10, (a,c) : 11, (b,a): 4, (b,b) : 8, (b,c) : 3....}
What I would like to do is convert this into a pandas dataframe that looks like this:
a b c
a 5 10 11
b 4 8 3
c .. .. ..
After that I would like to create a multiple bar plot in the jupyter notebook. I know you can display the data as a pandas series to show the following:
dataset = pd.Series(dict1)
print dataset
a a 5
b 10
c 11
b a 4
b 8
c 3
c a ..
b ..
c ..
However, I was not able to create a multiple bar plot from that.
You're almost there, just need to unstack:
dataset.unstack()
I prefer to use this page for reference, rather than the official documentation.

Appending Tuples to Pandas DataFrame

I am trying to join (vertically) some of the tuples, better to say I am inserting these tuples in the dataframe. But unable to do so till now. Problem arises since I am trying to add them horizentally and not vertically.
data_frame = pandas.DataFrame(columns=("A","B","C","D"))
str1 = "Doodles are the logo-incorporating works of art that Google regularly features on its homepage. They began in 1998 with a stick figure by Google co-founders Larry Page and Sergey Brin -- to indicate they were attending the Burning Man festival. Since then the doodles have become works of art -- some of them high-tech and complex -- created by a team of doodlers. Stay tuned here for more of this year's doodles"
aa = str1.split()
bb = zip(aa[0:4])
data_frame.append(bb,ignore_index=True,verify_integrity=False)
Is it possible or do I have to iterate over each word in tuple to use insert
You could do this
In [8]: index=list('ABCD')
In [9]: df = pd.DataFrame(columns=index)
In [11]: df.append(Series(aa[0:4],index=index),ignore_index=True)
Out[11]:
A B C D
0 Doodles are the logo-incorporating
alternatively if you have many of these rows that you are going to append, just
create a list, then DataFame(list_of_series) at the end
In [13]: DataFrame([ aa[0:4], aa[5:8] ],columns=list('ABCD'))
Out[13]:
A B C D
0 Doodles are the logo-incorporating
1 of art that None