Convert data frame column to float and perform operation in Pandas - python-2.7

I have a data frame that contains the following that are imported as strings
df3 = pd.DataFrame(data = {
'Column1':['10/1','9/5','7/4','12/3','18/7','14/2']})
I am tried to convert to float and do the division. The following did work well.
for i, v in enumerate(df3.Column1):
df3['Column2'] = float(v[:-2]) / float(v[-1])
print df3.Column2
This is the output that I am trying to achieve
df3 = pd.DataFrame(data = {
'Column1':['10/1','9/5','7/4','12/3','18/7','14/2'],
'Column2':['10.0','1.8','1.75','4.0','2.57142857143','7.0']})
df3

The following would work, define a function to perform the casting to float and return this, the result of which should be assigned to your new column:
In [10]:
df3 = pd.DataFrame(data = {
'Column1':['10/1','9/5','7/4','12/3','18/7','14/2']})
def func(x):
return float(x[:-2]) / float(x[-1])
df3['Column2'] = df3['Column1'].apply(func)
df3
Out[10]:
Column1 Column2
0 10/1 10.000000
1 9/5 1.800000
2 7/4 1.750000
3 12/3 4.000000
4 18/7 2.571429
5 14/2 7.000000

if, and ONLY IF, you do not have input/data from an untrusted source, here's a shortcut:
In [46]: df3
Out[46]:
Column1
0 10/1
1 9/5
2 7/4
3 12/3
4 18/7
5 14/2
In [47]: df3.Column1.map(eval)
Out[47]:
0 10.000000
1 1.800000
2 1.750000
3 4.000000
4 2.571429
5 7.000000
Name: Column1, dtype: float64
But seriously...be careful with eval.

Related

numpy array to pandas pivot table

I'm new to pandas and am trying to create a pivot table from a numpy array.
variable npArray is just that, a numpy array:
>>> npArray
array([(1, 3), (4, 3), (1, 3), ..., (1, 4), (1, 12), (1, 12)],
dtype=[('MATERIAL', '<i4'), ('DIVISION', '<i4')])
I'd to count occurrences of each material by division, with division being rows and material being columns. Example:
What I have:
#numpy array to pandas data frame
pandaDf = pandas.DataFrame (npArray)
#pivot table - guessing here
pandas.pivot_table (pandaDf, index = "DIVISION",
columns = "MATERIAL",
aggfunc = numpy.sum) #<--- want count, not sum
Results:
Empty DataFrame
Columns: []
Index: []
Sample of pandaDf:
>>> print pandaDf
MATERIAL DIVISION
0 1 3
1 4 3
2 1 3
3 1 3
4 1 3
5 1 3
6 1 3
7 1 3
8 1 3
9 1 3
10 1 3
11 1 3
12 4 3
... ... ...
3845291 1 4
3845292 1 4
3845293 1 4
3845294 1 12
3845295 1 12
[3845296 rows x 2 columns]
Any help would be appreciated.
Something similar has already been asked: https://stackoverflow.com/a/12862196/9754169
Bottom line, just do aggfunc=lambda x: len(x)
#GerardoFlores is correct. Another solution I found was adding a column for frequency.
#numpy array to pandas data frame
pandaDf = pandas.DataFrame (npArray)
print "adding frequency column"
pandaDf ["FREQ"] = 1
#pivot table
pivot = pandas.pivot_table (pandaDf, values = "FREQ",
index = "DIVISION", columns = "MATERIAL",
aggfunc = "count")

Concatenate pandas dataframe with result of apply(lambda) where lambda returns another dataframe

A dataframe stores some values in columns, passing those values to a function I get another dataframe. I'd like to concatenate the returned dataframe's columns to the original dataframe.
I tried to do something like
i = pd.concat([i, i[['cid', 'id']].apply(lambda x: xy(*x), axis=1)], axis=1)
but it did not work with error:
ValueError: cannot copy sequence with size 2 to array axis with dimension 1
So I did like this:
def xy(x, y):
return pd.DataFrame({'x': [x*2], 'y': [y*2]})
df1 = pd.DataFrame({'cid': [4, 4], 'id': [6, 10]})
print('df1:\n{}'.format(df1))
df2 = pd.DataFrame()
for _, row in df1.iterrows():
nr = xy(row['cid'], row['id'])
nr['cid'] = row['cid']
nr['id'] = row['id']
df2 = df2.append(nr, ignore_index=True)
print('df2:\n{}'.format(df2))
Output:
df1:
cid id
0 4 6
1 4 10
df2:
x y cid id
0 8 12 4 6
1 8 20 4 10
The code does not look nice and should work slowly.
Is there pandas/pythonic way to do it properly and fast working?
python 2.7
Option 0
Most directly with pd.DataFrame.assign. Not very generalizable.
df1.assign(x=df1.cid * 2, y=df1.id * 2)
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 1
Use pd.DataFrame.join to add new columns
This shows how to adjoin new columns after having used apply with a lambda
df1.join(df1.apply(lambda x: pd.Series(x.values * 2, ['x', 'y']), 1))
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 2
Use pd.DataFrame.assign to add new columns
This shows how to adjoin new columns after having used apply with a lambda
df1.assign(**df1.apply(lambda x: pd.Series(x.values * 2, ['x', 'y']), 1))
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 3
However, if your function really is just multiplying by 2
df1.join(df1.mul(2).rename(columns=dict(cid='x', id='y')))
Or
df1.assign(**df1.mul(2).rename(columns=dict(cid='x', id='y')))

scikit-learn: One hot encoding of column with list values [duplicate]

This question already has answers here:
How to one hot encode variant length features?
(2 answers)
Closed 5 years ago.
I am trying to encode a dataframe like below:
A B C
2 'Hello' ['we', are', 'good']
1 'All' ['hello', 'world']
Now as you can see I can labelencod string values of second column, but I am not able to figure out how to go about encode the third column which is having list of string values and length of the lists are different. Even if i onehotencode this i will get an array which i dont know how to merge with array elements of other columns after encoding. Please suggest some good technique
Assuming we have the following DF:
In [31]: df
Out[31]:
A B C
0 2 Hello [we, are, good]
1 1 All [hello, world]
Let's use sklearn.feature_extraction.text.CountVectorizer
In [32]: from sklearn.feature_extraction.text import CountVectorizer
In [33]: vect = CountVectorizer()
In [34]: X = vect.fit_transform(df.C.str.join(' '))
In [35]: df = df.join(pd.DataFrame(X.toarray(), columns=vect.get_feature_names()))
In [36]: df
Out[36]:
A B C are good hello we world
0 2 Hello [we, are, good] 1 1 0 1 0
1 1 All [hello, world] 0 0 1 0 1
alternatively you can use sklearn.preprocessing.MultiLabelBinarizer as #VivekKumar suggested in this comment
In [56]: from sklearn.preprocessing import MultiLabelBinarizer
In [57]: mlb = MultiLabelBinarizer()
In [58]: X = mlb.fit_transform(df.C)
In [59]: df = df.join(pd.DataFrame(X, columns=mlb.classes_))
In [60]: df
Out[60]:
A B C are good hello we world
0 2 Hello [we, are, good] 1 1 0 1 0
1 1 All [hello, world] 0 0 1 0 1

Efficiently walking through pandas dataframe index

import pandas as pd
from numpy.random import randn
oldn = pd.DataFrame(randn(10, 4), columns=['A', 'B', 'C', 'D'])
I want to make a new DataFrame that is 0..9 rows long, and has one column "avg", whose value for row N = average(old[N]['A'], old[N]['B']..old[N]['D'])
I'm not very familiar with pandas, so all my ideas how to do this are gross for- loops and things. What is the efficient way to create and populate the new table?
Call mean on your df and pass param axis=1 to calculate the mean row-wise, you can then pass this as data to the DataFrame ctor:
In [128]:
new_df = pd.DataFrame(data = oldn.mean(axis=1), columns=['avg'])
new_df
Out[128]:
avg
0 0.541550
1 0.525518
2 -0.492634
3 0.163784
4 0.012363
5 0.514676
6 -0.468888
7 0.334473
8 0.669139
9 0.736748
If you want average for specific columns use the following. Else you can use the answer provided by #EdChum
oldn['Avg'] = oldn.apply(lambda v: ((v['A']+v['B']+v['C']+v['D']) / 4.), axis=1)
or
old['Avg'] = oldn.apply(lambda v: ((v[['A','B','C','D']]).sum() / 4.), axis=1)
print oldn
A B C D Avg
0 -0.201468 -0.832845 0.100299 0.044853 -0.222290
1 1.510688 -0.955329 0.239836 0.767431 0.390657
2 0.780910 0.335267 0.423232 -0.678401 0.215252
3 0.780518 2.876386 -0.797032 -0.523407 0.584116
4 0.438313 -1.952162 0.909568 -0.465147 -0.267357
5 0.145152 -0.836300 0.352706 -0.794815 -0.283314
6 -0.375432 -1.354249 0.920052 -1.002142 -0.452943
7 0.663149 -0.064227 0.321164 0.779981 0.425017
8 -1.279022 -2.206743 0.534943 0.794929 -0.538973
9 -0.339976 0.636516 -0.530445 -0.832413 -0.266579

Strange behaviour when adding columns

I'm using Python 2.7.8 |Anaconda 2.1.0. I'm wondering why the strange behavior below occurs
I create a pandas dataframe with two columns, then add a third column by summing the first two columns
x = pd.DataFrame(np.random.randn(5, 2), columns = ['a', 'b'])
x['c'] = x[['a', 'b']].sum(axis = 1) #or x['c'] = x['a'] + x['b']
Out[7]:
a b c
0 -1.644246 0.851602 -0.792644
1 -0.129092 0.237140 0.108049
2 0.623160 0.105494 0.728654
3 0.737803 -1.612189 -0.874386
4 0.340671 -0.113334 0.227337
All good so far. Now I want to set the values of column c to zero if they are negative
x[x['c']<0] = 0
Out[9]:
a b c
0 0.000000 0.000000 0.000000
1 -0.129092 0.237140 0.108049
2 0.623160 0.105494 0.728654
3 0.000000 0.000000 0.000000
4 0.340671 -0.113334 0.227337
This gives the desired result in column 'c', but for some reason columns 'a' and 'b' have been modified - i don't want this to happen. I was wondering why this is happening and how I can fix this behavior?
You have to specify you only want the 'c' column:
x.loc[x['c']<0, 'c'] = 0
When you just index with a boolean array/series, this will select full rows, as you can see in this example:
In [46]: x['c']<0
Out[46]:
0 True
1 False
2 False
3 True
4 False
Name: c, dtype: bool
In [47]: x[x['c']<0]
Out[47]:
a b c
0 -0.444493 -0.592318 -1.036811
3 -1.363727 -1.572558 -2.936285
Because you are setting to zero for all the columns. You should set it only for column c
x['c'][x['c']<0] = 0