How to pass variable as a column name pandas - python-2.7

I'm using Python 2.7
I try do create new column based on variable form a list
tickers=['BAC','JPM','WFC','C','MS']
returns=pd.DataFrame
for tick in tickers:
returns[tick]=bank_stocks[tick][]1'Close'].pct_change()**
But I get this error
TypeError Traceback (most recent call last)
in ()
2 returns=pd.DataFrame
3 for tick in tickers:
----> 4 returns[tick]=bank_stocks[tick]['Close'].pct_change()
5
TypeError: 'type' object does not support item assignment

IIUC you need:
np.random.seed(100)
mux = pd.MultiIndex.from_product([['BAC','JPM','WFC','C','MS', 'Other'], ['Close', 'Open']])
df = pd.DataFrame(np.random.rand(10,12), columns=mux)
print (df)
BAC JPM WFC C \
Close Open Close Open Close Open Close
0 0.543405 0.278369 0.424518 0.844776 0.004719 0.121569 0.670749
1 0.185328 0.108377 0.219697 0.978624 0.811683 0.171941 0.816225
2 0.175410 0.372832 0.005689 0.252426 0.795663 0.015255 0.598843
3 0.980921 0.059942 0.890546 0.576901 0.742480 0.630184 0.581842
4 0.285896 0.852395 0.975006 0.884853 0.359508 0.598859 0.354796
5 0.376252 0.592805 0.629942 0.142600 0.933841 0.946380 0.602297
6 0.173608 0.966610 0.957013 0.597974 0.731301 0.340385 0.092056
7 0.395036 0.335596 0.805451 0.754349 0.313066 0.634037 0.540405
8 0.254258 0.641101 0.200124 0.657625 0.778289 0.779598 0.610328
9 0.976500 0.166694 0.023178 0.160745 0.923497 0.953550 0.210978
MS Other
Open Close Open Close Open
0 0.825853 0.136707 0.575093 0.891322 0.209202
1 0.274074 0.431704 0.940030 0.817649 0.336112
2 0.603805 0.105148 0.381943 0.036476 0.890412
3 0.020439 0.210027 0.544685 0.769115 0.250695
4 0.340190 0.178081 0.237694 0.044862 0.505431
5 0.387766 0.363188 0.204345 0.276765 0.246536
6 0.463498 0.508699 0.088460 0.528035 0.992158
7 0.296794 0.110788 0.312640 0.456979 0.658940
8 0.309000 0.697735 0.859618 0.625324 0.982408
9 0.360525 0.549375 0.271831 0.460602 0.696162
First select columns by slicers, then call pct_change and last remove second level of MultiIndex in column by droplevel:
tickers=['BAC','JPM','WFC','C','MS']
idx = pd.IndexSlice
df = df.sort_index(axis=1)
returns = df.loc[:, idx[tickers,'Close']].pct_change()
returns.columns = returns.columns.droplevel(-1)
print (returns)
BAC C JPM MS WFC
0 NaN NaN NaN NaN NaN
1 -0.658950 0.216885 -0.482477 2.157889 171.008452
2 -0.053515 -0.266325 -0.974108 -0.756436 -0.019738
3 4.592146 -0.028390 155.551779 0.997444 -0.066841
4 -0.708544 -0.390220 0.094841 -0.152103 -0.515801
5 0.316048 0.697588 -0.353910 1.039454 1.597555
6 -0.538586 -0.847159 0.519208 0.400649 -0.216890
7 1.275448 4.870415 -0.158370 -0.782213 -0.571905
8 -0.356369 0.129391 -0.751538 5.297934 1.486019
9 2.840595 -0.654320 -0.884181 -0.212630 0.186573

Your code is correct except the line in In[73] where you must call dataframe(i.e., pd.DataFrame()) you have created an object by not using '()' after DataFrame. Thats why the error is type object doesnot support assignment.

Related

Drop rows based on one column values

I've a dataframe which looks like this:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
8 6842.04 -32.751551 -0.002514 65.118329
9 6842.69 18.293519 -0.002158 36.385884
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
As it's clearly evident in the above table that some of the values in the column mad and median are very big(outliers). So i want to remove the rows which have these very big values.
For example in row3 the value of mad is 30.408377 which very big so i want to drop this row. I know that i can use one line
to remove these values from the columns but it doesn't removes the complete row
df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]
But i want to remove the complete row.
How can i do that?
Predicates like what you've given will remove entire rows. But none of your data is outside of 3 standard deviations. If you tone it down to just one standard deviation, rows are removed with your example data.
Here's an example using your data:
import pandas as pd
import numpy as np
columns = ["wave", "mean", "median", "mad"]
data = [
[4050.32, -0.016182, -0.011940, 0.008885],
[4208.98, 0.023707, 0.007189, 0.032585],
[4508.28, 3.662293, 0.001414, 7.193139],
[4531.62, -15.459313, -0.001523, 30.408377],
[4551.65, 0.009028, 0.007581, 0.005247],
[4554.46, 0.001861, 0.010692, 0.027969],
[6828.60, -10.604568, -0.000590, 21.084799],
[6839.84, -0.003466, -0.001870, 0.010169],
[6842.04, -32.751551, -0.002514, 65.118329],
[6842.69, 18.293519, -0.002158, 36.385884],
[6843.66, 0.006386, -0.002468, 0.034995],
[6855.72, 0.020803, 0.000886, 0.040529],
]
df = pd.DataFrame(np.array(data), columns=columns)
print("ORIGINAL: ")
print(df)
print()
res = df[np.abs(df['mad']-df['mad'].mean()) <= (df['mad'].std())]
print("REMOVED: ")
print(res)
this outputs:
ORIGINAL:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
8 6842.04 -32.751551 -0.002514 65.118329
9 6842.69 18.293519 -0.002158 36.385884
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
REMOVED:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
Observe that rows indexed 8 and 9 are now gone.
Be sure you're reassigning the output of df[np.abs(df['mad']-df['mad'].mean()) <= (df['mad'].std())] as shown above. The operation is not done in place.
Doing df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())] will not change the dataframe.
But assign it back to df, so that:
df = df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]

for loop in pandas to search dataframe and update list stuck

I want to count areas of interest in my dataframe column 'which_AOI' (ranging from 0 -9). I would like to have a new column with the results added to a dataframe depending on a variable 'marker' (ranging from 0 - x) which tells me when one 'picture' is done and the next begins (one marker can go on for a variable length of rows). This is my code so far but it seems to be stuck and runs on without giving output. I tried reconstructing it from the beginning once but as soon as i get to 'if df.marker == num' it doesn't stop. What am I missing?
(example dataframe below)
## AOI count of spec. type function (in progress):
import numpy as np
import pandas as pd
path_i = "/Users/Desktop/Pilot/results/gazedata_filename.csv"
df = pd.read_csv(path_i, sep =",")
#create a new dataframe for AOIs:
d = {'marker': []}
df_aoi = pd.DataFrame(data=d)
### Creating an Aoi list
item = df.which_AOI
aoi = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] #list for search
aoi_array = [0, 0 , 0, 0, 0, 0, 0, 0, 0, 0] #list for filling
num = 0
for i in range (0, len (df.marker)): #loop through the dataframe
if df.marker == num: ## if marker = num its one picture
for index, item in enumerate(aoi): #look for item (being a number in which_AOI) in aoi list
if (item == aoi[index]):
aoi_array[index] += 1
print (aoi)
print (aoi_array)
se = pd.Series(aoi_array) # make list into a series to attach to dataframe
df_aoi['new_col'] = se.values #add list to dataframe
aoi_array.clear() #clears list before next picture
else:
num +=1
index pos_time pos_x pos_y pup_time pup_diameter marker which_AOI fixation Picname shock
1 16300 168.608779907227 -136.360855102539 16300 2.935715675354 0 7 18 5 save
2 16318 144.97673034668 -157.495513916016 16318 3.08838820457459 0 8 33 5 save
3 16351 152.92560577392598 -156.64172363281298 16351 3.0895299911499 0 7 17 5 save
4 16368 152.132453918457 -157.989685058594 16368 3.111008644104 0 7 18 5 save
5 16386 151.59835815429702 -157.55587768554702 16386 3.09514689445496 0 7 18 5 save
6 16404 150.88092803955098 -152.69479370117202 16404 3.10009074211121 1 7 37 5 save
7 16441 152.76554107666 -142.06188964843798 16441 3.0821495056152304 1 7 33 5 save
Not 100% clear based on your question but it sounds like you want to count the number of rows for each which_AOI value in each marker.
You can accomplish this using groupby
df_aoi = df.groupby(['marker','which_AOI']).size().unstack('which_AOI',fill_value=0)
In:
pos_time pos_x pos_y pup_time pup_diameter marker \
0 16300 168.608780 -136.360855 16300 2.935716 0
1 16318 144.976730 -157.495514 16318 3.088388 0
2 16351 152.925606 -156.641724 16351 3.089530 0
3 16368 152.132454 -157.989685 16368 3.111009 0
4 16386 151.598358 -157.555878 16386 3.095147 0
5 16404 150.880928 -152.694794 16404 3.100091 1
6 16441 152.765541 -142.061890 16441 3.082150 1
which_AOI fixation Picname shock
0 7 18 5 save
1 8 33 5 save
2 7 17 5 save
3 7 18 5 save
4 7 18 5 save
5 7 37 5 save
6 7 33 5 save
Out:
which_AOI 7 8
marker
0 4 1
1 2 0

Concatenate pandas dataframe with result of apply(lambda) where lambda returns another dataframe

A dataframe stores some values in columns, passing those values to a function I get another dataframe. I'd like to concatenate the returned dataframe's columns to the original dataframe.
I tried to do something like
i = pd.concat([i, i[['cid', 'id']].apply(lambda x: xy(*x), axis=1)], axis=1)
but it did not work with error:
ValueError: cannot copy sequence with size 2 to array axis with dimension 1
So I did like this:
def xy(x, y):
return pd.DataFrame({'x': [x*2], 'y': [y*2]})
df1 = pd.DataFrame({'cid': [4, 4], 'id': [6, 10]})
print('df1:\n{}'.format(df1))
df2 = pd.DataFrame()
for _, row in df1.iterrows():
nr = xy(row['cid'], row['id'])
nr['cid'] = row['cid']
nr['id'] = row['id']
df2 = df2.append(nr, ignore_index=True)
print('df2:\n{}'.format(df2))
Output:
df1:
cid id
0 4 6
1 4 10
df2:
x y cid id
0 8 12 4 6
1 8 20 4 10
The code does not look nice and should work slowly.
Is there pandas/pythonic way to do it properly and fast working?
python 2.7
Option 0
Most directly with pd.DataFrame.assign. Not very generalizable.
df1.assign(x=df1.cid * 2, y=df1.id * 2)
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 1
Use pd.DataFrame.join to add new columns
This shows how to adjoin new columns after having used apply with a lambda
df1.join(df1.apply(lambda x: pd.Series(x.values * 2, ['x', 'y']), 1))
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 2
Use pd.DataFrame.assign to add new columns
This shows how to adjoin new columns after having used apply with a lambda
df1.assign(**df1.apply(lambda x: pd.Series(x.values * 2, ['x', 'y']), 1))
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 3
However, if your function really is just multiplying by 2
df1.join(df1.mul(2).rename(columns=dict(cid='x', id='y')))
Or
df1.assign(**df1.mul(2).rename(columns=dict(cid='x', id='y')))

How to split the data in the data frame in python?

I used the below code:
import pandas as pd
pandas_bigram = pd.DataFrame(bigram_data)
print pandas_bigram
I got output as below
0
0 ashoka -**0
1 - wikipedia,**1
2 wikipedia, the**2
3 the free**2
4 free encyclopedia**2
5 encyclopedia ashoka**1
6 ashoka from**2
7 from wikipedia,**1
8 wikipedia, the**2
9 the free**2
10 free encyclopedia**2
My question is How to split this data frame. So, that i will get data in two rows. the data here is separated by "**".
import pandas as pd
df= [" ashoka -**0","- wikipedia,**1","wikipedia, the**2"]
df=pd.DataFrame(df)
print(df)
0
0 ashoka -**0
1 - wikipedia,**1
2 wikipedia, the**2
Use split function: The method split() returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.
df1 = pd.DataFrame(df[0].str.split('*',1).tolist(),
columns = ['0','1'])
print(df1)
0 1
0 ashoka - *0
1 - wikipedia, *1
2 wikipedia, the *2

Removing some particular rows in pandas

I want to delete some rows in pandas dataframe.
ID Value
2012XY000 1
2012XY001 1
.
.
.
2015AB000 4
2015PQ001 5
.
.
.
2016DF00G 2
I want to delete rows whose ID does not start with 2015.
How should I do that?
Use startswith with boolean indexing:
print (df.ID.str.startswith('2015'))
0 False
1 False
2 True
3 True
4 False
Name: ID, dtype: bool
print (df[df.ID.str.startswith('2015')])
ID Value
2 2015AB000 4
3 2015PQ001 5
EDIT by comment:
print (df)
ID Value
0 2012XY000 1
1 2012XY001 1
2 2015AB000 4
3 2015PQ001 5
4 2015XQ001 5
5 2016DF00G 2
print ((df.ID.str.startswith('2015')) & (df.ID.str[4] != 'X'))
0 False
1 False
2 True
3 True
4 False
5 False
Name: ID, dtype: bool
print (df[(df.ID.str.startswith('2015')) & (df.ID.str[4] != 'X')])
ID Value
2 2015AB000 4
3 2015PQ001 5
Use str.match with regex string r'^2015':
df[df.ID.str.match(r'^2015')]
To exclude those that have an X afterwards.
df[df.ID.str.match(r'^2015[^X]')]
The regex r'^2015[^X]' translates into
^2015 - must start with 2015
[^X] - character after 2015 must not be X
consider the df
then
df[df.ID.str.match(r'^2015[^X]')]