I'm using Python 2.7
I try do create new column based on variable form a list
tickers=['BAC','JPM','WFC','C','MS']
returns=pd.DataFrame
for tick in tickers:
returns[tick]=bank_stocks[tick][]1'Close'].pct_change()**
But I get this error
TypeError Traceback (most recent call last)
in ()
2 returns=pd.DataFrame
3 for tick in tickers:
----> 4 returns[tick]=bank_stocks[tick]['Close'].pct_change()
5
TypeError: 'type' object does not support item assignment
IIUC you need:
np.random.seed(100)
mux = pd.MultiIndex.from_product([['BAC','JPM','WFC','C','MS', 'Other'], ['Close', 'Open']])
df = pd.DataFrame(np.random.rand(10,12), columns=mux)
print (df)
BAC JPM WFC C \
Close Open Close Open Close Open Close
0 0.543405 0.278369 0.424518 0.844776 0.004719 0.121569 0.670749
1 0.185328 0.108377 0.219697 0.978624 0.811683 0.171941 0.816225
2 0.175410 0.372832 0.005689 0.252426 0.795663 0.015255 0.598843
3 0.980921 0.059942 0.890546 0.576901 0.742480 0.630184 0.581842
4 0.285896 0.852395 0.975006 0.884853 0.359508 0.598859 0.354796
5 0.376252 0.592805 0.629942 0.142600 0.933841 0.946380 0.602297
6 0.173608 0.966610 0.957013 0.597974 0.731301 0.340385 0.092056
7 0.395036 0.335596 0.805451 0.754349 0.313066 0.634037 0.540405
8 0.254258 0.641101 0.200124 0.657625 0.778289 0.779598 0.610328
9 0.976500 0.166694 0.023178 0.160745 0.923497 0.953550 0.210978
MS Other
Open Close Open Close Open
0 0.825853 0.136707 0.575093 0.891322 0.209202
1 0.274074 0.431704 0.940030 0.817649 0.336112
2 0.603805 0.105148 0.381943 0.036476 0.890412
3 0.020439 0.210027 0.544685 0.769115 0.250695
4 0.340190 0.178081 0.237694 0.044862 0.505431
5 0.387766 0.363188 0.204345 0.276765 0.246536
6 0.463498 0.508699 0.088460 0.528035 0.992158
7 0.296794 0.110788 0.312640 0.456979 0.658940
8 0.309000 0.697735 0.859618 0.625324 0.982408
9 0.360525 0.549375 0.271831 0.460602 0.696162
First select columns by slicers, then call pct_change and last remove second level of MultiIndex in column by droplevel:
tickers=['BAC','JPM','WFC','C','MS']
idx = pd.IndexSlice
df = df.sort_index(axis=1)
returns = df.loc[:, idx[tickers,'Close']].pct_change()
returns.columns = returns.columns.droplevel(-1)
print (returns)
BAC C JPM MS WFC
0 NaN NaN NaN NaN NaN
1 -0.658950 0.216885 -0.482477 2.157889 171.008452
2 -0.053515 -0.266325 -0.974108 -0.756436 -0.019738
3 4.592146 -0.028390 155.551779 0.997444 -0.066841
4 -0.708544 -0.390220 0.094841 -0.152103 -0.515801
5 0.316048 0.697588 -0.353910 1.039454 1.597555
6 -0.538586 -0.847159 0.519208 0.400649 -0.216890
7 1.275448 4.870415 -0.158370 -0.782213 -0.571905
8 -0.356369 0.129391 -0.751538 5.297934 1.486019
9 2.840595 -0.654320 -0.884181 -0.212630 0.186573
Your code is correct except the line in In[73] where you must call dataframe(i.e., pd.DataFrame()) you have created an object by not using '()' after DataFrame. Thats why the error is type object doesnot support assignment.
Related
I've a dataframe which looks like this:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
8 6842.04 -32.751551 -0.002514 65.118329
9 6842.69 18.293519 -0.002158 36.385884
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
As it's clearly evident in the above table that some of the values in the column mad and median are very big(outliers). So i want to remove the rows which have these very big values.
For example in row3 the value of mad is 30.408377 which very big so i want to drop this row. I know that i can use one line
to remove these values from the columns but it doesn't removes the complete row
df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]
But i want to remove the complete row.
How can i do that?
Predicates like what you've given will remove entire rows. But none of your data is outside of 3 standard deviations. If you tone it down to just one standard deviation, rows are removed with your example data.
Here's an example using your data:
import pandas as pd
import numpy as np
columns = ["wave", "mean", "median", "mad"]
data = [
[4050.32, -0.016182, -0.011940, 0.008885],
[4208.98, 0.023707, 0.007189, 0.032585],
[4508.28, 3.662293, 0.001414, 7.193139],
[4531.62, -15.459313, -0.001523, 30.408377],
[4551.65, 0.009028, 0.007581, 0.005247],
[4554.46, 0.001861, 0.010692, 0.027969],
[6828.60, -10.604568, -0.000590, 21.084799],
[6839.84, -0.003466, -0.001870, 0.010169],
[6842.04, -32.751551, -0.002514, 65.118329],
[6842.69, 18.293519, -0.002158, 36.385884],
[6843.66, 0.006386, -0.002468, 0.034995],
[6855.72, 0.020803, 0.000886, 0.040529],
]
df = pd.DataFrame(np.array(data), columns=columns)
print("ORIGINAL: ")
print(df)
print()
res = df[np.abs(df['mad']-df['mad'].mean()) <= (df['mad'].std())]
print("REMOVED: ")
print(res)
this outputs:
ORIGINAL:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
8 6842.04 -32.751551 -0.002514 65.118329
9 6842.69 18.293519 -0.002158 36.385884
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
REMOVED:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
Observe that rows indexed 8 and 9 are now gone.
Be sure you're reassigning the output of df[np.abs(df['mad']-df['mad'].mean()) <= (df['mad'].std())] as shown above. The operation is not done in place.
Doing df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())] will not change the dataframe.
But assign it back to df, so that:
df = df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]
I want to count areas of interest in my dataframe column 'which_AOI' (ranging from 0 -9). I would like to have a new column with the results added to a dataframe depending on a variable 'marker' (ranging from 0 - x) which tells me when one 'picture' is done and the next begins (one marker can go on for a variable length of rows). This is my code so far but it seems to be stuck and runs on without giving output. I tried reconstructing it from the beginning once but as soon as i get to 'if df.marker == num' it doesn't stop. What am I missing?
(example dataframe below)
## AOI count of spec. type function (in progress):
import numpy as np
import pandas as pd
path_i = "/Users/Desktop/Pilot/results/gazedata_filename.csv"
df = pd.read_csv(path_i, sep =",")
#create a new dataframe for AOIs:
d = {'marker': []}
df_aoi = pd.DataFrame(data=d)
### Creating an Aoi list
item = df.which_AOI
aoi = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] #list for search
aoi_array = [0, 0 , 0, 0, 0, 0, 0, 0, 0, 0] #list for filling
num = 0
for i in range (0, len (df.marker)): #loop through the dataframe
if df.marker == num: ## if marker = num its one picture
for index, item in enumerate(aoi): #look for item (being a number in which_AOI) in aoi list
if (item == aoi[index]):
aoi_array[index] += 1
print (aoi)
print (aoi_array)
se = pd.Series(aoi_array) # make list into a series to attach to dataframe
df_aoi['new_col'] = se.values #add list to dataframe
aoi_array.clear() #clears list before next picture
else:
num +=1
index pos_time pos_x pos_y pup_time pup_diameter marker which_AOI fixation Picname shock
1 16300 168.608779907227 -136.360855102539 16300 2.935715675354 0 7 18 5 save
2 16318 144.97673034668 -157.495513916016 16318 3.08838820457459 0 8 33 5 save
3 16351 152.92560577392598 -156.64172363281298 16351 3.0895299911499 0 7 17 5 save
4 16368 152.132453918457 -157.989685058594 16368 3.111008644104 0 7 18 5 save
5 16386 151.59835815429702 -157.55587768554702 16386 3.09514689445496 0 7 18 5 save
6 16404 150.88092803955098 -152.69479370117202 16404 3.10009074211121 1 7 37 5 save
7 16441 152.76554107666 -142.06188964843798 16441 3.0821495056152304 1 7 33 5 save
Not 100% clear based on your question but it sounds like you want to count the number of rows for each which_AOI value in each marker.
You can accomplish this using groupby
df_aoi = df.groupby(['marker','which_AOI']).size().unstack('which_AOI',fill_value=0)
In:
pos_time pos_x pos_y pup_time pup_diameter marker \
0 16300 168.608780 -136.360855 16300 2.935716 0
1 16318 144.976730 -157.495514 16318 3.088388 0
2 16351 152.925606 -156.641724 16351 3.089530 0
3 16368 152.132454 -157.989685 16368 3.111009 0
4 16386 151.598358 -157.555878 16386 3.095147 0
5 16404 150.880928 -152.694794 16404 3.100091 1
6 16441 152.765541 -142.061890 16441 3.082150 1
which_AOI fixation Picname shock
0 7 18 5 save
1 8 33 5 save
2 7 17 5 save
3 7 18 5 save
4 7 18 5 save
5 7 37 5 save
6 7 33 5 save
Out:
which_AOI 7 8
marker
0 4 1
1 2 0
A dataframe stores some values in columns, passing those values to a function I get another dataframe. I'd like to concatenate the returned dataframe's columns to the original dataframe.
I tried to do something like
i = pd.concat([i, i[['cid', 'id']].apply(lambda x: xy(*x), axis=1)], axis=1)
but it did not work with error:
ValueError: cannot copy sequence with size 2 to array axis with dimension 1
So I did like this:
def xy(x, y):
return pd.DataFrame({'x': [x*2], 'y': [y*2]})
df1 = pd.DataFrame({'cid': [4, 4], 'id': [6, 10]})
print('df1:\n{}'.format(df1))
df2 = pd.DataFrame()
for _, row in df1.iterrows():
nr = xy(row['cid'], row['id'])
nr['cid'] = row['cid']
nr['id'] = row['id']
df2 = df2.append(nr, ignore_index=True)
print('df2:\n{}'.format(df2))
Output:
df1:
cid id
0 4 6
1 4 10
df2:
x y cid id
0 8 12 4 6
1 8 20 4 10
The code does not look nice and should work slowly.
Is there pandas/pythonic way to do it properly and fast working?
python 2.7
Option 0
Most directly with pd.DataFrame.assign. Not very generalizable.
df1.assign(x=df1.cid * 2, y=df1.id * 2)
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 1
Use pd.DataFrame.join to add new columns
This shows how to adjoin new columns after having used apply with a lambda
df1.join(df1.apply(lambda x: pd.Series(x.values * 2, ['x', 'y']), 1))
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 2
Use pd.DataFrame.assign to add new columns
This shows how to adjoin new columns after having used apply with a lambda
df1.assign(**df1.apply(lambda x: pd.Series(x.values * 2, ['x', 'y']), 1))
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 3
However, if your function really is just multiplying by 2
df1.join(df1.mul(2).rename(columns=dict(cid='x', id='y')))
Or
df1.assign(**df1.mul(2).rename(columns=dict(cid='x', id='y')))
I used the below code:
import pandas as pd
pandas_bigram = pd.DataFrame(bigram_data)
print pandas_bigram
I got output as below
0
0 ashoka -**0
1 - wikipedia,**1
2 wikipedia, the**2
3 the free**2
4 free encyclopedia**2
5 encyclopedia ashoka**1
6 ashoka from**2
7 from wikipedia,**1
8 wikipedia, the**2
9 the free**2
10 free encyclopedia**2
My question is How to split this data frame. So, that i will get data in two rows. the data here is separated by "**".
import pandas as pd
df= [" ashoka -**0","- wikipedia,**1","wikipedia, the**2"]
df=pd.DataFrame(df)
print(df)
0
0 ashoka -**0
1 - wikipedia,**1
2 wikipedia, the**2
Use split function: The method split() returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.
df1 = pd.DataFrame(df[0].str.split('*',1).tolist(),
columns = ['0','1'])
print(df1)
0 1
0 ashoka - *0
1 - wikipedia, *1
2 wikipedia, the *2
I want to delete some rows in pandas dataframe.
ID Value
2012XY000 1
2012XY001 1
.
.
.
2015AB000 4
2015PQ001 5
.
.
.
2016DF00G 2
I want to delete rows whose ID does not start with 2015.
How should I do that?
Use startswith with boolean indexing:
print (df.ID.str.startswith('2015'))
0 False
1 False
2 True
3 True
4 False
Name: ID, dtype: bool
print (df[df.ID.str.startswith('2015')])
ID Value
2 2015AB000 4
3 2015PQ001 5
EDIT by comment:
print (df)
ID Value
0 2012XY000 1
1 2012XY001 1
2 2015AB000 4
3 2015PQ001 5
4 2015XQ001 5
5 2016DF00G 2
print ((df.ID.str.startswith('2015')) & (df.ID.str[4] != 'X'))
0 False
1 False
2 True
3 True
4 False
5 False
Name: ID, dtype: bool
print (df[(df.ID.str.startswith('2015')) & (df.ID.str[4] != 'X')])
ID Value
2 2015AB000 4
3 2015PQ001 5
Use str.match with regex string r'^2015':
df[df.ID.str.match(r'^2015')]
To exclude those that have an X afterwards.
df[df.ID.str.match(r'^2015[^X]')]
The regex r'^2015[^X]' translates into
^2015 - must start with 2015
[^X] - character after 2015 must not be X
consider the df
then
df[df.ID.str.match(r'^2015[^X]')]