Getting index values from pd mean() and std() functions - python-2.7

I'm trying to get the index values from a pd std().
My final objective is to match the index with another df and insert the corresponding values (standard deviations).
(in): df_std['index'] = df_std.index
(out): Index([u'AAPL US Equity', u'QQQ US Equity', u'BRABCBACNPR4 BZ Equity'...dtype='object')
However, I've been unable to add the indexes to the "right" of df_std because of the types: df_std.index is a series while df_std is a df. When I try to do it, a line is added instead of a column:
(in): df_std['index'] = df_std.index
(out):
BRSTNCLF1R25 Govt 64.0864
BRITUBACNPR1 BZ Equity 2.67762
BRSTNCNTB4O9 Govt 48.2419
BRSTNCLF1R74 Govt 64.901
PBR US Equity 0.770755
BRBBASACNOR3 BZ Equity 2.93335
BRSTNCLF1R82 Govt 65.0979
index Index([u'AAPL US Equity', u'QQQ US Equity', u'...
dtype: object
I've already tried converting it df_std.inde to a tuple and to a dataframe.
Thanks!
Edit:
I'm trying to match df_std['index'] with df_final['bloomberg_ticker'] and bring the std values to df_final['std']:
(in): print df_final
(out):
serie tipo tp_cnpjfundo valor id bloomberg_ticker \
0 NaN caixa NaN NaN 0 NaN
1 NaN titpublicos NaN NaN 1 BRSTNCLF1R17 Govt
2 NaN titpublicos NaN NaN 2 BRSTNCLF1R17 Govt
3 NaN titpublicos NaN NaN 3 BRSTNCLF1R25 Govt
(the column 'id' will be deleted later)

Use .reset_index() than assigning if what you have is a dataframe i.e
df_std = df_std.reset_index()
Example :
df = pd.DataFrame([0,1,2,3], index=['a','b','c','d'])
df = df.reset_index()
Output :
index 0
0 a 0
1 b 1
2 c 2
3 d 3
In case what you have is a series, convert that to dataframe then reset_index i.e if df_std is the series you have then
df_std = df_std.to_frame().reset_index()
I think what are trying to do is map the values of series to a specific column so you can use
df = pd.DataFrame({'col':['a','b','c','d','e'],'vales':[5,1,2,4,5]})
s = pd.Series([1,2,3],index=['a','b','c'])
df['new'] = df['col'].map(s)
Output :
col vales new
0 a 5 1.0
1 b 1 2.0
2 c 2 3.0
3 d 4 NaN
4 e 5 NaN
In your case you can use df_final['index'].map(df_std)
For conditional check if the index of series is present int he index column of dataframe then you can use .isin i.e
df['col'].isin(s.index) # Returns the boolen mask
df[df['col'].isin(s.index)] #Returns the dataframe based matched index

Related

Drop rows based on one column values

I've a dataframe which looks like this:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
8 6842.04 -32.751551 -0.002514 65.118329
9 6842.69 18.293519 -0.002158 36.385884
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
As it's clearly evident in the above table that some of the values in the column mad and median are very big(outliers). So i want to remove the rows which have these very big values.
For example in row3 the value of mad is 30.408377 which very big so i want to drop this row. I know that i can use one line
to remove these values from the columns but it doesn't removes the complete row
df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]
But i want to remove the complete row.
How can i do that?
Predicates like what you've given will remove entire rows. But none of your data is outside of 3 standard deviations. If you tone it down to just one standard deviation, rows are removed with your example data.
Here's an example using your data:
import pandas as pd
import numpy as np
columns = ["wave", "mean", "median", "mad"]
data = [
[4050.32, -0.016182, -0.011940, 0.008885],
[4208.98, 0.023707, 0.007189, 0.032585],
[4508.28, 3.662293, 0.001414, 7.193139],
[4531.62, -15.459313, -0.001523, 30.408377],
[4551.65, 0.009028, 0.007581, 0.005247],
[4554.46, 0.001861, 0.010692, 0.027969],
[6828.60, -10.604568, -0.000590, 21.084799],
[6839.84, -0.003466, -0.001870, 0.010169],
[6842.04, -32.751551, -0.002514, 65.118329],
[6842.69, 18.293519, -0.002158, 36.385884],
[6843.66, 0.006386, -0.002468, 0.034995],
[6855.72, 0.020803, 0.000886, 0.040529],
]
df = pd.DataFrame(np.array(data), columns=columns)
print("ORIGINAL: ")
print(df)
print()
res = df[np.abs(df['mad']-df['mad'].mean()) <= (df['mad'].std())]
print("REMOVED: ")
print(res)
this outputs:
ORIGINAL:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
8 6842.04 -32.751551 -0.002514 65.118329
9 6842.69 18.293519 -0.002158 36.385884
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
REMOVED:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
Observe that rows indexed 8 and 9 are now gone.
Be sure you're reassigning the output of df[np.abs(df['mad']-df['mad'].mean()) <= (df['mad'].std())] as shown above. The operation is not done in place.
Doing df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())] will not change the dataframe.
But assign it back to df, so that:
df = df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]

Keeping some data from duplicates and adding to existing python dataframe

I have an issue with keeping some data from duplicates and wanting to add valuable information to a new column in the dataframe.
import pandas as pd
data = {'id':[1,1,2,2,3],'key':[1,1,2,2,1],'value0':['a', 'b', 'x', 'y', 'a']}
frame = pd.DataFrame(data, columns = ['id','key','value0'])
print frame
Yields:
id key value0
0 1 1 a
1 1 1 b
2 2 2 x
3 2 2 y
4 3 1 a
Desired Output:
key value0_0 value0_1 value1_0
0 1 a b a
1 2 x y None
The "id" column isn't important to keep but could help with iteration and grouping.
I think this could be adapted to other projects where you don't know how many values exist for a set of keys.
set_index including a cumcount and unstack
frame.set_index(
['key', frame.groupby('key').cumcount()]
).value0.unstack().add_prefix('value0_').reset_index()
key value0_0 value0_1 value0_2
0 1 a b a
1 2 x y None
I'm questioning your column labeling but here is an approach using binary
frame.set_index(
['key', frame.groupby('key').cumcount()]
).value0.unstack().rename(
columns='{:02b}'.format
).add_prefix('value_').reset_index()
key value_00 value_01 value_10
0 1 a b a
1 2 x y None

How to compare two columns and create new dataframe

My table:
A B C D E F G H I
0 0.292090 0.806958 0.255845 0.855154 0.590744 0.937458 0.192190 0.548974 0.703214
1 0.094978 NaN NaN NAN 0.350109 0.635469 0.025525 0.108062 0.510891
2 0.918005 0.568802 0.041519 NaN NaN 0.882552 0.086663 0.908168 0.221058
3 0.882920 0.230281 0.172843 0.948232 0.560853 NaN NaN 0.664388 0.393678
4 0.086579 0.819807 0.712273 0.769890 0.448730 0.853134 0.508932 0.630004 0.579961
Output:
A B&C D&E F&G H&I
0.292090 Present Present Present Present
0.094978 Not There Not There Present Present
0.918005 Present Not There Present Present
0.882920 Present Present Not There Present
0.086579 Present Present Present Present
If both B and C is not there then show not there else present
If anyone D and E is not there then show not there else present
If anyone F and G is not equal to 0 present else not there
If H and I sum is greater than 2, then show not there else present
I want to write if functions or lambda whatever is fast in pandas and I want to generate a new dataframe as I have given an output. But I am not able to understand how should I write these following statements in pandas.
if (B & C):
df.at[0, 'B&C'] = 'Present'
elif
df.at[0, 'B&C'] = 'Not there'
if (D | E):
df.at[0, 'D&E'] = 'Present'
elif
df.at[0, 'D&E'] = 'Not there'
So is there anyway in pandas with that I can complete my newset of dataframe.
You can use isnull to determine which entries are NaN:
In [3]: df
Out[3]:
A B C D E
0 -0.600684 -0.112947 -0.081186 -0.012543 1.951430
1 -1.198891 NaN NaN NaN 1.196819
2 -0.342050 0.971968 -1.097107 NaN NaN
3 -0.908169 0.095141 -1.029277 1.533454 0.171399
In [4]: df.B.isnull()
Out[4]:
0 False
1 True
2 False
3 False
Name: B, dtype: bool
Use the & and | operators to combine two boolean Series, and use the where function from numpy to select between two values based on a Series of booleans. It returns a numpy array, but you can assign it to a column of a DataFrame:
In [5]: df['B&C'] = np.where(df.B.isnull() & df.C.isnull(), 'Not There', 'Present')
In [6]: df['B&C']
Out[6]:
0 Present
1 Not There
2 Present
3 Present
Name: B&C, dtype: object
Here you need to take two columns and need to check the corresponding entries having "NAN" or not. In Pandas, there is a one stop solution to all kind of indexing and selecting from a data frame.
http://pandas.pydata.org/pandas-docs/stable/indexing.html
I have explained this using iloc but you can do it in many other ways. I have not run the below code, but I hope the logic will be clear.
def tmp(col1,col2):
df = data[[col1,col2]]
for i in range(df.shape[0]):
if(df.iloc[i,0] == np.nan or df.iloc[i,1] == np.nan):
df.iloc[i,2]="Not Present"
else:
df.iloc[i,2]="Present"

How to remove rows from multiindex dataframe with string indices

I have a dataframe with multiindex, from which I want to delete rows according to some index based pattern. For example, I would like to remove frames 1-4 where the annotator is "Peter Test xx" and the label is "empty' in the dataframe below
print df
boundingbox x1 boundingbox y1 \
frame annotator label
0 Peter Test xx empty NaN NaN
1 Peter Test xx empty NaN NaN
2 Peter Test xx empty NaN NaN
3 Peter Test xx empty NaN NaN
Petaa yea NaN NaN
4 Peter Test xx empty NaN NaN
5 P empty frame 494 64
Peter Test xx empty NaN NaN
6 P empty frame 494 64
Peter Test xx empty NaN NaN
7 P empty frame 494 64
Peter Test xx empty NaN NaN
8 P empty frame 494 64
Peter Test xx empty NaN NaN
I can select rows by doing something like
indexer = [slice(None)]*len(df.index.names)
indexer[df.index.names.index('frame')] = range(1,4)
indexer[df.index.names.index('annotator')] = ['Peter Test xx']
indexer[df.index.names.index('label')] = ['empty']
return df.loc[tuple(indexer),:]
If I want to delete these rows, ideally I would like to do something like
del df.loc[tuple(indexer),:]
But this does not work (why?). All solutions I found online were based on int based indices. But if I am working with strings as indices, I cannot simply slice or such things.
Something I tried as well was:
def filterFunc(x, frames, annotator, label):
if x[0] in frames\
and x[1] == annotator\
and x[2] == label:
return 1
else:
return 0
mask = df.index.map(lambda x: filterFunc(x, frames, annotator, label))
return df[~mask,:]
Which gives me:
TypeError: unhashable type: 'numpy.ndarray'
Any advice?
Trying to solve another problem I figured out that one can use the index of a selected part of a dataframe in drop:
indexer = [slice(None)]*len(df.index.names)
indexer[df.index.names.index('frame')] = range(1,4)
indexer[df.index.names.index('annotator')] = ['Peter Test xx']
indexer[df.index.names.index('label')] = ['empty']
selection = df.loc[tuple(indexer),:]
df.drop(selection.index)
Is that how it is supposed to be done?
You have to use loc, iloc or ix when doing more complicated slicing:
df[msk] # works
df.iloc[msk, ] # works
df.iloc[msk, :] # works
but
df[msk, ]
TypeError: unhashable type: 'numpy.ndarray'
See different choices for indexing in the docs.

Find empty or NaN entry in Pandas Dataframe

I am trying to search through a Pandas Dataframe to find where it has a missing entry or a NaN entry.
Here is a dataframe that I am working with:
cl_id a c d e A1 A2 A3
0 1 -0.419279 0.843832 -0.530827 text76 1.537177 -0.271042
1 2 0.581566 2.257544 0.440485 dafN_6 0.144228 2.362259
2 3 -1.259333 1.074986 1.834653 system 1.100353
3 4 -1.279785 0.272977 0.197011 Fifty -0.031721 1.434273
4 5 0.578348 0.595515 0.553483 channel 0.640708 0.649132
5 6 -1.549588 -0.198588 0.373476 audio -0.508501
6 7 0.172863 1.874987 1.405923 Twenty NaN NaN
7 8 -0.149630 -0.502117 0.315323 file_max NaN NaN
NOTE: The blank entries are empty strings - this is because there was no alphanumeric content in the file that the dataframe came from.
If I have this dataframe, how can I find a list with the indexes where the NaN or blank entry occurs?
np.where(pd.isnull(df)) returns the row and column indices where the value is NaN:
In [152]: import numpy as np
In [153]: import pandas as pd
In [154]: np.where(pd.isnull(df))
Out[154]: (array([2, 5, 6, 6, 7, 7]), array([7, 7, 6, 7, 6, 7]))
In [155]: df.iloc[2,7]
Out[155]: nan
In [160]: [df.iloc[i,j] for i,j in zip(*np.where(pd.isnull(df)))]
Out[160]: [nan, nan, nan, nan, nan, nan]
Finding values which are empty strings could be done with applymap:
In [182]: np.where(df.applymap(lambda x: x == ''))
Out[182]: (array([5]), array([7]))
Note that using applymap requires calling a Python function once for each cell of the DataFrame. That could be slow for a large DataFrame, so it would be better if you could arrange for all the blank cells to contain NaN instead so you could use pd.isnull.
Try this:
df[df['column_name'] == ''].index
and for NaNs you can try:
pd.isna(df['column_name'])
Check if the columns contain Nan using .isnull() and check for empty strings using .eq(''), then join the two together using the bitwise OR operator |.
Sum along axis 0 to find columns with missing data, then sum along axis 1 to the index locations for rows with missing data.
missing_cols, missing_rows = (
(df2.isnull().sum(x) | df2.eq('').sum(x))
.loc[lambda x: x.gt(0)].index
for x in (0, 1)
)
>>> df2.loc[missing_rows, missing_cols]
A2 A3
2 1.10035
5 -0.508501
6 NaN NaN
7 NaN NaN
I've resorted to
df[ (df[column_name].notnull()) & (df[column_name]!=u'') ].index
lately. That gets both null and empty-string cells in one go.
In my opinion, don't waste time and just replace with NaN! Then, search all entries with Na. (This is correct because empty values are missing values anyway).
import numpy as np # to use np.nan
import pandas as pd # to use replace
df = df.replace(' ', np.nan) # to get rid of empty values
nan_values = df[df.isna().any(axis=1)] # to get all rows with Na
nan_values # view df with NaN rows only
Partial solution: for a single string column
tmp = df['A1'].fillna(''); isEmpty = tmp==''
gives boolean Series of True where there are empty strings or NaN values.
you also do something good:
text_empty = df['column name'].str.len() > -1
df.loc[text_empty].index
The results will be the rows which are empty & it's index number.
Another opltion covering cases where there might be severar spaces is by using the isspace() python function.
df[df.col_name.apply(lambda x:x.isspace() == False)] # will only return cases without empty spaces
adding NaN values:
df[(df.col_name.apply(lambda x:x.isspace() == False) & (~df.col_name.isna())]
To obtain all the rows that contains an empty cell in in a particular column.
DF_new_row=DF_raw.loc[DF_raw['columnname']=='']
This will give the subset of DF_raw, which satisfy the checking condition.
You can use string methods with regex to find cells with empty strings:
df[~df.column_name.str.contains('\w')].column_name.count()