How to remove rows from multiindex dataframe with string indices - python-2.7

I have a dataframe with multiindex, from which I want to delete rows according to some index based pattern. For example, I would like to remove frames 1-4 where the annotator is "Peter Test xx" and the label is "empty' in the dataframe below
print df
boundingbox x1 boundingbox y1 \
frame annotator label
0 Peter Test xx empty NaN NaN
1 Peter Test xx empty NaN NaN
2 Peter Test xx empty NaN NaN
3 Peter Test xx empty NaN NaN
Petaa yea NaN NaN
4 Peter Test xx empty NaN NaN
5 P empty frame 494 64
Peter Test xx empty NaN NaN
6 P empty frame 494 64
Peter Test xx empty NaN NaN
7 P empty frame 494 64
Peter Test xx empty NaN NaN
8 P empty frame 494 64
Peter Test xx empty NaN NaN
I can select rows by doing something like
indexer = [slice(None)]*len(df.index.names)
indexer[df.index.names.index('frame')] = range(1,4)
indexer[df.index.names.index('annotator')] = ['Peter Test xx']
indexer[df.index.names.index('label')] = ['empty']
return df.loc[tuple(indexer),:]
If I want to delete these rows, ideally I would like to do something like
del df.loc[tuple(indexer),:]
But this does not work (why?). All solutions I found online were based on int based indices. But if I am working with strings as indices, I cannot simply slice or such things.
Something I tried as well was:
def filterFunc(x, frames, annotator, label):
if x[0] in frames\
and x[1] == annotator\
and x[2] == label:
return 1
else:
return 0
mask = df.index.map(lambda x: filterFunc(x, frames, annotator, label))
return df[~mask,:]
Which gives me:
TypeError: unhashable type: 'numpy.ndarray'
Any advice?

Trying to solve another problem I figured out that one can use the index of a selected part of a dataframe in drop:
indexer = [slice(None)]*len(df.index.names)
indexer[df.index.names.index('frame')] = range(1,4)
indexer[df.index.names.index('annotator')] = ['Peter Test xx']
indexer[df.index.names.index('label')] = ['empty']
selection = df.loc[tuple(indexer),:]
df.drop(selection.index)
Is that how it is supposed to be done?

You have to use loc, iloc or ix when doing more complicated slicing:
df[msk] # works
df.iloc[msk, ] # works
df.iloc[msk, :] # works
but
df[msk, ]
TypeError: unhashable type: 'numpy.ndarray'
See different choices for indexing in the docs.

Related

Getting index values from pd mean() and std() functions

I'm trying to get the index values from a pd std().
My final objective is to match the index with another df and insert the corresponding values (standard deviations).
(in): df_std['index'] = df_std.index
(out): Index([u'AAPL US Equity', u'QQQ US Equity', u'BRABCBACNPR4 BZ Equity'...dtype='object')
However, I've been unable to add the indexes to the "right" of df_std because of the types: df_std.index is a series while df_std is a df. When I try to do it, a line is added instead of a column:
(in): df_std['index'] = df_std.index
(out):
BRSTNCLF1R25 Govt 64.0864
BRITUBACNPR1 BZ Equity 2.67762
BRSTNCNTB4O9 Govt 48.2419
BRSTNCLF1R74 Govt 64.901
PBR US Equity 0.770755
BRBBASACNOR3 BZ Equity 2.93335
BRSTNCLF1R82 Govt 65.0979
index Index([u'AAPL US Equity', u'QQQ US Equity', u'...
dtype: object
I've already tried converting it df_std.inde to a tuple and to a dataframe.
Thanks!
Edit:
I'm trying to match df_std['index'] with df_final['bloomberg_ticker'] and bring the std values to df_final['std']:
(in): print df_final
(out):
serie tipo tp_cnpjfundo valor id bloomberg_ticker \
0 NaN caixa NaN NaN 0 NaN
1 NaN titpublicos NaN NaN 1 BRSTNCLF1R17 Govt
2 NaN titpublicos NaN NaN 2 BRSTNCLF1R17 Govt
3 NaN titpublicos NaN NaN 3 BRSTNCLF1R25 Govt
(the column 'id' will be deleted later)
Use .reset_index() than assigning if what you have is a dataframe i.e
df_std = df_std.reset_index()
Example :
df = pd.DataFrame([0,1,2,3], index=['a','b','c','d'])
df = df.reset_index()
Output :
index 0
0 a 0
1 b 1
2 c 2
3 d 3
In case what you have is a series, convert that to dataframe then reset_index i.e if df_std is the series you have then
df_std = df_std.to_frame().reset_index()
I think what are trying to do is map the values of series to a specific column so you can use
df = pd.DataFrame({'col':['a','b','c','d','e'],'vales':[5,1,2,4,5]})
s = pd.Series([1,2,3],index=['a','b','c'])
df['new'] = df['col'].map(s)
Output :
col vales new
0 a 5 1.0
1 b 1 2.0
2 c 2 3.0
3 d 4 NaN
4 e 5 NaN
In your case you can use df_final['index'].map(df_std)
For conditional check if the index of series is present int he index column of dataframe then you can use .isin i.e
df['col'].isin(s.index) # Returns the boolen mask
df[df['col'].isin(s.index)] #Returns the dataframe based matched index

how to make list of lists from pandas dataframe, skipping nan values

I have a pandas dataframe that looks roughly like
foo foo2 foo3 foo4
a NY WA AZ NaN
b DC NaN NaN NaN
c MA CA NaN NaN
I'd like to make a nested list of the observations of this dataframe, but omit the NaN values, so I have something like [['NY','WA','AZ'],['DC'],['MA',CA'].
There is a pattern in this dataframe, if that makes a difference, such that if fooX is empty, the subsequent column fooY will also be empty.
I originally had something like this code below. I'm sure there's a nicer way to do this
A = [[i] for i in subset_label['label'].tolist()]
B = [i for i in subset_label['label2'].tolist()]
C = [i for i in subset_label['label3'].tolist()]
D = [i for i in subset_label['label4'].tolist()]
out_list = []
for index, row in subset_label.iterrows():
out_list.append([row.label, row.label2, row.label3, row.label4])
out_list
Option 1
pd.DataFrame.stack drops na by default.
df.stack().groupby(level=0).apply(list).tolist()
[['NY', 'WA', 'AZ'], ['DC'], ['MA', 'CA']]
​___
Option 2
Fun alternative, because I think summing lists within pandas objects is fun.
df.applymap(lambda x: [x] if pd.notnull(x) else []).sum(1).tolist()
[['NY', 'WA', 'AZ'], ['DC'], ['MA', 'CA']]
Option 3
numpy experiment
nn = df.notnull().values
sliced = df.values.ravel()[nn.ravel()]
splits = nn.sum(1)[:-1].cumsum()
[s.tolist() for s in np.split(sliced, splits)]
[['NY', 'WA', 'AZ'], ['DC'], ['MA', 'CA']]
Try this:
In [77]: df.T.apply(lambda x: x.dropna().tolist()).tolist()
Out[77]: [['NY', 'WA', 'AZ'], ['DC'], ['MA', 'CA']]
Here's a vectorized version!
original = pd.DataFrame(data={
'foo': ['NY', 'DC', 'MA'],
'foo2': ['WA', np.nan, 'CA'],
'foo3': ['AZ', np.nan, np.nan],
'foo4': [np.nan] * 3,
})
out = original.copy().fillna('NAN')
# Build up mapping such that each non-nan entry is mapped to [entry]
# and nan entries are mapped to []
unique_entries = np.unique(out.values)
mapping = {e: [e] for e in unique_entries}
mapping['NAN'] = []
# Apply mapping
for c in original.columns:
out[c] = out[c].map(mapping)
# Concatenate the lists along axis 1
out.sum(axis=1)
You should get something like
0 [NY, WA, AZ]
1 [DC]
2 [MA, CA]
dtype: object

How to compare two columns and create new dataframe

My table:
A B C D E F G H I
0 0.292090 0.806958 0.255845 0.855154 0.590744 0.937458 0.192190 0.548974 0.703214
1 0.094978 NaN NaN NAN 0.350109 0.635469 0.025525 0.108062 0.510891
2 0.918005 0.568802 0.041519 NaN NaN 0.882552 0.086663 0.908168 0.221058
3 0.882920 0.230281 0.172843 0.948232 0.560853 NaN NaN 0.664388 0.393678
4 0.086579 0.819807 0.712273 0.769890 0.448730 0.853134 0.508932 0.630004 0.579961
Output:
A B&C D&E F&G H&I
0.292090 Present Present Present Present
0.094978 Not There Not There Present Present
0.918005 Present Not There Present Present
0.882920 Present Present Not There Present
0.086579 Present Present Present Present
If both B and C is not there then show not there else present
If anyone D and E is not there then show not there else present
If anyone F and G is not equal to 0 present else not there
If H and I sum is greater than 2, then show not there else present
I want to write if functions or lambda whatever is fast in pandas and I want to generate a new dataframe as I have given an output. But I am not able to understand how should I write these following statements in pandas.
if (B & C):
df.at[0, 'B&C'] = 'Present'
elif
df.at[0, 'B&C'] = 'Not there'
if (D | E):
df.at[0, 'D&E'] = 'Present'
elif
df.at[0, 'D&E'] = 'Not there'
So is there anyway in pandas with that I can complete my newset of dataframe.
You can use isnull to determine which entries are NaN:
In [3]: df
Out[3]:
A B C D E
0 -0.600684 -0.112947 -0.081186 -0.012543 1.951430
1 -1.198891 NaN NaN NaN 1.196819
2 -0.342050 0.971968 -1.097107 NaN NaN
3 -0.908169 0.095141 -1.029277 1.533454 0.171399
In [4]: df.B.isnull()
Out[4]:
0 False
1 True
2 False
3 False
Name: B, dtype: bool
Use the & and | operators to combine two boolean Series, and use the where function from numpy to select between two values based on a Series of booleans. It returns a numpy array, but you can assign it to a column of a DataFrame:
In [5]: df['B&C'] = np.where(df.B.isnull() & df.C.isnull(), 'Not There', 'Present')
In [6]: df['B&C']
Out[6]:
0 Present
1 Not There
2 Present
3 Present
Name: B&C, dtype: object
Here you need to take two columns and need to check the corresponding entries having "NAN" or not. In Pandas, there is a one stop solution to all kind of indexing and selecting from a data frame.
http://pandas.pydata.org/pandas-docs/stable/indexing.html
I have explained this using iloc but you can do it in many other ways. I have not run the below code, but I hope the logic will be clear.
def tmp(col1,col2):
df = data[[col1,col2]]
for i in range(df.shape[0]):
if(df.iloc[i,0] == np.nan or df.iloc[i,1] == np.nan):
df.iloc[i,2]="Not Present"
else:
df.iloc[i,2]="Present"

Find empty or NaN entry in Pandas Dataframe

I am trying to search through a Pandas Dataframe to find where it has a missing entry or a NaN entry.
Here is a dataframe that I am working with:
cl_id a c d e A1 A2 A3
0 1 -0.419279 0.843832 -0.530827 text76 1.537177 -0.271042
1 2 0.581566 2.257544 0.440485 dafN_6 0.144228 2.362259
2 3 -1.259333 1.074986 1.834653 system 1.100353
3 4 -1.279785 0.272977 0.197011 Fifty -0.031721 1.434273
4 5 0.578348 0.595515 0.553483 channel 0.640708 0.649132
5 6 -1.549588 -0.198588 0.373476 audio -0.508501
6 7 0.172863 1.874987 1.405923 Twenty NaN NaN
7 8 -0.149630 -0.502117 0.315323 file_max NaN NaN
NOTE: The blank entries are empty strings - this is because there was no alphanumeric content in the file that the dataframe came from.
If I have this dataframe, how can I find a list with the indexes where the NaN or blank entry occurs?
np.where(pd.isnull(df)) returns the row and column indices where the value is NaN:
In [152]: import numpy as np
In [153]: import pandas as pd
In [154]: np.where(pd.isnull(df))
Out[154]: (array([2, 5, 6, 6, 7, 7]), array([7, 7, 6, 7, 6, 7]))
In [155]: df.iloc[2,7]
Out[155]: nan
In [160]: [df.iloc[i,j] for i,j in zip(*np.where(pd.isnull(df)))]
Out[160]: [nan, nan, nan, nan, nan, nan]
Finding values which are empty strings could be done with applymap:
In [182]: np.where(df.applymap(lambda x: x == ''))
Out[182]: (array([5]), array([7]))
Note that using applymap requires calling a Python function once for each cell of the DataFrame. That could be slow for a large DataFrame, so it would be better if you could arrange for all the blank cells to contain NaN instead so you could use pd.isnull.
Try this:
df[df['column_name'] == ''].index
and for NaNs you can try:
pd.isna(df['column_name'])
Check if the columns contain Nan using .isnull() and check for empty strings using .eq(''), then join the two together using the bitwise OR operator |.
Sum along axis 0 to find columns with missing data, then sum along axis 1 to the index locations for rows with missing data.
missing_cols, missing_rows = (
(df2.isnull().sum(x) | df2.eq('').sum(x))
.loc[lambda x: x.gt(0)].index
for x in (0, 1)
)
>>> df2.loc[missing_rows, missing_cols]
A2 A3
2 1.10035
5 -0.508501
6 NaN NaN
7 NaN NaN
I've resorted to
df[ (df[column_name].notnull()) & (df[column_name]!=u'') ].index
lately. That gets both null and empty-string cells in one go.
In my opinion, don't waste time and just replace with NaN! Then, search all entries with Na. (This is correct because empty values are missing values anyway).
import numpy as np # to use np.nan
import pandas as pd # to use replace
df = df.replace(' ', np.nan) # to get rid of empty values
nan_values = df[df.isna().any(axis=1)] # to get all rows with Na
nan_values # view df with NaN rows only
Partial solution: for a single string column
tmp = df['A1'].fillna(''); isEmpty = tmp==''
gives boolean Series of True where there are empty strings or NaN values.
you also do something good:
text_empty = df['column name'].str.len() > -1
df.loc[text_empty].index
The results will be the rows which are empty & it's index number.
Another opltion covering cases where there might be severar spaces is by using the isspace() python function.
df[df.col_name.apply(lambda x:x.isspace() == False)] # will only return cases without empty spaces
adding NaN values:
df[(df.col_name.apply(lambda x:x.isspace() == False) & (~df.col_name.isna())]
To obtain all the rows that contains an empty cell in in a particular column.
DF_new_row=DF_raw.loc[DF_raw['columnname']=='']
This will give the subset of DF_raw, which satisfy the checking condition.
You can use string methods with regex to find cells with empty strings:
df[~df.column_name.str.contains('\w')].column_name.count()

Convert 2D numpy.ndarray to pandas.DataFrame

I have a pretty big numpy.ndarray. Its basically an array of arrays. I want to convert it to a pandas.DataFrame. What I want to do is in the code below
from pandas import DataFrame
cache1 = DataFrame([{'id1': 'ABC1234'}, {'id1': 'NCMN7838'}])
cache2 = DataFrame([{'id2': 3276827}, {'id2': 98567498}, {'id2': 38472837}])
ndarr = [[4.3, 5.6, 6.7], [3.2, 4.5, 2.1]]
arr = []
for idx, i in enumerate(ndarr):
id1 = cache1.ix[idx].id1
for idx2, val in enumerate(i):
id2 = cache2.ix[idx2].id2
if val > 0:
arr.append(dict(id1=id1, id2=id2, value=val))
df = DataFrame(arr)
print(df.head())
I am mapping the index of the outer array and the inner array to index of two DataFrames to get certain IDs.
cache1 and cache2 are pandas.DataFrame. Each has ~100k rows.
This takes really really long, like a few hours to complete.
Is there some way I can speed it up?
I suspect your ndarr, if expressed as a 2d np.array, always has the shape of n,m, where n is the length of cache1.id1 and m is the length of cache2.id2. And the last entry in cache2, should be {'id2': 38472837} instead of {'id': 38472837}. If so, the following simple solution may be all what is needed:
In [30]:
df=pd.DataFrame(np.array(ndarr).ravel(),
index=pd.MultiIndex.from_product([cache1.id1.values, cache2.id2.values],names=['idx1', 'idx2']),
columns=['val'])
In [33]:
print df.reset_index()
idx1 idx2 val
0 ABC1234 3276827 4.3
1 ABC1234 98567498 5.6
2 ABC1234 38472837 6.7
3 NCMN7838 3276827 3.2
4 NCMN7838 98567498 4.5
5 NCMN7838 38472837 2.1
[6 rows x 3 columns]
Actually, I also think, that keep it having the MultiIndex may be a better idea.
Something like this should work:
ndarr = np.asarray(ndarr) # if ndarr is actually an array, skip this
fast_df = pd.DataFrame({"value": ndarr.ravel()})
i1, i2 = [i.ravel() for i in np.indices(ndarr.shape)]
fast_df["id1"] = cache1["id1"].loc[i1].values
fast_df["id2"] = cache2["id2"].loc[i2].values
which gives
>>> fast_df
value id1 id2
0 4.3 ABC1234 3276827
1 5.6 ABC1234 98567498
2 6.7 ABC1234 NaN
3 3.2 NCMN7838 3276827
4 4.5 NCMN7838 98567498
5 2.1 NCMN7838 NaN
And then if you really want to drop the zero values, you can keep only the nonzero ones using fast_df = fast_df[fast_df['value'] != 0].