I have data like the data below. I would like to only return the columns from the dataframe that contain at least one non-zero value. So in the example below it would be column ALF. Returning non-zero rows doesn’t seem that tricky but selecting the column and records is giving me a little trouble.
print df
Data:
Type ADR ALE ALF AME
Seg0 0.0 0.0 0.0 0.0
Seg1 0.0 0.0 0.5 0.0
When I try something like the link below:
Pandas: How to select columns with non-zero value in a sparse table
m1 = (df['Type'] == 'Seg0')
m2 = (df[m1] != 0).all()
print (df.loc[m1,m2])
I get a key error for 'Type'
In my opinion you get key error because first column is index:
Solution use DataFrame.any for check at least one non zero value to mask and then filter index of Trues:
m2 = (df != 0).any()
a = m2.index[m2]
print (a)
Index(['ALF'], dtype='object')
Or if need list:
a = m2.index[m2].tolist()
print (a)
['ALF']
Similar solution is filter columns names:
a = df.columns[m2]
Detail:
print (m2)
ADR False
ALE False
ALF True
AME False
dtype: bool
Related
I'm pulling data from Impala using impyla, and converting them to dataframe using as_pandas. And I'm using Pandas 0.18.0, Python 2.7.9
I'm trying to calculate the sum of all columns in a dataframe and trying to select the columns which are greater than the threshold.
self.data = self.data.loc[:,self.data.sum(axis=0) > 15]
But when I run this I'm getting error like below:
pandas.core.indexing.IndexingError: Unalignable boolean Series key
provided
Then I tried like below.
print 'length : ',len(self.data.sum(axis = 0)),' all columns : ',len(self.data.columns)
Then i'm getting different length i.e
length : 78 all columns : 83
And I'm getting below warning
C:\Python27\lib\decimal.py:1150: RuntimeWarning: tp_compare didn't
return -1 or -2 for exception
And To achieve my goal i tried the other way
for column in self.data.columns:
sum = self.data[column].sum()
if( sum < 15 ):
self.data = self.data.drop(column,1)
Now i have got the other errors like below:
TypeError: unsupported operand type(s) for +: 'Decimal' and 'float'
C:\Python27\lib\decimal.py:1150: RuntimeWarning: tp_compare didn't return -1 or -2 for exception
Then i tried to get the data types of each column like below.
print 'dtypes : ', self.data.dtypes
The result has all the columns are one of these int64 , object and float 64
Then i thought of changing the data type of columns which are in object like below
self.data.convert_objects(convert_numeric=True)
Still i'm getting the same errors, Please help me in solving this.
Note : In all the columns I do not have strings i.e characters and missing values or empty.I have checked this using self.data.to_csv
As i'm new to pandas and python Please don't mind if it is a silly question. I just want to learn
Please review the simple code below and you may understand the reason of the error.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random([3,3]))
df.iloc[0,0] = np.nan
print df
print df.sum(axis=0) > 1.5
print df.loc[:, df.sum(axis=0) > 1.5]
df.iloc[0,0] = 'string'
print df
print df.sum(axis=0) > 1.5
print df.loc[:, df.sum(axis=0) > 1.5]
0 1 2
0 NaN 0.336250 0.801349
1 0.930947 0.803907 0.139484
2 0.826946 0.229269 0.367627
0 True
1 False
2 False
dtype: bool
0
0 NaN
1 0.930947
2 0.826946
0 1 2
0 string 0.336250 0.801349
1 0.930947 0.803907 0.139484
2 0.826946 0.229269 0.367627
1 False
2 False
dtype: bool
Traceback (most recent call last):
...
pandas.core.indexing.IndexingError: Unalignable boolean Series key provided
Shortly, you need additional preprocess on your data.
df.select_dtypes(include=['object'])
If it's convertable string numbers, you can convert it by df.astype(), or you should purge them.
My table:
A B C D E F G H I
0 0.292090 0.806958 0.255845 0.855154 0.590744 0.937458 0.192190 0.548974 0.703214
1 0.094978 NaN NaN NAN 0.350109 0.635469 0.025525 0.108062 0.510891
2 0.918005 0.568802 0.041519 NaN NaN 0.882552 0.086663 0.908168 0.221058
3 0.882920 0.230281 0.172843 0.948232 0.560853 NaN NaN 0.664388 0.393678
4 0.086579 0.819807 0.712273 0.769890 0.448730 0.853134 0.508932 0.630004 0.579961
Output:
A B&C D&E F&G H&I
0.292090 Present Present Present Present
0.094978 Not There Not There Present Present
0.918005 Present Not There Present Present
0.882920 Present Present Not There Present
0.086579 Present Present Present Present
If both B and C is not there then show not there else present
If anyone D and E is not there then show not there else present
If anyone F and G is not equal to 0 present else not there
If H and I sum is greater than 2, then show not there else present
I want to write if functions or lambda whatever is fast in pandas and I want to generate a new dataframe as I have given an output. But I am not able to understand how should I write these following statements in pandas.
if (B & C):
df.at[0, 'B&C'] = 'Present'
elif
df.at[0, 'B&C'] = 'Not there'
if (D | E):
df.at[0, 'D&E'] = 'Present'
elif
df.at[0, 'D&E'] = 'Not there'
So is there anyway in pandas with that I can complete my newset of dataframe.
You can use isnull to determine which entries are NaN:
In [3]: df
Out[3]:
A B C D E
0 -0.600684 -0.112947 -0.081186 -0.012543 1.951430
1 -1.198891 NaN NaN NaN 1.196819
2 -0.342050 0.971968 -1.097107 NaN NaN
3 -0.908169 0.095141 -1.029277 1.533454 0.171399
In [4]: df.B.isnull()
Out[4]:
0 False
1 True
2 False
3 False
Name: B, dtype: bool
Use the & and | operators to combine two boolean Series, and use the where function from numpy to select between two values based on a Series of booleans. It returns a numpy array, but you can assign it to a column of a DataFrame:
In [5]: df['B&C'] = np.where(df.B.isnull() & df.C.isnull(), 'Not There', 'Present')
In [6]: df['B&C']
Out[6]:
0 Present
1 Not There
2 Present
3 Present
Name: B&C, dtype: object
Here you need to take two columns and need to check the corresponding entries having "NAN" or not. In Pandas, there is a one stop solution to all kind of indexing and selecting from a data frame.
http://pandas.pydata.org/pandas-docs/stable/indexing.html
I have explained this using iloc but you can do it in many other ways. I have not run the below code, but I hope the logic will be clear.
def tmp(col1,col2):
df = data[[col1,col2]]
for i in range(df.shape[0]):
if(df.iloc[i,0] == np.nan or df.iloc[i,1] == np.nan):
df.iloc[i,2]="Not Present"
else:
df.iloc[i,2]="Present"
My Table:
A Country Code1 Code2
626349 US 640AD1237 407223
702747 NaN IO1062123 407255
824316 US NaN NaN
712947 US 00220221 870262123
278147 Canada 721AC31234 109123
278144 Canada NaN 7214234321
278142 Canada 72142QW134 109123AS12
Here in the above table I need to check country and code.
I want a 5th column with correct or wrong, pseudocode:
If 'Country' == 'US' and (length(Code1) OR length(Code2) == 9):
Add values to 5th column as correct.
else:
Add values to 5th column as incorrect.
If 'Country' == 'Canada' and (length(Code1) OR length(Code2) == 10):
Add values to 5th column as correct.
else:
Add values to 5th column as incorrect.
if no values are there either in Country or Code Column than insufficient information.
I am not able to understand how should I do this in pandas. Please help. Thanks.
I tried to first find the length of rows of Code1 and Code2 and store it in different df but after that I am not able to Compare the different set of data as what I need to do.
Len1 = df.Code1.map(len)
Len2 = df.Code2.map(len)
LengthCode = pd.DataFrame({'Len_Code1': Len1,'Len_Code2': Len2})
Please tell me the better way of how to do this in single dataframe if possible.
I tried this
df[(df.Country == 'US') & ((df.Code1.str.len() == 9)|(df.Code2.str.len() == 9))|(df.Country == 'Canada') & ((df.Code1.str.len() == 10)|(df.Code2.str.len() == 10))]
But it is getting long and I will not be able to write for many countries.
This will give you a 'is_correct' boolean column:
code_lengths = {'US':9, 'Canada':10}
df['correct_code_length'] = df.Country.replace(code_lengths)
df['is_correct'] = (df.Code1.apply(lambda x: len(str(x))) == df.correct_code_length) | (df.Code2.apply(lambda x: len(str(x))) == df.correct_code_length)
You will need to populate the code_lengths dictionary with more countries as necessary.
I am trying to search through a Pandas Dataframe to find where it has a missing entry or a NaN entry.
Here is a dataframe that I am working with:
cl_id a c d e A1 A2 A3
0 1 -0.419279 0.843832 -0.530827 text76 1.537177 -0.271042
1 2 0.581566 2.257544 0.440485 dafN_6 0.144228 2.362259
2 3 -1.259333 1.074986 1.834653 system 1.100353
3 4 -1.279785 0.272977 0.197011 Fifty -0.031721 1.434273
4 5 0.578348 0.595515 0.553483 channel 0.640708 0.649132
5 6 -1.549588 -0.198588 0.373476 audio -0.508501
6 7 0.172863 1.874987 1.405923 Twenty NaN NaN
7 8 -0.149630 -0.502117 0.315323 file_max NaN NaN
NOTE: The blank entries are empty strings - this is because there was no alphanumeric content in the file that the dataframe came from.
If I have this dataframe, how can I find a list with the indexes where the NaN or blank entry occurs?
np.where(pd.isnull(df)) returns the row and column indices where the value is NaN:
In [152]: import numpy as np
In [153]: import pandas as pd
In [154]: np.where(pd.isnull(df))
Out[154]: (array([2, 5, 6, 6, 7, 7]), array([7, 7, 6, 7, 6, 7]))
In [155]: df.iloc[2,7]
Out[155]: nan
In [160]: [df.iloc[i,j] for i,j in zip(*np.where(pd.isnull(df)))]
Out[160]: [nan, nan, nan, nan, nan, nan]
Finding values which are empty strings could be done with applymap:
In [182]: np.where(df.applymap(lambda x: x == ''))
Out[182]: (array([5]), array([7]))
Note that using applymap requires calling a Python function once for each cell of the DataFrame. That could be slow for a large DataFrame, so it would be better if you could arrange for all the blank cells to contain NaN instead so you could use pd.isnull.
Try this:
df[df['column_name'] == ''].index
and for NaNs you can try:
pd.isna(df['column_name'])
Check if the columns contain Nan using .isnull() and check for empty strings using .eq(''), then join the two together using the bitwise OR operator |.
Sum along axis 0 to find columns with missing data, then sum along axis 1 to the index locations for rows with missing data.
missing_cols, missing_rows = (
(df2.isnull().sum(x) | df2.eq('').sum(x))
.loc[lambda x: x.gt(0)].index
for x in (0, 1)
)
>>> df2.loc[missing_rows, missing_cols]
A2 A3
2 1.10035
5 -0.508501
6 NaN NaN
7 NaN NaN
I've resorted to
df[ (df[column_name].notnull()) & (df[column_name]!=u'') ].index
lately. That gets both null and empty-string cells in one go.
In my opinion, don't waste time and just replace with NaN! Then, search all entries with Na. (This is correct because empty values are missing values anyway).
import numpy as np # to use np.nan
import pandas as pd # to use replace
df = df.replace(' ', np.nan) # to get rid of empty values
nan_values = df[df.isna().any(axis=1)] # to get all rows with Na
nan_values # view df with NaN rows only
Partial solution: for a single string column
tmp = df['A1'].fillna(''); isEmpty = tmp==''
gives boolean Series of True where there are empty strings or NaN values.
you also do something good:
text_empty = df['column name'].str.len() > -1
df.loc[text_empty].index
The results will be the rows which are empty & it's index number.
Another opltion covering cases where there might be severar spaces is by using the isspace() python function.
df[df.col_name.apply(lambda x:x.isspace() == False)] # will only return cases without empty spaces
adding NaN values:
df[(df.col_name.apply(lambda x:x.isspace() == False) & (~df.col_name.isna())]
To obtain all the rows that contains an empty cell in in a particular column.
DF_new_row=DF_raw.loc[DF_raw['columnname']=='']
This will give the subset of DF_raw, which satisfy the checking condition.
You can use string methods with regex to find cells with empty strings:
df[~df.column_name.str.contains('\w')].column_name.count()
I have time series data in two separate DataFrame columns which refer to the same parameter but are of differing lengths.
On dates where data only exist in one column, I'd like this value to be placed in my new column. On dates where there are entries for both columns, I'd like to have the mean value. (I'd like to join using the index, which is a datetime value)
Could somebody suggest a way that I could combine my two columns? Thanks.
Edit2: I written some code which should merge the data from both of my column, but I get a KeyError when I try to set the new values using my index generated from rows where my first df has values but my second df doesn't. Here's the code:
def merge_func(df):
null_index = df[(df['DOC_mg/L'].isnull() == False) & (df['TOC_mg/L'].isnull() == True)].index
df['TOC_mg/L'][null_index] = df[null_index]['DOC_mg/L']
notnull_index = df[(df['DOC_mg/L'].isnull() == True) & (df['TOC_mg/L'].isnull() == False)].index
df['DOC_mg/L'][notnull_index] = df[notnull_index]['TOC_mg/L']
df.insert(len(df.columns), 'Mean_mg/L', 0.0)
df['Mean_mg/L'] = (df['DOC_mg/L'] + df['TOC_mg/L']) / 2
return df
merge_func(sve)
And here's the error:
KeyError: "['2004-01-14T01:00:00.000000000+0100' '2004-03-04T01:00:00.000000000+0100'\n '2004-03-30T02:00:00.000000000+0200' '2004-04-12T02:00:00.000000000+0200'\n '2004-04-15T02:00:00.000000000+0200' '2004-04-17T02:00:00.000000000+0200'\n '2004-04-19T02:00:00.000000000+0200' '2004-04-20T02:00:00.000000000+0200'\n '2004-04-22T02:00:00.000000000+0200' '2004-04-26T02:00:00.000000000+0200'\n '2004-04-28T02:00:00.000000000+0200' '2004-04-30T02:00:00.000000000+0200'\n '2004-05-05T02:00:00.000000000+0200' '2004-05-07T02:00:00.000000000+0200'\n '2004-05-10T02:00:00.000000000+0200' '2004-05-13T02:00:00.000000000+0200'\n '2004-05-17T02:00:00.000000000+0200' '2004-05-20T02:00:00.000000000+0200'\n '2004-05-24T02:00:00.000000000+0200' '2004-05-28T02:00:00.000000000+0200'\n '2004-06-04T02:00:00.000000000+0200' '2004-06-10T02:00:00.000000000+0200'\n '2004-08-27T02:00:00.000000000+0200' '2004-10-06T02:00:00.000000000+0200'\n '2004-11-02T01:00:00.000000000+0100' '2004-12-08T01:00:00.000000000+0100'\n '2011-02-21T01:00:00.000000000+0100' '2011-03-21T01:00:00.000000000+0100'\n '2011-04-04T02:00:00.000000000+0200' '2011-04-11T02:00:00.000000000+0200'\n '2011-04-14T02:00:00.000000000+0200' '2011-04-18T02:00:00.000000000+0200'\n '2011-04-21T02:00:00.000000000+0200' '2011-04-25T02:00:00.000000000+0200'\n '2011-05-02T02:00:00.000000000+0200' '2011-05-09T02:00:00.000000000+0200'\n '2011-05-23T02:00:00.000000000+0200' '2011-06-07T02:00:00.000000000+0200'\n '2011-06-21T02:00:00.000000000+0200' '2011-07-04T02:00:00.000000000+0200'\n '2011-07-18T02:00:00.000000000+0200' '2011-08-31T02:00:00.000000000+0200'\n '2011-09-13T02:00:00.000000000+0200' '2011-09-28T02:00:00.000000000+0200'\n '2011-10-10T02:00:00.000000000+0200' '2011-10-25T02:00:00.000000000+0200'\n '2011-11-08T01:00:00.000000000+0100' '2011-11-28T01:00:00.000000000+0100'\n '2011-12-20T01:00:00.000000000+0100' '2012-01-19T01:00:00.000000000+0100'\n '2012-02-14T01:00:00.000000000+0100' '2012-03-13T01:00:00.000000000+0100'\n '2012-03-27T02:00:00.000000000+0200' '2012-04-02T02:00:00.000000000+0200'\n '2012-04-10T02:00:00.000000000+0200' '2012-04-17T02:00:00.000000000+0200'\n '2012-04-26T02:00:00.000000000+0200' '2012-04-30T02:00:00.000000000+0200'\n '2012-05-03T02:00:00.000000000+0200' '2012-05-07T02:00:00.000000000+0200'\n '2012-05-10T02:00:00.000000000+0200' '2012-05-14T02:00:00.000000000+0200'\n '2012-05-22T02:00:00.000000000+0200' '2012-06-05T02:00:00.000000000+0200'\n '2012-06-19T02:00:00.000000000+0200' '2012-07-03T02:00:00.000000000+0200'\n '2012-07-17T02:00:00.000000000+0200' '2012-07-31T02:00:00.000000000+0200'\n '2012-08-14T02:00:00.000000000+0200' '2012-08-28T02:00:00.000000000+0200'\n '2012-09-11T02:00:00.000000000+0200' '2012-09-25T02:00:00.000000000+0200'\n '2012-10-10T02:00:00.000000000+0200' '2012-10-24T02:00:00.000000000+0200'\n '2012-11-21T01:00:00.000000000+0100' '2012-12-18T01:00:00.000000000+0100'] not in index"
You are close, but you actually don't need to iterate over the rows when using the isnull() functions. by default
df[(df['DOC_mg/L'].isnull() == False) & (df['TOC_mg/L'].isnull() == True)].index
Will return just the index of the rows where DOC_mg/L is not null and TOC_mg/L is null.
Now you can do something like this to set the values for TOC_mg/L:
null_index = df[(df['DOC_mg/L'].isnull() == False) & \
(df['TOC_mg/L'].isnull() == True)].index
df['TOC_mg/L'][null_index] = df['DOC_mg/L'][null_index] # EDIT To switch the index position.
This will use the index of the rows where TOC_mg/L is null and DOC_mg/L is not null, and set the values for TOC_mg/L to the those found in DOC_mg/L in the same rows.
Note: This is not the accepted way for setting values using an index, but it is how I've been doing it for some time. Just make sure that when setting values, the left side of the equation is df['col_name'][index]. If col_name and index are switched you will set the values to a copy which is never set back to the original.
Now to set the mean, you can create a new column, we'll call this Mean_mg/L and set the value = 0.0. Then set this new column to the mean of both columns:
# Insert a new col at the end of the dataframe columns name 'Mean_mg/L'
# with default value 0.0
df.insert(len(df.columns), 'Mean_mg/L', 0.0)
# Set this columns value to the average of DOC_mg/L and TOC_mg/L
df['Mean_mg/L'] = (df['DOC_mg/L'] + df['TOC_mg/L']) / 2
In the columns where we filled null values with the corresponding column value, the average will be the same as the values.