Pandas Series - print columns and rows - python-2.7

For now I am not so worried about the most performant way to get at my data in a series, lets say that my series is as follows :
A 1
B 2
C 3
D 4
If I am using a for loop to iterate this, for example :
for row in seriesObj:
print row
The code above will print the values down the right hand side, but lets say, I want to get at the left column (indexes) how might I do that?
All help greatly appreciated, I am very new to pandas and am having some teething problems.
Thanks.

Try Series.iteritems.
import pandas as pd
s = pd.Series([1, 2, 3, 4], index=iter('ABCD'))
for ind, val in s.iteritems():
print ind, val
Prints:
A 1
B 2
C 3
D 4

Related

Use regular expression to extract elements from a pandas data frame

From the following data frame:
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
My ultimate goal is to extract the letters a, b or c (as string) in a pandas series. For that I am using the .findall() method from the re module, as shown below:
# import the module
import re
# define the patterns
pat = 'a|b|c'
# extract the patterns from the elements in the specified column
df['col1'].str.findall(pat)
The problem is that the output i.e. the letters a, b or c, in each row, will be present in a list (of a single element), as shown below:
Out[301]:
0 [a]
1 [b]
2 [c]
3 [a]
While I would like to have the letters a, b or c as string, as shown below:
0 a
1 b
2 c
3 a
I know that if I combine re.search() with .group() I can get a string, but if I do:
df['col1'].str.search(pat).group()
I will get the following error message:
AttributeError: 'StringMethods' object has no attribute 'search'
Using .str.split() won't do the job because, in my original dataframe, I want to capture strings that might contain the delimiter (e.g. I might want to capture a-b)
Does anyone know a simple solution for that, perhaps avoiding iterative operations such as a for loop or list comprehension?
Use extract with capturing groups:
import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
result = df['col1'].str.extract('(a|b|c)')
print(result)
Output
0
0 a
1 b
2 c
3 a
Fix your code
pat = 'a|b|c'
df['col1'].str.findall(pat).str[0]
Out[309]:
0 a
1 b
2 c
3 a
Name: col1, dtype: object
Simply try with str.split() like this- df["col1"].str.split("-", n = 1, expand = True)
import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
df['col1'] = df["col1"].str.split("-", n = 1, expand = True)
print(df.head())
Output:
col1
0 a
1 b
2 c
3 a

Drop rows based on one column values

I've a dataframe which looks like this:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
8 6842.04 -32.751551 -0.002514 65.118329
9 6842.69 18.293519 -0.002158 36.385884
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
As it's clearly evident in the above table that some of the values in the column mad and median are very big(outliers). So i want to remove the rows which have these very big values.
For example in row3 the value of mad is 30.408377 which very big so i want to drop this row. I know that i can use one line
to remove these values from the columns but it doesn't removes the complete row
df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]
But i want to remove the complete row.
How can i do that?
Predicates like what you've given will remove entire rows. But none of your data is outside of 3 standard deviations. If you tone it down to just one standard deviation, rows are removed with your example data.
Here's an example using your data:
import pandas as pd
import numpy as np
columns = ["wave", "mean", "median", "mad"]
data = [
[4050.32, -0.016182, -0.011940, 0.008885],
[4208.98, 0.023707, 0.007189, 0.032585],
[4508.28, 3.662293, 0.001414, 7.193139],
[4531.62, -15.459313, -0.001523, 30.408377],
[4551.65, 0.009028, 0.007581, 0.005247],
[4554.46, 0.001861, 0.010692, 0.027969],
[6828.60, -10.604568, -0.000590, 21.084799],
[6839.84, -0.003466, -0.001870, 0.010169],
[6842.04, -32.751551, -0.002514, 65.118329],
[6842.69, 18.293519, -0.002158, 36.385884],
[6843.66, 0.006386, -0.002468, 0.034995],
[6855.72, 0.020803, 0.000886, 0.040529],
]
df = pd.DataFrame(np.array(data), columns=columns)
print("ORIGINAL: ")
print(df)
print()
res = df[np.abs(df['mad']-df['mad'].mean()) <= (df['mad'].std())]
print("REMOVED: ")
print(res)
this outputs:
ORIGINAL:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
8 6842.04 -32.751551 -0.002514 65.118329
9 6842.69 18.293519 -0.002158 36.385884
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
REMOVED:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
Observe that rows indexed 8 and 9 are now gone.
Be sure you're reassigning the output of df[np.abs(df['mad']-df['mad'].mean()) <= (df['mad'].std())] as shown above. The operation is not done in place.
Doing df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())] will not change the dataframe.
But assign it back to df, so that:
df = df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]

Using Pandas to subset data from a dataframe based on multiple columns?

I am new to python. I have to extract a subset from pandas dataframe based on 2 lists corresponding to 2 columns in that dataframe. Both the values in list should match with that of dataframe at index level. I have tried with "isin" function but obviously it doesn't work with combinations.
from pandas import *
d = {'A' : ['a', 'a', 'c', 'a','b'] ,'B' : [1, 2, 1, 4,1]}
df = DataFrame(d)
list1 = ['a','b']
list2 = [1,2]
print df
A B
0 a 1
1 a 2
2 c 1
3 a 4
4 b 1
### Using isin function
df[(df.A.isin(list1)) & (df.B.isin(list2)) ]
A B
0 a 1
1 a 2
4 b 1
###Desired outcome
d2 = {'A' : ['a'], 'B':[1]}
DataFrame(d2)
A B
0 a 1
Please let me know if this can be done without using loops and if there is a way to do it in a single step.
A quick and dirty way to do this is using zip:
df['C'] = zip(df['A'], df['B'])
list3 = zip(list1, list2)
d2 = df[df['C'].isin(list3)
print(df2)
A B C
0 a 1 (a, 1)
You can of course drop the newly created column after you're done filtering on it.

Efficiently walking through pandas dataframe index

import pandas as pd
from numpy.random import randn
oldn = pd.DataFrame(randn(10, 4), columns=['A', 'B', 'C', 'D'])
I want to make a new DataFrame that is 0..9 rows long, and has one column "avg", whose value for row N = average(old[N]['A'], old[N]['B']..old[N]['D'])
I'm not very familiar with pandas, so all my ideas how to do this are gross for- loops and things. What is the efficient way to create and populate the new table?
Call mean on your df and pass param axis=1 to calculate the mean row-wise, you can then pass this as data to the DataFrame ctor:
In [128]:
new_df = pd.DataFrame(data = oldn.mean(axis=1), columns=['avg'])
new_df
Out[128]:
avg
0 0.541550
1 0.525518
2 -0.492634
3 0.163784
4 0.012363
5 0.514676
6 -0.468888
7 0.334473
8 0.669139
9 0.736748
If you want average for specific columns use the following. Else you can use the answer provided by #EdChum
oldn['Avg'] = oldn.apply(lambda v: ((v['A']+v['B']+v['C']+v['D']) / 4.), axis=1)
or
old['Avg'] = oldn.apply(lambda v: ((v[['A','B','C','D']]).sum() / 4.), axis=1)
print oldn
A B C D Avg
0 -0.201468 -0.832845 0.100299 0.044853 -0.222290
1 1.510688 -0.955329 0.239836 0.767431 0.390657
2 0.780910 0.335267 0.423232 -0.678401 0.215252
3 0.780518 2.876386 -0.797032 -0.523407 0.584116
4 0.438313 -1.952162 0.909568 -0.465147 -0.267357
5 0.145152 -0.836300 0.352706 -0.794815 -0.283314
6 -0.375432 -1.354249 0.920052 -1.002142 -0.452943
7 0.663149 -0.064227 0.321164 0.779981 0.425017
8 -1.279022 -2.206743 0.534943 0.794929 -0.538973
9 -0.339976 0.636516 -0.530445 -0.832413 -0.266579

Find empty or NaN entry in Pandas Dataframe

I am trying to search through a Pandas Dataframe to find where it has a missing entry or a NaN entry.
Here is a dataframe that I am working with:
cl_id a c d e A1 A2 A3
0 1 -0.419279 0.843832 -0.530827 text76 1.537177 -0.271042
1 2 0.581566 2.257544 0.440485 dafN_6 0.144228 2.362259
2 3 -1.259333 1.074986 1.834653 system 1.100353
3 4 -1.279785 0.272977 0.197011 Fifty -0.031721 1.434273
4 5 0.578348 0.595515 0.553483 channel 0.640708 0.649132
5 6 -1.549588 -0.198588 0.373476 audio -0.508501
6 7 0.172863 1.874987 1.405923 Twenty NaN NaN
7 8 -0.149630 -0.502117 0.315323 file_max NaN NaN
NOTE: The blank entries are empty strings - this is because there was no alphanumeric content in the file that the dataframe came from.
If I have this dataframe, how can I find a list with the indexes where the NaN or blank entry occurs?
np.where(pd.isnull(df)) returns the row and column indices where the value is NaN:
In [152]: import numpy as np
In [153]: import pandas as pd
In [154]: np.where(pd.isnull(df))
Out[154]: (array([2, 5, 6, 6, 7, 7]), array([7, 7, 6, 7, 6, 7]))
In [155]: df.iloc[2,7]
Out[155]: nan
In [160]: [df.iloc[i,j] for i,j in zip(*np.where(pd.isnull(df)))]
Out[160]: [nan, nan, nan, nan, nan, nan]
Finding values which are empty strings could be done with applymap:
In [182]: np.where(df.applymap(lambda x: x == ''))
Out[182]: (array([5]), array([7]))
Note that using applymap requires calling a Python function once for each cell of the DataFrame. That could be slow for a large DataFrame, so it would be better if you could arrange for all the blank cells to contain NaN instead so you could use pd.isnull.
Try this:
df[df['column_name'] == ''].index
and for NaNs you can try:
pd.isna(df['column_name'])
Check if the columns contain Nan using .isnull() and check for empty strings using .eq(''), then join the two together using the bitwise OR operator |.
Sum along axis 0 to find columns with missing data, then sum along axis 1 to the index locations for rows with missing data.
missing_cols, missing_rows = (
(df2.isnull().sum(x) | df2.eq('').sum(x))
.loc[lambda x: x.gt(0)].index
for x in (0, 1)
)
>>> df2.loc[missing_rows, missing_cols]
A2 A3
2 1.10035
5 -0.508501
6 NaN NaN
7 NaN NaN
I've resorted to
df[ (df[column_name].notnull()) & (df[column_name]!=u'') ].index
lately. That gets both null and empty-string cells in one go.
In my opinion, don't waste time and just replace with NaN! Then, search all entries with Na. (This is correct because empty values are missing values anyway).
import numpy as np # to use np.nan
import pandas as pd # to use replace
df = df.replace(' ', np.nan) # to get rid of empty values
nan_values = df[df.isna().any(axis=1)] # to get all rows with Na
nan_values # view df with NaN rows only
Partial solution: for a single string column
tmp = df['A1'].fillna(''); isEmpty = tmp==''
gives boolean Series of True where there are empty strings or NaN values.
you also do something good:
text_empty = df['column name'].str.len() > -1
df.loc[text_empty].index
The results will be the rows which are empty & it's index number.
Another opltion covering cases where there might be severar spaces is by using the isspace() python function.
df[df.col_name.apply(lambda x:x.isspace() == False)] # will only return cases without empty spaces
adding NaN values:
df[(df.col_name.apply(lambda x:x.isspace() == False) & (~df.col_name.isna())]
To obtain all the rows that contains an empty cell in in a particular column.
DF_new_row=DF_raw.loc[DF_raw['columnname']=='']
This will give the subset of DF_raw, which satisfy the checking condition.
You can use string methods with regex to find cells with empty strings:
df[~df.column_name.str.contains('\w')].column_name.count()