Having problems converting strings in a pandas series to lowercase - python-2.7

I was able to do this in the DataFrame using a lambda function with map(lambda x: x.lower()). I tried to use a lambda function with pd.series.apply() but that didn't work. Also when I try to isolate the column in series with something like series['A'] should it return the index(although I guess this makes sense) because I get a float error even though the values that I want to apply the lower method to are strings. Any help would be appreciated.

You can use the Series vectorised string methods, which includes lower:
In [11]: df = pd.DataFrame([['A', 'B'], ['C', 4]], columns=['X', 'Y'])
In [12]: df
Out[12]:
X Y
0 A B
1 C 4
In [13]: df.X.str.lower()
Out[13]:
0 a
1 c
Name: X, dtype: object
In [14]: df.Y.str.lower()
Out[14]:
0 b
1 NaN
Name: Y, dtype: object

Related

how to make list of lists from pandas dataframe, skipping nan values

I have a pandas dataframe that looks roughly like
foo foo2 foo3 foo4
a NY WA AZ NaN
b DC NaN NaN NaN
c MA CA NaN NaN
I'd like to make a nested list of the observations of this dataframe, but omit the NaN values, so I have something like [['NY','WA','AZ'],['DC'],['MA',CA'].
There is a pattern in this dataframe, if that makes a difference, such that if fooX is empty, the subsequent column fooY will also be empty.
I originally had something like this code below. I'm sure there's a nicer way to do this
A = [[i] for i in subset_label['label'].tolist()]
B = [i for i in subset_label['label2'].tolist()]
C = [i for i in subset_label['label3'].tolist()]
D = [i for i in subset_label['label4'].tolist()]
out_list = []
for index, row in subset_label.iterrows():
out_list.append([row.label, row.label2, row.label3, row.label4])
out_list
Option 1
pd.DataFrame.stack drops na by default.
df.stack().groupby(level=0).apply(list).tolist()
[['NY', 'WA', 'AZ'], ['DC'], ['MA', 'CA']]
​___
Option 2
Fun alternative, because I think summing lists within pandas objects is fun.
df.applymap(lambda x: [x] if pd.notnull(x) else []).sum(1).tolist()
[['NY', 'WA', 'AZ'], ['DC'], ['MA', 'CA']]
Option 3
numpy experiment
nn = df.notnull().values
sliced = df.values.ravel()[nn.ravel()]
splits = nn.sum(1)[:-1].cumsum()
[s.tolist() for s in np.split(sliced, splits)]
[['NY', 'WA', 'AZ'], ['DC'], ['MA', 'CA']]
Try this:
In [77]: df.T.apply(lambda x: x.dropna().tolist()).tolist()
Out[77]: [['NY', 'WA', 'AZ'], ['DC'], ['MA', 'CA']]
Here's a vectorized version!
original = pd.DataFrame(data={
'foo': ['NY', 'DC', 'MA'],
'foo2': ['WA', np.nan, 'CA'],
'foo3': ['AZ', np.nan, np.nan],
'foo4': [np.nan] * 3,
})
out = original.copy().fillna('NAN')
# Build up mapping such that each non-nan entry is mapped to [entry]
# and nan entries are mapped to []
unique_entries = np.unique(out.values)
mapping = {e: [e] for e in unique_entries}
mapping['NAN'] = []
# Apply mapping
for c in original.columns:
out[c] = out[c].map(mapping)
# Concatenate the lists along axis 1
out.sum(axis=1)
You should get something like
0 [NY, WA, AZ]
1 [DC]
2 [MA, CA]
dtype: object

How to modify a Pandas Column filled text

I am trying to edit a Pandas dataframe column filled with text. Basically applying some editing functions(slicing, extraction and so on).
I am using writing the fucntion and applying the map function on the column to accomplish that.
df["Time taken"] = df["details"].map(somefunc)
However it seems I cant edit the text as Pandas stores the datatype in "object" not "string".
I tried using astype(str) but it still stays "object".
How do I accomplish this task?
You can perform string operations on Pandas series by appending .str to the series name. Here are some examples:
>>> df = pd.DataFrame([{'A': 'Label1', 'B': '$12.00'},
... {'A': 'Label2', 'B': '$14.00'},
... {'A': 'Label1', 'B': '$9.00'},
... {'A': 'Label2', 'B': '$8.00'}])
>>> df.B.str.replace('$','')
0 12.00
1 14.00
2 9.00
3 8.00
Name: B, dtype: object
>>> df.A.str[-1:]
0 1
1 2
2 1
3 2
Name: A, dtype: object
>>> df.A.str[1:]
0 abel1
1 abel2
2 abel1
3 abel2
Name: A, dtype: object
>>> df.B.str.len()
0 6
1 6
2 5
3 5
Name: B, dtype: int64
Pandas documentation: Working with Text Data

Using Pandas to subset data from a dataframe based on multiple columns?

I am new to python. I have to extract a subset from pandas dataframe based on 2 lists corresponding to 2 columns in that dataframe. Both the values in list should match with that of dataframe at index level. I have tried with "isin" function but obviously it doesn't work with combinations.
from pandas import *
d = {'A' : ['a', 'a', 'c', 'a','b'] ,'B' : [1, 2, 1, 4,1]}
df = DataFrame(d)
list1 = ['a','b']
list2 = [1,2]
print df
A B
0 a 1
1 a 2
2 c 1
3 a 4
4 b 1
### Using isin function
df[(df.A.isin(list1)) & (df.B.isin(list2)) ]
A B
0 a 1
1 a 2
4 b 1
###Desired outcome
d2 = {'A' : ['a'], 'B':[1]}
DataFrame(d2)
A B
0 a 1
Please let me know if this can be done without using loops and if there is a way to do it in a single step.
A quick and dirty way to do this is using zip:
df['C'] = zip(df['A'], df['B'])
list3 = zip(list1, list2)
d2 = df[df['C'].isin(list3)
print(df2)
A B C
0 a 1 (a, 1)
You can of course drop the newly created column after you're done filtering on it.

How to compare two columns and create new dataframe

My table:
A B C D E F G H I
0 0.292090 0.806958 0.255845 0.855154 0.590744 0.937458 0.192190 0.548974 0.703214
1 0.094978 NaN NaN NAN 0.350109 0.635469 0.025525 0.108062 0.510891
2 0.918005 0.568802 0.041519 NaN NaN 0.882552 0.086663 0.908168 0.221058
3 0.882920 0.230281 0.172843 0.948232 0.560853 NaN NaN 0.664388 0.393678
4 0.086579 0.819807 0.712273 0.769890 0.448730 0.853134 0.508932 0.630004 0.579961
Output:
A B&C D&E F&G H&I
0.292090 Present Present Present Present
0.094978 Not There Not There Present Present
0.918005 Present Not There Present Present
0.882920 Present Present Not There Present
0.086579 Present Present Present Present
If both B and C is not there then show not there else present
If anyone D and E is not there then show not there else present
If anyone F and G is not equal to 0 present else not there
If H and I sum is greater than 2, then show not there else present
I want to write if functions or lambda whatever is fast in pandas and I want to generate a new dataframe as I have given an output. But I am not able to understand how should I write these following statements in pandas.
if (B & C):
df.at[0, 'B&C'] = 'Present'
elif
df.at[0, 'B&C'] = 'Not there'
if (D | E):
df.at[0, 'D&E'] = 'Present'
elif
df.at[0, 'D&E'] = 'Not there'
So is there anyway in pandas with that I can complete my newset of dataframe.
You can use isnull to determine which entries are NaN:
In [3]: df
Out[3]:
A B C D E
0 -0.600684 -0.112947 -0.081186 -0.012543 1.951430
1 -1.198891 NaN NaN NaN 1.196819
2 -0.342050 0.971968 -1.097107 NaN NaN
3 -0.908169 0.095141 -1.029277 1.533454 0.171399
In [4]: df.B.isnull()
Out[4]:
0 False
1 True
2 False
3 False
Name: B, dtype: bool
Use the & and | operators to combine two boolean Series, and use the where function from numpy to select between two values based on a Series of booleans. It returns a numpy array, but you can assign it to a column of a DataFrame:
In [5]: df['B&C'] = np.where(df.B.isnull() & df.C.isnull(), 'Not There', 'Present')
In [6]: df['B&C']
Out[6]:
0 Present
1 Not There
2 Present
3 Present
Name: B&C, dtype: object
Here you need to take two columns and need to check the corresponding entries having "NAN" or not. In Pandas, there is a one stop solution to all kind of indexing and selecting from a data frame.
http://pandas.pydata.org/pandas-docs/stable/indexing.html
I have explained this using iloc but you can do it in many other ways. I have not run the below code, but I hope the logic will be clear.
def tmp(col1,col2):
df = data[[col1,col2]]
for i in range(df.shape[0]):
if(df.iloc[i,0] == np.nan or df.iloc[i,1] == np.nan):
df.iloc[i,2]="Not Present"
else:
df.iloc[i,2]="Present"

Find empty or NaN entry in Pandas Dataframe

I am trying to search through a Pandas Dataframe to find where it has a missing entry or a NaN entry.
Here is a dataframe that I am working with:
cl_id a c d e A1 A2 A3
0 1 -0.419279 0.843832 -0.530827 text76 1.537177 -0.271042
1 2 0.581566 2.257544 0.440485 dafN_6 0.144228 2.362259
2 3 -1.259333 1.074986 1.834653 system 1.100353
3 4 -1.279785 0.272977 0.197011 Fifty -0.031721 1.434273
4 5 0.578348 0.595515 0.553483 channel 0.640708 0.649132
5 6 -1.549588 -0.198588 0.373476 audio -0.508501
6 7 0.172863 1.874987 1.405923 Twenty NaN NaN
7 8 -0.149630 -0.502117 0.315323 file_max NaN NaN
NOTE: The blank entries are empty strings - this is because there was no alphanumeric content in the file that the dataframe came from.
If I have this dataframe, how can I find a list with the indexes where the NaN or blank entry occurs?
np.where(pd.isnull(df)) returns the row and column indices where the value is NaN:
In [152]: import numpy as np
In [153]: import pandas as pd
In [154]: np.where(pd.isnull(df))
Out[154]: (array([2, 5, 6, 6, 7, 7]), array([7, 7, 6, 7, 6, 7]))
In [155]: df.iloc[2,7]
Out[155]: nan
In [160]: [df.iloc[i,j] for i,j in zip(*np.where(pd.isnull(df)))]
Out[160]: [nan, nan, nan, nan, nan, nan]
Finding values which are empty strings could be done with applymap:
In [182]: np.where(df.applymap(lambda x: x == ''))
Out[182]: (array([5]), array([7]))
Note that using applymap requires calling a Python function once for each cell of the DataFrame. That could be slow for a large DataFrame, so it would be better if you could arrange for all the blank cells to contain NaN instead so you could use pd.isnull.
Try this:
df[df['column_name'] == ''].index
and for NaNs you can try:
pd.isna(df['column_name'])
Check if the columns contain Nan using .isnull() and check for empty strings using .eq(''), then join the two together using the bitwise OR operator |.
Sum along axis 0 to find columns with missing data, then sum along axis 1 to the index locations for rows with missing data.
missing_cols, missing_rows = (
(df2.isnull().sum(x) | df2.eq('').sum(x))
.loc[lambda x: x.gt(0)].index
for x in (0, 1)
)
>>> df2.loc[missing_rows, missing_cols]
A2 A3
2 1.10035
5 -0.508501
6 NaN NaN
7 NaN NaN
I've resorted to
df[ (df[column_name].notnull()) & (df[column_name]!=u'') ].index
lately. That gets both null and empty-string cells in one go.
In my opinion, don't waste time and just replace with NaN! Then, search all entries with Na. (This is correct because empty values are missing values anyway).
import numpy as np # to use np.nan
import pandas as pd # to use replace
df = df.replace(' ', np.nan) # to get rid of empty values
nan_values = df[df.isna().any(axis=1)] # to get all rows with Na
nan_values # view df with NaN rows only
Partial solution: for a single string column
tmp = df['A1'].fillna(''); isEmpty = tmp==''
gives boolean Series of True where there are empty strings or NaN values.
you also do something good:
text_empty = df['column name'].str.len() > -1
df.loc[text_empty].index
The results will be the rows which are empty & it's index number.
Another opltion covering cases where there might be severar spaces is by using the isspace() python function.
df[df.col_name.apply(lambda x:x.isspace() == False)] # will only return cases without empty spaces
adding NaN values:
df[(df.col_name.apply(lambda x:x.isspace() == False) & (~df.col_name.isna())]
To obtain all the rows that contains an empty cell in in a particular column.
DF_new_row=DF_raw.loc[DF_raw['columnname']=='']
This will give the subset of DF_raw, which satisfy the checking condition.
You can use string methods with regex to find cells with empty strings:
df[~df.column_name.str.contains('\w')].column_name.count()