Using Pandas to subset data from a dataframe based on multiple columns? - python-2.7

I am new to python. I have to extract a subset from pandas dataframe based on 2 lists corresponding to 2 columns in that dataframe. Both the values in list should match with that of dataframe at index level. I have tried with "isin" function but obviously it doesn't work with combinations.
from pandas import *
d = {'A' : ['a', 'a', 'c', 'a','b'] ,'B' : [1, 2, 1, 4,1]}
df = DataFrame(d)
list1 = ['a','b']
list2 = [1,2]
print df
A B
0 a 1
1 a 2
2 c 1
3 a 4
4 b 1
### Using isin function
df[(df.A.isin(list1)) & (df.B.isin(list2)) ]
A B
0 a 1
1 a 2
4 b 1
###Desired outcome
d2 = {'A' : ['a'], 'B':[1]}
DataFrame(d2)
A B
0 a 1
Please let me know if this can be done without using loops and if there is a way to do it in a single step.

A quick and dirty way to do this is using zip:
df['C'] = zip(df['A'], df['B'])
list3 = zip(list1, list2)
d2 = df[df['C'].isin(list3)
print(df2)
A B C
0 a 1 (a, 1)
You can of course drop the newly created column after you're done filtering on it.

Related

Use regular expression to extract elements from a pandas data frame

From the following data frame:
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
My ultimate goal is to extract the letters a, b or c (as string) in a pandas series. For that I am using the .findall() method from the re module, as shown below:
# import the module
import re
# define the patterns
pat = 'a|b|c'
# extract the patterns from the elements in the specified column
df['col1'].str.findall(pat)
The problem is that the output i.e. the letters a, b or c, in each row, will be present in a list (of a single element), as shown below:
Out[301]:
0 [a]
1 [b]
2 [c]
3 [a]
While I would like to have the letters a, b or c as string, as shown below:
0 a
1 b
2 c
3 a
I know that if I combine re.search() with .group() I can get a string, but if I do:
df['col1'].str.search(pat).group()
I will get the following error message:
AttributeError: 'StringMethods' object has no attribute 'search'
Using .str.split() won't do the job because, in my original dataframe, I want to capture strings that might contain the delimiter (e.g. I might want to capture a-b)
Does anyone know a simple solution for that, perhaps avoiding iterative operations such as a for loop or list comprehension?
Use extract with capturing groups:
import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
result = df['col1'].str.extract('(a|b|c)')
print(result)
Output
0
0 a
1 b
2 c
3 a
Fix your code
pat = 'a|b|c'
df['col1'].str.findall(pat).str[0]
Out[309]:
0 a
1 b
2 c
3 a
Name: col1, dtype: object
Simply try with str.split() like this- df["col1"].str.split("-", n = 1, expand = True)
import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
df['col1'] = df["col1"].str.split("-", n = 1, expand = True)
print(df.head())
Output:
col1
0 a
1 b
2 c
3 a

how to make list of lists from pandas dataframe, skipping nan values

I have a pandas dataframe that looks roughly like
foo foo2 foo3 foo4
a NY WA AZ NaN
b DC NaN NaN NaN
c MA CA NaN NaN
I'd like to make a nested list of the observations of this dataframe, but omit the NaN values, so I have something like [['NY','WA','AZ'],['DC'],['MA',CA'].
There is a pattern in this dataframe, if that makes a difference, such that if fooX is empty, the subsequent column fooY will also be empty.
I originally had something like this code below. I'm sure there's a nicer way to do this
A = [[i] for i in subset_label['label'].tolist()]
B = [i for i in subset_label['label2'].tolist()]
C = [i for i in subset_label['label3'].tolist()]
D = [i for i in subset_label['label4'].tolist()]
out_list = []
for index, row in subset_label.iterrows():
out_list.append([row.label, row.label2, row.label3, row.label4])
out_list
Option 1
pd.DataFrame.stack drops na by default.
df.stack().groupby(level=0).apply(list).tolist()
[['NY', 'WA', 'AZ'], ['DC'], ['MA', 'CA']]
​___
Option 2
Fun alternative, because I think summing lists within pandas objects is fun.
df.applymap(lambda x: [x] if pd.notnull(x) else []).sum(1).tolist()
[['NY', 'WA', 'AZ'], ['DC'], ['MA', 'CA']]
Option 3
numpy experiment
nn = df.notnull().values
sliced = df.values.ravel()[nn.ravel()]
splits = nn.sum(1)[:-1].cumsum()
[s.tolist() for s in np.split(sliced, splits)]
[['NY', 'WA', 'AZ'], ['DC'], ['MA', 'CA']]
Try this:
In [77]: df.T.apply(lambda x: x.dropna().tolist()).tolist()
Out[77]: [['NY', 'WA', 'AZ'], ['DC'], ['MA', 'CA']]
Here's a vectorized version!
original = pd.DataFrame(data={
'foo': ['NY', 'DC', 'MA'],
'foo2': ['WA', np.nan, 'CA'],
'foo3': ['AZ', np.nan, np.nan],
'foo4': [np.nan] * 3,
})
out = original.copy().fillna('NAN')
# Build up mapping such that each non-nan entry is mapped to [entry]
# and nan entries are mapped to []
unique_entries = np.unique(out.values)
mapping = {e: [e] for e in unique_entries}
mapping['NAN'] = []
# Apply mapping
for c in original.columns:
out[c] = out[c].map(mapping)
# Concatenate the lists along axis 1
out.sum(axis=1)
You should get something like
0 [NY, WA, AZ]
1 [DC]
2 [MA, CA]
dtype: object

Keeping some data from duplicates and adding to existing python dataframe

I have an issue with keeping some data from duplicates and wanting to add valuable information to a new column in the dataframe.
import pandas as pd
data = {'id':[1,1,2,2,3],'key':[1,1,2,2,1],'value0':['a', 'b', 'x', 'y', 'a']}
frame = pd.DataFrame(data, columns = ['id','key','value0'])
print frame
Yields:
id key value0
0 1 1 a
1 1 1 b
2 2 2 x
3 2 2 y
4 3 1 a
Desired Output:
key value0_0 value0_1 value1_0
0 1 a b a
1 2 x y None
The "id" column isn't important to keep but could help with iteration and grouping.
I think this could be adapted to other projects where you don't know how many values exist for a set of keys.
set_index including a cumcount and unstack
frame.set_index(
['key', frame.groupby('key').cumcount()]
).value0.unstack().add_prefix('value0_').reset_index()
key value0_0 value0_1 value0_2
0 1 a b a
1 2 x y None
I'm questioning your column labeling but here is an approach using binary
frame.set_index(
['key', frame.groupby('key').cumcount()]
).value0.unstack().rename(
columns='{:02b}'.format
).add_prefix('value_').reset_index()
key value_00 value_01 value_10
0 1 a b a
1 2 x y None

scikit-learn: One hot encoding of column with list values [duplicate]

This question already has answers here:
How to one hot encode variant length features?
(2 answers)
Closed 5 years ago.
I am trying to encode a dataframe like below:
A B C
2 'Hello' ['we', are', 'good']
1 'All' ['hello', 'world']
Now as you can see I can labelencod string values of second column, but I am not able to figure out how to go about encode the third column which is having list of string values and length of the lists are different. Even if i onehotencode this i will get an array which i dont know how to merge with array elements of other columns after encoding. Please suggest some good technique
Assuming we have the following DF:
In [31]: df
Out[31]:
A B C
0 2 Hello [we, are, good]
1 1 All [hello, world]
Let's use sklearn.feature_extraction.text.CountVectorizer
In [32]: from sklearn.feature_extraction.text import CountVectorizer
In [33]: vect = CountVectorizer()
In [34]: X = vect.fit_transform(df.C.str.join(' '))
In [35]: df = df.join(pd.DataFrame(X.toarray(), columns=vect.get_feature_names()))
In [36]: df
Out[36]:
A B C are good hello we world
0 2 Hello [we, are, good] 1 1 0 1 0
1 1 All [hello, world] 0 0 1 0 1
alternatively you can use sklearn.preprocessing.MultiLabelBinarizer as #VivekKumar suggested in this comment
In [56]: from sklearn.preprocessing import MultiLabelBinarizer
In [57]: mlb = MultiLabelBinarizer()
In [58]: X = mlb.fit_transform(df.C)
In [59]: df = df.join(pd.DataFrame(X, columns=mlb.classes_))
In [60]: df
Out[60]:
A B C are good hello we world
0 2 Hello [we, are, good] 1 1 0 1 0
1 1 All [hello, world] 0 0 1 0 1

Efficiently walking through pandas dataframe index

import pandas as pd
from numpy.random import randn
oldn = pd.DataFrame(randn(10, 4), columns=['A', 'B', 'C', 'D'])
I want to make a new DataFrame that is 0..9 rows long, and has one column "avg", whose value for row N = average(old[N]['A'], old[N]['B']..old[N]['D'])
I'm not very familiar with pandas, so all my ideas how to do this are gross for- loops and things. What is the efficient way to create and populate the new table?
Call mean on your df and pass param axis=1 to calculate the mean row-wise, you can then pass this as data to the DataFrame ctor:
In [128]:
new_df = pd.DataFrame(data = oldn.mean(axis=1), columns=['avg'])
new_df
Out[128]:
avg
0 0.541550
1 0.525518
2 -0.492634
3 0.163784
4 0.012363
5 0.514676
6 -0.468888
7 0.334473
8 0.669139
9 0.736748
If you want average for specific columns use the following. Else you can use the answer provided by #EdChum
oldn['Avg'] = oldn.apply(lambda v: ((v['A']+v['B']+v['C']+v['D']) / 4.), axis=1)
or
old['Avg'] = oldn.apply(lambda v: ((v[['A','B','C','D']]).sum() / 4.), axis=1)
print oldn
A B C D Avg
0 -0.201468 -0.832845 0.100299 0.044853 -0.222290
1 1.510688 -0.955329 0.239836 0.767431 0.390657
2 0.780910 0.335267 0.423232 -0.678401 0.215252
3 0.780518 2.876386 -0.797032 -0.523407 0.584116
4 0.438313 -1.952162 0.909568 -0.465147 -0.267357
5 0.145152 -0.836300 0.352706 -0.794815 -0.283314
6 -0.375432 -1.354249 0.920052 -1.002142 -0.452943
7 0.663149 -0.064227 0.321164 0.779981 0.425017
8 -1.279022 -2.206743 0.534943 0.794929 -0.538973
9 -0.339976 0.636516 -0.530445 -0.832413 -0.266579