Pandas Dataframe Wildcard Values in List - regex

How can I filter a dataframe to rows with values that are contained within a list? Specifically, the values in the dataframe will only be partial matches with the list and never exact match.
I've tried using pandas.DataFrame.isin but this only works if the values in the dataframe are the same as in the list.
list = ["123 MAIN STREET", "456 BLUE ROAD", "789 SKY DRIVE"]
df =
address
0 123 MAIN
1 456 BLUE
2 987 PANDA
target_df = df[df["address"].isin(list)
Ideally the result should be
target_df =
address
0 123 MAIN
1 456 BLUE

Use str.contains and a simple regex using | to connect the terms.
f = '|'.join
mask = f(map(f, map(str.split, list)))
df[df.address.str.contains(mask)]
address
0 123 MAIN
1 456 BLUE

Ending up using for loop
df[[any(x in y for y in l) for x in df.address]]
Out[257]:
address
0 123 MAIN
1 456 BLUE

Related

Remove all words containing '#' from list in DataFrame

I have a DataFrame in which one column contains lists of words.
>>dataset.head(1)
>> contain
0 ["name", "Place", "ect#gtr", "nick"]
1 ["gf#e", "nobel", "play", "hi"]
I want to remove all the words which contain '#'. In the above example, I want to remove "ect#gtr" and "gf#e".
Try This one
ab= np.column_stack([~df[col].str.contains(r"#") for col in df])
new_df=df.loc[ab.any(axis=1)]
print(new_df)
Use list comprehension with filtering, regex here is not necessary:
df = pd.DataFrame({'contain':[['name', 'Place', 'ect#gtr', 'nick'],
['gf#e', 'nobel', 'play', 'hi']]})
print (df)
contain
0 [name, Place, ect#gtr, nick]
1 [gf#e, nobel, play, hi]
df.contain = df.contain.apply(lambda x: [y for y in x if '#' not in y])
Or:
df.contain = [[y for y in x if '#' not in y] for x in df.contain]
print (df)
contain
0 [name, Place, nick]
1 [nobel, play, hi]
EDIT: For remove values in strings add split with join:
df = pd.DataFrame({'contain':['name Place ect#gtr nick',"gf#e nobel play hi"]})
print (df)
contain
0 name Place ect#gtr nick
1 gf#e nobel play hi
df.contain = df.contain.apply(lambda x: ' '.join([y for y in x.split() if '#' not in y]))
print (df)
contain
0 name Place nick
1 nobel play hi

Finding occurrences of substrings within pandas dataframe -- Python

I have a list of 'words' I want to count below
word_list = ['one','two','three']
And I have a column within pandas dataframe with text below.
TEXT
-----
"Perhaps she'll be the one for me."
"Is it two or one?"
"Mayhaps it be three afterall..."
"Three times and it's a charm."
"One fish, two fish, red fish, blue fish."
"There's only one cat in the hat."
"One does not simply code into pandas."
"Two nights later..."
"Quoth the Raven... nevermore."
The desired output that I would like is the following below, where I want to count the number of times the substrings defined in word_list appear in the strings of each row in the dataframe.
Word | Count
one 5
two 3
three 2
Is there a way to do this in Python 2.7?
I would do this with vanilla python, first join the string:
In [11]: long_string = "".join(df[0]).lower()
In [12]: long_string[:50] # all the words glued up
Out[12]: "perhaps she'll be the one for me.is it two or one?"
In [13]: for w in word_list:
...: print(w, long_string.count(w))
...:
one 5
two 3
three 2
If you want to return a Series, you could use a dict comprehension:
In [14]: pd.Series({w: long_string.count(w) for w in word_list})
Out[14]:
one 5
three 2
two 3
dtype: int64
Use str.extractall + value_counts:
df
text
0 "Perhaps she'll be the one for me."
1 "Is it two or one?"
2 "Mayhaps it be three afterall..."
3 "Three times and it's a charm."
4 "One fish, two fish, red fish, blue fish."
5 "There's only one cat in the hat."
6 "One does not simply code into pandas."
7 "Two nights later..."
8 "Quoth the Raven... nevermore."
rgx = '({})'.format('|'.join(word_list))
df['text'].str.lower().str.extractall(rgx).iloc[:, 0].value_counts()
one 5
two 3
three 2
Name: 0, dtype: int64
Details
rgx
'(one|two|three)'
df.text.str.lower().str.extractall(rgx).iloc[:, 0]
match
0 0 one
1 0 two
1 one
2 0 three
3 0 three
4 0 one
1 two
5 0 one
6 0 one
7 0 two
Name: 0, dtype: object
Performance
Small
# Zero's code
%%timeit
pd.Series({w: df.text.str.count(w, flags=re.IGNORECASE).sum() for w in word_list}).sort_values(ascending=False)
1000 loops, best of 3: 1.55 ms per loop
# Andy's code
%%timeit
long_string = "".join(df.iloc[:, 0]).lower()
for w in word_list:
long_string.count(w)
10000 loops, best of 3: 132 µs per loop
%%timeit
df['text'].str.lower().str.extractall(rgx).iloc[:, 0].value_counts()
100 loops, best of 3: 2.53 ms per loop
Large
df = pd.concat([df] * 100000)
%%timeit
pd.Series({w: df.text.str.count(w, flags=re.IGNORECASE).sum() for w in word_list}).sort_values(ascending=False)
1 loop, best of 3: 4.34 s per loop
%%timeit
long_string = "".join(df.iloc[:, 0]).lower()
for w in word_list:
long_string.count(w)
10 loops, best of 3: 151 ms per loop
%%timeit
df['text'].str.lower().str.extractall(rgx).iloc[:, 0].value_counts()
1 loop, best of 3: 4.12 s per loop
Use
In [52]: pd.Series({w: df.TEXT.str.contains(w, case=False).sum() for w in word_list})
Out[52]:
one 5
three 2
two 3
dtype: int64
Or, to count multiple instances in each row
In [53]: pd.Series({w: df.TEXT.str.count(w, flags=re.IGNORECASE).sum() for w in word_list})
Out[53]:
one 5
three 2
two 3
dtype: int64
Use sort_values
In [55]: s = pd.Series({w: df.TEXT.str.count(w, flags=re.IGNORECASE).sum() for w in word_list})
In [56]: s.sort_values(ascending=False)
Out[56]:
one 5
two 3
three 2
dtype: int64

load multiple csv files into Dataframe: columns names issue

I have multiple csv files with the same format (14 rows 4 columns).
I tried to load all of them into a single dataFrame, and use file's name to rename the values of the first column (1-14)
1 500 0 0
2 350 0 1
3 500 1 0
.............
13 600 0 0
14 800 0 0
I tried the following code but I am not getting what I am expecting:
filenames = os.listdir('Threshold/')
Y = pd.DataFrame () #empty df
# file name are in the following foramt "subx_ICA_thre.csv"
# need to get x (subject number to be used later for renaming columns values)
Sub_list=[]
for filename in filenames:
s= int(''.join(filter(str.isdigit, filename)))
Sub_list.append(int(s))
S_Sub_list= sorted(Sub_list)
for x in S_Sub_list: # get the file according to the subject number
temp = pd.read_csv('sub' +str(x)+'_ICA_thre.csv' )
df = pd.concat([Y, temp]) # concat the obtained frame with the empty frame
df.columns = ['id', 'data', 'isEB', 'isEM']
# replace the column values using subject id
for sub in range(1,15):
df['id'].replace(sub, 'sub' +str(x)+'_ICA_'+str(sub) ,inplace=True)
print (df)
output:
id data isEB isEM
0 sub1_ICA_2 200 0 0
1 sub1_ICA_3 275 0 0
2 sub1_ICA_4 500 1 0
................................
11 sub1_ICA_13 275 0 0
12 sub1_ICA_14 300 0 0
id data isEB isEM
0 sub2_ICA_2 275 0 0
1 sub2_ICA_3 500 0 0
2 sub2_ICA_4 400 0 0
.................................
11 sub2_ICA_13 300 0 0
12 sub2_ICA_14 450 0 0
First, it seems that the code makes different dataFrame not a single one.Second, the first row is removed (sub1_ICA_1 is missing, may be replaced with column names).
I couldn't find the problem in the loop that I am using
I think you need create list of DataFrames first, then concat with parameter keys for new values by range in MultiIndex, then modify column id and last remove MultiIndex by reset_index:
Also was added parameter names to read_csv for custom columns names.
Y = []
for x in S_Sub_list:
n = ['id', 'data', 'isEB', 'isEM']
temp = pd.read_csv('sub' + str(x) +'_ICA_thre.csv', names = n)
Y.append(temp)
#list comprehension alternative
#n = ['id', 'data', 'isEB', 'isEM']
#Y = [pd.read_csv('sub' + str(x) +'_ICA_thre.csv', names = n) for x in S_Sub_list]
df = pd.concat(Y, keys=range(1,len(S_Sub_list) + 1))
df['id'] = 'sub' + df.index.get_level_values(0).astype(str) +'_ICA_'+ df['id'].astype(str)
df = df.reset_index(drop=True)

R - How do document the number of grepl matches based in another data frame?

This is a rather tricky question indeed. It would be awesome if someone might be able to help me out.
What I'm trying to do is the following. I have data frame in R containing every locality in a given state, scraped from Wikipedia. It looks something like this (top 10 rows). Let's call it NewHampshire.df:
Municipality County Population
1 Acworth Sullivan 891
2 Albany Carroll 735
3 Alexandria Grafton 1613
4 Allenstown Merrimack 4322
5 Alstead Cheshire 1937
6 Alton Belknap 5250
7 Amherst Hillsborough 11201
8 Andover Merrimack 2371
9 Antrim Hillsborough 2637
10 Ashland Grafton 2076
I've further compiled a new variable called grep_term, which combines the values from Municipality and County into a new, variable that functions as an or-statement, something like this:
Municipality County Population grep_term
1 Acworth Sullivan 891 "Acworth|Sullivan"
2 Albany Carroll 735 "Albany|Carroll"
and so on. Furthermore, I have another dataset, containing self-disclosed locations of 2000 Twitter users. I call it location.df, and it looks a bit like this:
[1] "London" "Orleans village VT USA" "The World"
[4] "D M V Towson " "Playa del Sol Solidaridad" "Beautiful Downtown Burbank"
[7] NA "US" "Gaithersburg Md"
[10] NA "California " "Indy"
[13] "Florida" "exsnaveen com" "Houston TX"
I want to do two things:
1: Grepl through every observation in the location.df dataset, and save a TRUE or FALSE into a new variable depending on whether the self-disclosed location is part of the list in the first dataset.
2: Save the number of matches for a particular line in the NewHampshire.df dataset to a new variable. I.e., if there are 4 matches for Acworth in the twitter location dataset, there should be a value "4" for observation 1 in the NewHampshire.df on the newly created "matches" variable
What I've done so far: I've solved task 1, as follows:
for(i in 1:234){
location.df$isRelevant <- sapply(location.df$location, function(s) grepl(NH_Places[i], s, ignore.case = TRUE))
}
How can I solve task 2, ideally in the same for loop?
Thanks in advance, any help would be greatly appreciated!
With regard to task one, you could also use:
# location vector to be matched against
loc.vec <- c("Acworth","Hillsborough","California","Amherst","Grafton","Ashland","London")
location.df <- data.frame(location=loc.vec)
# create a 'grep-vector'
places <- paste(paste(NewHampshire$Municipality, NewHampshire$County,
sep = "|"),
collapse = "|")
# match them against the available locations
location.df$isRelevant <- sapply(location.df$location,
function(s) grepl(places, s, ignore.case = TRUE))
which gives:
> location.df
location isRelevant
1 Acworth TRUE
2 Hillsborough TRUE
3 California FALSE
4 Amherst TRUE
5 Grafton TRUE
6 Ashland TRUE
7 London FALSE
To get the number of matches in the location.df with the grep_term column, you can use:
NewHampshire$n.matches <- sapply(NewHampshire$grep_term, function(x) sum(grepl(x, loc.vec)))
gives:
> NewHampshire
Municipality County Population grep_term n.matches
1 Acworth Sullivan 891 Acworth|Sullivan 1
2 Albany Carroll 735 Albany|Carroll 0
3 Alexandria Grafton 1613 Alexandria|Grafton 1
4 Allenstown Merrimack 4322 Allenstown|Merrimack 0
5 Alstead Cheshire 1937 Alstead|Cheshire 0
6 Alton Belknap 5250 Alton|Belknap 0
7 Amherst Hillsborough 11201 Amherst|Hillsborough 2
8 Andover Merrimack 2371 Andover|Merrimack 0
9 Antrim Hillsborough 2637 Antrim|Hillsborough 1
10 Ashland Grafton 2076 Ashland|Grafton 2

How to remove rows from multiindex dataframe with string indices

I have a dataframe with multiindex, from which I want to delete rows according to some index based pattern. For example, I would like to remove frames 1-4 where the annotator is "Peter Test xx" and the label is "empty' in the dataframe below
print df
boundingbox x1 boundingbox y1 \
frame annotator label
0 Peter Test xx empty NaN NaN
1 Peter Test xx empty NaN NaN
2 Peter Test xx empty NaN NaN
3 Peter Test xx empty NaN NaN
Petaa yea NaN NaN
4 Peter Test xx empty NaN NaN
5 P empty frame 494 64
Peter Test xx empty NaN NaN
6 P empty frame 494 64
Peter Test xx empty NaN NaN
7 P empty frame 494 64
Peter Test xx empty NaN NaN
8 P empty frame 494 64
Peter Test xx empty NaN NaN
I can select rows by doing something like
indexer = [slice(None)]*len(df.index.names)
indexer[df.index.names.index('frame')] = range(1,4)
indexer[df.index.names.index('annotator')] = ['Peter Test xx']
indexer[df.index.names.index('label')] = ['empty']
return df.loc[tuple(indexer),:]
If I want to delete these rows, ideally I would like to do something like
del df.loc[tuple(indexer),:]
But this does not work (why?). All solutions I found online were based on int based indices. But if I am working with strings as indices, I cannot simply slice or such things.
Something I tried as well was:
def filterFunc(x, frames, annotator, label):
if x[0] in frames\
and x[1] == annotator\
and x[2] == label:
return 1
else:
return 0
mask = df.index.map(lambda x: filterFunc(x, frames, annotator, label))
return df[~mask,:]
Which gives me:
TypeError: unhashable type: 'numpy.ndarray'
Any advice?
Trying to solve another problem I figured out that one can use the index of a selected part of a dataframe in drop:
indexer = [slice(None)]*len(df.index.names)
indexer[df.index.names.index('frame')] = range(1,4)
indexer[df.index.names.index('annotator')] = ['Peter Test xx']
indexer[df.index.names.index('label')] = ['empty']
selection = df.loc[tuple(indexer),:]
df.drop(selection.index)
Is that how it is supposed to be done?
You have to use loc, iloc or ix when doing more complicated slicing:
df[msk] # works
df.iloc[msk, ] # works
df.iloc[msk, :] # works
but
df[msk, ]
TypeError: unhashable type: 'numpy.ndarray'
See different choices for indexing in the docs.