How to use re.search on a list? - regex

I have tried to change the re.search to re.match and so. But still it will show "No match result" no matter what I type.
I think there could be a problem in the code, since I made this code without fully comprehend the concept behind it.
Basically, I am trying to do a "search engine" to look for all the matching name if a word is given and matches one of the word in the names. Can someone tell me what is wrong?
import re
searchlist=[ *insert name here* ]
word_s = input("Search : ")
search_list = re.compile(r'\b(?:%s)\b' % '|'.join(searchlist), re.I|re.M)
result = re.search(search_list, word_s)
if result:
print("Match Result: ", result.group())
else:
print("No match result.")

Your last comment shows the problem:
In your code, searchlist is a list of the search terms (the things the regex searches for), not the list of strings to be searched.
For example:
searchlist = ["Fundamentals", "Engineering"]
search_list = re.compile(r'\b(?:%s)\b' % '|'.join(searchlist), re.I|re.M)
Now search_list is \b(?:Fundamentals|Engineering)\b, so it can be used as regex that will find if any of those terms appears in word_s
result = re.search(search_list, word_s)
You want to do the exact opposite:
books = ["Fundamentals of Organic Chemistry, International Edition", "Engineering Mechanics: Statics In SI Units"]
word_s = input("Search for: ")
word_re = re.compile(r"\b{}\b".format(word_s), re.I)
for book in books:
if re.search(word_re, book):
print("First Match Result: ", book)
break # Abort search after first match
else: # Only executed if the for loop was exhausted
print("No match result.")

Related

regex search nested dictionary and stop on first match (python)

I'm using a nested dictionary, which contains various vertebrates types. I can currently read the nested dictionary in and search a simple sentence for a keyword (e.g., tiger).
I would like to stop the dictionary search (loop), once the first match is found.
How do I accomplish this?
Example code:
vertebrates = {'dict1':{'frog':'amphibian', 'toad':'amphibian', 'salamander':'amphibian','newt':'amphibian'},
'dict2':{'bear':'mammal','cheetah':'mammal','fox':'mammal', 'mongoose':'mammal','tiger':'mammal'},
'dict3': {'anteater': 'mammal', 'tiger': 'mammal'}}
sentence = 'I am a tiger'
for dictionaries, values in vertebrates.items():
for pattern, value in values.items():
animal = re.compile(r'\b{}\b'.format(pattern), re.IGNORECASE|re.MULTILINE)
match = re.search(animal, sentence)
if match:
print (value)
print (match.group(0))
vertebrates = {'dict1':{'frog':'amphibian', 'toad':'amphibian', 'salamander':'amphibian','newt':'amphibian'},
'dict2':{'bear':'mammal','cheetah':'mammal','fox':'mammal', 'mongoose':'mammal','tiger':'mammal'},
'dict3': {'anteater': 'mammal', 'tiger': 'mammal'}}
sentence = 'I am a tiger'
found = False # Initialized found flag as False (match not found)
for dictionaries, values in vertebrates.items():
for pattern, value in values.items():
animal = re.compile(r'\b{}\b'.format(pattern), re.IGNORECASE|re.MULTILINE)
match = re.search(animal, sentence)
if match is not None:
print (value)
print (match.group(0))
found = True # Set found flag as True if you found a match
break # exit the loop since match is found
if found: # If match is found then break the loop
break

Filter out sentences in a list that don't contain particular words

Let's say I have this list:
sentences = ['the cat slept', 'the dog jumped', 'the bird flew']
I want to filter out any sentences that contain terms from the following list:
terms = ['clock', 'dog']
I should get:
['the cat slept', 'the bird flew']
I tried this solution, but it doesn't work
empty = []
if any(x not in terms for x in sentences):
empty.append(x)
What's the best way to tackle this?
I'd go with a solution like this for readability rather than reducing to a one liner:
for sentence in sentences:
if all(term not in sentence for term in terms):
empty.append(sentence)
Simple brute-force O(m*n) approach using list comprehension:
For each sentence - check if any of not allowed terms are found in this sentence and allow sentence if there was no match.
[s for s in sentences if not any(t in s for t in terms)]
# ['the cat slept', 'the bird flew']
Obviously, you can also invert condition and to something like:
[s for s in sentences if all(t not in s for t in terms)]
Similar to the above two answers but using filter, perhaps being closer to the problem specification:
filter(lambda x: all([el not in terms for el in x.split(' ')]), sentences)
Binary Seach is more optimized for too long sentences and terms.
from bisect import bisect
def binary_search(a,x,lo=0,hi=-1):
i = bisect(a,x,lo,hi)
if i == 0:
return -1
elif a[i-1] == x:
return i-1
else:
return -1
sentences = ['the cat slept', 'the dog jumped', 'the bird flew', 'the a']
terms = ['clock', 'dog']
sentences_with_sorted = [(sentence, sorted(sentence.split()))
for sentence in sentences] # sort them for binary search
valid_sentences = []
for sentence in sentences_with_sorted:
list_of_word = sentence[1] # get sorted word list
if all([1 if binary_search(list_of_word, word)<0 else 0
for word in terms]): # find no word found
valid_sentences.append(sentence[0]) # append them
print valid_sentences

Replacing a word with another in a string if a condition is met

I am trying to get some help with a function on replacing two words in a string with another word if a condition is true.
The condition is: if the word 'poor' follows 'not', then replace the whole string 'not ... poor' with 'rich'. The problem is that I don't know how to make the function - more specific how to make a function that seeks for if the word poor follows not and then what I have to write to make the replacement. I am pretty new to python, so maybe it is a stupid questions but i hope someone will help me.
I want the function to do something like this:
string = 'I am not that poor'
new_string = 'I am rich'
Doubtless the regular expression pattern could be improved, but a quick and dirty way to do this is with Python's re module:
import re
patt = 'not\s+(.+\s)?poor'
s = 'I am not that poor'
sub_s = re.sub(patt, 'rich', s)
print s, '->', sub_s
s2 = 'I am not poor'
sub_s2 = re.sub(patt, 'rich', s2)
print s2, '->', sub_s2
s3 = 'I am poor not'
sub_s3 = re.sub(patt, 'rich', s3)
print s3, '->', sub_s3
Output:
I am not that poor -> I am rich
I am not poor -> I am rich
I am poor not -> I am poor not
The regular expression pattern patt matches the text not followed by a space and (optionally) other characters followed by a space and then the word poor.
Step One: Determine where the 'not' and 'poor' are inside your string (check out https://docs.python.org/2.7/library/stdtypes.html#string-methods)
Step Two: Compare the locations of 'not' and 'poor' that you just found. Does 'poor' come after 'not'? How could you tell? Are there any extra edge cases you should account for?
Step Three: If your conditions are not met, do nothing. If they are, everything between and including 'not' and 'poor' must be replaced by 'rich'. I'll leave you to decide how to do that, given the above documentation link.
Good luck, and happy coding!
This is something I came up with. Works for your example, but will need tweaks (what if there is more than 1 word between not and poor).
my_string = 'I am not that poor'
print my_string
my_list = my_string.split(' ')
poor_pos = my_list.index('poor')
if my_list[poor_pos - 1] or my_list[poor_pos - 2] == 'not':
not_pos = my_list.index('not')
del my_list[not_pos:poor_pos+1]
my_list.append('rich')
print " ".join(word for word in my_list)
Output:
I am not that poor
I am rich

Python Regex List into Another List

I currently have a piece of code that runs mainly as I would expect only it prints out both the original list and the one that has been filtered. Essentially what I am trying to do is read URL's from a webpage and store them into a list ( called match, this part works fine) and then filter that list into a new list (called fltrmtch) because the original contains all of the extra href tags ect.
For example at the moment it would print out A and B but Im only after B:
A Core Development',
B'http://docs.python.org/devguide/'),
Heres the code:
url = "URL WOULD BE IN HERE BUT NOT ALLOWED TO POST MULTIPLE LINKS" #Name of the url being searched
webpage = urllib.urlopen(url)
content = webpage.read() #places the read url contents into variable content
import re # Imports the re module which allows seaching for matches.
import pprint # This import allows all listitems to be printed on seperate lines.
match = re.findall(r'\<a.*href\=.*http\:.+', content)#matches any content that begins with a href and ands in >
def filterPick(list, filter):
return [( l, m.group(1) ) for l in match for m in (filter(l),) if m]
regex=re.compile(r'\"(.+?)\"').search
fltrmtch = filterPick(match, regex)
try:
if match: # defines that if there is a match the below is ran.
print "The number of URL's found is:" , len(match)
match.sort()
print "\nAnd here are the URL's found: "
pprint.pprint(fltrmtch)
except:
print "No URL matches have been found, please try again!"
Any help would be much appreciated.
Thank you in advance.
UPDATE: Thank you for the answer issued however I managed to find the flaw
return [( l, m.group(1) ) for l in match for m in (filter(l),) if m]
I simply had to remove the 1, from [(1, m.group(1)) ). Thanks again.
It appears that the bottom portion of your code is mostly catching errors from the top portion, and that the regex you provided has no capturing groups. Here is a revised example:
import re
url = "www.site.com" # place real web address here
# read web page into string
page = urllib.urlopen(url).read()
# use regex to extract URLs from <a href=""> patterns
matches = re.findall(r'''\<a\s[^\>]*?\bhref\=(['"])(.+?)\1[^\>]*?\>''', page, re.IGNORECASE)
# keep only the second group of positive matches
matches = sorted([match.group(2) for match in matches if match])
# print matches if they exist
if matches:
print("The number of URL's found is:" + str(len(matches)))
print("\nAnd here are the URL's found:")
# print each match
print('\n'.join(matches))
else:
print 'No URL matches have been found, please try again!'

Take first successful match from a batch of regexes

I'm trying to extract set of data from a string that can match one of three patterns. I have a list of compiled regexes. I want to run through them (in order) and go with the first match.
regexes = [
compiled_regex_1,
compiled_regex_2,
compiled_regex_3,
]
m = None
for reg in regexes:
m = reg.match(name)
if m: break
if not m:
print 'ARGL NOTHING MATCHES THIS!!!'
This should work (haven't tested yet) but it's pretty fugly. Is there a better way of boiling down a loop that breaks when it succeeds or explodes when it doesn't?
There might be something specific to re that I don't know about that allows you to test multiple patterns too.
You can use the else clause of the for loop:
for reg in regexes:
m = reg.match(name)
if m: break
else:
print 'ARGL NOTHING MATCHES THIS!!!'
If you just want to know if any of the regex match then you could use the builtin any function:
if any(reg.match(name) for reg in regexes):
....
however this will not tell you which regex matched.
Alternatively you can combine multiple patterns into a single regex with |:
regex = re.compile(r"(regex1)|(regex2)|...")
Again this will not tell you which regex matched, but you will have a match object that you can use for further information. For example you can find out which of the regex succeeded from the group that is not None:
>>> match = re.match("(a)|(b)|(c)|(d)", "c")
>>> match.groups()
(None, None, 'c', None)
However this can get complicated however if any of the sub-regex have groups in them as well, since the numbering will be changed.
This is probably faster than matching each regex individually since the regex engine has more scope for optimising the regex.
Since you have a finite set in this case, you could use short ciruit evaluation:
m = compiled_regex_1.match(name) or
compiled_regex_2.match(name) or
compiled_regex_3.match(name) or
print("ARGHHHH!")
In Python 2.6 or better:
import itertools as it
m = next(it.ifilter(None, (r.match(name) for r in regexes)), None)
The ifilter call could be made into a genexp, but only a bit awkwardly, i.e., with the usual trick for name binding in a genexp (aka the "phantom nested for clause idiom"):
m = next((m for r in regexes for m in (r.match(name),) if m), None)
but itertools is generally preferable where applicable.
The bit needing 2.6 is the next built-in, which lets you specify a default value if the iterator is exhausted. If you have to simulate it in 2.5 or earlier,
def next(itr, deft):
try: return itr.next()
except StopIteration: return deft
I use something like Dave Kirby suggested, but add named groups to the regexps, so that I know which one matched.
regexps = {
'first': r'...',
'second': r'...',
}
compiled = re.compile('|'.join('(?P<%s>%s)' % item for item in regexps.iteritems()))
match = compiled.match(my_string)
print match.lastgroup
Eric is in better track in taking bigger picture of what OP is aiming, I would use if else though. I would also think that using print function in or expression is little questionable. +1 for Nathon of correcting OP to use proper else statement.
Then my alternative:
# alternative to any builtin that returns useful result,
# the first considered True value
def first(seq):
for item in seq:
if item: return item
regexes = [
compiled_regex_1,
compiled_regex_2,
compiled_regex_3,
]
m = first(reg.match(name) for reg in regexes)
print(m if m else 'ARGL NOTHING MATCHES THIS!!!')