regex search nested dictionary and stop on first match (python) - regex

I'm using a nested dictionary, which contains various vertebrates types. I can currently read the nested dictionary in and search a simple sentence for a keyword (e.g., tiger).
I would like to stop the dictionary search (loop), once the first match is found.
How do I accomplish this?
Example code:
vertebrates = {'dict1':{'frog':'amphibian', 'toad':'amphibian', 'salamander':'amphibian','newt':'amphibian'},
'dict2':{'bear':'mammal','cheetah':'mammal','fox':'mammal', 'mongoose':'mammal','tiger':'mammal'},
'dict3': {'anteater': 'mammal', 'tiger': 'mammal'}}
sentence = 'I am a tiger'
for dictionaries, values in vertebrates.items():
for pattern, value in values.items():
animal = re.compile(r'\b{}\b'.format(pattern), re.IGNORECASE|re.MULTILINE)
match = re.search(animal, sentence)
if match:
print (value)
print (match.group(0))

vertebrates = {'dict1':{'frog':'amphibian', 'toad':'amphibian', 'salamander':'amphibian','newt':'amphibian'},
'dict2':{'bear':'mammal','cheetah':'mammal','fox':'mammal', 'mongoose':'mammal','tiger':'mammal'},
'dict3': {'anteater': 'mammal', 'tiger': 'mammal'}}
sentence = 'I am a tiger'
found = False # Initialized found flag as False (match not found)
for dictionaries, values in vertebrates.items():
for pattern, value in values.items():
animal = re.compile(r'\b{}\b'.format(pattern), re.IGNORECASE|re.MULTILINE)
match = re.search(animal, sentence)
if match is not None:
print (value)
print (match.group(0))
found = True # Set found flag as True if you found a match
break # exit the loop since match is found
if found: # If match is found then break the loop
break

Related

How to make a list of the indexes of capital letters in a word

I'm trying to solve a challenge that I found online. It gives an input word and the expected output is a list of the indexes of all the capital letters. My program works unless there's duplicate capital letters. I can't figure out how to deal with it. Here's my code right now:
def capital_indexes(string):
string = list(string)
print(string)
output = []
for i in string:
if i.isupper():
output.append(string. index(i))
return output
Like I said, it works for words like "HeLlO" but not for words like "TesT"
Try this one and compare the difference with OP:
You don't have to use index() method to search the character again, just use enumerate to get the tuple of (index, char) at the same time, and check if the character is capital case.
def capital_indexes(string):
#string = list(string) # string is an iterable!
#print(string)
output = []
for i, ch in enumerate(string): # get index, char
if ch.isupper():
output.append(i)
return output
print(capital_indexes('TesT')) # [0, 3]

Conditionally extracting the beginning of a regex pattern

I have a list of strings containing the names of actors in a movie that I want to extract. In some cases, the actor's character name is also included which must be ignored.
Here are a couple of examples:
# example 1
input = 'Levan Gelbakhiani as Merab\nAna Javakishvili as Mary\nAnano Makharadze'
expected_output = ['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']
# example 2
input = 'Yoosuf Shafeeu\nAhmed Saeed\nMohamed Manik'
expected_output = ['Yoosuf Shafeeu', 'Ahmed Saeed', 'Mohamed Manik']
Here is what I've tried to no avail:
import re
output = re.findall(r'(?:\\n)?([\w ]+)(?= as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?: as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?:(?= as )|(?! as ))', input)
The \n in the input string are new line characters. We can make use of this fact in our regex.
Essentially, each line always begins with the actor's name. After the the actor's name, there could be either the word as, or the end of the line.
Using this info, we can write the regex like this:
^(?:[\w ]+?)(?:(?= as )|$)
First, we assert that we must be at the start of the line ^. Then we match some word characters and spaces lazily [\w ]+?, until we see (?:(?= as )|$), either as or the end of the line.
In code,
output = re.findall(r'^(?:[\w ]+?)(?:(?= as )|$)', input, re.MULTILINE)
Remember to use the multiline option. That is what makes ^ and $ mean "start/end of line".
You can do this without using regular expression as well.
Here is the code:
output = [x.split(' as')[0] for x in input.split('\n')]
I guess you can combine the values obtained from two regex matches :
re.findall('(?:\\n)?(.+)(?:\W[a][s].*?)|(?:\\n)?(.+)$', input)
gives
[('Levan Gelbakhiani', ''), ('Ana Javakishvili', ''), ('', 'Anano Makharadze')]
from which you filter the empty strings out
output = list(map(lambda x : list(filter(len, x))[0], output))
gives
['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']

How to use re.search on a list?

I have tried to change the re.search to re.match and so. But still it will show "No match result" no matter what I type.
I think there could be a problem in the code, since I made this code without fully comprehend the concept behind it.
Basically, I am trying to do a "search engine" to look for all the matching name if a word is given and matches one of the word in the names. Can someone tell me what is wrong?
import re
searchlist=[ *insert name here* ]
word_s = input("Search : ")
search_list = re.compile(r'\b(?:%s)\b' % '|'.join(searchlist), re.I|re.M)
result = re.search(search_list, word_s)
if result:
print("Match Result: ", result.group())
else:
print("No match result.")
Your last comment shows the problem:
In your code, searchlist is a list of the search terms (the things the regex searches for), not the list of strings to be searched.
For example:
searchlist = ["Fundamentals", "Engineering"]
search_list = re.compile(r'\b(?:%s)\b' % '|'.join(searchlist), re.I|re.M)
Now search_list is \b(?:Fundamentals|Engineering)\b, so it can be used as regex that will find if any of those terms appears in word_s
result = re.search(search_list, word_s)
You want to do the exact opposite:
books = ["Fundamentals of Organic Chemistry, International Edition", "Engineering Mechanics: Statics In SI Units"]
word_s = input("Search for: ")
word_re = re.compile(r"\b{}\b".format(word_s), re.I)
for book in books:
if re.search(word_re, book):
print("First Match Result: ", book)
break # Abort search after first match
else: # Only executed if the for loop was exhausted
print("No match result.")

Entire text is matched but not able to group in named groups

I have following example text:
my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10
I want to extract sub-fields from it in following way:
appName = my_app,
[
{key = key1, value = value1},
{key = user_id, value = testuser},
{key = ip_address, value = 10.10.10.10}
]
I have written following regex for doing this:
(?<appName>\w+)\|(((?<key>\w+)?(?<equals>=)(?<value>[^\|]+))\|?)+
It matches the entire text but is not able to group it correctly in named groups.
Tried testing it on https://regex101.com/
What am I missing here?
I think the main problem you have is trying to write a regex that matches ALL the key=value pairs. That's not the way to do it. The correct way is based on a pattern that matches ONLY ONE key=value, but is applied by a function that finds all accurances of the pattern. Every languages supplies such a function. Here's the code in Python for example:
import re
txt = 'my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
pairs = re.findall(r'(\w+)=([^|]+)', txt)
print(pairs)
This gives:
[('key1', 'value1'), ('user_id', 'testuser'), ('ip_address', '10.10.10.10')]
The pattern matches a key consisting of alpha-numeric chars - (\w+) with a value. The value is designated by ([^|]+), that is everything but a vertical line, because the value can have non-alpha numeric values, such a dot in the ip address.
Mind the findall function. There's a search function to catch a pattern once, and there's a findall function to catch all the patterns within the text.
I tested it on regex101 and it worked.
I must comment, though, that the specific text pattern you work on doesn't require regex. All high level languages supply a split function. You can split by vertical line, and then each slice you get (expcept the first one) you split again by the equal sign.
Use the PyPi regex module with the following code:
import regex
s = "my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10"
rx = r"(?<appName>\w+)(?:\|(?<key>\w+)=(?<value>[^|]+))+"
print( [(m.group("appName"), dict(zip(m.captures("key"),m.captures("value")))) for m in regex.finditer(rx, s)] )
# => [('my_app', {'ip_address': '10.10.10.10', 'key1': 'value1', 'user_id': 'testuser'})]
See the Python demo online.
The .captures property contains all the values captured into a group at all the iterations.
Not sure, but maybe regular expression might be unnecessary, and splitting similar to,
data='my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
x= data.split('|')
appName = []
for index,item in enumerate(x):
if index>0:
element = item.split('=')
temp = {"key":element[0],"value":element[1]}
appName.append(temp)
appName = str(x[0] + ',' + str(appName))
print(appName)
might return an output similar to the desired output:
my_app,[{'key': 'key1', 'value': 'value1'}, {'key': 'user_id', 'value': 'testuser'}, {'key': 'ip_address', 'value': '10.10.10.10'}]
using dict:
temp = {"key":element[0],"value":element[1]}
temp can be modified to other desired data structure that you like to have.

Using python regex to find repeated values after a header

If I have a string that looks something like:
s = """
...
Random Stuff
...
HEADER
a 1
a 3
# random amount of rows
a 17
RANDOM_NEW_HEADER
a 200
a 300
...
More random stuff
...
"""
Is there a clean way to use regex (in Python) to find all instances of a \d* after HEADER, but before the pattern is broken by SOMETHING_TOTALLY_DIFFERENT? I thought about something like:
import re
pattern = r'HEADER(?:\na \d*)*\na (\d*)'
print re.findall(pattern, s)
Unfortunately, regex doesn't find overlapping matches. If there's no sensible way to do this with regex, I'm okay with anything faster than writing my own for loop to extract this data.
(TL;DR -- There's a distinct header, followed by a pattern that repeats. I want to catch each instance of that pattern, as long as there isn't a break in the repetition.)
EDIT:
To clarify, I don't necessarily know what SOMETHING_TOTALLY_DIFFERENT will be, only that it won't match a \d+. I want to collect all consecutive instances of \na \d+ that follow HEADER\n.
How about a simple loop?
import re
e = re.compile(r'(a\s+\d+)')
header = 'whatever your header field is'
breaker = 'something_different'
breaker_reached = False
header_reached = False
results = []
with open('yourfile.txt') as f:
for line in f:
if line == header:
# skip processing lines unless we reach the header
header_reached = True
continue
if header_reached:
i = e.match(line)
if i and not breaker_reached:
results.append(i.groups()[0])
else:
# There was no match, check if we reached the breaker
if line == breaker:
breaker_reached = True
Not completly sure where you want the regex to stop please clarify
'((a \d*)\s){1,}'
import re
sentinel_begin = 'HEADER'
sentinel_end = 'SOMETHING_TOTALLY_DIFFERENT'
re.findall(r'(a \d*)', s[s.find(sentinel_begin): s.find(sentinel_end)])