python regex search for elimitation - regex

i have the following case
import re
target_regex = '^(?!P\-[5678]).*'
pattern = re.compile(target_regex, re.IGNORECASE)
mylists=['p-1.1', 'P-5']
target_object_is_found = pattern.findall(''.join(mylists))
print "target_object_is_found:", target_object_is_found
this will give
target_object_is_found: ['P-1.1P-5']
but from my regex what i need is P-1.1 alone eliminating P-5

You joined the items in mylist and P-5 is no longer at the start of the string.
You may use
import re
target_regex = 'P-[5-8]'
pattern = re.compile(target_regex, re.IGNORECASE)
mylists=['p-1.1', 'P-5']
target_object_is_found = [x for x in mylists if not pattern.match(x)]
print("target_object_is_found: {}".format(target_object_is_found))
# => target_object_is_found: ['p-1.1']
See the Python demo.
Here, the P-[5-8] pattern is compiled with re.IGNORECASE flag and is used to check each item inside mylist (see the [...] list comprehension) with the regex_objext.match method that looks for a match at the start of string only. The match result is reversed, see not after if.
So, all items are returned that do not start with (?i)P-[5-8] pattern.

Related

Replace Words proceeding it with Regex

I have two strings like this:
word=list()
word.append('The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3')
word.append('Guzoo.1986.VHSDVDRiP.x264.AC3.HS.ES-SHAG')
I want to remove the words starting from VHSDVDRIP and DVDRIP onward. So from The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3 to The.Eternal.Evil.of.Asia.1995. and Guzoo.1986.VHSDVDRiP.x264.AC3.HS.ES-SHAG to Guzoo.1986.
I tried the following but it doesn't work:
re.findall(r"\b\." + 'DVDRIP' + r"\b\.", word)
You could use re.split for that (regex101):
s = 'The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3'
import re
print( re.split(r'(\.[^.]*dvdrip\.)', s, 1, flags=re.I)[0] )
Prints:
The.Eternal.Evil.of.Asia.1995
Some test cases:
lst = ['The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3',
'Guzoo.1986.VHSDVDRiP.x264.AC3.HS.ES-SHAG']
import re
for item in lst:
print( re.split(r'(\.[^.]*dvdrip\.)', item, 1, flags=re.I)[0] )
Prints:
The.Eternal.Evil.of.Asia.1995
Guzoo.1986
If you wish to replace those instances, that I'm guessing, with an empty string, maybe this expression with an i flag may be working:
import re
regex = r"(?i)(.*)(?:\w+)?dvdrip\W(.*)"
test_str = """
The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3
Guzoo.1986.VHSDVDRiP.x264.AC3.HS.ES-SHAG
"""
subst = "\\1\\2"
print(re.sub(regex, subst, test_str))
Output
The.Eternal.Evil.of.Asia.1995.x264.AC3
Guzoo.1986.VHSx264.AC3.HS.ES-SHAG
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
Consider re.sub:
import re
films = ["The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3", "Guzoo.1986.VHSDVDRiP.x264.AC3.HS.ES-SHAG"]
for film in films:
print(re.sub(r'(.*)VHSDVDRiP.*|DVDRip.*', r'\1', film))
Output:
The.Eternal.Evil.of.Asia.1995.
Guzoo.1986.
Note: this leaves the trailing period, as requested.

Entire text is matched but not able to group in named groups

I have following example text:
my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10
I want to extract sub-fields from it in following way:
appName = my_app,
[
{key = key1, value = value1},
{key = user_id, value = testuser},
{key = ip_address, value = 10.10.10.10}
]
I have written following regex for doing this:
(?<appName>\w+)\|(((?<key>\w+)?(?<equals>=)(?<value>[^\|]+))\|?)+
It matches the entire text but is not able to group it correctly in named groups.
Tried testing it on https://regex101.com/
What am I missing here?
I think the main problem you have is trying to write a regex that matches ALL the key=value pairs. That's not the way to do it. The correct way is based on a pattern that matches ONLY ONE key=value, but is applied by a function that finds all accurances of the pattern. Every languages supplies such a function. Here's the code in Python for example:
import re
txt = 'my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
pairs = re.findall(r'(\w+)=([^|]+)', txt)
print(pairs)
This gives:
[('key1', 'value1'), ('user_id', 'testuser'), ('ip_address', '10.10.10.10')]
The pattern matches a key consisting of alpha-numeric chars - (\w+) with a value. The value is designated by ([^|]+), that is everything but a vertical line, because the value can have non-alpha numeric values, such a dot in the ip address.
Mind the findall function. There's a search function to catch a pattern once, and there's a findall function to catch all the patterns within the text.
I tested it on regex101 and it worked.
I must comment, though, that the specific text pattern you work on doesn't require regex. All high level languages supply a split function. You can split by vertical line, and then each slice you get (expcept the first one) you split again by the equal sign.
Use the PyPi regex module with the following code:
import regex
s = "my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10"
rx = r"(?<appName>\w+)(?:\|(?<key>\w+)=(?<value>[^|]+))+"
print( [(m.group("appName"), dict(zip(m.captures("key"),m.captures("value")))) for m in regex.finditer(rx, s)] )
# => [('my_app', {'ip_address': '10.10.10.10', 'key1': 'value1', 'user_id': 'testuser'})]
See the Python demo online.
The .captures property contains all the values captured into a group at all the iterations.
Not sure, but maybe regular expression might be unnecessary, and splitting similar to,
data='my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
x= data.split('|')
appName = []
for index,item in enumerate(x):
if index>0:
element = item.split('=')
temp = {"key":element[0],"value":element[1]}
appName.append(temp)
appName = str(x[0] + ',' + str(appName))
print(appName)
might return an output similar to the desired output:
my_app,[{'key': 'key1', 'value': 'value1'}, {'key': 'user_id', 'value': 'testuser'}, {'key': 'ip_address', 'value': '10.10.10.10'}]
using dict:
temp = {"key":element[0],"value":element[1]}
temp can be modified to other desired data structure that you like to have.

Regular expression: Matching multiple occurrences in the same line

I have a string that I need to match using regex. It works perfectly fine when I have a single occurrence in a single line, however, when there are multiple occurrences of the same string in a single line I'm not getting any matches. Can you please help?
Sample strings:
MS17010314 MS00030208 IL00171198 IH09850115 IH99400409 IH99410409
IL01771010 IL01791002 IL01930907 IL02360907 CM00010904 IH09520115
MS00201285 MS19050708 MS00370489 MS19011285T
Regex that I tried:
(([A-Z]{2}[0-9]{8,9}[A-Z]{1})|([A-Z]{2}[0-9]{8,9}))
This seems to work fine:
a = '''MS17010314 MS00030208 IL00171198 IH09850115 IH99400409 IH99410409
IL01771010 IL01791002 IL01930907 IL02360907 CM00010904 IH09520115
MS00201285 MS19050708 MS00370489 MS19011285T'''
import re
patterns = ['[A-Z]{2}[0-9]{8,9}[A-Z]{1}','[A-Z]{2}[0-9]{8,9}']
pattern = '({})'.format(')|('.join(patterns))
matches = re.findall(pattern, a)
print([match for sub in matches for match in sub if match])
#['MS17010314', 'MS00030208', 'IL00171198', 'IH09850115', 'IH99400409',
# 'IH99410409', 'IL01771010', 'IL01791002', 'IL01930907', 'IL02360907',
# 'CM00010904', 'IH09520115', 'MS00201285', 'MS19050708', 'MS00370489',
# 'MS19011285T']
I've added a way to combine all patterns.
i tried using python and the following code worked
import re
s='''MS17010314 MS00030208 IL00171198 IH09850115 IH99400409 IH99410409
IL01771010 IL01791002 IL01930907 IL02360907 CM00010904 IH09520115
MS00201285 MS19050708 MS00370489 MS19011285T'''
lst_of_regex = [a,b]
pattern = '|'.join(lst_of_regex)
print(re.findall(pattern,s))

Using python regex to find repeated values after a header

If I have a string that looks something like:
s = """
...
Random Stuff
...
HEADER
a 1
a 3
# random amount of rows
a 17
RANDOM_NEW_HEADER
a 200
a 300
...
More random stuff
...
"""
Is there a clean way to use regex (in Python) to find all instances of a \d* after HEADER, but before the pattern is broken by SOMETHING_TOTALLY_DIFFERENT? I thought about something like:
import re
pattern = r'HEADER(?:\na \d*)*\na (\d*)'
print re.findall(pattern, s)
Unfortunately, regex doesn't find overlapping matches. If there's no sensible way to do this with regex, I'm okay with anything faster than writing my own for loop to extract this data.
(TL;DR -- There's a distinct header, followed by a pattern that repeats. I want to catch each instance of that pattern, as long as there isn't a break in the repetition.)
EDIT:
To clarify, I don't necessarily know what SOMETHING_TOTALLY_DIFFERENT will be, only that it won't match a \d+. I want to collect all consecutive instances of \na \d+ that follow HEADER\n.
How about a simple loop?
import re
e = re.compile(r'(a\s+\d+)')
header = 'whatever your header field is'
breaker = 'something_different'
breaker_reached = False
header_reached = False
results = []
with open('yourfile.txt') as f:
for line in f:
if line == header:
# skip processing lines unless we reach the header
header_reached = True
continue
if header_reached:
i = e.match(line)
if i and not breaker_reached:
results.append(i.groups()[0])
else:
# There was no match, check if we reached the breaker
if line == breaker:
breaker_reached = True
Not completly sure where you want the regex to stop please clarify
'((a \d*)\s){1,}'
import re
sentinel_begin = 'HEADER'
sentinel_end = 'SOMETHING_TOTALLY_DIFFERENT'
re.findall(r'(a \d*)', s[s.find(sentinel_begin): s.find(sentinel_end)])

Take first successful match from a batch of regexes

I'm trying to extract set of data from a string that can match one of three patterns. I have a list of compiled regexes. I want to run through them (in order) and go with the first match.
regexes = [
compiled_regex_1,
compiled_regex_2,
compiled_regex_3,
]
m = None
for reg in regexes:
m = reg.match(name)
if m: break
if not m:
print 'ARGL NOTHING MATCHES THIS!!!'
This should work (haven't tested yet) but it's pretty fugly. Is there a better way of boiling down a loop that breaks when it succeeds or explodes when it doesn't?
There might be something specific to re that I don't know about that allows you to test multiple patterns too.
You can use the else clause of the for loop:
for reg in regexes:
m = reg.match(name)
if m: break
else:
print 'ARGL NOTHING MATCHES THIS!!!'
If you just want to know if any of the regex match then you could use the builtin any function:
if any(reg.match(name) for reg in regexes):
....
however this will not tell you which regex matched.
Alternatively you can combine multiple patterns into a single regex with |:
regex = re.compile(r"(regex1)|(regex2)|...")
Again this will not tell you which regex matched, but you will have a match object that you can use for further information. For example you can find out which of the regex succeeded from the group that is not None:
>>> match = re.match("(a)|(b)|(c)|(d)", "c")
>>> match.groups()
(None, None, 'c', None)
However this can get complicated however if any of the sub-regex have groups in them as well, since the numbering will be changed.
This is probably faster than matching each regex individually since the regex engine has more scope for optimising the regex.
Since you have a finite set in this case, you could use short ciruit evaluation:
m = compiled_regex_1.match(name) or
compiled_regex_2.match(name) or
compiled_regex_3.match(name) or
print("ARGHHHH!")
In Python 2.6 or better:
import itertools as it
m = next(it.ifilter(None, (r.match(name) for r in regexes)), None)
The ifilter call could be made into a genexp, but only a bit awkwardly, i.e., with the usual trick for name binding in a genexp (aka the "phantom nested for clause idiom"):
m = next((m for r in regexes for m in (r.match(name),) if m), None)
but itertools is generally preferable where applicable.
The bit needing 2.6 is the next built-in, which lets you specify a default value if the iterator is exhausted. If you have to simulate it in 2.5 or earlier,
def next(itr, deft):
try: return itr.next()
except StopIteration: return deft
I use something like Dave Kirby suggested, but add named groups to the regexps, so that I know which one matched.
regexps = {
'first': r'...',
'second': r'...',
}
compiled = re.compile('|'.join('(?P<%s>%s)' % item for item in regexps.iteritems()))
match = compiled.match(my_string)
print match.lastgroup
Eric is in better track in taking bigger picture of what OP is aiming, I would use if else though. I would also think that using print function in or expression is little questionable. +1 for Nathon of correcting OP to use proper else statement.
Then my alternative:
# alternative to any builtin that returns useful result,
# the first considered True value
def first(seq):
for item in seq:
if item: return item
regexes = [
compiled_regex_1,
compiled_regex_2,
compiled_regex_3,
]
m = first(reg.match(name) for reg in regexes)
print(m if m else 'ARGL NOTHING MATCHES THIS!!!')