Why regex match is not what I expect? - regex

I try to implement following code, but it's prints out not what I expect
import re
def regex_search(txt):
lst = re.findall(r'(\d{1,3}\.){3}', txt)
return lst
print(regex_search("123.45.67.89"))
It prints out ['67.'] when I expect ['123.', '45.', '67.']. Where I am wrong? Help please.
Thanks in advance.

There is no need to even use regex here:
input = "123.45.67.89"
parts = input.split(".")
parts = [s + "." for s in parts]
parts = parts[:-1]
print(parts)
['123.', '45.', '67.']

You should remove the quantifer {3} and also get rid of extra grouping and write this code,
import re
def regex_search(txt):
lst = re.findall(r'\d{1,3}\.', txt)
return lst
print(regex_search("123.45.67.89"))
Prints your expected output,
['123.', '45.', '67.']
Also, as you were using this (\d{1,3}\.){3}, then it will match exactly three of this \d{1,3}\. repeatedly three times which would be 123.45.67. as whole match but will group1 will capture only the last match which is 67. in your case and hence prints only that value in list.

I changed your regex slightly and now it should work.
import re
def regex_search(txt):
lst = re.findall(r'(\d{1,3}\.)', txt)
return lst
print(regex_search("123.45.67.89"))
The output will now be.
['123.', '45.', '67.']
You can also do it without regex, splitting on . and ignoring last element
import re
def regex_search(txt):
items = txt.split('.')[:-1]
return [item+'.' for item in items]
print(regex_search("123.45.67.89"))
The output will be
['123.', '45.', '67.']

Related

python3 regular expression max length limit [duplicate]

I try to compile a big pattern with re.compile in Python 3.
The pattern I try to compile is composed of 500 small words (I want to remove them from a text). The problem is that it stops the pattern after about 18 words
Python doesn't raise any error.
What I do is:
stoplist = map(lambda s: "\\b" + s + "\\b", stoplist)
stopstring = '|'.join(stoplist)
stopword_pattern = re.compile(stopstring)
The stopstring is ok (all the words are in) but the pattern is much shorter. It even stops in the middle of a word!
Is there a max length for the regex pattern?
Consider this example:
import re
stop_list = map(lambda s: "\\b" + str(s) + "\\b", range(1000, 2000))
stopstring = "|".join(stop_list)
stopword_pattern = re.compile(stopstring)
If you try to print the pattern, you'll see something like
>>> print(stopword_pattern)
re.compile('\\b1000\\b|\\b1001\\b|\\b1002\\b|\\b1003\\b|\\b1004\\b|\\b1005\\b|\\b1006\\b|\\b1007\\b|\\b1008\\b|\\b1009\\b|\\b1010\\b|\\b1011\\b|\\b1012\\b|\\b1013\\b|\\b1014\\b|\\b1015\\b|\\b1016\\b|\\b1017\\b|\)
which seems to indicate that the pattern is incomplete. However, this just seems to be a limitation of the __repr__ and/or __str__ methods for re.compile objects. If you try to perform a match against the "missing" part of the pattern, you'll see that it still succeeds:
>>> stopword_pattern.match("1999")
<_sre.SRE_Match object; span=(0,4), match='1999')

Replace Words proceeding it with Regex

I have two strings like this:
word=list()
word.append('The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3')
word.append('Guzoo.1986.VHSDVDRiP.x264.AC3.HS.ES-SHAG')
I want to remove the words starting from VHSDVDRIP and DVDRIP onward. So from The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3 to The.Eternal.Evil.of.Asia.1995. and Guzoo.1986.VHSDVDRiP.x264.AC3.HS.ES-SHAG to Guzoo.1986.
I tried the following but it doesn't work:
re.findall(r"\b\." + 'DVDRIP' + r"\b\.", word)
You could use re.split for that (regex101):
s = 'The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3'
import re
print( re.split(r'(\.[^.]*dvdrip\.)', s, 1, flags=re.I)[0] )
Prints:
The.Eternal.Evil.of.Asia.1995
Some test cases:
lst = ['The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3',
'Guzoo.1986.VHSDVDRiP.x264.AC3.HS.ES-SHAG']
import re
for item in lst:
print( re.split(r'(\.[^.]*dvdrip\.)', item, 1, flags=re.I)[0] )
Prints:
The.Eternal.Evil.of.Asia.1995
Guzoo.1986
If you wish to replace those instances, that I'm guessing, with an empty string, maybe this expression with an i flag may be working:
import re
regex = r"(?i)(.*)(?:\w+)?dvdrip\W(.*)"
test_str = """
The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3
Guzoo.1986.VHSDVDRiP.x264.AC3.HS.ES-SHAG
"""
subst = "\\1\\2"
print(re.sub(regex, subst, test_str))
Output
The.Eternal.Evil.of.Asia.1995.x264.AC3
Guzoo.1986.VHSx264.AC3.HS.ES-SHAG
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
Consider re.sub:
import re
films = ["The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3", "Guzoo.1986.VHSDVDRiP.x264.AC3.HS.ES-SHAG"]
for film in films:
print(re.sub(r'(.*)VHSDVDRiP.*|DVDRip.*', r'\1', film))
Output:
The.Eternal.Evil.of.Asia.1995.
Guzoo.1986.
Note: this leaves the trailing period, as requested.

How to get some number by different groups using re module in python3

Assume that there is a string like s = 'add/10/20/30/4/3/9/' or s = 'add/10/20/30/', which starts with 'add', and follows many numbers(not sure how many, only know 3 times repeat at least).
I wanted to got them in: ['10', '20', ...]
I tried to use re: r = re.compile(r"add/(?:(\d+)/){3,}")
However, only the last number matched and returned.
>>> r.findall(s)
['9']
So what's the problem and how to fix that? Thanks in advance.
Is regex a must? string split method should be faster here if you have such simple patterns:
s = "add/10/20/30/4/3/9/"
nums = [num for num in s.split('/')[1:] if num]
regex pattern would be smply:
re.findall('\d+', s)
This would return all the numerical sequence in string s.
re.findall(r"[0-9]+", s)

Python: get the string between two capitals

I'd like your opinion as you might be more experienced on Python as I do.
I came from C++ and I'm still not used to the Pythonic way to do things.
I want to loop under a string, between 2 capital letters. For example, I could do that this way:
i = 0
str = "PythonIsFun"
for i, z in enumerate(str):
if(z.isupper()):
small = ''
x = i + 1
while(not str[x].isupper()):
small += str[x]
I wrote this on my phone, so I don't know if this even works but you caught the idea, I presume.
I need you to help me get the best results on this, not just in a non-forced way to the cpu but clean code too. Thank you very much
This is one of those times when regexes are the best bet.
(And don't call a string str, by the way: it shadows the built-in function.)
s = 'PythonIsFun'
result = re.search('[A-Z]([a-z]+)[A-Z]', s)
if result is not None:
print result.groups()[0]
you could use regular expressions:
import re
re.findall ( r'[A-Z]([^A-Z]+)[A-Z]', txt )
outputs ['ython'], and
re.findall ( r'(?=[A-Z]([^A-Z]+)[A-Z])', txt )
outputs ['ython', 's']; and if you just need the first match,
re.search ( r'[A-Z]([^A-Z]+)[A-Z]', txt ).group( 1 )
You can use a list comprehension to do this easily.
>>> s = "PythonIsFun"
>>> u = [i for i,x in enumerate(s) if x.isupper()]
>>> s[u[0]+1:u[1]]
'ython'
If you can't guarantee that there are two upper case characters you can check the length of u to make sure it is at least 2. This does iterate over the entire string, which could be a problem if the two upper case characters occur at the start of a lengthy string.
There are many ways to tackle this, but I'd use regular expressions.
This example will take "PythonIsFun" and return "ythonsun"
import re
text = "PythonIsFun"
pattern = re.compile(r'[a-z]') #look for all lower-case characters
matches = re.findall(pattern, text) #returns a list of lower-chase characters
lower_string = ''.join(matches) #turns the list into a string
print lower_string
outputs:
ythonsun

Take first successful match from a batch of regexes

I'm trying to extract set of data from a string that can match one of three patterns. I have a list of compiled regexes. I want to run through them (in order) and go with the first match.
regexes = [
compiled_regex_1,
compiled_regex_2,
compiled_regex_3,
]
m = None
for reg in regexes:
m = reg.match(name)
if m: break
if not m:
print 'ARGL NOTHING MATCHES THIS!!!'
This should work (haven't tested yet) but it's pretty fugly. Is there a better way of boiling down a loop that breaks when it succeeds or explodes when it doesn't?
There might be something specific to re that I don't know about that allows you to test multiple patterns too.
You can use the else clause of the for loop:
for reg in regexes:
m = reg.match(name)
if m: break
else:
print 'ARGL NOTHING MATCHES THIS!!!'
If you just want to know if any of the regex match then you could use the builtin any function:
if any(reg.match(name) for reg in regexes):
....
however this will not tell you which regex matched.
Alternatively you can combine multiple patterns into a single regex with |:
regex = re.compile(r"(regex1)|(regex2)|...")
Again this will not tell you which regex matched, but you will have a match object that you can use for further information. For example you can find out which of the regex succeeded from the group that is not None:
>>> match = re.match("(a)|(b)|(c)|(d)", "c")
>>> match.groups()
(None, None, 'c', None)
However this can get complicated however if any of the sub-regex have groups in them as well, since the numbering will be changed.
This is probably faster than matching each regex individually since the regex engine has more scope for optimising the regex.
Since you have a finite set in this case, you could use short ciruit evaluation:
m = compiled_regex_1.match(name) or
compiled_regex_2.match(name) or
compiled_regex_3.match(name) or
print("ARGHHHH!")
In Python 2.6 or better:
import itertools as it
m = next(it.ifilter(None, (r.match(name) for r in regexes)), None)
The ifilter call could be made into a genexp, but only a bit awkwardly, i.e., with the usual trick for name binding in a genexp (aka the "phantom nested for clause idiom"):
m = next((m for r in regexes for m in (r.match(name),) if m), None)
but itertools is generally preferable where applicable.
The bit needing 2.6 is the next built-in, which lets you specify a default value if the iterator is exhausted. If you have to simulate it in 2.5 or earlier,
def next(itr, deft):
try: return itr.next()
except StopIteration: return deft
I use something like Dave Kirby suggested, but add named groups to the regexps, so that I know which one matched.
regexps = {
'first': r'...',
'second': r'...',
}
compiled = re.compile('|'.join('(?P<%s>%s)' % item for item in regexps.iteritems()))
match = compiled.match(my_string)
print match.lastgroup
Eric is in better track in taking bigger picture of what OP is aiming, I would use if else though. I would also think that using print function in or expression is little questionable. +1 for Nathon of correcting OP to use proper else statement.
Then my alternative:
# alternative to any builtin that returns useful result,
# the first considered True value
def first(seq):
for item in seq:
if item: return item
regexes = [
compiled_regex_1,
compiled_regex_2,
compiled_regex_3,
]
m = first(reg.match(name) for reg in regexes)
print(m if m else 'ARGL NOTHING MATCHES THIS!!!')