regular expression: can ".*" match the same string as the ".*?" do? - regex

From Dive into Python3,
re.findall(' s.*? s', "The sixth sick sheikh's sixth sheep's sick.")
It explains that :
The regular expression looks for a space, an s, and then the shortest possible series of any character (.*?), then a space, then another s.
My question is : can .* match the same string as .*? do?

Yes. If the greedy match is identical to the lazy match.
>>> re.findall(' s.*? s', "The sixth sheik") == re.findall(' s.* s', "The sixth sheik")
True
But if greedy match is longer, you will get different results.
>>> re.findall(' s.*? s', "The sixth sick sheik") == re.findall(' s.* s', "The sixth sick sheik")
False

My question is : can .* match the same string as .*? do?
Yes, if there is only one pattern like ' sany s' exists. That is, exactly one match found.
Example:
>>> import re
>>> s = 'foo sgh s'
>>> re.findall(r' s.*? s', s)
[' sgh s']
>>> re.findall(r' s.* s', s)
[' sgh s']

No
check it over here
When remove Question mark:

Related

RegEx not recognized although it should be

I'm trying to split texts like these:
§1Hello§fman, §0this §8is §2a §blittle §dtest :)
by delimiter "§[a-z|A-Z
My first approach was the following:
^[§]{1}[a-fA-F]|[0-9]$
But pythex.org won't find any occurrences in my example text by using this regex.
Do you know why?
The ^[§]{1}[a-fA-F]|[0-9]$ pattern matches a string starting with § and then having a letter from a-f and A-F ranges, or a digit at the end of the string.
Note the ^ matches the start of the string, and $ matches the end of the string positions.
To extract those words after § and a hex char after it you may use
re.findall(r'§[A-Fa-z0-9]([^\W\d_]+)', s)
# => ['Hello', 'man', 'this', 'is', 'a', 'little', 'test']
To remove them, you may use re.sub:
re.sub(r'\s*§[A-Fa-z0-9]', ' ', s).strip()
# => Hello man, this is a little test :)
To just get a string of those delimiters you may use
"".join(re.findall(r'§[A-Za-z0-9]', s))
# => §1§f§0§8§2§b§d
See this Python demo.
Details
§ - a § symbol
[A-Fa-z0-9] - 1 digit or ASCII letter from a-f and A-F ranges (hex char)
([^\W\d_]+) - Group 1 (this value will be extracted by re.findall): one or more letters (to include digits, remove \d)
Your regex uses anchors to assert the start and the end of the string ^$.
You could update your regex to §[a-fA-F0-9]
Example using split:
import re
s = "§1Hello§fman, §0this §8is §2a §blittle §dtest :)"
result = [r.strip() for r in re.split('[§]+[a-fA-F0-9]', s) if r.strip()]
print(result)
Demo

regex last repetition

I have a string: 0220110000AL0091 and I would like to get back the last 000 for replace by three spaces.
So for: 0220110000AL0091, I want to replace by 0220110 AL0091.
I don't know how to apply the regex between the 7th and 11th characters!
Thanks
You're looking for a negative lookahead.
You will look for a sequence of three zeros, not followed by a zero.
Here is how you could do it in Python (doc here, ctrl+f -> (?!):
>>> import re
>>> s = "0220110000AL0091"
>>> re.sub("000(?!0)", " ", s)
'0220110 AL0091'
>>>

python3: regex need to character to match but dont want in output

I have a string named
Set-Cookie: BIGipServerApp_Pool_SSL=839518730.47873.0000; path=/
I am trying to extract 839518730.47873.0000 from it. For exact string I am fine with my regex but If I include any digit before 1st = then its all going wrong.
No Digit
>>> m=re.search('[0-9.]+','Set-Cookie: BIGipServerApp_Pool_SSL=839518730.47873.0000; path=/')
>>> m.group()
'839518730.47873.0000'
With Digit
>>> m=re.search('[0-9.]+','Set-Cookie: BIGipServerApp_Pool_SSL2=839518730.47873.0000; path=/')
>>> m.group()
'2'
Is there any way I can extract `839518730.47873.0000' only but doesnt matter what else lies in the string.
I tried
>>> m=re.search('=[0-9.]+','Set-Cookie: BIGipServerApp_Pool_SSL=839518730.47873.0000; path=/')
>>> m.group()
'=839518730.47873.0000'
As well but its starting with '=' in the output and I dont want it.
Any ideas.
Thank you.
If your substring always comes after the first =, you can just use capture group with =([\d.]+) pattern:
import re
result = ""
m = re.search(r'=([0-9.]+)','Set-Cookie: BIGipServerApp_Pool_SSL2=839518730.47873.0000; path=/')
if m:
result = m.group(1) # Get Group 1 value only
print(result)
See the IDEONE demo
The main point is that you match anything you do not need and match and capture (with the unescaped round brackets) the part of pattern you need. The value you need is in Group 1.
You can use word boundaries:
\b[\d.]+
RegEx Demo
Or to make match more targeted use lookahead for next semi-colon after your matched text:
\b[\d.]+(?=\s*;)
RegEx Demo2
Update :
>>> m.group(0)
'839518730.47873.0000'
>>> m=re.search(r'\b[\d.]+','Set-Cookie: BIGipServerApp_Pool_SSL2=839518730.47873.0000; path=/')
>>> m.group(0)
'839518730.47873.0000'
>>>

Regex - How do you match everything except four digits in a row?

Using Regex, how do you match everything except four digits in a row? Here is a sample text that I might be using:
foo1234bar
baz 1111bat
asdf 0000 fdsa
a123b
Matches might look something like the following:
"foo", "bar", "baz ", "bat", "asdf ", " fdsa", "a123b"
Here are some regular expressions I've come up with on my own that have failed to capture everything I need:
[^\d]+ (this one includes a123b)
^.*(?=[\d]{4}) (this one does not include the line after the 4 digits)
^.*(?=[\d]{4}).* (this one includes the numbers)
Any ideas on how to get matches before and after a four digit sequence?
You haven't specified your app language, but practically every app language has a split function, and you'll get what you want if you split on \d{4}.
eg in java:
String[] stuffToKeep = input.split("\\d{4}");
You can use a negative lookahead:
(?!\b\d{4}\b)(\b\w+\b)
Demo
In Python the following is very close to what you want:
In [1]: import re
In [2]: sample = '''foo1234bar
...: baz 1111bat
...: asdf 0000 fdsa
...: a123b'''
In [3]: re.findall(r"([^\d\n]+\d{0,3}[^\d\n]+)", sample)
Out[3]: ['foo', 'bar', 'baz ', 'bat', 'asdf ', ' fdsa', 'a123b']

Regex to catch a string without () in 3 patterns like abc(ef) ,(ef)abc and (ef)abc(gh)

I have tested this Regex
(?<=\))(.+?)(?=\()|(?<=\))(.+?)\b|(.+?)(?=\()
but it doesn't work for strings like this pattern (ef)abc(gh).
I got a result like this "(ef)abc".
But these 3 regexes (?<=\))(.+?)(?=\() , (?<=\))(.+?)\b, (.+?)(?=\()
do work separately for "(ef)abc(gh)", "(ef)abc" ,"abc(ef)" .
can anyone tell me where the problem is or how can I get the expected result?
Assuming you are looking to match the text from between the elements in parenthesis, try this:
^(?:\(\w*\))?([\w]*)(?:\(\w*\))?$
^ - beginning of string
(?:\(\w*\))? - non-capturing group, match 0 or more alphabetic letters within parens, all optional
([\w]*) - capturing group, match 0 or more alphabetic letters
(?:\(\w*\))? - non-capturing group, match 0 or more alphabetic letters within parens, all optional
$ - end of string
You haven't specified what language you might be using, but here is an example in Python:
>>> import re
>>> string = "(ef)abc(gh)"
>>> string2 = "(ef)abc"
>>> string3 = "abc(gh)"
>>> p = re.compile(r'^(?:\(\w*\))?([\w]*)(?:\(\w*\))?$')
>>> m = re.search(p, string)
>>> m2 = re.search(p, string2)
>>> m3 = re.search(p, string3)
>>> print m.groups()[0]
'abc'
>>> print m2.groups()[0]
'abc'
>>> print m3.groups()[0]
'abc'
\([^)]+\)|([^()\n]+)
Try this.Just grab the capture or group.See demo.
https://regex101.com/r/tX2bH4/6
Your problem is that (.+?)(?=\() matches "(ef)abc" in "(ef)abc(gh)".
The easiest solution to this problem is be more explicit about what you are looking for. In this case by exchanging "any character" ., with "any character that is not a parenthesis" [^\(\)].
(?<=\))([^\(\)]+?)(?=\()|(?<=\))([^\(\)]+?)\b|([^\(\)]+?)(?=\()
A cleaner regexp would be
(?:(?<=^)|(?<=\)))([^\(\)]+)(?:(?=\()|(?=$))