Regexp matching a string - positive lookahead - regex

Regexp: (?=(\d+))\w+\1
String: 456x56
Hi,
I am not getting the concept, how this regex matches "56x56" in the string "456x56".
The lookaround, (?=(\d+)), captures 456 and put into \1, for (\d+)
The wordcharacter, \w+, matches the whole string("456x56")
\1, which is 456, should be followed by \w+
After backtracking the string, it should not find a match, as there is no "456" preceded by a word character
However the regexp matches 56x56.

5) Regex engines concludes that it cannot find a match if it start searching from 4, so it skips one character and searches again. This time, it captures two digits into \1 and ends up matching 56x56
If you want to match only whole strings, use ^(?=(\d+))\w+\1$
^ matches beginning of string
$ matches end of string

You don't anchor your regex, as has been said. Another problem is that \w also matches digits... Now look at how the regex engine proceeds to match with your input:
# begin
regex: |(?=(\d+))\w+\1
input: |456x56
# lookahead (first group = '456')
regex: (?=(\d+))|\w+\1
input: |456x56
# \w+
regex: (?=(\d+))\w+|\1
input: 456x56|
# \1 cannot be satisfied: backtrack on \w+
regex: (?=(\d+))\w+|\1
input: 456x5|6
# And again, and again... Until the beginning of the input: \1 cannot match
# Regex engine therefore decides to start from the next character:
regex: |(?=(\d+))\w+\1
input: 4|56x56
# lookahead (first group = '56')
regex: (?=(\d+))|\w+\1
input: 4|56x56
# \w+
regex: (?=(\d+))\w+|\1
input: 456x56|
# \1 cannot be satisfied: backtrack
regex: (?=(\d+))\w+|\1
input: 456x5|6
# \1 cannot be satisfied: backtrack
regex: (?=(\d+))\w+|\1
input: 456x|56
# \1 satified: match
regex: (?=(\d+))\w+\1|
input: 4<56x56>

The points you listed are almost entirely, but not quite, wrong!
1) The group (?=(\d+)) matches a sequence of one or more digits
not necessarily 456
2) \w captures only characters, not digits
3) \1 the is a back reference to the match in the group
So the role expression means find a sequence of digits followed by s sequence of word characters with are followed by the same sequence that was found in front of the characters. Hence the match 56x56.

Well that's what makes it a positive lookahead
(?=(\d+))\w+\1
You are correct when you say the first \d+ will match 456, so \1 must also be 456, but if that's the case: the expression won't match the string.
The only common characters of before the x and after the x are 56, and that's what it will do to get a positive match.

The operator + is greedy and backtracks as necessary. The lookahead (?=(\d+)) will match 456 then 56 if the regex fails then 6 if the regex fails. First attempt: 456. It matches, the group 1 contains 456. Then we have \w+ which is greedy and takes 456x56, there is nothing left but we still have to match \1 i.e. 456. Thus: failure. Then \w+ backtraks one step at a time till we get to the beginning of the regex. And it still fails.
We consume a character from the string. Next backtrack is trying to lookahead match with substring 56. it matches and the group 1 contains 56. \w+ matches until the end of the string and gets 456x56 and then we try to match 56: failure. So \w+ bactracks until we have 56 left in the string and then we have a global match and regex success.
You should try it with regex buddy debug mode.

Related

Using regex to duplicate a selection and replacing some characters

Probably a terrible title.
I am trying to take the following:
Joe Dane
Bob Sagget
Whitney Houston
Some
Other
Test
And trying to produce:
JOE_DANE("Joe Dane"),
BOB_SAGGET("Bob Sagget"),
WHITNEY_HOUSTON("Whitney Houston"),
SOME("Some"),
OTHER("Other"),
TEST("Test"),
I'm using Notepad++ and am close but not good enough at regex to figure out the remaining expression. So far, this is what I have:
Find what: (^.*)
Replace with: \1 \(\"\1\"\),
Produces: Joe Dane("Joe Dane"),
I've tried replacing with: \U$1 \(\"\1\"\), but this also impacts the second instance of \1 with upper case. It also does not replace the whitespace with an underscore _.
This can be done in a single step.
If you don't have more than 2 words in a line:
Ctrl+H
Find what: ^(\S+)(?: (\S+))?$
Replace with: \U$1(?2_$2)\E\("$0"\),
CHECK Wrap around
CHECK Regular expression
Replace all
Explanation:
^ # beginning of line
(\S+) # group 1, 1 or more non space
(?: (\S+))? # non capture group, a space, group 2, 1 or more non space, optional
$
Replacement:
\U # uppercased
$1 # group 1
(?2_$2) # if group 2 exists, add and underscore before
\E # end uppercase
\("$0"\), # the whole match with parens and quote
Screenshot (after):
If you have more than 2 words (up to 5), use:
Find ^(\S+)(?: (\S+))?(?: (\S+))?(?: (\S+))?(?: (\S+))?
Replace: \U$1(?2_$2)(?3_$3)(?4_$4)(?5_$5)\E\("$0"\),
I you have more thans five word, add as many (?: (\S+))? as needed.
You might do it in 2 steps, first matching any char 1+ more times from the start of the string.
Find what
^.+
For the first replacement you can use \E to end the activation of \U and use the full match $0
Replace with
\U$0\E\("$0"\),
For the second step, to replace the spaces with underscores, you could skip over the text between parenthesis, and match spaces between uppercase chars.
Find what
\(".*?"\)(*SKIP)(*F)|[A-Z]+\K\h+(?=[A-Z])
\(".*?"\) Match from (" till ")
(*SKIP)(*F)| Skip this part of the match
[A-Z]+\K Match uppercase chars and use \K to clear the current match buffer (forget what is matches do far)
\h+(?=[A-Z]) Match 1+ horizontal whitespace chars and assert an uppercase char to the right
Replace with _

Regex: Match partial string

I need some help - my skills here falls short :) (and I don't know if it is possible with pure regex)
Case: I have some text inputs in the form of:
input1: "abc,clutter,01;xyz,clutter,02;" (should match)
input2: "abc,clutter,02;zyz,clutter,01;" (no match)
input3: "abc,clutter,02;abc,txt,txt,01;xyz,clutter,01" (should match)
Then match should be
Starts with abc (anywhere in the input)
Everything in between - unless ,02; is in-between
Ends with ,01;
So something like:
abc(.*)(?!,02;),01;
.. but this also matches input2, and that was not the intension :)
You might use for example a repeating pattern matching all chars except , and ;
\babc(?:,(?!02,)[^,;\n]+)*,01;
\babc A word boundary, match abc
(?: Non capture group
,(?!02,)[^,;\n]+ Negative lookahead, assert not 02, and match any char except , ; or a newline
)* Close the group and optionally repeat
,01; Match literally
Regex demo
If abc should only be matched one, you can also add that to the negative lookahead
\babc(?:,(?!(?:02|abc),)[^,;\n]+)*,01;
Regex demo

How to exclude non-numeric character in regex

I have a string which goes like this
Section 78(1) of the blabla
These are my regex
\b\s(?!\b(\d{1,3}|\d{1,2}[a-zA-Z]|\d{5,})\b)\b\S*
Expected output is: of the blabla
This regex works but it does not exclude "of" because of the (). Can anyone help me? Thank you
Try this pattern: .+\d\)?
Explanation:
.+ - match one or more times of any charaters
\d - match digit
\)? - match ) zero or one time
Because of greediness of + it will match until last digit, if it's in bracket, then match following bracket.
Demo
Alternatively use \d+(?:\(\d+\))?(.+)
Then desired output is in first capturing group.
Demo
It seems all you need to change is to remove the \b before \S* and replace the \S* with .+ or .* (if the match can be an empty string).
\s(?!\b(?:\d{1,3}|\d{1,2}[a-zA-Z]|\d{5,})\b)(.+)
See the regex demo, grab Group 1 value. Note I turned the first group matching digits in the negative lookahead into a non-capturing group to avoid clutter in the resulting match list.
VB.NET demo:
Dim r As New Regex("\s(?!\b(?:\d{1,3}|\d{1,2}[a-zA-Z]|\d{5,})\b)(.+)")
Dim s As String
s = "Section 78(1) of the blabla"
For Each m As Match In r.Matches(s)
Console.WriteLine(m.Groups(1).Value)
Next
Result: of the blabla.

Why is this regex selecting this text

I am using the regex
(.*)\d.txt
on the expression
MyFile23.txt
Now the online tester says that using the above regex the mentioned string would be allowed (selected). My understanding is that it should not be allowed because there are two numeric digits 2 and 3 while the above regex expression has only one numeric digit in it i.e \d.It should have been \d+. My current expression reads. Zero of more of any character followed by one numeric digit followed by .txt. My question is why is the above string passing the regex expression ?
This regex (.*)\d.txt will still match MyFile23.txt because of .* which will match 0 or more of any character (including a digit).
So for the given input: MyFile23.txt here is the breakup:
.* # matches MyFile2
\d # matched 3
. # matches a dot (though it can match anything here due to unescaped dot)
txt # will match literal txt
To make sure it only matches MyFile2.txt you can use:
^\D*\d\.txt$
Where ^ and $ are anchors to match start and end. \D* will match 0 or more non-digit.
The pattern you have has one group (.*) which would match using your example:MyFile2
because the . allows any character.
Furthermore the . in the pattern after this group is not escaped which will result in allowing another character of any kind.
To avoid this use:
(\D*)\d+\.txt
the group (\D*) would now match all non digit characters.
Here is the explanation, your "MyFile23.txt" matches the regex pattern:
A literal period . should always be escaped as \. else it will match "any character".
And finally, (.*) matches all the string from the beginning to the last digit (MyFile2). Have a look at the "MATCH INFORMATION" area on the right at this page.
So, I'd suggest the following fix:
^\D*\d\.txt$ = beginning of a line/string, non-digit character, any number of repetitions, a digit, a literal period, a literal txt, and the end of the string/line (depending on the m switch, which depends on the input string, whether you have a list of words on separate lines, or just a separate file name).
Here is a working example.

How can I detect last digits in python string

I need to detect last digits in the string, as they are indexes for my strings. They may be 2^64, So it's not convenient to check only last element in the string, then try second... etc.
String may be like asdgaf1_hsg534, i.e. in the string may be other digits too, but there are somewhere in the middle and they are not neighboring with the index I want to get.
Here is a method using re.sub:
import re
input = ['asdgaf1_hsg534', 'asdfh23_hsjd12', 'dgshg_jhfsd86']
for s in input:
print re.sub('.*?([0-9]*)$',r'\1',s)
Output:
534
12
86
Explanation:
The function takes a regular expression, a replacement string, and the string you want to do the replacement on: re.sub(regex,replace,string)
The regex '.*?([0-9]*)$' matches the whole string and captures the number that precedes the end of the string. Parenthesis are used to capture parts of the match we are interested in, \1 refers to the first capture group and \2 the second ect..
.*? # Matches anything (non-greedy)
([0-9]*) # Upto a zero or more digits digit (captured)
$ # Followed by the end-of-string identifier
So we are replacing the whole string with just the captured number we are interested in. In python we need to use raw strings for this: r'\1'. If the string doesn't end with digits then a blank string with be returned.
twosixfour = "get_the_numb3r_2_^_64__18446744073709551615"
print re.sub('.*?([0-9]*)$',r'\1',twosixfour)
>>> 18446744073709551615
A simple regex can detect digits at the end of the string:
'\d+$'
$ matches the end of the string. \d+ matches one or more digits. The + operator is greedy by default, meaning it matches as many digits as possible. So this will match all of the digits at the end of the string.
If you want to use re.sub and make sure that there is at least a single digit present at the end of the line, you can use the quantifier + to match 1 or more digits \d+ to not remove the whole line if there are no digits present or no digits only at the end of the line.
^.*?(\d+)$
^ Start of line
.*? Match any char except a newline as least as possible (non greedy)
(\d+) Capture group 1, match 1+ digits
$ End of line
Or using a negative lookbehind
^.*(?<!\d)(\d+)$
^ Start of line
.* Match any char except a newline as much as possible
(?<!\d)(\d+) Assert no digits directly to the left, then capture 1+ digits in group 1
$ End of line
Regex demo
When using re.match, you can omit the ^ anchor and you might also use \A and \Z to asert the start and the end of the string.
Regex demo
import re
strings = ['asdgaf1_hsg534', 'asdfh23_hsjd12', 'dgshg_jhfsd86', 'test']
for s in strings:
print (re.sub(r".*?(\d+)$", r'\1',s))
Output
534
12
86
test
If there should be a non digit present before matching a digit as in this comment you could use a negated character class with a single capture group.
^.*[^\d\r\n](\d+)
^ Start of line
.* Match any char except a newline as much as possible
[^\d\r\n] Negated character class, match any char except a digit or a newline
(\d+) Capture group 1, match 1+ digits
Regex demo
To get the last digits in the string (not necessarily at the end of the string)
^.*?(\d+)[^\r\n\d]*$
^ Start of line
.*? Match any char except a newline as least as possible (non greedy)
(\d+) Capture group 1, match 1+ digits
[^\r\n\d]* Negated character class, match 0+ times any char except a newline or digit
$ End of line
Regex demo