Understanding behavior of regex when using consecutive \d and \w [duplicate] - regex

This question already has answers here:
Greedy vs. Reluctant vs. Possessive Qualifiers
(7 answers)
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I'm trying to understand the behavior of regex when using \d and \w consecutively to match words and numbers in a sentence. I searched for similar questions but I couldn't find a good match (please let me know if this is somehow duplicate).
# Example sentence
"Adam has 100 friends. Bill has 23 friends. Cindy has 5 friends."
When I use regex [A-Za-z]+\s\w+\s\d+\w, it returns matches for:
Adam has 100
Bill has 23
BUT NOT FOR
Cindy has 5
I would have expected no matches at all since the greedily searched digits (\d+) are not followed by any word character (\w); they are followed by a white space instead. I think, somehow \w is matching digits following the first occurrence of any digit. I thought \d+ would have exhausted the stretch of digits in the search. Can you help me understand what is going on here?
Thanks

I thought \d+ would have exhausted the stretch of digits in the search
No that is not the case. \d+ matches as many digits as it can before next \w (that also matches digit i.e. [a-zA-Z_0-9]) forces regex engine to backtrack one position so that \w can match one word character.
If you don't want this backtracking to happen then use possessive quantifier ++:
[A-Za-z]+\s\w+\s\d++\w
However note that \d++w pattern will always fail for all 3 cases because \d++ won't backtrack and \w will never be able to match a digit.
This pattern will succeed only if there is non-digit word character in the end like Chapter is 23A.
RegEx Demo

Related

Regex that find a expression start and end with same word and separated by 10 characters

My question is about a regex that find a expression starts and ends with same word and separated by 10 characters.
I've tried this regex
\b(\w).{1,10}(\s\1)\b
to solve my problem but it doesn't work.
Thanks in advance for your answers
You can use
\b(\w+)\b.{0,10}\b\1\b
See the regex demo. Mind that in many programming environments, backslashes are used as part of string escape sequences and thus needs doubling in regular string literals.
Details
\b - a word boundary
(\w+) - Capturing group #1: one or more word chars
\b - a word boundary
.{0,10} - zero to ten chars other than line break chars as many as possible
\b\1\b - same whole word as in Group 1.

Sublime regular expression to match wx.data.xxx but not wx.data.yyy [duplicate]

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 3 years ago.
I want to find all instances where
wx.data.xxx
but not
wx.data.update OR wx.data.get
I tried (wx.data.)[^update][^get] but it doesn't work.
Use negative lookahead instead of a negative character set:
wx\.data\.(?!update|get)[^.\n]+
https://regex101.com/r/chy79T/1
You could use (?! negative lookahead asserting what is on the right is not get or update.
\S+ matches 1+ times a non whitespace char. If you don't want to match a dot and a whitespace as well, you could use [^.\s]+
\bwx\.data\.(?!get|update)\S+
Regex demo
If for example updates or gets is allowed, you could add a word boundary \b
\bwx\.data\.(?!(?:get|update)\b)\S+

How do I force a string to have both digits and alpha chars with regexp [duplicate]

This question already has answers here:
Regex pattern to match at least 1 number and 1 character in a string
(10 answers)
Closed 6 years ago.
I am trying to have a regexp to force/validate A string to have both digits (numbers) and alpha chars (a-zA-Z).
doing [a-zA-Z0-9] will allow any combination (including only digits or only letters).
The order has to be random.
I do not know how to force "stuff" in such a case.
You will need to use lookaheads for this. Consider this regex:
^(?=.*[0-9])(?=.*[a-zA-Z])[a-zA-Z0-9]+$
It has 2 lookaheads for forcing presence of digit and alphabet.
(?=.*[0-9]) # assert that there is at least a digit ahead
(?=.*[a-zA-Z]) # assert that there is at least an alphabet ahead
[a-zA-Z0-9]+$ # will match only alphanumerics
Regex Lookaround Reference
Note that if your regex tool/language doesn't support lookaheads then you will have to use alternation:
^[a-zA-Z0-9]*?([0-9][a-zA-Z0-9]*[a-zA-Z]|[a-zA-Z][a-zA-Z0-9]*[0-9])[a-zA-Z0-9]*$

Regex Matching Behaviour Of \w

I noticed some interesting behaviour with some regex work I am doing, and I'd like some insight.
From what I understand, the word character, \w should match the following [a-zA-Z_0-9]
Given this input,
0000000060399301+0000000042456971+0000000
What should this regex
(\d+)\w
Capture?
I would expect it to capture 0000000060399301 but it actually captures 000000006039930
Is there something I am missing? Why is the 1 dropped from the end?
I noticed if I changed the regex to
(\d+\w)
It captures correctly i.e. including the 1
Anyone care to explain? Thanks
You require the regex to match a trailing word character - that would be the 1.
It cannot be another character, because
+ is not a word class character
+ is not a digit
matching is greedy
\d+ - matches one or more digit characters.
\w+ - matches one or more word characters. [A-Za-z\d_]
So with this string 0000000060399301+, \d+ in this (\d+)\w regex matches all the digits (including the 1 before +) at very first, since the following pattern is \w , regex engine tries to find a match, so it backtracks one character to the left and forces \w to match the digit before + . Now the captured group contains 000000006039930 and the last 1 is matched by \w
The 1 is being dropped because \w isn't in the capture group.

Not sure if I understand the regex: (\b\w+) \1\b? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 8 years ago.
I get what it does on higher level... it detects duplicate words. What I am having trouble understanding is the logic of how this works. I am hoping you will correct me if my understanding is off. Other detail is assume that I am using grep on a Linux machine.
\b will detect the first character.
\w+ will scan the letters.
Now for the part I am confused about.
The brackets will "store" the letters up to the first space or really the second \b
then the /1 will repeat steps 1 to 3 and then compare and if they match... display.
I would appreciate layman's terms if possible.
(\b\w+) \1\b detects repeated words. For example, abc abc or aaa aaa or x123_ x123_.
A word is a sequence of word character as defined below.
A word character, depending on the mode (ASCII, Locale or Unicode) will match alphabet (can be locale dependent), digits (can be locale dependent) and underscore.
\b detects word boundary, which is a position where you can find a word character before or after (but not both).
There is a slight flaw in the regex above. If the word is repeated 3 times or more, it will only remove half of the repeated words, when replacing with capturing group 1.
Pattern explanation:
( group and capture to \1:
\b the word boundary
\w+ word characters (a-z, A-Z, 0-9, _) (1 or more times)
) end of \1
' '
\1 what was matched by capture \1
\b the word boundary
If you are using \w that captures a-z, A-Z, 0-9, _ hence you don't need to specify first \b that is used for word boundary.
\1 is back reference that is matched by first groups.
Here parenthesis (...) is used for making groups.
(\b\w+) \1\b
First Group ------^^^^^^ ^-------- Match First Group again
Online demo
\1 is backreference which means it matches with the last capture group.
In this case, \b\w is the capture group so \1 matches the last captured group.
More on the backreference can be found here
http://www.regular-expressions.info/backref.html