Not sure if I understand the regex: (\b\w+) \1\b? [duplicate] - regex

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 8 years ago.
I get what it does on higher level... it detects duplicate words. What I am having trouble understanding is the logic of how this works. I am hoping you will correct me if my understanding is off. Other detail is assume that I am using grep on a Linux machine.
\b will detect the first character.
\w+ will scan the letters.
Now for the part I am confused about.
The brackets will "store" the letters up to the first space or really the second \b
then the /1 will repeat steps 1 to 3 and then compare and if they match... display.
I would appreciate layman's terms if possible.

(\b\w+) \1\b detects repeated words. For example, abc abc or aaa aaa or x123_ x123_.
A word is a sequence of word character as defined below.
A word character, depending on the mode (ASCII, Locale or Unicode) will match alphabet (can be locale dependent), digits (can be locale dependent) and underscore.
\b detects word boundary, which is a position where you can find a word character before or after (but not both).
There is a slight flaw in the regex above. If the word is repeated 3 times or more, it will only remove half of the repeated words, when replacing with capturing group 1.

Pattern explanation:
( group and capture to \1:
\b the word boundary
\w+ word characters (a-z, A-Z, 0-9, _) (1 or more times)
) end of \1
' '
\1 what was matched by capture \1
\b the word boundary
If you are using \w that captures a-z, A-Z, 0-9, _ hence you don't need to specify first \b that is used for word boundary.
\1 is back reference that is matched by first groups.
Here parenthesis (...) is used for making groups.
(\b\w+) \1\b
First Group ------^^^^^^ ^-------- Match First Group again
Online demo

\1 is backreference which means it matches with the last capture group.
In this case, \b\w is the capture group so \1 matches the last captured group.
More on the backreference can be found here
http://www.regular-expressions.info/backref.html

Related

Regular expression that matches at least 4 words starting with the same letter?

I've been trying to solve this problems for few hours but with no luck. The task is to write a regular expression that matches at least four words starting with the same letter. But! These words do not have to be one after another.
This regex should be able to match a line like this:
cat color coral chat
but also one like this:
cat take boom candle creepy drum cheek
Thank you!
So far I have got this regex but it only matches words when they are in order.
(\w)\w+\s+\1\w+\s+\1\w+\s+\1
If you have only words in the line that can be matched with \w:
\b(\w)\w*(?:(?:\s+\w+)*?\s+\1\w*){3}
Explanation
\b A word boundary to prevent a partial word match
(\w)\w* Capture a single word character in group 1 followed by matching optional word characters
(?: Non capture group to repeat as a whole part
(?:\s+\w+)*? Match 1+ whitespace chars and 1+ word chars in between in case the word does not start with the character captured in the back reference
\s+\1\w* Match 1+ whitespace chars, a backreference to the same captured character and optional word characters
){3} Close the non capture group and repeat 3 times
See a regex demo
Note that \s can also match a newline.
If the words that should with the same character should be at least 2 characters long (as (\w)\w+ matches 2 or more characters)
\b(\w)\w+(?:(?:\s+\w+)*?\s+\1\w+){3}
See another regex demo.
Another idea to match lines with at least 4 words starting with the same letter:
\b(\w)(?:.*?\b\1){3}
See this demo at regex101
This is not very accurate, it just checks if there are three \b word boundaries, each followed by \1 in the first group \b(\w) captured character to the right with .*? any characters in between.

Regex Capturing alternating letters and numbers only when it does not begin the string

I'm trying to capture alternating numbers and alphabets (alphabets come first) and ultimately remove them, unless it starts the string.
So in the below example, yellow is what I'm trying to capture:
While I'm identifying the correct rows I'm having a hard time just capturing just the yellow highlighted however...
^(?!([A-Z]+\d+\w*))(?:(.+))[A-Z]+\d+\w*
https://regexr.com/673hl
Any help greatly appreciated.
You can use
(?!^)\b[A-Z]+\d+\w*
See the regex demo. Details:
(?!^) - a negative lookahead that matches a position that is NOT at the start of string
\b - match a word boundary, the preceding char must a non-word char (or start of string, but the lookahead above already ruled that position out)
[A-Z]+ - one or more uppercase ASCII letters
\d+ - one or more digits
\w* - zero or more letters, digits or underscores.
If you want to match any kind of alphanumeric strings add an alternative:
(?!^)\b(?:[A-Z]+\d|\d+[A-Z])\w*
And to make it case insensitive:
(?!^)\b(?:[A-Za-z]+\d|\d+[A-Za-z])\w*

Strange behaviour of word boundary `\b` and `\B` with special characters [duplicate]

This question already has answers here:
What are non-word boundary in regex (\B), compared to word-boundary?
(2 answers)
Closed 3 years ago.
For regex (456)\b and input 123456 xyz it works as expected and the output is 456. Case 1..
For almost the same regex (456)#\b and input 123456# xyz I expected the output to be 456#. Because \b should still match the end of the line after matching #.
But the regex engine failed to find a match. Case 2.
Strangely, it works for the regex (456)#\B. Notice the non-word boundary \B in this regex. Case 3. What does \B match here?
I went through This answer for understanding \b and \B and seems like my understanding is right.
So why is it strange? What am I missing here? Why does \B work while \b doesn't in case 2 and case 3?
A word boundary asserts the position using the following regex - (^\w|\w$|\W\w|\w\W). A word here is anything in [a-zA-Z0-9_]
So in your case, for the regex (456)#\b, trying to match the string 123456# xyz will fail since # and the space after it are BOTH non- words(there needs to be one word and one non-word for a boundary) and thereby not satisfying the above regex.
Amusingly, if you try adding a word after the # in the string, say 123456#b xyz, it'll match, like shown here
A word character is a character from a-z, A-Z, 0-9, including the _
(underscore) character.
So the # is not a word character, so it is not followed by a word boundary
A word boundary \b is defined as the point between a word and non word character. Assuming the standard C locale then # and space are both non word characters so there is no word boundary between them.

How can I exclude a character from a regex capturing group? [duplicate]

This question already has answers here:
Regular expression to skip character in capture group
(6 answers)
Closed 7 years ago.
I have a regex capture, and I would like to exclude a character (a space, in this particular case) from the middle of the captured string. Can this be done in one step, by modifying the regex?
(Quick and dirty) example:
Text: Key name = value
My regex: (.*) = (.*)
Output: \1 = "Key name" and \2 = "value"
Desired output: \1 = "Keyname" and \2 = "value"
Update: I'm not sure what regex engine will run this regex, since it's part of a larger software product. If you have a solution, please specify which engines it will run on, and on which it will not.
Update2: The aforementioned product takes a regex as an input, and then uses the matched values further, which is the reason for which a one-step solution is asked for. There is no opportunity to insert an intermediate processing step in the pipeline.
This is a possible theoretical pure-regex implementation using the end-of-previous-match \G anchor:
/(?:\G(\w+)\h(?:(?:=\h)(\w+))?)+/g
Online demo
Legenda
(?: # Non capturing group 1
\G # Matches where the regex engine stops in the previous step
(\w+) # capture group 1: a regex word of 1+ chars
\h* # zero or more horizontal spaces (space, tabs)
(?: # Non capturing group 2
=\h* # literal '=' follower by zero or more hspaces
(\w+) # capture group 2: a regex word of 1+ chars
)? # make the non capturing group 2 optional
)+ # repeat the non capturing group 1, one or more
In the substitution section of the demo:
\1 actually contains Keyname (the 2 terms are separated by a fake space)
\2 is value
NOTE: i don't recommend using this unless actually needed (why?).
There are multiple possible approaches in 2 steps: as surely already stated simply strip spaces from the first capturing group of the OP regex.
I would come up with sth. like:
(?<key>[\w]+)\s*=\s*(?<value>.+)
# look for a word character and capture it in a group called "key"
# followed by zero or unlimited times of a whitespace character (\s)
# followed by an equation sign
# followed by zero or unlimited times of a whitespace character (\s)
# capture the rest in a group called value
... and process the captured output afterwards. But with the \w character class no whitespace will matched (do you have keys with a whitespace in it?).
See a working demo here. But as mentionned in the comments, it depends on your programming language.

Regex Matching Behaviour Of \w

I noticed some interesting behaviour with some regex work I am doing, and I'd like some insight.
From what I understand, the word character, \w should match the following [a-zA-Z_0-9]
Given this input,
0000000060399301+0000000042456971+0000000
What should this regex
(\d+)\w
Capture?
I would expect it to capture 0000000060399301 but it actually captures 000000006039930
Is there something I am missing? Why is the 1 dropped from the end?
I noticed if I changed the regex to
(\d+\w)
It captures correctly i.e. including the 1
Anyone care to explain? Thanks
You require the regex to match a trailing word character - that would be the 1.
It cannot be another character, because
+ is not a word class character
+ is not a digit
matching is greedy
\d+ - matches one or more digit characters.
\w+ - matches one or more word characters. [A-Za-z\d_]
So with this string 0000000060399301+, \d+ in this (\d+)\w regex matches all the digits (including the 1 before +) at very first, since the following pattern is \w , regex engine tries to find a match, so it backtracks one character to the left and forces \w to match the digit before + . Now the captured group contains 000000006039930 and the last 1 is matched by \w
The 1 is being dropped because \w isn't in the capture group.