?: Notation in Regular Expression [duplicate] - regex

This question already has answers here:
What is a non-capturing group in regular expressions?
(18 answers)
Closed 6 years ago.
for one of my classes I have to describe the following regular expression:
\b4[0-9]{12}(?:[0-9]{3})\b
I understand that it selects a number that: begins with 4, is followed by 12 digits (each between 0-9), and is followed by another 3 digits.
What I don't understand is the the question mark with the semicolon (?:....). I've tried looking online to find out what this means but the links I've found were somewhat confusing; I was hoping someone could give me a quick basic idea of what the question mark does in this example.

This is going to be short answer.
When you use (?:) it means that the group is matched but is not captured for back-referencing i.e non-capturing group. It's not stored in memory to be referenced later on.
For example:
(34)5\1
This regex means that you are looking for 34 followed by 5 and then again 34. Definitely you could write it as 34534 but sometimes the captured group is a complex pattern which you could not predict before hand.
So whatever is matched by capturing group should be appearing again.
Regex101 demo for back-referencing
Back-referencing is also used while replacement.
For Example:
([A-Z]+)[0-9]+
This regex will look for many upper case letters followed by many digits. And I wish to replace this whole pattern just by found upper case letters.
Then I would replace whole pattern by using \1 which stands for back-referencing first captured group.
Regex101 demo for replacement
If you change to (?:[A-Z]+)[0-9]+ this will no longer capture it and hence cannot be referenced back.
Regex101 demo for non-capturing group
A live answer.

It's called a 'non-capturing group', which means the regex would not make a group by the match inside the parenteses like it would otherwise do (normally, a parenthesis creates a group).

Related

How to create a short regular expression [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
How to create a short regular expression which only matches words that don't have the same characters following after another.
It is only the following Syntax elements allowed to use:
. * + ? | ()
And the alphabet is as {a, b}
Example:
Is matching: abababab
Not matching: abbab
Thank you :)
Well, your exercise is not very clear (which regex engine are you using? etc),
but I managed to do something:
(?<=^|\P{L})(?:(\p{L})(?!\1))+(?=\P{L}|$)
https://regex101.com/r/R2t2ik/1
Explanation
We are looking for a character of any type of language and not just [a-z]
neither just the \w for a word character. This is because àéêï would
typically not match. So instead, use \p{L} which is made for selecting
specific Unicode classes.
More details here:
https://www.regular-expressions.info/unicode.html#category
We will capture this char with a capturing group: (\p{L})
This will create a match with the number 1. The match 0 is the match of the
entire regular expression. Each capturing expression found from left to right
will create a new numbered match. In our case we will then be able to refer
our captured group with the \1 reference.
To check if two following characters are not identical, we will use a
negative lookahead, meaning that the searched item will not be selected
if the lookahead results with a success.
The regex becomes: (\p{L})(?!\1)
This means: "Find a letter of any language that is not followed by itself."
Now, a word is made of one or more characters, so it could be matched with
\w+ but as explained before, this would only work in English. So in any
language, it would become (\p{L})+. It seems that \p{L}+ doesn't work
properly, so adding a group around it will help the + to know what should
appear once or more.
Okay, that's good, but it's not what we want exactly. We only want to find
characters that are not followed by themselves. So we have to use our
pattern at point 3.
This becomes: (?:(\p{L})(?!\1))+
You would ask why do we have this (?: and ) around all of it?
Well, this is because we could simply use ( and )+ but in this case it
would create a new capturing group, which we don't need. So to create a
non-capturing group, you have to add the ?: at the beginning.
Capturing group = (abc) vs non-capturing group = (?:abc)
To finish, we want to capture word beginnings and ends with the help of
a positive lookbehind and a positive lookahead. I started with the usual
\b for word boundary but it did not work. Don't ask me why. I expect
that it's related to the use of the Unicode classes or perhaps the way the
selector is written. Someone may find an explanation, I'm not a specialist.
Well, I had to solve that by trying to match either the begin of the string
with the ^ selector and with the \P{L} Unicode class to select a char
which is not a language character. I did the same for the end by using the
$ selector.
So at the beginning, I added a positive lookbehind meaning "start with or
has a non-letter char before" done with this (?<=^|\P{L}) rule.
And at the end, I added a positive lookahead meaning "finish with or has
a non-letter char after" done with this (?=\P{L}|$) rule.
Putting everything together:
(?<=^|\P{L})5 + (?:(\p{L})(?!\1))+4 +
(?<=^|\P{L})5 results in:
(?<=^|\P{L})(?:(\p{L})(?!\1))+(?=\P{L}|$)
I hope it's what you where looking for and that it's not to complicated to
understand.

How to extract characters from a string with optional string afterwards using Regex?

I am in the process of learning Regex and have been stuck on this case. I have a url that can be in two states EXAMPLE 1:
spotify.com/track/1HYcYZCOpaLjg51qUg8ilA?si=Nf5w1q9MTKu3zG_CJ83RWA
OR EXAMPLE 2:
spotify.com/track/1HYcYZCOpaLjg51qUg8ilA
I need to extract the 1HYcYZCOpaLjg51qUg8ilA ID
So far I am using this: (?<=track\/)(.*)(?=\?)? which works well for Example 2 but it includes the ?si=Nf5w1q9MTKu3zG_CJ83RWA when matching with Example 1.
BUT if I remove the ? at the end of the expression then it works for Example 1 but not Example 2! Doesn't that mean that last group (?=\?) is optional and should match?
Where am I going wrong?
Thanks!
I searched a handful of "Questions that may already have your answer" suggestions from SO, and didn't find this case, so I hope asking this is okay!
The capturing group in your regular expression is trying to match anything (.) as much as possible due to the greediness of the quantifier (*).
When you use:
(?<=track\/)(.*)(?=\?)
only 1HYcYZCOpaLjg51qUg8ilA from the first example is captured, as there is no question mark in your second example.
When using:
(?<=track\/)(.*)(?=\??)
You are effectively making the positive lookahead optional, so the capturing group will try to match as much as possible (including the question mark), so that 1HYcYZCOpaLjg51qUg8ilA?si=Nf5w1q9MTKu3zG_CJ83RWA and 1HYcYZCOpaLjg51qUg8ilA are matched, which is not the desired output.
Rather than matching anything, it is perhaps more appropriate for you to match alphanumerical characters \w only.
(?<=track\/)(\w*)(?=\??)
Alternatively, if you are expecting other characters , let's say a hyphen - or a underscore _, you may use a character class.
(?<=track\/)([a-zA-Z0-9_-]*)(?=\??)
Or you might want to capture everything except a question mark ? with a negated character class.
(?<=track\/)([^?]*)(?=\??)
As pointed out by gaganso, a look-behind is not necessary in this situation (or indeed the lookahead), however it is indeed a good idea to start playing around with them. The look-around assertions do not actually consume the characters in the string. As you can see here, the full match for both matches only consists of what is captured by the capture group. You may find more information here.
This should work:
track\/(\w+)
Please see here.
Since track is part of both the strings, and the ID is formed from alphanumeric characters, the above regex which matches the string "track/" and captures the alphanumeric characters after that string, should provide the required ID.
Regex : (\w+(?=\?))|(\w+&)
See the demo for the regex, https://regexr.com/3s4gv .
This will first try to search for word which has '?' just after it and if thats unsuccessful it will fetch the last word.

How to find a sequence of formatted digits in Apache Nifi using a regular expression?

I want to find using Apache Nifi this kind of text in a CSV with lots of text:
nnnn?nn
where n is a digit between 0 and 9, and ? is a literal question mark.
A real example is:
8764?23
It always has 4 digits before ? and 2 digits after.
How can this be done?
Starting off simple:
\d{4}\?\d{2}
But this would also match 8764?23 within a longer string such as 98764?23 or 8764?234.
If you need to find exact matches as individual values within the CSV, a more complex regular expression is needed:
(?:^|,)\s*(\d{4}\?\d{2})\s*(?:,|$)
This may look a bit strange at first sight so let's break it down:
(?:^|,) uses the (something|something else) syntax to allow a choice of two different things - here it is allowing either the very start of the string ^ or a comma ,. The ?: at the start excludes this expression from being included as a capturing group.
\s* allows any amount of whitespace (i.e. zero or more spaces, tabs etc.) to appear before the matched expression.
(\d{4}\?\d{2}) specifies exactly 4 digits \d{4} followed by a question mark \? (which needs to be escaped to distinguish it from the regex ? meaning 0 or 1 occurrences), followed by 2 more digits \d{2}. The surrounding brackets () are used to specify this as a capturing group.
\s* allows more whitespace after the matched expression.
(?:,|$) allows either a comma , or the end of the string $ and ?: excludes this from being a capturing group.
Demo
https://regex101.com/r/X0Ic4v/1
Usage
The above can be used with Nifi's ExtractText to get the first capturing group for each match. Since it is only the capturing group that is of interest and not the rest of the match, "Include Capture Group 0" can be set to false. Presumably both "Enable Multiline Mode" and "Enable repeating capture group" should be set to true.
Further considerations
The above assumes that 8764?23 appears exactly like that as a value in a CSV string. But maybe you need to allow "8764?23"? Or possibly others such as '8764?23', _8764?23_ or even ABC8764?23DEF? There are too many possible variants for here a one size fits all so please reply in the comments to state the requirements if the above doesn't fit your needs.
Here is your regular expression: \d\d\d\d\?\d\d and tool where you can use it (and here more complicated version)
This is the Regex required for your needs.
(\d{4}\?\d\d)

A short way to capture/back-reference every digit of a number individually

So basically I want to reformat a 10 digit number like so:
1234567890 --> (123) 456-7890
A long way to do this would be to have each number be its own capture group and then back-reference each one individually:
'([0-9])([0-9])...([0-9])' --> (\1\2\3) \4\5\6-\7\8\9\10
This seems unnecessary and verbose, but when I try the following
'([0-9]){10}'
There appears to be only one back-reference and its of the last digit in the number.
Is there is a more elegant way to reference each character as its own capture group?
Thanks!
The following pattern will do the job: ^(\d{3})(\d{3})(\d{4})$
^(\d{3}): beginning of the string, then exactly 3 digits
(\d{3}): exactly 3 digits
(\d{4})$: exactly 4 digits, then end of the string.
Then replace by: (\1) \2-\3
Although the other answer with its example regex patterns hopefully shed light on the correct application of capture groups, it does not directly answer the question. If you fail to understand how regular expressions work (capture groups in particular), you may find yourself wanting to do the same thing with a different pattern in the future.
Is there is a more elegant way to reference each character as its own
capture group?
The initial answer is "No", there is no way to reference an individual capture of a single capture group using traditional replacement syntax - regardless of whether it is a single digit or any other capture group. Consider that you indicate a precise number of matches with {10} and it seems perfectly reasonable to be able to access each capture. But what if you had indicated a variable number of matches with + or {,3}? There would be no well-defined way of knowing how many possible captures occurred. If the same regex pattern had had more capture groups following the "repeated" capture group, there would be no way of correctly referencing the later groups. Example: Given the pattern ([a-z])+(\d){3}, the first capture group could match 4 letters one time, then the next time match 11 letters. If you wanted to refer to the captured digits, how would you do that? You could not, since \1, \2, \3, ... would all be reserved for possible capture instances of the first group.
But the inability of basic regular expressions syntax to do what you want does not remove the validity of your question, nor does it necessarily place the solution outside the realm of many regular expression implementations. Various regex implementations (i.e. language syntax and regex libraries) resolve this limitation by facilitating regex matching with various objects for accessing repeated captures. (c# and .Net regex library is one example, like match.Groups[1].Captures[3]) So even though you can't use basic replacement patterns to get want you want, the answer is often "Yes", depending on the specific implementation.

Regexp to start matching after a specific character [duplicate]

This question already has an answer here:
Regex to match text after a given character excluding the character itself
(1 answer)
Closed 6 years ago.
In someXstring it's easy to find everything after and including 'X'.
What I need is to find everything after, but EXCLUDING 'X'.
... just to match string in it.
Try using a lookbehind assertion.
(?<=X)\w+
If you regex engine doesn't support lookbehind assertions, you can work around that using capturing groups.
X(\w+)
In the above regex, string would be accessed referencing \1.
NOTE: this uses \w to capture word characters. If you literally mean that you want to capture everything then use the dot, ., metacharacter instead...
(?<=X).+$
You can use lookbehind if available
(?<=X).*$
if not you can use groups.Grab group 1.
X(.*$)