Capturing regex group with optional suffix [duplicate] - regex

I have addresses in two formats:
SomeHouse,
Holbrook,
Belper,
Derbyshire,
DE56 0RR
and
SomeHouse,
Holbrook,
Belper,
Derbyshire,
DE56 0RR(123123123123)
The number only ever appears right at the end, is always in brackets and always 12 digits.
I am trying to get a regex to match two groups ... the address and the number (if it is there).
It is a head banger (for my inregexperienced self) since i cant get my expression to work on both types of address.
I have
(?<address>.*)(?<bracketsandnum>\((?<num>[0-9]{12})\))$
which also uses a group to match the brackets - not so sure i need that bit :) certainly not as a named group anyway.
Please advise!
Cheers,
James.
Update
I have used the answer provided by Martinho, Qtax. Many thanks to them.
Now i understand a bit more, i see my question is similar to the following:
Ignoring an optional suffix with a greedy regex

Make the second group optional with ?, and use a non-greedy match in the first group (by modifying * with ?). Something like this:
^(?<address>.*?)(?:\((?<num>\d{12})\))?$

Related

Regex named caption groups separation for optional part [duplicate]

I have addresses in two formats:
SomeHouse,
Holbrook,
Belper,
Derbyshire,
DE56 0RR
and
SomeHouse,
Holbrook,
Belper,
Derbyshire,
DE56 0RR(123123123123)
The number only ever appears right at the end, is always in brackets and always 12 digits.
I am trying to get a regex to match two groups ... the address and the number (if it is there).
It is a head banger (for my inregexperienced self) since i cant get my expression to work on both types of address.
I have
(?<address>.*)(?<bracketsandnum>\((?<num>[0-9]{12})\))$
which also uses a group to match the brackets - not so sure i need that bit :) certainly not as a named group anyway.
Please advise!
Cheers,
James.
Update
I have used the answer provided by Martinho, Qtax. Many thanks to them.
Now i understand a bit more, i see my question is similar to the following:
Ignoring an optional suffix with a greedy regex
Make the second group optional with ?, and use a non-greedy match in the first group (by modifying * with ?). Something like this:
^(?<address>.*?)(?:\((?<num>\d{12})\))?$

How to extract characters from a string with optional string afterwards using Regex?

I am in the process of learning Regex and have been stuck on this case. I have a url that can be in two states EXAMPLE 1:
spotify.com/track/1HYcYZCOpaLjg51qUg8ilA?si=Nf5w1q9MTKu3zG_CJ83RWA
OR EXAMPLE 2:
spotify.com/track/1HYcYZCOpaLjg51qUg8ilA
I need to extract the 1HYcYZCOpaLjg51qUg8ilA ID
So far I am using this: (?<=track\/)(.*)(?=\?)? which works well for Example 2 but it includes the ?si=Nf5w1q9MTKu3zG_CJ83RWA when matching with Example 1.
BUT if I remove the ? at the end of the expression then it works for Example 1 but not Example 2! Doesn't that mean that last group (?=\?) is optional and should match?
Where am I going wrong?
Thanks!
I searched a handful of "Questions that may already have your answer" suggestions from SO, and didn't find this case, so I hope asking this is okay!
The capturing group in your regular expression is trying to match anything (.) as much as possible due to the greediness of the quantifier (*).
When you use:
(?<=track\/)(.*)(?=\?)
only 1HYcYZCOpaLjg51qUg8ilA from the first example is captured, as there is no question mark in your second example.
When using:
(?<=track\/)(.*)(?=\??)
You are effectively making the positive lookahead optional, so the capturing group will try to match as much as possible (including the question mark), so that 1HYcYZCOpaLjg51qUg8ilA?si=Nf5w1q9MTKu3zG_CJ83RWA and 1HYcYZCOpaLjg51qUg8ilA are matched, which is not the desired output.
Rather than matching anything, it is perhaps more appropriate for you to match alphanumerical characters \w only.
(?<=track\/)(\w*)(?=\??)
Alternatively, if you are expecting other characters , let's say a hyphen - or a underscore _, you may use a character class.
(?<=track\/)([a-zA-Z0-9_-]*)(?=\??)
Or you might want to capture everything except a question mark ? with a negated character class.
(?<=track\/)([^?]*)(?=\??)
As pointed out by gaganso, a look-behind is not necessary in this situation (or indeed the lookahead), however it is indeed a good idea to start playing around with them. The look-around assertions do not actually consume the characters in the string. As you can see here, the full match for both matches only consists of what is captured by the capture group. You may find more information here.
This should work:
track\/(\w+)
Please see here.
Since track is part of both the strings, and the ID is formed from alphanumeric characters, the above regex which matches the string "track/" and captures the alphanumeric characters after that string, should provide the required ID.
Regex : (\w+(?=\?))|(\w+&)
See the demo for the regex, https://regexr.com/3s4gv .
This will first try to search for word which has '?' just after it and if thats unsuccessful it will fetch the last word.

A short way to capture/back-reference every digit of a number individually

So basically I want to reformat a 10 digit number like so:
1234567890 --> (123) 456-7890
A long way to do this would be to have each number be its own capture group and then back-reference each one individually:
'([0-9])([0-9])...([0-9])' --> (\1\2\3) \4\5\6-\7\8\9\10
This seems unnecessary and verbose, but when I try the following
'([0-9]){10}'
There appears to be only one back-reference and its of the last digit in the number.
Is there is a more elegant way to reference each character as its own capture group?
Thanks!
The following pattern will do the job: ^(\d{3})(\d{3})(\d{4})$
^(\d{3}): beginning of the string, then exactly 3 digits
(\d{3}): exactly 3 digits
(\d{4})$: exactly 4 digits, then end of the string.
Then replace by: (\1) \2-\3
Although the other answer with its example regex patterns hopefully shed light on the correct application of capture groups, it does not directly answer the question. If you fail to understand how regular expressions work (capture groups in particular), you may find yourself wanting to do the same thing with a different pattern in the future.
Is there is a more elegant way to reference each character as its own
capture group?
The initial answer is "No", there is no way to reference an individual capture of a single capture group using traditional replacement syntax - regardless of whether it is a single digit or any other capture group. Consider that you indicate a precise number of matches with {10} and it seems perfectly reasonable to be able to access each capture. But what if you had indicated a variable number of matches with + or {,3}? There would be no well-defined way of knowing how many possible captures occurred. If the same regex pattern had had more capture groups following the "repeated" capture group, there would be no way of correctly referencing the later groups. Example: Given the pattern ([a-z])+(\d){3}, the first capture group could match 4 letters one time, then the next time match 11 letters. If you wanted to refer to the captured digits, how would you do that? You could not, since \1, \2, \3, ... would all be reserved for possible capture instances of the first group.
But the inability of basic regular expressions syntax to do what you want does not remove the validity of your question, nor does it necessarily place the solution outside the realm of many regular expression implementations. Various regex implementations (i.e. language syntax and regex libraries) resolve this limitation by facilitating regex matching with various objects for accessing repeated captures. (c# and .Net regex library is one example, like match.Groups[1].Captures[3]) So even though you can't use basic replacement patterns to get want you want, the answer is often "Yes", depending on the specific implementation.

RegEx to match acronyms

I am trying to write a regular expression that will match values such as U.S., D.C., U.S.A., etc.
Here is what I have so far -
\b([a-zA-Z]\.){2,}+
Note how this expression matches but does not include the last letter in the acronym.
Can anyone help explain what I am missing here?
SOLUTION
I'm posting the solution here in case this helps anyone.
\b(?:[a-zA-Z]\.){2,}
It seems as if a non-capturing group is required here.
Try (?:[a-zA-Z]\.){2,}
?: (non-capturing group) is there because you want to omit capturing the last iteration of the repeated group.
For example, without ?:, 'U.S.A.' will yield a group match 'A.', which you are not interested about.
None of these proposed solutions do what yours does - make sure that there are at least 2 letters in the acronym. Also, yours works on http://rubular.com/ . This is probably some issue with the regex implementation - to be fair, all of the matches that you got were valid acronyms. To fix this, you could either:
Make sure there's a space or EOF succeeding your expression ((?=\s|$) in ruby at least)
Surround your regex with ^ and $ to make sure it catches the whole string. You'd have to split the whole string on spaces to get matches with this though.
I prefer the former solution - to do this you'd have:
\b([a-zA-Z]\.){2,}(?=\s|$)
Edit: I've realized this doesn't actually work with other punctuation in the string, and a couple of other edge cases. This is super ugly, but I think it should be good enough:
(?<=\s|^)((?:[a-zA-Z]\.){2,})(?=[[:punct:]]?(?:\s|$))
This assumes that you've got this [[:punct:]] character class, and allows for 0-1 punctuation marks after an acronym that won't be captured. I've also fixed it up so that there's a single capture group that gets the whole acronym. Check out validation at http://rubular.com/r/lmr0qERLDh
Bonus: you now get to make this super confusing to anyone reading it.
This should work:
/([a-zA-Z]\.)+/g
I have slightly modified the solution above:
\b(?:[a-zA-Z]+\.){2,}
to enable capturing acronyms containing more than one letter between the dots, like in 'GHQ.AFP.X.Y'

Mod Rewrite RegEx To Match Only If Previous Subset Matched

I am trying to make what I think is a simple regex for use with mod_rewrite.
I've tried various expressions, many of which I thought were promising, but all of which ultimately failed for one reason or another. They all also seem to fail once I add start/end string delimiters.
For example, ^user/(\d{1,10})(?=/)$ was one I tried, but among other things, it seems to group the trailing slash, and I only want to group the digits. I think I need to use a positive lookbehind, but I'm having difficulty because it's looking behind at a group.
What I am trying to match is strings that 1) begin with "user/" and 2) possibly end with (\d{1,10})/ (1 to 10 digits followed by a single slash)
Should Match:
user/
user/123/
user/1234567890/
Should not match:
user
user//
user/-4/
user/35.5/
user/123
user/123//
user/123/5/
user/12345678901/
Edit: Sorry about the formatting; I do not understand how to format anything via this markdown. Those examples are preceded by 4 spaces which I thought should make a code block, but obviously I thought wrong.
^user/(?:([0-9]{1,10})/)?$ should work just fine.
This: ^user(?=/)(/\d{1,10})?/$ Edit: if you want to group digits, ^user(?=/)(?:/(\d{1,10}))?/$