Regex: Comma Delimiting large integers (e.g. 2903 -> 2,903) - regex

Here is the text:
1234567890
The regular expression:
s/(\d)((\d\d\d)+\b)/\1,\2/g
The expected result:
1,234,567,890
The actual result:
1,234567890
This is an example used to add a comma per 3 digits from right to left from mastering regular expression. Here is the explaination:
This is because the digits matched by (\d\d\d)+ are now actually part of the final match, and so are not left "unmatched" and available to the next iteration of the regex via the /g.
But I still don't understand it and I hope anybody could help me to figure it out detailly. Thanks in advance.

Prerequisite
The regex engine will match each character from left to right. And the matched characters are consumed by the engine. That is once consumed you cannot go back reconsume those characters again.
How does the match occure for (\d)((\d\d\d)+\b)
1234567890
|
(\d)
1234567890
|||
(\d\d\d)+
1234567890
|
\b #cannot be matched, hence it goes for another `(\d\d\d)+`
1234567890
|||
(\d\d\d)+
1234567890
|
\b #cannot be matched, hence it goes for another `(\d\d\d)+`
1234567890
|||
(\d\d\d)+
1234567890
|
\b #matched here for the first time.
Now here the magic happens. See the engine consumed all characters and the pointer has reached the end of the input with a successfull match. The substitution \1,\2 occures. Now there is no way to retrack the pointer back to
1234567890
|
(\d)
inorder to obtain the expected result
Solution
You havn't mentioned which language you are using. Assuming that the language supports PCRE.
The look aheads will be of great use here.
s/(\d)(?=(\d\d\d)+\b)/\1,/g
Here the second group (?=(\d\d\d)+\b) is a look ahead and does not consume any characters, but checks if the characters can be matched or not
Regex Demo
OR
Using look arounds as
s/(?<=\d)(?=(\d\d\d)+\b)/,/g
Here
(?<=\d) look behind. Checks if presceded by digits
(?=(\d\d\d)+\b) look ahead. Checks if followed by 3 digits.
Regex Demo
Note on look arounds

Related

Regex to find numbers from String with different format

I've got the following text:
instance=hostname1, topic="AB_CD_EF_12345_ZY_XW_001_000001"
instance=hostname2, topic="AB_CD_EF_1345_ZY_XW_001_00001"
instance=hostname1, topic="AB_CD_EF_1235_ZY_XW_001_000001"
instance=hostname2, topic="AB_CD_EF_GH_4567_ZY_XW_01_000001"
instance=hostname1, topic="AB_CD_EF_35678_ZY_XW_001_00001"
instance=hostname2, topic="AB_CD_EF_56789_ZY_XW_001_000001"
I would like to capture numbers from the sample above. I've tried to do so with the regular expressions below and they work well as separate queries:
Regex: *.topic="AB_CD_EF_([^_]+).*
Matches: 12345 1345 1235
Regex: *.topic="AB_CD_EF_GH_([^_]+).*
Matches: 4567 35678 56789
But I need a regex which can give me all numbers, ie:
12345 1345 1235 4567 35678 56789
Make GH_ optional:
.*topic="AB_CD_EF_(GH_)?([^_]+).*
which matches all your target numbers.
See live demo.
You could be more general by allowing any number of "letter letter underscore" sequences using:
.*topic="(?:[A-Z]{2}_)+([^_]+).*
See live demo.
Another option that we might call, would be an expression similar to:
topic=".*?[A-Z]_([0-9]+)_.*?"
and our desired digits are in this capturing group ([0-9]+).
Please see the demo for additional explanation.
From the examples and conditions you've given I think you're going to need a very restrictive regex, but this may depend on how you want to adapt it. Take a look at the following regex and read the breakdown for more information on what it does. Use the first group (there is only one in this regex) as a substitution to retrieve the numbers you are looking for.
Regex
^instance\=hostname[0-9]+\,\s*topic\=\“[A-Z_]+([0-9]+)_[A-Z_]+[0-9_]+\”$
Try it out in this DEMO.
Breakdown
^ # Asserts position at start of the line
hostname[0-9]+ # Matches any and all hostname numbers
\s* # Matches whitespace characters (between 0 and unlimited times)
[A-Z_]+ # Matches any upper-case letter or underscore (between 1 and unlimited times)
([0-9]+) # This captures the number you want
$ # Asserts position at end of the line
Although this does answer the question you have asked I fear this might not be exactly what you're looking for but without further information this is the best I can give you. In any case after you've studied the breakdown and played around the demo a it it should prove to be of some help.
The regex worked for me :
/.*topic="(?:[AB_CD_EF_(GH_)]{2,3}_)+([^_]]+).*/

RegExp checking for sign only if there is text afterwards

I have some cases, which I need to filter with a regex. The values which need to be filtered are listed below:
// These should be catched
123456_Test.pdf
123456 Test.pdf
123456.pdf
// These shouldn't be catched
123456Abcasd.pdf
123456-Abcasd.pdf
123456_.pdf
The current regEx looks like this:
(\d{6,7})((\_| ){0,1})(.*)\..*
The problem here is, that the latter 3 are also matched. To give you a short overview, whats wrong with the 1st "wrongly" matched strings:
The 1st capture-group has to consist 6-7 digits. (Also the capture-group is needed in the end). If there are letters after these numbers, there has to be a whitespace or underscore. The 1st example of the "shouldn't be catched" shows this. The entry is invalid, since there are letters after 123456 without the needed sign.
The last entry isn't really important, just there for convinience.
What am I missing? How do I adjust my regex in a way, that I can check for signs, only if there are letters following a number-chain?
You may use
^(\d{6,7})([_ ][A-Za-z].*)?\..*$
See the regex demo
Details
^ - start of a string
(\d{6,7}) - Group 1: 6 or 7 digits
([_ ][A-Za-z].*)? - an optional capturing group #2: a _ or space followed with a letter and then any 0+ chars as many as possible, up to the last
\. - . on a line
.* - the rest of the line
$ - end of string.
Check if this perl solution works for you.
> cat regex_catch.dat
123456_Test.pdf
123456 Test.pdf
123456.pdf
123456Abcasd.pdf
123456-Abcasd.pdf
123456_.pdf
> perl -ne ' print if m/\d+(([ _])[a-zA-Z]+| [a-zA-Z]*)?\.pdf/ ' regex_catch.dat
123456_Test.pdf
123456 Test.pdf
123456.pdf
>

Regex- Ignore a constant string that matches a pattern

I have this regular expression:
\b[A-Z]{1}[A-Z]{0,7}[0-9]?\b|\b[0-9]{2,3}\b
The desired output is as highlighted:
JOHN went to LONDON one fine day.
JOHN had lunch in a PUB.
JOHN then moved to CHICAGO.
I don't want JOHN to be highlighted.
John does not want this to match the pattern.
Neither this.
But THIS1 should match the pattern.
Also the other 70 times that the pattern should match.
Observed output:
JOHN went to LONDON one fine day.
JOHN had lunch in a PUB.
JOHN then moved to CHICAGO.
I don't want JOHN to be highlighted.
John does not want this to match the pattern.
Neither this.
But THIS1 should match the pattern.
Also the other 70 times that the pattern should match.
The regex works partly but I don't want two constant strings- JOHN and I to match as part of this regex. Please help.
You can use a negative lookahead to exclude those matches. Also, your pattern seems rather "redundant", you may shorten it considerably using grouping and removing unnecessary subpatterns:
\b(?!(?:JOHN|I)\b)(?:[A-Z]{1,8}[0-9]?|[0-9]{2,3})\b
^^^^^^^^^^^^^^^^
See the regex demo
The (?!(?:JOHN|I)\b) is the negative lookahead that fails the match if the word matched is equal to I or JOHN.
Note that {1} can always be omitted as any unquantified pattern is matched once. [A-Z]{1}[A-Z]{0,7} is actually equal to [A-Z]{1,8}.
Pattern details:
\b - word boundary
(?!(?:JOHN|I)\b) - the word matched cannot be equal to JOHN or I
(?:[A-Z]{1,8}[0-9]?|[0-9]{2,3}) - one of the two alternatives:
[A-Z]{1,8}[0-9]? - 1 to 8 uppercase ASCII letters followed with an optional (1 or 0) digit
| - or
[0-9]{2,3} - 2 to 3 digits
\b - trailing word boundary

Regular Expression find space delimited numbers

I have a string that comes from user input through a messaging system, this can contain a series of 4 digit numbers, but as users are likely to type things in wrong it needs to be a little bit flexible.
Therefore I want to allow them to type in the numbers, or pepper their message with any string of characters and then just take the numbers that match the formats
=nnnn or nnnn
For this I have the Regular Expression:
(^|=|\s)\d{4}(\s|$)
Which almost works, however as it says that each group of 4 digits must start with an =, a space, or the start of the string it misses every other set of numbers
I tried this:
(^|=|\s*)\d{4}(\s|$)
But that means that any four digits followed by a space get matched - which is incorrect.
How can I match groups of numbers, but include a single space at the end of one group, and the beginning of the next, to clarify this string:
Ack 9876 3456 3467 4578 4567
Should produce the matches:
9876
3456
3467
4578
4567
Here you need to use lookarounds which won't consume any characters.
(?:^|[=\s])\K\d{4}(?=\s|$)
OR
(?:^|[=\s])(\d{4})(?=\s|$)
DEMO
Your regex (^|=|\s)\d{4}(\s|$) fails because at first this would match <space>9876<space> then it would look for another space or equals or start of the line. So now it finds the next match at <space>3467<space>. It won't match 3456 because the space before 3456 was already consumed in the first match. In-order to do overlapping matches, you need to put the pattern inside positive lookarounds. So when you put the last pattern (\s|$) inside lookahead, it won't consume the space, it just asserts that the match must be followed by a space or end of the line boundary.
\b\d+\b
\b asserts position at a word boundary (^\w|\w$|\W\w|\w\W). It is a 0-width anchor, much like ^ and $. It doesn't consume any characters.
Demo
or
(?:^|(?<=[=\s]))\d{4}\b
Demo

About this regular expression (?<=\d)\d{4}

I use (?<=\d)\d{4} to match 1234567890, the result is 2345 6789.
Why it's not 2345 7890?
In the second match, it starts from 6 and 6 is matched by (?<=\d), so I think the result is 7890 rather than 6789.
Besides, how about using ((?<=\d)\d{3})+ match 1234567890?
Look behinds are non consuming, so the 5 is being "reused" in the second match (even though the first match consumed it).
If you want to start at 6, consume but don't capture:
\d(\d{4})
And use group 1, or if your regex engine supports it, use a negative look behind for \G, which is the end of the previous match:
(?!\G)(?<=\d)\d{4}
See a live demo.
(?<=\d) is Zero-Length Assertion, assertions do not consume characters in the string, but only assert whether a match is possible or not.
It matches this way as the first match finishes at 5 so the next group can be matched from 6. (?<=\d) matches 5 in this case and the match is on 6789, starting with 6.
(?<=\d) doesn't belong to the match, it doesn't consume a character, it's just asserting what is in front of the match.
(?<=\d)\d{4}
?<= Lookbehind. Makes sure a digit precedes the text to be matched.
What text are we matching ? d{4} So, Meaning is match those 4 digits which are preceded by one digit.
In 1234567890 such a match is 2345 as it is preceded by 1 Now we have got one match and the string to be matched still is 1234567890 Now checking the regex condition will again tell to find group of four digits which has a prefix as a digit. Since 2345 has already been matched, the next successful match is 6789 which is preceded by 5 satisfying the regex conditions.
Coming to (?<=\d)\d{3} it does the same thing as before only it makes a group of 3. Editing this regex to get the one mentioned by you, we add the whole thing in a capture group. ((?<=\d)\d{3}) and say one or more of this ((?<=\d)\d{3})+. A repeated capturing group will only capture the last iteration.
So 890 is returned as a match.