Regex to find numbers from String with different format - regex

I've got the following text:
instance=hostname1, topic="AB_CD_EF_12345_ZY_XW_001_000001"
instance=hostname2, topic="AB_CD_EF_1345_ZY_XW_001_00001"
instance=hostname1, topic="AB_CD_EF_1235_ZY_XW_001_000001"
instance=hostname2, topic="AB_CD_EF_GH_4567_ZY_XW_01_000001"
instance=hostname1, topic="AB_CD_EF_35678_ZY_XW_001_00001"
instance=hostname2, topic="AB_CD_EF_56789_ZY_XW_001_000001"
I would like to capture numbers from the sample above. I've tried to do so with the regular expressions below and they work well as separate queries:
Regex: *.topic="AB_CD_EF_([^_]+).*
Matches: 12345 1345 1235
Regex: *.topic="AB_CD_EF_GH_([^_]+).*
Matches: 4567 35678 56789
But I need a regex which can give me all numbers, ie:
12345 1345 1235 4567 35678 56789

Make GH_ optional:
.*topic="AB_CD_EF_(GH_)?([^_]+).*
which matches all your target numbers.
See live demo.
You could be more general by allowing any number of "letter letter underscore" sequences using:
.*topic="(?:[A-Z]{2}_)+([^_]+).*
See live demo.

Another option that we might call, would be an expression similar to:
topic=".*?[A-Z]_([0-9]+)_.*?"
and our desired digits are in this capturing group ([0-9]+).
Please see the demo for additional explanation.

From the examples and conditions you've given I think you're going to need a very restrictive regex, but this may depend on how you want to adapt it. Take a look at the following regex and read the breakdown for more information on what it does. Use the first group (there is only one in this regex) as a substitution to retrieve the numbers you are looking for.
Regex
^instance\=hostname[0-9]+\,\s*topic\=\“[A-Z_]+([0-9]+)_[A-Z_]+[0-9_]+\”$
Try it out in this DEMO.
Breakdown
^ # Asserts position at start of the line
hostname[0-9]+ # Matches any and all hostname numbers
\s* # Matches whitespace characters (between 0 and unlimited times)
[A-Z_]+ # Matches any upper-case letter or underscore (between 1 and unlimited times)
([0-9]+) # This captures the number you want
$ # Asserts position at end of the line
Although this does answer the question you have asked I fear this might not be exactly what you're looking for but without further information this is the best I can give you. In any case after you've studied the breakdown and played around the demo a it it should prove to be of some help.

The regex worked for me :
/.*topic="(?:[AB_CD_EF_(GH_)]{2,3}_)+([^_]]+).*/

Related

Having issue identify a number pattern

I am new to RegEx and I am having some difficult time when trying to detect a pattern.
I want to identify a number that is between 4000-4999 but at the same time must NOT be preceded or followed by another number with an optional character of either space or hyphen "-".
For example:
4567 (match)
I have 4999 roses (match)
1234567 days are gone (no match)
My water supply account is 123 4567 89 (no match)
Howdy, my cell number is 123-4567-89 (no match)
I tried below pattern
(?<!(\d))\b4\d{3}\b(?!(\d))
but it still gives me a match for 123 4567 - I guess there is something special about \b?
Any advice will be highly appreciated.
Thanks,
Eric
You may use
(?<!\d[\s-]|\d)4\d{3}(?![\s-]?\d)
In .NET, JavaScript ECMAScript 2018 compliant environments, or PyPi regex, where lookbehinds patterns can contain ?, *, + and {min,} quantifiers, you may shorten it to
(?<!\d[\s-]?)4\d{3}(?![\s-]?\d)
Or, in case alternation with different length is not supported (as in Boost or Python), use
(?<!\d[\s-])(?<!\d)4\d{3}(?![\s-]?\d)
See the regex demo and regex demo 2 (and a .NET regex demo).
Details
(?<!\d[\s-]|\d) / (?<!\d[\s-]?) / (?<!\d[\s-])(?<!\d) - no digit and a whitespace/- and no digit immediately to the left of the current position is allowed
4\d{3} - 4 and any 3 digits
(?![\s-]?\d) - immediately to the right, no 1 or 0 occurrences of a whitespace/- followed with a digit is allowed.
NOTE The solutions above do not rely on word boundaries and may even match in between underscores and when glued to words. If you really want to avoid that, then you need to use word boundaries by all means, e.g. (?<!\d[\s-]|\d)\b4\d{3}\b(?![\s-]?\d).
How about using Positive Lookahead and Positive Lookbehind along with [^ ]? I think it can get you the desired results.
Pattern:
(?<=^|[^\d]{2})4[0-9]{3}(?=$|[^\d]{2})
Example: https://regex101.com/r/PYPeCk/2/

Regex Giftcard number pattern

I am trying to come up with a regex for a giftcard number pattern in an application. I have this so far and it works fine:
(?:5049\d{12}|6219\d{12}) = 5049123456789012
What I need to account for though is numbers that are separated by dashed or spaces like so:
5049-1234-5678-9012
5049 1234 5678 9012
Can I chain these patterns together or do I need to make separate for each type?
The easiest and most simple regex could be:
(?:(5049|6219)([ -]?\d{4}){3})
Explanation:
(5049|6219) - Will check for the '5049' or '6219' start
(x){3} - Will repeat the (x) 3 times
[ -]? - Will look for " " or "-", ? accepts it once or 0 times
\d{4} - Will look for a digit 4 times
A more detailed explanation and example can be found here: https://regex101.com/r/A46GJp/1/
Use (?:5049|6219)(?:[ -]?\d{4}){3}
First, match one of the two leads. Then match 3 groups of 4 digits each, each group optionally preceded by space or dash.
See regex101 for demo, and also explains in more detail.
The above regex will also match if separators are mixed, e.g. 5049 1234-5678 9012. If you don't want that, use
(?:5049|6219)([ -]?)\d{4}(?:\1\d{4}){2} regex101
This captures the first separator, if any, and specifies that the following 2 groups must use that same separator.
Try this :
(?:(504|621)9(\d{12}|(\-\d{4}){3}|(\s\d{4}){3}))
https://regex101.com/r/SyjaT5/6

Regex to match 10 digit exactly with specific pattern

Say i give a pattern 123* or 1234* , i would like to match any 10 digit number that starts with that pattern. It should have exactly 10 digits.
Example:
Pattern : 123 should match 1234567890 but not 12345678
I tried this regex : (^(123)(\d{0,10}))(?(1)\d{10}).. obviously it didn't work. I tried to group the pattern and remaining digits as two different groups. It matches 10 digits after the captured group (https://regex101.com/). How do i check the captured group is exactly 10 digits? Or is there any good knacks here. Please guide me.
Sounds like a case for the positive lookahead:
(?=123)\d{10}
This will match any sequence of exactly 10 digits but only if prefixed with 123. Test it here.
Similarly for prefix 1234:
(?=1234)\d{10}
Of course, if you know the prefix length upfront, you can use 123\d{7}, but then you'll have to change range limits with each prefix change (for example: 1234\d{6}).
Additionally, to ensure only isolated groups of 10 digits are captured, you might want to anchor the above expression with a (zero-length) word boundary \b:
\b(?=123)\d{10}\b
or, if your sequence can appear inside of the word, you might want to use negative lookbehind and lookahead on \d (as suggested in comments by #Wiktor):
(?<!\d)(?=123)\d{10}(?!\d)
I would keep it simple:
import re
text = "1234567890"
match = re.search("^123\d{7}$|^1111\d{6}$", text)
if match:
print ("matched")
Just throw your 2 patterns in as such and it should be good to go! Note that 123* would catch 1234* so I'm using 1111\d{6} as an example

Regex : Match digits with hyphens and white spaces only

I'm trying to match digits with at least 5 characters (for the whole string) connected by a hyphen or space (like a bank account number).
e.g
"12345-62436-223434"
"12345 6789 123232"
I should also be able to match
"123-4567-890"
The current pattern I'm using is
(\d[\s-]*){5,}[\W]
But i'm getting these problems.
When I do this, I match all the white spaces after matching digits with at least 5 digit-characters
I'm going to replace this so I only want to match digits, not the white-spaces and hypens.
When I get the match what I want to do is to mask it like the one below.
from "12345-67890-11121" to "*****-*****-*****"
or
from "12345 67890 11121" to "***** ***** *****"
My only problem is that I don't get to match it like what I want to.
Thanks!
This one might work for you (probably some false-positives, though):
\d[ \d-]{3,}\d
See a demo on regex101.com.
Maybe you want something like this:
(\d{5,})(?:-|\s)(\d{5,})(?:-|\s)(\d{5,})
Demo
EDIT:
(\d+)(?:-|\s)(\d+)(?:-|\s)(\d+)
Demo
One option here is to take your existing pattern, and then add a positive lookahead which asserts that there are seven or more characters in the pattern. Assuming that there are two spaces or dashes in the account number, this will guarantee that there are five or more digits.
You can try using the following regex:
^(?=.{7,}$)((\\d+ \\d+ \\d+)|(\\d+-\\d+-\\d+))$
Test code:
String input = "123-4567-890";
boolean match = input.matches("^(?=.{7,}$)((\\d+ \\d+ \\d+)|(\\d+-\\d+-\\d+))$");
if (match) {
System.out.println("Match!");
}
If you need to first fish out the account numbers from a larger document/source, then do so and afterwards you can apply the regex logic above.

Regex: Comma Delimiting large integers (e.g. 2903 -> 2,903)

Here is the text:
1234567890
The regular expression:
s/(\d)((\d\d\d)+\b)/\1,\2/g
The expected result:
1,234,567,890
The actual result:
1,234567890
This is an example used to add a comma per 3 digits from right to left from mastering regular expression. Here is the explaination:
This is because the digits matched by (\d\d\d)+ are now actually part of the final match, and so are not left "unmatched" and available to the next iteration of the regex via the /g.
But I still don't understand it and I hope anybody could help me to figure it out detailly. Thanks in advance.
Prerequisite
The regex engine will match each character from left to right. And the matched characters are consumed by the engine. That is once consumed you cannot go back reconsume those characters again.
How does the match occure for (\d)((\d\d\d)+\b)
1234567890
|
(\d)
1234567890
|||
(\d\d\d)+
1234567890
|
\b #cannot be matched, hence it goes for another `(\d\d\d)+`
1234567890
|||
(\d\d\d)+
1234567890
|
\b #cannot be matched, hence it goes for another `(\d\d\d)+`
1234567890
|||
(\d\d\d)+
1234567890
|
\b #matched here for the first time.
Now here the magic happens. See the engine consumed all characters and the pointer has reached the end of the input with a successfull match. The substitution \1,\2 occures. Now there is no way to retrack the pointer back to
1234567890
|
(\d)
inorder to obtain the expected result
Solution
You havn't mentioned which language you are using. Assuming that the language supports PCRE.
The look aheads will be of great use here.
s/(\d)(?=(\d\d\d)+\b)/\1,/g
Here the second group (?=(\d\d\d)+\b) is a look ahead and does not consume any characters, but checks if the characters can be matched or not
Regex Demo
OR
Using look arounds as
s/(?<=\d)(?=(\d\d\d)+\b)/,/g
Here
(?<=\d) look behind. Checks if presceded by digits
(?=(\d\d\d)+\b) look ahead. Checks if followed by 3 digits.
Regex Demo
Note on look arounds