Conditional regex in Ruby - regex

I've got the following string:
'USD 100'
Based on this post I'm trying to capture 100 if USD is contained in the string or the individual (currency) characters if USD is not contained in the string.
For example:
'USD 100' # => '100'
'YEN 300' # => ['Y', 'E', 'N']
So far I've got up to this but it's not working:
https://rubular.com/r/cK8Hn2mzrheHXZ
Interestingly if I place the USD after the amount it seems to work. Ideally I'd like to have the same behaviour regardless of the position of the currency characters.

Your regex (?=.*(USD))(?(1)\d+|[a-zA-Z]) does not work because
(?=.*(USD)) - a positive lookahead, triggered at every location inside a string (if scan is used) that matches USD substring after any 0 or more chars other than line break chars as many as possible (it means, there will only be a match if there is USD somewhere on a line)
(?(1)\d+|[a-zA-Z]) - a conditional construct that matches 1+ digits if Group 1 matched (if there is USD), or, an ASCII letter will be tried. However, the second alternative pattern will never be tried, because you required USD to be present in the string for a match to occur.
Look at the USD 100 regex debugger, it shows exactly what happens when the (?=.*(USD))(?(1)\d+|[a-zA-Z]) regex tries to find a match:
Step 1 to 22: The lookahead pattern is tried first. The point here is that the match will fail immediately if the positive lookahead pattern does not find a match. In this case, USD is found at the start of the string (since the first time the pattern is tried, the regex index is at the string start position). The lookahead found a match.
Step 23-25: since a lookahead is a non-consuming pattern, the regex index is still at the string start position. The lookahead says "go-ahead", and the conditional construct is entered. (?(1) condition is met, Group 1, USD, was matched. So, the first, then, part is triggered. \d+ does not find any digits, since there is U letter at the start. The regex match fails at the string start position, but there are more positions in the string to test since there is no \A nor ^ anchor that would only let a match to occur if the match is found at the start of the string/line.
Step 26: The regex engine index is advanced one char to the right, now, it is right before the letter S.
Step 27-40: The regex engine wants to find 0+ chars and then USD immediately to the right of the current location, but fails (U is already "behind" the index).
Then, the execution is just the same as described above: the regex fails to match USD anywhere to the right of the current location and eventually fails.
If the USD is somewhere to the right of 100, then you'd get a match.
So, the lookahead does not set any search range, it simply allows matching the rest of the patterns (if its pattern matches) or not (if its pattern is not found).
You may use
.scan(/^USD.*?\K(\d+)|([a-zA-Z])/).flatten.compact
Pattern details
^USD.*?\K(\d+) - either USD at the start of the string, then any 0 or more chars other than line break chars as few as possible, and then the text matched is dropped and 1+ digits are captured into Group 1
| - or
([a-zA-Z]) - any ASCII letter captured into Group 2.
See Ruby demo:
p "USD 100".scan(/^USD.*?\K(\d+)|([a-zA-Z])/).flatten.compact
# => ["100"]
p "YEN 100".scan(/^USD.*?\K(\d+)|([a-zA-Z])/).flatten.compact
# => ["Y", "E", "N"]

Anatomy of your pattern
(?=.*(USD))(?(1)\d+|[a-zA-Z])
| | | | | |_______
| | | | | Else match a single char a-zA-Z
| | | | |
| | | | |__
| | | | If group 1 exists, match 1+ digits
| | | |
| | | |__
| | | Test for group 1
| | |_________________
| | If Clause
| |___
| Capture group 1
|__________
Positive lookahead
About the pattern you tried
The positive lookahead is not anchored and will be tried on each position. It will continue the match if it returns true, else the match stops and the engine will move to the next position.
Why does the pattern not match?
On the first position the lookahead is true as it can find USD on the right.
It tries to match 1+ digits, but the first char is U which it can not match.
USD 100
⎸
First position
From the second position till the end, the lookahead is false because it can not find USD on the right.
USD 100
⎸
Second position
Eventually, the if clause is only tried once, where it could not match 1+ digits. The else clause is never tried and overall there is no match.
For the YEN 300 part, the if clause is never tried as the lookahead will never find USD at the right and overall there is no match.
Interesting resources about conditionals can be for example found at rexegg.com and regular-expressions.info
If you want the separate matches, you might use:
\bUSD \K\d+|[A-Z](?=[A-Z]* \d+\b)
Explanation
\bUSD Match USD and a space
\K\d+ Forget what is matched using \K and match 1+ digits
| Or
[A-Z] Match a char A-Z
(?=[A-Z]* \d+\b) Assert what is on the right is optional chars A-Z and 1+ digits
regex demo
Or using capturing groups:
\bUSD \K(\d+)|([A-Z])(?=[A-Z]* \d+\b)
Regex demo

The following pattern seems to work:
\b(?:USD (\d+)|(?!USD\b)(\w+) \d+)\b
This works with caveat that it just has a single capture group for the non USD currency symbol. One part of the regex might merit explanation:
(?!USD\b)(\w+)
This uses a negative lookahead to assert that the currency symbol is not USD. If so, then it captures that currency symbol.

I suggest the information desired be extracted as follows.
R = /\b([A-Z]{3}) +(\d+)\b/
def doit(str)
str.scan(R).each_with_object({}) do |(cc,val),h|
h[cc] = (cc == 'USD') ? val : cc.split('')
end
end
doit 'USD 100'
#=> {"USD"=>"100"}
doit 'YEN 300'
#=> {"YEN"=>["Y", "E", "N"]}
doit 'I had USD 6000 to spend'
#=> {"USD"=>"6000"}
doit 'I had YEN 25779 to spend'
#=> {"YEN"=>["Y", "E", "N"]}
doit 'I had USD 60 and CDN 80 to spend'
#=> {"USD"=>"60", "CDN"=>["C", "D", "N"]}
doit 'USD -100'
#=> {}
doit 'YENS 4000'
#=> {}
Regex demo
Ruby's regex engine performs the following operations.
\b : assert a word boundary
([A-Z]{3}) : match 3 uppercase letters in capture group 1
\ + : match 1+ spaces
(\d+) : match 3 digits in capture group 2
\b : assert a word boundary

TLDR;
An excellent working solution can be found in Wiktor's answer and the rest of the posts.
Long answer:
Since I wasn't perfectly satisfied with Wiktor's explanation of why my solution wasn't working, I decided to dig into it a bit more myself and this is my take on it:
Given the string USD 100, the following regex
(?=.*(USD))(?(1)\d+|[a-zA-Z])
simply won't work. The juice of this whole thing is to figure out why.
It turns out that using a lookahead (?=.*(USD)) with a capture group, implicitly suggests that the position of USD (if any is found) is followed by some pattern (defined inside the conditional ((?(1)\d+|[a-zA-Z])) which in this case yields nothing since there's nothing before USD.
If we break it down in steps here's an outline of what -I think- is happening:
The pointer is set at the very beginning. The lookahead (?=.*(USD)) is parsed and executed.
USD is found but since the expression is a lookahead the pointer remains at the beginning of the string and is not consumed.
The conditional ((?(1)\d+|[a-zA-Z])) is parsed and executed.
Group 1 is set (since USD has been found) however \d+ fails since the pointer searches from the beginning of the string to the beginning of the string which turns out is the furthest point we can search when using a lookahead! After all that's exactly why it's called a lookahead: The searching has to happen across a range which stops just before this one starts.
Since no digits nor anything is found before USD, the regex returns no results. And as Wiktor correctly pointed out:
the second alternative pattern will never be tried, because you required USD to be present in the string for a match to occur.
which basically says that since USD is always present in the string, the system would never jump to the "else" statement even if something was eventually found before USD.
As a counter example if the same regex is tested on this string, it will work:
'YEN USD 100'
Hope this helps someone in the future.

Related

How to make negative lookbehind in regex work with following meta-sequence? [duplicate]

This question already has answers here:
Regex: match everything but a specific pattern
(6 answers)
Closed 3 years ago.
I'm having trouble understanding negative lookbehind in regular expressions.
For a simple example, say I want to match all Gmail addresses that don't start with 'test'.
I have created an example on regex101 here.
My regular expression is:
(?<!test)\w+?\.?\w+#gmail\.com
So it matches things like:
hagrid#gmail.com
harry.potter#gmail.com
But it also matches things like
test#gmail.com
where the original string was
test#gmail.com
I thought the (?<!test) should exclude that match?
(?<!test)\w+?\.?\w+#gmail\.com works by looking behind each character before moving forward with the match.
test#gmail.com
^
At the point marked by the ^ (before the 0th character), the engine looks behind and doesn't see "test", so it can happily march forward and match "test#gmail.com", which is legal per what remains of the pattern \w+?\.?\w+#gmail\.com.
Using a negative lookahead with a word boundary fixes the problem:
\b(?!test)\w+?\.?\w+#gmail\.com
Consider our target again on the updated regex:
test#gmail.com
^
At this point, the engine is at a word boundary \b, looks ahead and sees "test" and cannot accept the string.
You may wonder if the \b boundary is necessary. It is, because removing it matches "est#gmail.com" from "test#gmail.com".
test#gmail.com
^
The engine's cursor failed to match "test#gmail.com" from the 0th character, but after it steps forward, it matches "est#gmail.com" without problem, but that's not the intent of the programmer.
Demo of rejecting any email otherwise matching your format that begins with "test":
const s = `this is a short example hagrid#gmail.com of what I'm
trying to do with negative lookbehind test#gmail.com
harry.potter#gmail.com testasdf#gmail.com #gmail.com
a#gmail.com asdftest#gmail.com`;
console.log([...s.matchAll(/\b(?!test)\w+?\.?\w+#gmail\.com/g)]);
Note that \w+?\.?\w+ enforces that if there is a period, it must be between \w+ substrings, but this approach rejects a (probably) valid email like "a#gmail.com" because it's only one letter. You might want \b(?!test)(?:\w+?\.?\w+|\w)#gmail\.com to rectify this.
As the name suggests, the (?<! sequence is a negative lookbehind. So, the rest of the pattern would match only if it's not preceded by the look behind. This is determined by where the matching starts from.
Let's start simple - we define a regex .cde. and try to match it against some input:
First nine letters are abcdefgeh
^ ^
| |
.cde. start ------------- |
.cde. end -----------------
See on Regex101
So now we can see that the match is bcdef and is preceded by (among other characters) a. So, if we use that as a negative lookbehind (?<!a).cde. we will not get a match:
First nine letters are abcdefgeh
^^ ^
|| |
`(?<!a)` ----------| |
.cde. start ----------- |
.cde. end ----------------
See on Regex101
We could match the .cde. pattern but it's preceded by a which we don't want.
However, what happens if we defined the negative lookahead differently - as (?<!b).cde.:
First nine letters are abcdefgeh
^ ^
| |
.cde. start ----------- |
.cde. end ----------------
See on Regex101
We get a match for bcdefg because there is no b before this match. Therefore, it works fine. Yes, b is the first character of the match but doesn't show up before it. And this is the core of the lookarounds (lookbehind and lookaheads) - they are not included in the main match. In fact they fall under zero length matches since, they will be checked but won't appear as a match. In effect, they only work starting from some position but check the part of the input that will not go in the final match.
Now, if we return to your regex - (?<!test)\w+?\.?\w+#gmail\.com here is where each match starts:
test#gmail.com
^^ ^
|| |
\w+? -------| |
\w+ -------- |
#gmail\.com -----------
See on Regex101
(yes, it's slightly weird but both \w+? and \w+ both produce matches)
The negative lookbehind is for test and since it doesn't appear before the match, the pattern is satisfied.
You might wander what happens why does something like testfoo#gmail.com still produce a match - it has test and then other letters, right?
testfoo#gmail.com
^^ ^
|| |
\w+? -------| |
\w+ -------- |
#gmail\.com --------------
See on Regex101
Same result again. The problem is that \w+ will include all letters in a match, so even if the actual string test appears, it will be in the match, not before it.
To be able to differentiate the two, you have to avoid overlaps between the lookbehind pattern and the actual matching pattern.
You can decide to define the matching pattern differently (?<!test)h\w+?\.?\w+#gmail\.com, so the match has to start with an h. In that case there is no overlap and the matching pattern will not "hide" the lookbehind and make it irrelevant. Thus the pattern will match correctly against harry.potter#gmail.com, hagrid#gmail.com but will not match testhermione#gmail.com:
testhermione#gmail.com
^ ^^^ ^
| ||| |
(?<!test) -- ||| |
h ------|| |
\w+? -------| |
\w+ -------- |
#gmail\.com --------------
See on Regex101
Alternatively, you can define a lookbehind that doesn't overlap with the start of the matching pattern. But beware. Remember that regexes (like most things with computers) do what you tell them, not exactly what you mean. If we use the regular expression ``(?(negative lookahead istest-` now) then we test it against test-hermione#gmai.com, we get a match for ermione#gmail.com:
test-hermione#gmail.com
^ ^^ ^
| || |
(?<!test-) -- || |
\w+? --------| |
\w+ --------- |
#gmail\.com ---------------
See on Regex101
The regex says that we don't want anything preceded by test-, so the regex engine obliges - there is a test- before the h, so the regular expression engine discards it and the rest of the string works to fit the pattern.
So, bottom line
avoid having the match overlap with the lookbehind, or it's not actually a lookbehind any more - it's part of the match.
be careful - the regex engine will satisfy the lookbehind but in the most literal way possible with the least effort possible.
In order for this to work properly you need to both:
Use a negative lookahead (as opposed to a lookbehind, like your example)
Anchor the match (to prevent partial matches. Several anchors are possible, but in your case the best is probably \b, for word boundaries)
This is the result:
\b(?!test)\w+?\.?\w+#gmail\.com
See it live!

Regex to get the word after specific match words

I am trying to pull the dollar amount from some invoices. I need the match to be on the word directly after the word "TOTAL". Also, the word total may sometimes appear with a colon after it (ie Total:). An example text sample is shown below:
4 Discover Credit Purchase - c REF#: 02353R TOTAL: 40.00 AID: 1523Q1Q TC: mzQm 40.00 CHANGE 0.00 TOTAL NUMBER OF ITEMS SOLD = 0 12/23/17 Ql:38piii 414 9 76 1G6 THANK YOU FOR SHOPPING KR08ER Now Hiring - Apply Today!
In the case of the sample above, the match should be "40.00".
The Regex statement that I wrote:
(?<=total)([^\n\r]*)
pulls EVERYTHING after the word "total". I only want the very next word.
This (unlike other answers so far) matches only the total amount (ie without needing to examine groups):
((?<=\bTOTAL\b )|(?<=\bTOTAL\b: ))[\d.]+
See live demo matching when input has, and doesn’t have, the colon after TOTAL.
The reason 2 look behinds (which don’t capture input) are needed is they can’t have variable length. The optional colon is handled by using an alternation (a regex OR via ...|...) of 2 look behinds, one with and one without the colon.
If TOTAL can be in any case, add (?i) (the ignore case flag) to the start of the regex.
What you could do is match total followed by an optional colon :? and zero or more times a whitespace character \s* and capture in a group one or more digits followed by an optional part that matches a dot and one or more digits.
To match an upper or lowercase variant of total you could make the match case insensitive by for example by adding a modifier (?i) or use a case insensitive flag.
\btotal:?\s*(\d+(?:\.\d+)?)
The value 40.00 will be in group 1.
Explanations are in the regex pattern.
string str = "4 Discover Credit Purchase - c REF#: 02353R TOTAL: 40.00 AID: 1523Q1Q";
string pattern = #"(?ix) # 'i' means case-insensitive search
\b # Word boundary
total # 'TOTAL' or 'total' or any other combination of cases
:? # Matches colon if it exists
\s+ # One or more spaces
(\d+\.\d+) # Sought number saved into group
\s # One space";
// The number is in the first group: Groups[1]
Console.WriteLine(Regex.Match(str, pattern).Groups[1].Value);
you can use below regex to get amount after TOTAL:
\bTOTAL\b:?\s*([\d.]+)
It will capture the amount in first group.
Link : https://regex101.com/r/tzze8J/1/
Try this pattern: TOTAL:? ?(\d+.\d+)[^\d]?.
Demo

Regex to check only if the group is present

I have String which may have values like below.
854METHYLDOPA
041ALDOMET /00000101/
133IODETO DE SODIO [I 131]
In this i need to get the text starting from index 4 till we find any one these patterns /00000101/ or [I 131]
Expected Output:
METHYLDOPA
ALDOMET
IODETO DE SODIO
I have tried the below RegEx for the same
(?:^.{3})(.*)(?:[[/][A-Z0-9\s]+[]/\s+])
But this RegEx works if the string contains [/ but it doesn't work for the case1 where these patterns doesn't exist.
I have tried adding ? at the end but it works fore case 1 but doesn't work for case 2 and 3.
Could anyone please help me on getting the regx work?
Your logic is difficult to phrase. My interpretation is that you always want to capture from the 4th character onwards. What else gets captured depends on the remainder of the input. Should either /00000101/ or [I 131] occur, then you want to capture up until that point. Otherwise, you want to capture the entire string. Putting this all together yields this regex:
^.{3}(?:(.*)(?=/00000101/|\[I 131\])|(.*))
Demo
You may try this:
^.{3}(.*?)($|(?:\s*\/00000101\/)|(?:\s*\[I\s+131\])).*$
and replace by this to get the exact output you want.
\1
Regex Demo
Explanation:
^ --> start of a the string
.{3} --> followed by 3 characters
(.*?) --> followed by anything where ? means lazy it will fetch until it finds the following and won't go beyond that. It also captures it as
group 1 --> \1
($|(?:\s*\/00000101\/)|(?:\s*\[I\s+131\])) ---------->
$ --> ends with $ which means there is there is not such pattern that
you have mentioned
| or
(?:\s*\/00000101\/) -->another pattern of yours improvised with \s* to cover zero or more blank space.
| or
(?:\s*\[I\s+131\]) --> another pattern of yours with improvised \s+
which means 1 or more spaces. ?: indicates that we will not capture
it.
.*$ --> .* is just to match anything that follows and $
declares the end of string.
so we end up only capturing group 1 and nothing else which ensures to
replace everything by group1 which is your target output.
You could get the values you are looking for in group 1:
^.{3}(.+?)(?=$| ?\[I 131\]| ?\/00000101\/)
Explanation
From the beginning of the string ^
Match the first 3 characters .{3}
Match in a capturing group (where your values will be) any character one or more times non greedy (.+?)
A positive lookahead (?=
To assert what follow is either the end of the string $
or |
an optional space ? followed by [I 131] \[I 131\]
or |
an optional space ? followed by /00000101/ \/00000101\/
If your regex engine supports \K, you could try it like this and the values you are looking for are not in a group but the full match:
^.{3}\K.+?(?=$| ?\[I 131\]| ?\/00000101\/)

Simple trouble with regular expression

I have this string:
I have an eraser and 2 pencils.
Jane has a ruler and a stapler.
I need to get all the items that I have (lines starting with I have). I have tried these expressions:
(?:I have|and)\h+((?:a|an|\d+)\h+(?:\w+))
# returns some of the items that Jane has.
(I have )(?(1)((?:a|an|\d+) \w+))
# returns only the word closest to the beginning of the string.
I'm looking for a way to match a given string/expression at the beginning of the line or somewhere before the capturing group. Thanks in advance.
Note: I'm working with PCRE
It's still tricky do have a variable number of groups, but you can try this:
I have (?:an |a )?(\d? ?\w+)(\(?: and (?:an |a )?(\d? ?\w+))?(?: and (?:an |a )?(\d? ?\w+))?(?: and (?:an |a )?(\d? ?\w+))?
Below are some sample results:
"I have an eraser and a pencil and an item" -> ["eraser", "pencil", "item"]
"She has a turtle and a car" -> []
"I have 3 bricks and 4 knees and a tie" -> ["3 bricks", "4 knees", "tie"]
"I have a motorcycle and a bag" -> ["motorcycle", "bag"]
"I have a journal" -> ["journal"]
"I have wires and tires" -> ["wires", "tires"]
"I must say I have a train and a bicycle" -> ["train", "bicycle"]
For each line, it will capture a maximum number of 3 items.
This is a typical case for anchoring at the end of previous match with \G.
We're trying to match some text followed by an unknown number of tokens, and it needs to capture each token individually. The regex engine is totally capable of repeating a construct to match repeating token, but each backreference must be defined on its own. Therefore, repeating a capturing group ends up overwriting its stored value and returning only the last matched value. This task may be achieved by 2 different strategies: either capturing all tokens with 1 pattern and then using a second pattern match to split them, or performing one full match for each token.
Instead of trying to get all the items "I have" in the same match, we're going to attempt to match once per item. This approach was also tried with some of the patterns proposed in the comments. However, as you may have realized, the regex engine also matches from the middle of the string, and thus matching unwanted cases like:
She has >>a turtle<< ...
This is where we can use an anchor like \G. Our strategy will be:
Match ^I have and capture 1 item (the match ends here).
In consecutive match, start at the end of previous match, and match 1 item.
Repeat (2) for successive matches.
Now, this can be translated to regex:
^I have an? + the token
Literal text at the beggining of the line.
an or a.
And we'll cover the the token construct later.
\G(?!^)(?: and)? an? + the token
\G matches a zero-width position at the end of previous match. This is how the regex engine won't attempt a match anywhere in the string.
However, \G also matches at the beggining of the string, and we don't want to match the string "an item...". There's a trick: we used the negative lookahead (?!^) to specify "it's not followed by the start of the text". Therefore, it's guaranteed to match where it left off from the previous match in (1).
(?: and)? is optional, so it may or may not be there.
an? matches the article (an/a).
Do you see that both end up with the same construct? if we join the 2 options together:
(?:^I have:?|(?!^)\G(?: and)?) an? <<the token>>
Let's talk about the token. If it were only one word, we'd use \w+. That's not the case. Neither is .* because it shouldn't match the whole string. And we can't consume part of the following token because otherwise it wouldn't be returned in the next match.
I have a new eraser and a pencil
^
|
How does it stop here?!
How do we define a condition not to allow a match beyond that position?
It's not followed by a/an/and !!!
This can be achieved by a negative lookahead, to guarantee it's not followed by a/an/and before we match a character: (?! a | an | and ).. As you can imagine, that construct will be repeated to match every one of the characters in a token.
This pattern matches what we want: (?:(?! and | an? ).)+
And one more thing, we'll use a capturing group around it to be able to extract the text.
the token = ((?:(?! and | an? ).)+)
First version:
We now have the first working version of the regex. Put together:
(?:^I have:?|(?!^)\G(?: and)?) an? ((?:(?! and | an? ).)+)
Test it in regex101
A few more tricks:
Following the same principle, this approach allows us to include more conditions to the match. For instance,
Not anchored to the start of line.
Without capturing groups, returning each token by with the value of the full match.
Items can be separated with commas.
"I have" could be followed by any word, not necessarily an article, using exceptions.
etc.
What to choose depends on the subjet text, and it should be tested with several examples and corrected until it works as desired.
Solution:
This is the pattern I'd personally use in this case:
(?: # SUBPATTERN 1
\bI have:? # "I have"
(?![ ](?:to|been|\w+?[en]d)\b) # not followed by to|been|\w+[en]d
| # or
(?!\A)\G[ ] # anchored to previous match
?,?(?:[ ]?and)? # optional comma or "and"
) #
#
[ ](?:(?:an?|some)[ ])? # ARTICLE: a|an|some
#
\K # \K (reset match)
#
(?: # SUBPATTERN 2
(?! # Negative lookahead (exceptions)
[ ]*, # a. Comma to list another item
| # b. Article (a|an), some
[ ](?:a(?:nd?)?|some)\b # or and
) #
. # MATCH each character in a token
)+ # REPEAT Subpattern 2
One-liner:
(?:\bI have:?(?! (?:to|been|\w+?[en]d)\b)|(?!\A)\G ?,?(?: ?and)?) (?:(?:an?|some) )?\K(?:(?! *,| (?:a(?:nd?)?|some)\b).)+
Test in regex101
However, it should be tested to identify exceptions and use cases. This is how it behaves with the examples discussed in this post.
Matching the subject text:
Each match has been marked.
I have an eraser, a pencil and an item
She has a turtle and a car
I have an awesome motorcycle tatoo and a bag
I have to say I have a train and a bicycle
I have 3 bricks and 4 knees and a tie
Notice these are full matches, and not the value returned by a group. Simply add a group to enclose the "subpattern 2" to capture the tokens.
Test in regex101

Simple regex validation

I want to implement the following validation. Match at least 5 digits and also some other characters between(for example letters and slashes). For example 12345, 1A/2345, B22226, 21113C are all valid combinations. But 1234, AA1234 are not. I know that {5,} gives minimum number of occurrences, but I don't know how to cope with the other characters. I mean [0-9A-Z/]{5,} won't work:(. I just don't know where to put the other characters in the regex expression.
Thanks in advance!
Best regards,
Petar
Using the simplest regex features since you haven't specified which engine you're using, you can try:
.*([0-9].*){5}
|/|\ /|/| |
| | \ / | | +--> exactly five occurrences of the group
| | | | +----> end group
| | | +------> zero or more of any character
| | +---------> any digit
| +------------> begin group
+--------------> zero or more of any character
This gives you any number (including zero) of characters, followed by a group consisting of a single digit and any number of characters again. That group is repeated exactly five times.
That'll match any string with five or more digits in it, along with anything else.
If you want to limit what the other characters can be, use something other than .. For example, alphas only would be:
[A-Za-z]*([0-9][A-Za-z]*){5}
EDIT: I'm picking up your suggestion from a comment to paxdiablo's answer: This regex now implements an upper bound of five for the number of "other" characters:
^(?=(?:[A-Z/]*\d){5})(?!(?:\d*[A-Z/]){6})[\dA-Z/]*$
will match and return a string that has at least five digits and zero or more of the "other" allowed characters A-Z or /. No other characters are allowed.
Explanation:
^ # Start of string
(?= # Assert that it's possible to match the following:
(?: # Match this group:
[A-Z/]* # zero or more non-digits, but allowed characters
\d # exactly one digit
){5} # five times
) # End of lookahead assertion.
(?! # Now assert that it's impossible to match the following:
(?: # Match this group:
\d* # zero or more digits
[A-Z/] # exactly one "other" character
){6} # six times (change this number to "upper bound + 1")
) # End of assertion.
[\dA-Z/]* # Now match the actual string, allowing only these characters.
$ # Anchor the match at the end of the string.
You may want to try counting the digits instead. I feel its much cleaner than writing a complex regex.
>> "ABC12345".gsub(/[^0-9]/,"").size >= 5
=> true
the above says substitute all things not numbers, and then finding the length of those remaining. You can do the same thing using your own choice of language. The most fundamental way would be to iterate the string you have, counting each character which is a digit until it reaches 5 (or not) and doing stuff accordingly.