There's a long natural number that can be grouped to smaller numbers by the 0 (zero) delimiter.
Example: 4201100370880
This would divide to Group1: 42, Group2: 110, Group3: 370880
There are 3 groups, groups never start with 0 and are at least 1 char long. Also the last groups is "as is", meaning it's not terminated by a tailing 0.
This is what I came up with, but it only works for certain inputs (like 420110037880):
(\d+)0([1-9][0-9]{1,2})0([1-9]\d+)
This shows I'm attempting to declare the 2nd group's length to min2 max3, but I'm thinking the correct solution should not care about it. If the delimiter was non-numeric I could probably tackle it, but I'm stumped.
All right, factoring in comment information, try splitting on a regex (this may vary based on what language you're using - .split(/.../) in JavaScript, preg_split in PHP, etc.)
The regex you want to split on is: 0(?!0). This translates to "a zero that is not followed by a zero". I believe this will solve your splitting problem.
If your language allows a limit parameter (PHP does), set it to 3. If not, you will need to do something like this (JavaScript):
result = input.split(/0(?!0)/);
result = result.slice(0,2).concat(result.slice(2).join("0"));
The following one should suit your needs:
^(.*?)0(?!0)(.*?)0(?!0)(.*)$
Visualization by Debuggex
The following regex works:
(\d+?)0(?!0) with the g modifier
Demo: http://regex101.com/r/rS4dE5
For only three matches, you can do:
(\d+?)0(?!0)(\d+?)0(?!0)(.*)
Related
I have a filename like this:
0296005_PH3843C5_SEQ_6210_QTY_BILLING_D_DEV_0000000000000183.PS.
I needed to break down the name into groups which are separated by a underscore. Which I did like this:
(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
So far so go.
Now I need to extract characters from one of the group for example in group 2 I need the first 3 and 8 decimal ( keep mind they could be characters too ).
So I had try something like this :
(.*?)_([38]{2})(.*?) _(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
It didn’t work but if I do this:
(.*?)_([PH]{2})(.*?) _(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
It will pull the PH into a group but not the 38 ? So I’m lost at this point.
Any help would be great
Try the below Regex to match any first 3 char/decimal and one decimal
(.?)_([A-Z0-9]{3}[0-9]{1})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
Try the below Regex to match any first 3 char/decimal and one decimal/char
(.?)_([A-Z0-9]{3}[A-Z0-9]{1})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
It will match any 3 letters/digits followed by 1 letter/digit.
If your first two letter is a constant like "PH" then try the below
(.?)_([PH]+[0-9A-Z]{2})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
I am assuming that you are trying to match group2 starting with numbers. If that is the case then you have change the source string such as
0296005_383843C5_SEQ_6210_QTY_BILLING_D_DEV_0000000000000183.PS.
It works, check it out at https://regex101.com/r/zem3vt/1
Using [^_]* performs much better in your case than .*? since it doesn't backtrack. So changing your original regex from:
(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
to:
([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)
reduces the number of steps from 114 to 42 for your given string.
The best method might be to actually split your string on _ and then test the second element to see if it contains 38. Since you haven't specified a language, I can't help to show how in your language, but most languages employ a contains or indexOf method that can be used to determine whether or not a substring exists in a string.
Using regex alone, however, this can be accomplished using the following regular expression.
See regex in use here
Ensuring 38 exists in the second part:
([^_]*)_([^_]*38[^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)
Capturing the 38 in the second part:
([^_]*)_([^_]*)(38)([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)
I came across the regular expression not containing 101 as follows:
0∗1∗0∗+(1+00+000)∗+(0+1+0+)∗
I was unable to understand how the author come up with this regex. So I just thought of string which did not contain 101:
01000100
I seems that above string will not be matched by above regex. But I was unsure. So tried translating to equivalent pcre regex on regex101.com, but failed there too (as it can be seen my regex does not even matches string containing single 1.
Whats wrong with my translation? Is above regex indeed correct? If not what will be the correct regex?
Here is a bit shorter expression ^0*(1|00+)*0*$
https://www.regex101.com/r/gG3wP5/1
Explanation:
(1|00+)* we can mix zeroes and ones as long as zeroes occur in groups
^0*...0*$ we can have as many zeroes as we want in prefix/suffix
Direct translation of the original regexp would be like
^(0*1*0*|(1|00|000)*|(0+1+0+)*)$
Update
This seems like artificially complicated version of the above regexp:
(1|00|000)* is the same as (1|00+)*
it is almost the solution, but it does not match strings 0, 01.., and ..10
0*1*0* doesn't match strings with 101 inside, but matches 0 and some of 01.., and ..10
we still need to match those of 01.., and ..10 which have 0 & 1 mixed inside, e.g. 01001.. or ..10010
(0+1+0+)* matches some of the remaining cases but there are still some valid strings unmatched
e.g. 10010 is the shortest string that is not matched by all of the cases.
So, this solution is overly complicated and not complete.
read the explanation in the right side tab in regex101 it tells you what your regex does( I think you misunderstood what list operator does) , inside a list operator ( [ ) , the other characters such as ( won't be metacharacters anymore so the expression [(0*1*0*)[1(00)(000)] will be equivalent to [01()*[] which means it matches 0 or 1 or ( or ) or [
The correct translation of the regular expression 0∗1∗0∗+(1+00+000)∗+(0+1+0+)∗
will be as follows:
^((?:0*1*0*)|(?:1|00|000)*|(?:0+1+0+)*)$
regex101
Debuggex Demo
What your regex [(0*1*0*)[1(00)(000)]*(0+1+0+)*] does:
[(0*1*0*)[1(00)(000)]* -> matches any of characters 0,(,),*,[ zero or more times followed by
(0+1+0+)* --> matches the pattern 0+1+0+ 0 or more times followed by
] --> matches the character ]
so you expression is equivalent to
[([)01](0+1+0+)*] which is not a regular expression to match strings that do not contain 101
0* 1* ( (00+000)* 1*)* (ε+0)
i think this expression covers all cases because --
any number apart from 1 can be broken into constituent 2's and 3's i.e. any number n=2*i+3*j. So there can be any number of 0's between 2 consecutive 1's apart from one 0.Hence, 101 cannot be obtained.
ε+0 for expressions ending in one 0.
The RE for language not containing 101 as sub-string can also be written as (0*1*00)*.0*.1*.0*
This may me a smaller one then what you are using. Try to make use of this.
Regular Expression I got (0+10)1. (looks simple :P)
I just considered all cases to make this.
you consider two 1's we have to end up with continuous 1's
case 1: 11111111111111...
case 2: 0000000011111111111111...(once we take two 1's we cant accept 0's so one and only chance is to continue with 1's)
if you consider only one 1 which was followed by 0 So, no issue and after one 1 we can have any number of 0's.
case 3: 00000000 10100100010000100000100000 1111111111
=>(0*+10*)1
final answer (0+10)1.
Thanks for your patience.
Trying to put together regex that can match minimum 4 digits, maximum 16 digits, and those digits can be separated by characters: ()- x+ (but should not be part of the min/max count).
ie. "555-123-4567" would return true, "1-234" is true, "+44(55)123-3333" is true, "abcd1" is false, "1-()-4++++-()-6" is false.
Any way to do that with purely regex? Trying a couple expressions but not working.
what you need to do, is to match any number of the allowed characters, followed by a digit, followed by any number of the allowed characters, and match that same sequence between 4 an 16 times.
like this
^([()\- x+]*\d[()\- x+]*){4,16}$
http://rubular.com/r/6VhALkFPQZ
This:
/^[(]{0,1}[0-9]{3}[)]{0,1}[-\s\.]{0,1}[0-9]{3}[-\s\.]{0,1}[0-9]{4}$/
Works with these formats:
123-456-7890
(123) 456-7890
1234567890
123.456.7890
TL;DR
The OP has already accepted a regex solution. Below I present an alternative way of looking at the problem. Hopefully it helps the OP, but it's really aimed more at future visitors and followers of regex.
Don't Validate Logic with Regexps
Regular expressions work best for matching or extracting patterns, rather than for complex data validation. For example, the OP gives the following rules:
Trying to put together regex that can match minimum 4 digits, maximum 16 digits, and those digits can be separated by characters: ()- x+ (but should not be part of the min/max count).
but then says that 1-()-4++++4-()-66 should be false. However, it meets the rules for truth as originally defined by the OP. (NB: This example was later changed in the OP's question, but the point I'm making remains valid.)
Example: Using Code to Simplify the Regex Pattern Match
Logic should be encapsulated in short, testable pieces of code, not in complex regular expressions. For example, consider the following Ruby code:
numbers = [
'555-123-4567',
'1-234',
'+44(55)123-3333',
'abcd1',
'1-()-4++++4-()-66'
]
numbers.map { |num| num.delete '- x+()' }.grep /\A\d{4,16}\z/
#=> ["5551234567", "1234", "44551233333", "14466"]
Even if you aren't a Rubyist, the code should be easy to follow. This code strips out the characters that are irrelevant to our match, then checks that each string contains nothing but 4-16 digits anchored to the beginning and end of the string. Instead of validating a complex pattern, you're now just validating a simple pattern (e.g. all numbers) with a well-defined interval from 4 to 16. Furthermore, you can break this kind of logic up into smaller steps rather than simply calling long method chains, making this inherently more testable.
Example: Avoiding Regexp Validation Altogether
You could even go further by avoiding the regex for any sort of validation, and making your Boolean expressions more explicit. Consider the following:
numbers = [
'555-123-4567',
'1-234',
'+44(55)123-3333',
'abcd1',
'1-()-4++++4-()-66'
]
numbers.map do |num|
digits = num.scan /\d/
valid = digits.count >= 4 and digits.count <= 16
puts "#{num}: #{valid}"
end
This will print:
555-123-4567: true
1-234: true
+44(55)123-3333: true
abcd1: false
1-()-4++++4-()-66: true
To me, this seems like a much more robust and flexible way of solving the "phone number validation" question, which gets asked here on Stack Overflow in one form or another with amazing regularity. Your mileage may vary.
I have thousands of article descriptions containing numbers.
they look like:
ca.2760h3x1000.5DIN345x1500e34
the resulting numbers should be:
2760
1000.5
1500
h3 or 3 shall not be a result of the parsing, since h3 is a tolerance only
same for e34
DIN345 is a norm an needs to be excluded (every number with a trailing DIN or BN)
My current REGEX is:
[^hHeE]([-+]?([0-9]+\.[0-9]+|[0-9]+))
This solves everything BUT the norm. How can I get this "DIN" and "BN" treated the same way as a single character ?
Thanx, TomE
Try using this regular expression:
(?<=x)[+-]?0*[0-9]+(?:\.[0-9]+)?|[+-]?0*[0-9]+(?:\.[0-9]+)?(?=h|e)
It looks like every number in your testcase you want to match exept the first number is starting with x.This is what the first part of the regex matches. (?<=x)[+-]?0*[0-9]+(?:\.[0-9]+)?The second part of the regex matches the number until h or e. [+-]?0*[0-9]+(?:\.[0-9]+)?(?=h|e)
The two parts [+-]?0*[0-9]+(?:\.[0-9]+)? in the regex is to match the number.
If we can assume that the numbers are always going to be four digits long, you can use the regex:
(\d{4}\.\d+|\d{4})
DEMO
Depending on the language you might need to replace \d with [0-9].
I want to create a regex that will match any of these values
7-5
6-6 ((0-99) - (0-99))
6-4
6-3
6-2
6-1
6-0
0-6
1-6
2-6
3-6
4-6
the 6-6 example is a special case, here are some examples of values:
6-6 (23-8)
6-6 (4-25)
6-6 (56-34)
Is it possible to make one regex that can do this?
If so, is it possible to further extend that regex for the 6-6 special case such that the the difference between the two numbers within the parentheses is equal to 2 or -2?
I could easily write this with procedural code, but i'm really curious if someone can devise a regex for this.
Lastly, if it could be further extended such that the individual digits were in their own match groups I'd be amazed. An example would be for 7-5, i could have a match group that just had the value 7, and another that had the value 5. However for 6-6 (24-26) I'd like a match group that had the first six, a match group for the second 6, a match group for the 24 and a match group for the 26.
This may be impossible, but some of you can probably get this part of the way there.
Good luck, and thanks for the help.
NO. The answer is "We can't," and the reason is because you're trying to use a hammer to dig a hole.
The problem with writing one long "clever" (this word causes a knee-jerk reaction in many people who are far more anti-regex than I) regex is that, six months from now, you'll have forgotten those clever regex features that you used so heavily, and you'll have written six months worth of code related to something else, and you'll get back to your impressive regex and have to tweak one detail, and you'll say, "WTF?"
This is what (I understand) you want, in Perl:
# data is in $_
if(/7-5|6-[0-4]|[0-4]-6|6-6 \((\d{1,2})-(\d{1,2})\)/) {
if($1 and $2 and abs($1 - $2) == 2) {
# we have the right difference
}
}
Some might say that the given regex is a bit much, but I don't think it's too bad. If the \d{1,2} bit is a little too obscure you could use \d\d? (which is what I used at first, but didn't like the repetition).
You can do it like this:
7-5|6-[0-4]|[0-5]-6|6-6 \(\d\d?-\d\d?\)
Just add parens to get your match groups.
Off the top of my head (there may be some errors but the principle should be good):
\d-\d|6-6 (\d+-\d+)
And like with any regexp, you can surround what you want to extract with parentheses for match groups:
(\d)-(\d)|(6)-(6) ((\d)+-(\d+))
In the 6-6 case, the first two parentheses should get the sixes, and the second two should get the multi-digit values that come afterwards.
Here is one that will match only the numbers you want and let you get each digit by name:
p = r'(?P<a>[0-4]|6|7)-(?P<b>[0-4]|6|5) *(\((?P<c>\d{1,2})-(?P<d>\d{1,2})\))?'
To get each digit you could use:
values = re.search(p, string).group('a', 'b', 'c', 'd')
Which will return a four element tuple with the values you are looking for (or None if no match was found).
One problem with this pattern is that it will patch the stuff in the parenthesis whether or not there was a match to '6-6'. This one will only match the final parenthesis if 6-6 is matched:
p = r'(?P<a>[0-4]|(?P<tmp_a>6)|7)-(?P<b>(?(tmp_a)(?P<tmp_b>6)|([0-4]|5)))(?(tmp_b) *(\((?P<c>\d{1,2})-(?P<d>\d{1,2})\))?)'
I don't know of any way to look for a difference between the numbers in the parenthesis; regex only knows about strings, not numerical values . . .
(I am assuming python syntax here; the perl syntax is slightly different, though perl supports the python way of doing things.)