Regex for decimal within 0.21 to 1 range [duplicate] - regex

I'm trying to use the range pattern [01-12] in regex to match two digit mm, but this doesn't work as expected.

You seem to have misunderstood how character classes definition works in regex.
To match any of the strings 01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, or 12, something like this works:
0[1-9]|1[0-2]
References
regular-expressions.info/Character Classes
Numeric Ranges (have many examples on matching strings interpreted as numeric ranges)
Explanation
A character class, by itself, attempts to match one and exactly one character from the input string. [01-12] actually defines [012], a character class that matches one character from the input against any of the 3 characters 0, 1, or 2.
The - range definition goes from 1 to 1, which includes just 1. On the other hand, something like [1-9] includes 1, 2, 3, 4, 5, 6, 7, 8, 9.
Beginners often make the mistakes of defining things like [this|that]. This doesn't "work". This character definition defines [this|a], i.e. it matches one character from the input against any of 6 characters in t, h, i, s, | or a. More than likely (this|that) is what is intended.
References
regular-expressions.info/Brackets for Grouping and Alternation with the vertical bar
How ranges are defined
So it's obvious now that a pattern like between [24-48] hours doesn't "work". The character class in this case is equivalent to [248].
That is, - in a character class definition doesn't define numeric range in the pattern. Regex engines doesn't really "understand" numbers in the pattern, with the exception of finite repetition syntax (e.g. a{3,5} matches between 3 and 5 a).
Range definition instead uses ASCII/Unicode encoding of the characters to define ranges. The character 0 is encoded in ASCII as decimal 48; 9 is 57. Thus, the character definition [0-9] includes all character whose values are between decimal 48 and 57 in the encoding. Rather sensibly, by design these are the characters 0, 1, ..., 9.
See also
Wikipedia/ASCII
Another example: A to Z
Let's take a look at another common character class definition [a-zA-Z]
In ASCII:
A = 65, Z = 90
a = 97, z = 122
This means that:
[a-zA-Z] and [A-Za-z] are equivalent
In most flavors, [a-Z] is likely to be an illegal character range
because a (97) is "greater than" than Z (90)
[A-z] is legal, but also includes these six characters:
[ (91), \ (92), ] (93), ^ (94), _ (95), ` (96)
Related questions
is the regex [a-Z] valid and if yes then is it the same as [a-zA-Z]

A character class in regular expressions, denoted by the [...] syntax, specifies the rules to match a single character in the input. As such, everything you write between the brackets specify how to match a single character.
Your pattern, [01-12] is thus broken down as follows:
0 - match the single digit 0
or, 1-1, match a single digit in the range of 1 through 1
or, 2, match a single digit 2
So basically all you're matching is 0, 1 or 2.
In order to do the matching you want, matching two digits, ranging from 01-12 as numbers, you need to think about how they will look as text.
You have:
01-09 (ie. first digit is 0, second digit is 1-9)
10-12 (ie. first digit is 1, second digit is 0-2)
You will then have to write a regular expression for that, which can look like this:
+-- a 0 followed by 1-9
|
| +-- a 1 followed by 0-2
| |
<-+--> <-+-->
0[1-9]|1[0-2]
^
|
+-- vertical bar, this roughly means "OR" in this context
Note that trying to combine them in order to get a shorter expression will fail, by giving false positive matches for invalid input.
For instance, the pattern [0-1][0-9] would basically match the numbers 00-19, which is a bit more than what you want.
I tried finding a definite source for more information about character classes, but for now all I can give you is this Google Query for Regex Character Classes. Hopefully you'll be able to find some more information there to help you.

This also works:
^([1-9]|[0-1][0-2])$
[1-9] matches single digits between 1 and 9
[0-1][0-2] matches double digits between 10 and 12
There are some good examples here

The []s in a regex denote a character class. If no ranges are specified, it implicitly ors every character within it together. Thus, [abcde] is the same as (a|b|c|d|e), except that it doesn't capture anything; it will match any one of a, b, c, d, or e. All a range indicates is a set of characters; [ac-eg] says "match any one of: a; any character between c and e; or g". Thus, your match says "match any one of: 0; any character between 1 and 1 (i.e., just 1); or 2.
Your goal is evidently to specify a number range: any number between 01 and 12 written with two digits. In this specific case, you can match it with 0[1-9]|1[0-2]: either a 0 followed by any digit between 1 and 9, or a 1 followed by any digit between 0 and 2. In general, you can transform any number range into a valid regex in a similar manner. There may be a better option than regular expressions, however, or an existing function or module which can construct the regex for you. It depends on your language.

Use this:
0?[1-9]|1[012]
07: valid
7: valid
0: not match
00 : not match
13 : not match
21 : not match
To test a pattern as 07/2018 use this:
/^(0?[1-9]|1[012])\/([2-9][0-9]{3})$/
(Date range between 01/2000 to 12/9999 )

As polygenelubricants says yours would look for 0|1-1|2 rather than what you wish for, due to the fact that character classes (things in []) match characters rather than strings.

My solution to keep mm-yyyy is ^0*([1-9]|1[0-2])-(20[2-4][0-9])$

Related

Regular Expression - String that contains at most one pair of consecutive 1's

How to express a string that contains at most one pair of consecutive 1's in UNIX regex?
Some examples to accepted strings: 0, 1, 11, 12, 22, 11221212, 22112121, 23456 etc.
Not accepted ones: 111, 11311, 311311 etc.
This should work:
^([^1]+|1[^1])*(11)?([^1]|$)([^1]+|1[^1]|1$)*$
See it on regex101.
Explanation:
([^1]+|1[^1])*
This will match anything that doesn't contain consecutive 1, by matching either anything that doesn't contain a 1 or a 1 followed by something else.
(11)?([^1]|$)
This next part will match two consecutive 1 (if present) followed by either a different char or the end of the string. So it will match a pair of 1 not followed by another 1.
([^1]+|1[^1]|1$)*
The final part is very similar to the first one, except it will allow the string to end with a single 1.
This pattern would be much simpler if you could use a richer regex dialect (like the Perl dialect). In the standard unix regexes, you can't use lookaround expressions. Here's a Perl pattern:
^(?!.*111)(?!.*11.*11).*$

Decoding a regex... I know what it's function is but I want to understand exactly what is happening

I have a regular expression that I'm going to be using to verify that an inputted number is in standard U.S. telephone format (i.e (###) ###-####). I am new to regex and still having some trouble figuring out the exact function of each character. If someone would go through this piece by piece/verify that I am understanding I would really appreciate it. Also if the regex is wrong I would obviously like to know that.
\D*?(\d\D*?){10}
What I think is happening:
\D*?( indicates an escape sequence for the parenthesis metacharacter... not sure why the \D*? is necessary
\d indicating digits
\D*? indicating there is a non-digit character (-) followed by the closing parenthesis.
{10} for the 10 digits
I feel very unsure explaining this, like my understanding is very vague in terms of why the regex is in the order that it is etc. Thanks in advance for help/explanations.
EDIT
It seems like this is not the best regex for what I want. Another possibility was [(][0-9]{3}[)] [0-9]{3}-[0-9]{4}, but I was told this would fail. I suppose I'll have to do a little more work with regular expressions to figure this out.
\D matches any non-digit character.
* means that the previous character is repeated 0 or more times.
*? means that the previous character is repeated 0 or more times, but until the match of the following character in the regex. It is a bit difficult perhaps at the start, but in your regex, the next character is \d, meaning \D*? will match the least amount of characters until the next \d character.
( ... ) is a capture group, and is also used to group things. For instance {10} means that the previous character or group is repeated 10 times exactly.
Now, \D*?(\d\D*?){10} will match exactly 10 numbers, starting with non-digit characters or not, with non-digit characters in between the digits if they are present.
[(][0-9]{3}[)] [0-9]{3}-[0-9]{4}
This regex is a bit better since it doesn't just accept anything (like the first regex does) and will match the format (###) ###-#### (notice the space is a character in regex!).
The new things introduced here are the square brackets. These represent character classes. [0-9] means any character between 0 to 9 inclusive, which means it will match 0, 1, 2, 3, 4, 5, 6, 7, 8 or 9. Adding {3} after it makes it match 3 similar character class, and since this character class contains only digits, it will match exactly 3 digits.
A character class can be used to escape certain characters, such as ( or ) (note I mentioned earlier they are for capturing groups, or grouping) and thus, [(] and [)] are literal ( and ) instead of being used for capturing/grouping.
You can also use backslashes (\) to escape characters. Thus:
\([0-9]{3}\) [0-9]{3}-[0-9]{4}
Will also work. I would also recommend the use of line anchors ^ and $ if you're only trying to see if a phone number matches the above format. This ensures that the string has only the phone number, and nothing else. ^ matches the beginning of a line and $ matches the end of a line. Thus, the regex will become:
^\([0-9]{3}\) [0-9]{3}-[0-9]{4}$
However, I don't know all the combinations of the different formats of phone numbers in the US, so this regex might need some tweaking if you have different phone number formats.
\D is "not a digit"; \d is "digit". With that in mind:
This matches zero or more non-digits, then it matches a digit and any number of non-digit characters 10 times. This won't actually verify that the number if formatted properly, just that it contains 10 digits. I suspect that the regex isn't what you want in the first place.
For example, the following will match your regex:
this is some bad text 1 and some more 2 and more 34567890
\D matches a character that is not a digit
* repeats the previous item 0 or more times
? find the first occurrence
\d matches a digit
so your group is matches 10 digits or non digits

Why is the regex to match 1 to 10 written as [1-9]|10 and not [1-10]?

Why is the regex to match numbers from 1 to 10 commonly written as follows?
[1-9]|10
Instead of:
[1-10]
Or this:
[1-(10)]
Sometime a good drawing worth 1000 words...
Here are the three propositions in your question and the way a regex flavour would understand them:
[1-9]|10
[1-10]
[1-(10)]
Invalid regexp !!
This regex is invalid because a range is opened (1-) with a digit but not closed with another digit (ends with ().
A range is usually bound with digits on both sides or letters on both sides.
Images generated with Debuggex
That is because regexes work with characters, not with numbers. [1-9] is equivalent to (?:1|2|3|4|5|6|7|8|9) while [1-10] would be (?:1|0) (because it's the range 1–1 and the digit 0).
Simply put, ranges in character classes always refer to contiguous ranges of characters, despite how they look. Even if they're digits that doesn't mean there is any kind of numeric range.
[1-9]|10
In this:
[1-9] accepts any character from 1 through 9;
| performs an "or" operation;
10 accepts the 10 literally.
[1-10]
This accepts:
any single character between 1 and 1,
or 0.
No matter what pattern is inside [...] (character class), it only matches a single character.
The way the range operator (-) inside character class works is it takes a single character as left operand, and a single character as right operand, then expand it to a list of characters.
So, looking at the ranges in your examples
1-9 (1 to 9) in [1-9]|10 (equivalent to [123456789]|10)
1-1 (1 to 1) in [1-10] (equivalent to [10] which is the same as [01])
1-( (1 to opening parenthesis) in [1-(10)]
I actually get an error with this in Perl because the range 1 to ( doesn't really make sense.
It is about the character matching. When you say [1-9] it means it matches any individual characters from 1 to 9. Number 10 would be treated as 2 separate characters.
The [] indicates a single character match
for example [ab] would match either a or b
so [1-9] which is effectively shorthand for [123456789] would match a single character that is one of the digits from 1 to 9
Your example of [1-10] would expand the 1-1 to mean all characters in the range 1 to 1 (i.e 1) so the actual regex would expand to be [10] (i.e. either the character 1 or the character 0)
That's the basic definition of a character class. [1-10] means "match any character in the range 1 though 1, or 0". Character classes are evaluated character by character (except for escape sequences and -); they don't understand numbers.
That is because the [] symbols represent character set, e.g. [0-5] matchers 0-5. However, 10 has two digits and therefore [0-9] will not produce an exact match (will only match the first digit, '1' of '10'.
The pipe symbol | can be seen as a "or" operator.
\[([1-9][0-9]|[0-9])\]
This will remove Wikipedia's references when you copy something for your project.

How to validate numeric values which may contain dots or commas?

I need a regular expression for validation two or one numbers then , or . and again two or one numbers.
So, these are valid inputs:
11,11
11.11
1.1
1,1
\d{1,2}[\,\.]{1}\d{1,2}
EDIT: update to meet the new requirements (comments) ;)
EDIT: remove unnecesary qtfier as per Bryan
^[0-9]{1,2}([,.][0-9]{1,2})?$
In order to represent a single digit in the form of a regular expression you can use either:
[0-9] or \d
In order to specify how many times the number appears you would add
[0-9]*: the star means there are zero or more digits
[0-9]{2}: {N} means N digits
[0-9]{0,2}: {N,M} N digits to M digits
Lets say I want to represent a number between 1 and 99 I would express it as such:
[0-9]{1,2} or \d{1,2}
Or lets say we were working with binary display, displaying a byte size, we would want our digits to be between 0 and 1 and length of a byte size, 8, so we would represent it as follows:
[0-1]{8} representation of a binary byte
Then if you want to add a , or a . symbol you would use:
\, or \. or you can use [.] or [,]
You can also state a selection between possible values as such
[.,] means either a dot or a comma symbol
And you just need to concatenate the pieces together, so in the case where you want to represent a 1 or 2 digit number followed by either a comma or a period and followed by two more digits you would express it as follows:
[0-9]{1,2}[.,]\d{1,2}
Also note that regular expression strings inside C++ strings must be double-back-slashed so every \ becomes \\
\d means a digit in most languages. You can also use [0-9] in all languages. For the "period or comma" use [\.,]. Depending on your language you may need more backslashes based on how you quote the expression. Ultimately, the regular expression engine needs to see a single backslash.
* means "zero-or-more", so \d* and [0-9]* mean "zero or more numbers". ? means "zero-or-one". Neither of those qualifiers means exactly one. Most languages also let you use {m,n} to mean "between m and n" (ie: {1,2} means "between 1 and 2")
Since the dot or comma and additional numbers are optional, you can put them in a group and use the ? quantifier to mean "zero-or-one" of that group.
Putting that all together you can use:
\d{1,2}([\.,][\d{1,2}])?
Meaning, one or two digits \d{1,2}, followed by zero-or-one of a group (...)? consisting of a dot or comma followed by one or two digits [\.,]\d{1,2}
\d{1,2}[,.]\d{1,2}
\d means a digit, the {1,2} part means 1 or 2 of the previous character (\d in this case) and the [,.] part means either a comma or dot.
Shortest regexp I know (16 char)
^\d\d?[,.]\d\d?$
The ^ and $ means begin and end of input string (without this part 23.45 of string like 123.45 will be matched). The \d means digit, the \d? means optional digit, the [,.] means dot or comma. Working example (when you click on left menu> tools> code generator you can gen code for one of 9 popular languages like c#, js, php, java, ...) here.
[ // tests
'11,11', // valid
'11.11',
'1.1',
'1,1',
'111,1', // nonvalid
'11.111',
'11-11',
',11',
'11.',
'a.11',
'11,a',
].forEach(n=> console.log(`${n}\t valid: ${ /^\d\d?[,.]\d\d?$/.test(n) }`))
If you want to be very permissive, required only two final digits with comma or dot:
^([,.\d]+)([,.]\d{2})$

Why doesn't [01-12] range work as expected?

I'm trying to use the range pattern [01-12] in regex to match two digit mm, but this doesn't work as expected.
You seem to have misunderstood how character classes definition works in regex.
To match any of the strings 01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, or 12, something like this works:
0[1-9]|1[0-2]
References
regular-expressions.info/Character Classes
Numeric Ranges (have many examples on matching strings interpreted as numeric ranges)
Explanation
A character class, by itself, attempts to match one and exactly one character from the input string. [01-12] actually defines [012], a character class that matches one character from the input against any of the 3 characters 0, 1, or 2.
The - range definition goes from 1 to 1, which includes just 1. On the other hand, something like [1-9] includes 1, 2, 3, 4, 5, 6, 7, 8, 9.
Beginners often make the mistakes of defining things like [this|that]. This doesn't "work". This character definition defines [this|a], i.e. it matches one character from the input against any of 6 characters in t, h, i, s, | or a. More than likely (this|that) is what is intended.
References
regular-expressions.info/Brackets for Grouping and Alternation with the vertical bar
How ranges are defined
So it's obvious now that a pattern like between [24-48] hours doesn't "work". The character class in this case is equivalent to [248].
That is, - in a character class definition doesn't define numeric range in the pattern. Regex engines doesn't really "understand" numbers in the pattern, with the exception of finite repetition syntax (e.g. a{3,5} matches between 3 and 5 a).
Range definition instead uses ASCII/Unicode encoding of the characters to define ranges. The character 0 is encoded in ASCII as decimal 48; 9 is 57. Thus, the character definition [0-9] includes all character whose values are between decimal 48 and 57 in the encoding. Rather sensibly, by design these are the characters 0, 1, ..., 9.
See also
Wikipedia/ASCII
Another example: A to Z
Let's take a look at another common character class definition [a-zA-Z]
In ASCII:
A = 65, Z = 90
a = 97, z = 122
This means that:
[a-zA-Z] and [A-Za-z] are equivalent
In most flavors, [a-Z] is likely to be an illegal character range
because a (97) is "greater than" than Z (90)
[A-z] is legal, but also includes these six characters:
[ (91), \ (92), ] (93), ^ (94), _ (95), ` (96)
Related questions
is the regex [a-Z] valid and if yes then is it the same as [a-zA-Z]
A character class in regular expressions, denoted by the [...] syntax, specifies the rules to match a single character in the input. As such, everything you write between the brackets specify how to match a single character.
Your pattern, [01-12] is thus broken down as follows:
0 - match the single digit 0
or, 1-1, match a single digit in the range of 1 through 1
or, 2, match a single digit 2
So basically all you're matching is 0, 1 or 2.
In order to do the matching you want, matching two digits, ranging from 01-12 as numbers, you need to think about how they will look as text.
You have:
01-09 (ie. first digit is 0, second digit is 1-9)
10-12 (ie. first digit is 1, second digit is 0-2)
You will then have to write a regular expression for that, which can look like this:
+-- a 0 followed by 1-9
|
| +-- a 1 followed by 0-2
| |
<-+--> <-+-->
0[1-9]|1[0-2]
^
|
+-- vertical bar, this roughly means "OR" in this context
Note that trying to combine them in order to get a shorter expression will fail, by giving false positive matches for invalid input.
For instance, the pattern [0-1][0-9] would basically match the numbers 00-19, which is a bit more than what you want.
I tried finding a definite source for more information about character classes, but for now all I can give you is this Google Query for Regex Character Classes. Hopefully you'll be able to find some more information there to help you.
This also works:
^([1-9]|[0-1][0-2])$
[1-9] matches single digits between 1 and 9
[0-1][0-2] matches double digits between 10 and 12
There are some good examples here
The []s in a regex denote a character class. If no ranges are specified, it implicitly ors every character within it together. Thus, [abcde] is the same as (a|b|c|d|e), except that it doesn't capture anything; it will match any one of a, b, c, d, or e. All a range indicates is a set of characters; [ac-eg] says "match any one of: a; any character between c and e; or g". Thus, your match says "match any one of: 0; any character between 1 and 1 (i.e., just 1); or 2.
Your goal is evidently to specify a number range: any number between 01 and 12 written with two digits. In this specific case, you can match it with 0[1-9]|1[0-2]: either a 0 followed by any digit between 1 and 9, or a 1 followed by any digit between 0 and 2. In general, you can transform any number range into a valid regex in a similar manner. There may be a better option than regular expressions, however, or an existing function or module which can construct the regex for you. It depends on your language.
Use this:
0?[1-9]|1[012]
07: valid
7: valid
0: not match
00 : not match
13 : not match
21 : not match
To test a pattern as 07/2018 use this:
/^(0?[1-9]|1[012])\/([2-9][0-9]{3})$/
(Date range between 01/2000 to 12/9999 )
As polygenelubricants says yours would look for 0|1-1|2 rather than what you wish for, due to the fact that character classes (things in []) match characters rather than strings.
My solution to keep mm-yyyy is ^0*([1-9]|1[0-2])-(20[2-4][0-9])$