Regex matching numbers and decimals - regex

I need a regex expression that will match the following:
.5
0.5
1.5
1234
but NOT
0.5.5
absnd (any letter character or space)
I have this that satisfies all but 0.5.5
^[.?\d]+$

This is a fairly common task. The simplest way I know of to deal with it is this:
^[+-]?(\d*\.)?\d+$
There are also other complications, such as whether you want to allow leading zeroes or commas or things like that. This can be as complicated as you want it to be. For example, if you want to allow the 1,234,567.89 format, you can go with this:
^[+-]?(\d*|\d{1,3}(,\d{3})*)(\.\d+)?\b$
That \b there is a word break, but I'm using it as a sneaky way to require at least one numeral at the end of the string. This way, an empty string or a single + won't match.
However, be advised that regexes are not the ideal way to parse numeric strings. All modern programming languages I know of have fast, simple, built-in methods for doing that.

Here's a much simpler solution that doesn't use any look-aheads or look-behinds:
^\d*\.?\d+$
To clearly understand why this works, read it from right to left:
At least one digit is required at the end.
7 works
77 works
.77 works
0.77 works
0. doesn't work
empty string doesn't work
A single period preceding the digit is optional.
.77 works
77 works
..77 doesn't work
Any number of digits preceding the (optional) period.
.77 works
0.77 works
0077.77 works
0077 works
Not using look-aheads and look-behinds has the added benefit of not having to worry about RegEx-based DOS attacks.
HTH

Nobody seems to be accounting for negative numbers. Also, some are creating a capture group which is unnecessary. This is the most thorough solution IMO.
^[+-]?(?:\d*\.)?\d+$

The following should work:
^(?!.*\..*\.)[.\d]+$
This uses a negative lookahead to make sure that there are fewer than two . characters in the string.
http://www.rubular.com/r/N3jl1ifJDX

This could work:
^(?:\d*\.)?\d+$

Related

Regex to match hexadecimal and integer numbers [duplicate]

In a regular expression, I need to know how to match one thing or another, or both (in order). But at least one of the things needs to be there.
For example, the following regular expression
/^([0-9]+|\.[0-9]+)$/
will match
234
and
.56
but not
234.56
While the following regular expression
/^([0-9]+)?(\.[0-9]+)?$/
will match all three of the strings above, but it will also match the empty string, which we do not want.
I need something that will match all three of the strings above, but not the empty string. Is there an easy way to do that?
UPDATE:
Both Andrew's and Justin's below work for the simplified example I provided, but they don't (unless I'm mistaken) work for the actual use case that I was hoping to solve, so I should probably put that in now. Here's the actual regexp I'm using:
/^\s*-?0*(?:[0-9]+|[0-9]{1,3}(?:,[0-9]{3})+)(?:\.[0-9]*)?(\s*|[A-Za-z_]*)*$/
This will match
45
45.988
45,689
34,569,098,233
567,900.90
-9
-34 banana fries
0.56 points
but it WON'T match
.56
and I need it to do this.
The fully general method, given regexes /^A$/ and /^B$/ is:
/^(A|B|AB)$/
i.e.
/^([0-9]+|\.[0-9]+|[0-9]+\.[0-9]+)$/
Note the others have used the structure of your example to make a simplification. Specifically, they (implicitly) factorised it, to pull out the common [0-9]* and [0-9]+ factors on the left and right.
The working for this is:
all the elements of the alternation end in [0-9]+, so pull that out: /^(|\.|[0-9]+\.)[0-9]+$/
Now we have the possibility of the empty string in the alternation, so rewrite it using ? (i.e. use the equivalence (|a|b) = (a|b)?): /^(\.|[0-9]+\.)?[0-9]+$/
Again, an alternation with a common suffix (\. this time): /^((|[0-9]+)\.)?[0-9]+$/
the pattern (|a+) is the same as a*, so, finally: /^([0-9]*\.)?[0-9]+$/
Nice answer by huon (and a bit of brain-twister to follow it along to the end). For anyone looking for a quick and simple answer to the title of this question, 'In a regular expression, match one thing or another, or both', it's worth mentioning that even (A|B|AB) can be simplified to:
A|A?B
Handy if B is a bit more complex.
Now, as c0d3rman's observed, this, in itself, will never match AB. It will only match A and B. (A|B|AB has the same issue.) What I left out was the all-important context of the original question, where the start and end of the string are also being matched. Here it is, written out fully:
^(A|A?B)$
Better still, just switch the order as c0d3rman recommended, and you can use it anywhere:
A?B|A
Yes, you can match all of these with such an expression:
/^[0-9]*\.?[0-9]+$/
Note, it also doesn't match the empty string (your last condition).
Sure. You want the optional quantifier, ?.
/^(?=.)([0-9]+)?(\.[0-9]+)?$/
The above is slightly awkward-looking, but I wanted to show you your exact pattern with some ?s thrown in. In this version, (?=.) makes sure it doesn't accept an empty string, since I've made both clauses optional. A simpler version would be this:
/^\d*\.?\d+$/
This satisfies your requirements, including preventing an empty string.
Note that there are many ways to express this. Some are long and some are very terse, but they become more complex depending on what you're trying to allow/disallow.
Edit:
If you want to match this inside a larger string, I recommend splitting on and testing the results with /^\d*\.?\d+$/. Otherwise, you'll risk either matching stuff like aaa.123.456.bbb or missing matches (trust me, you will. JavaScript's lack of lookbehind support ensures that it will be possible to break any pattern I can think of).
If you know for a fact that you won't get strings like the above, you can use word breaks instead of ^$ anchors, but it will get complicated because there's no word break between . and (a space).
/(\b\d+|\B\.)?\d*\b/g
That ought to do it. It will block stuff like aaa123.456bbb, but it will allow 123, 456, or 123.456. It will allow aaa.123.456.bbb, but as I've said, you'll need two steps if you want to comprehensively handle that.
Edit 2: Your use case
If you want to allow whitespace at the beginning, negative/positive marks, and words at the end, those are actually fairly strict rules. That's a good thing. You can just add them on to the simplest pattern above:
/^\s*[-+]?\d*\.?\d+[a-z_\s]*$/i
Allowing thousands groups complicates things greatly, and I suggest you take a look at the answer I linked to. Here's the resulting pattern:
/^\s*[-+]?(\d+|\d{1,3}(,\d{3})*)?(\.\d+)?\b(\s[a-z_\s]*)?$/i
The \b ensures that the numeric part ends with a digit, and is followed by at least one whitespace.
Maybe this helps (to give you the general idea):
(?:((?(digits).^|[A-Za-z]+)|(?<digits>\d+))){1,2}
This pattern matches characters, digits, or digits following characters, but not characters following digits.
The pattern matches aa, aa11, and 11, but not 11aa, aa11aa, or the empty string.
Don't be puzzled by the ".^", which means "a character followd by line start", it is intended to prevent any match at all.
Be warned that this does not work with all flavors of regex, your version of regex must support (?(named group)true|false).

Most succinct regular expression for integers from 0-100 inclusive

I'm trying to find the most succinct way to write a regular expression for integers from 0-100 inclusive. This is what I have so far, is there a better form?
^[0-9][0-9]?$|^100$
Regex is a very powerful tool for certain tasks, but it can quickly get out of hand when applied to things it's not designed for. It's hard to say without knowing why you need this particular regex, but in most cases I would prefer capturing the number you want and then using your programming language to evaluate whether the captured value is in the desired range. This just seems like a case where regex is going to needlessly complicate your code.
That said, if you're committed to using a regex and don't want leading zeros, you probably want ^[1-9]?\d$|^100$.
I'd recommend against doing this, but to answer your question...I'd argue that this regular expression is the most succinct/pure version:
^(?:100|[1-9]?[0-9])$
Demo
Notes
I used a (non-capturing) group so the ^ and $ anchors only are used once.
I put 100 first in the alternation since 99% (arbitrary estimation) of the time it will be more efficient...200 will fail right away rather than matching 20 and then failing.
I elected to not use \d, since it isn't the same as [0-9].
Handling every case,like 001 or 00001, makes it more complex, this is the best I can think of. For sure you can use \d to make it look shorter.
^0*\(100\|^[0-9]\?[0-9]\)$

What do we need Lookahead/Lookbehind Zero Width Assertions for?

I've just learned about these two concepts in more detail. I've always been good with RegEx, and it seems I've never seen the need for these 2 zero width assertions.
I'm pretty sure I'm wrong, but I do not see why these constructs are needed. Consider this example:
Match a 'q' which is not followed by a 'u'.
2 strings will be the input:
Iraq
quit
With negative lookahead, the regex looks like this:
q(?!u)
Without it, it looks like this:
q[^u]
For the given input, both of these regex give the same results (i.e. matching Iraq but not quit) (tested with perl). The same idea applies to lookbehinds.
Am I missing a crucial feature that makes these assertions more valuable than the classic syntax?
Why your test probably worked (and why it shouldn't)
The reason you were able to match Iraq in your test might be that your string contained a \n at the end (for instance, if you read it from the shell). If you have a string that ends in q, then q[^u] cannot match it as the others said, because [^u] matches a non-u character - but the point is there has to be a character.
What do we actually need lookarounds for?
Obviously in the above case, lookaheads are not vital. You could workaround this by using q(?:[^u]|$). So we match only if q is followed by a non-u character or the end of the string. There are much more sophisticated uses for lookaheads though, which become a pain if you do them without lookaheads.
This answer tries to give an overview of some important standard situations which are best solved with lookarounds.
Let's start with looking at quoted strings. The usual way to match them is with something like "[^"]*" (not with ".*?"). After the opening ", we simply repeat as many non-quote characters as possible and then match the closing quote. Again, a negated character class is perfectly fine. But there are cases, where a negated character class doesn't cut it:
Multi-character delimiters
Now what if we don't have double-quotes to delimit our substring of interest, but a multi-character delimiter. For instance, we are looking for ---sometext---, where single and double - are allowed within sometext. Now you can't just use [^-]*, because that would forbid single -. The standard technique is to use a negative lookahead at every position, and only consume the next character, if it is not the beginning of ---. Like so:
---(?:(?!---).)*---
This might look a bit complicated if you haven't seen it before, but it's certainly nicer (and usually more efficient) than the alternatives.
Different delimiters
You get a similar case, where your delimiter is only one character but could be one of two (or more) different characters. For instance, say in our initial example, we want to allow for both single- and double-quoted strings. Of course, you could use '[^']*'|"[^"]*", but it would be nice to treat both cases without an alternative. The surrounding quotes can easily be taken care of with a backreference: (['"])[^'"]*\1. This makes sure that the match ends with the same character it began with. But now we're too restrictive - we'd like to allow " in single-quoted and ' in double-quoted strings. Something like [^\1] doesn't work, because a backreference will in general contain more than one character. So we use the same technique as above:
(['"])(?:(?!\1).)*\1
That is after the opening quote, before consuming each character we make sure that it is not the same as the opening character. We do that as long as possible, and then match the opening character again.
Overlapping matches
This is a (completely different) problem that can usually not be solved at all without lookarounds. If you search for a match globally (or want to regex-replace something globally), you may have noticed that matches can never overlap. I.e. if you search for ... in abcdefghi you get abc, def, ghi and not bcd, cde and so on. This can be problem if you want to make sure that your match is preceded (or surrounded) by something else.
Say you have a CSV file like
aaa,111,bbb,222,333,ccc
and you want to extract only fields that are entirely numerical. For simplicity, I'll assume that there is no leading or trailing whitespace anywhere. Without lookarounds, we might go with capturing and try:
(?:^|,)(\d+)(?:,|$)
So we make sure that we have the start of a field (start of string or ,), then only digits, and then the end of a field (, or end of string). Between that we capture the digits into group 1. Unfortunately, this will not give us 333 in the above example, because the , that precedes it was already part of the match ,222, - and matches cannot overlap. Lookarounds solve the problem:
(?<=^|,)\d+(?=,|$)
Or if you prefer double negation over alternation, this is equivalent to
(?<![^,])\d+(?![^,])
In addition to being able to get all matches, we get rid of the capturing which can generally improve performance. (Thanks to Adrian Pronk for this example.)
Multiple independent conditions
Another very classic example of when to use lookarounds (in particular lookaheads) is when we want to check multiple conditions on an input at the same time. Say we want to write a single regex that makes sure our input contains a digit, a lower case letter, an upper case letter, a character that is none of those, and no whitespace (say, for password security). Without lookarounds you'd have to consider all permutations of digit, lower case/upper case letter, and symbol. Like:
\S*\d\S*[a-z]\S*[A-Z]\S*[^0-9a-zA_Z]\S*|\S*\d\S*[A-Z]\S*[a-z]\S*[^0-9a-zA_Z]\S*|...
Those are only two of the 24 necessary permutations. If you also want to ensure a minimum string length in the same regex, you'd have to distribute those in all possible combinations of the \S* - it simply becomes impossible to do in a single regex.
Lookahead to the rescue! We can simply use several lookaheads at the beginning of the string to check all of these conditions:
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[^0-9a-zA-Z])(?!.*\s)
Because the lookaheads don't actually consume anything, after checking each condition the engine resets to the beginning of the string and can start looking at the next one. If we wanted to add a minimum string length (say 8), we could simply append (?=.{8}). Much simpler, much more readable, much more maintainable.
Important note: This is not the best general approach to check these conditions in any real setting. If you are making the check programmatically, it's usually better to have one regex for each condition, and check them separately - this let's you return a much more useful error message. However, the above is sometimes necessary, if you have some fixed framework that lets you do validation only by supplying a single regex. In addition, it's worth knowing the general technique, if you ever have independent criteria for a string to match.
I hope these examples give you a better idea of why people would like to use lookarounds. There are a lot more applications (another classic is inserting commas into numbers), but it's important that you realise that there is a difference between (?!u) and [^u] and that there are cases where negated character classes are not powerful enough at all.
q[^u] will not match "Iraq" because it will look for another symbol.
q(?!u) however, will match "Iraq":
regex = /q[^u]/
/q[^u]/
regex.test("Iraq")
false
regex.test("Iraqf")
true
regex = /q(?!u)/
/q(?!u)/
regex.test("Iraq")
true
Well, another thing along with what others mentioned with the negative lookahead, you can match consecutive characters (e.g. you can negate ui while with [^...], you cannot negate ui but either u or i and if you try [^ui]{2}, you will also negate uu, ii and iu.
The whole point is to not "consume" the next character(s), so that it can be e.g. captured by another expression that comes afterwards.
If they're the last expression in the regex, then what you've shown are equivalent.
But e.g. q(?!u)([a-z]) would let the non-u character be part of the next group.

Is there a simple regex to compare numbers to x?

I want a regex that will match if a number is greater than or equal to an arbitrary number. This seems monstrously complex for such a simple task... it seems like you need to reinvent 'counting' in an explicit regex hand-crafted for the x.
For example, intuitively to do this for numbers greater than 25, I get
(\d{3,}|[3-9]\d|2[6-9]\d)
What if the number was 512345? Is there a simpler way?
Seems that there is no simpler way. regex is not thing that for numbers.
You may try this one:
\[1-9]d{6,}|
[6-9]\d{5}|
5[2-9]\d{4}|
51[3-9]\d{3}|
512[4-9]\d{2}|
5123[5-9]\d|
51234[6-9]
(newlines for clarity)
What if the number was 512345? Is there a simpler way?
No, a regex to match a number in a certain range will be a horrible looking thing (especially large numbers ranges).
Regex is simply not meant for such tasks. The better solution would be to "freely" match the digits, like \d+, and then compare them with the language's relational operators (<, >, ...).
In Perl you can use the conditional regexp construct (?(condition)yes-pattern) where the (condition) is (?{CODE}) to run arbitrary Perl code. If you make the yes-pattern be (*FAIL) then you have a regexp fragment which succeeds only when CODE returns false. Thus:
foreach (0 .. 50) {
if (/\A(\d+)(?(?{$1 <= 25})(*FAIL))\z/) {
say "$_ matches";
}
else {
say "$_ does not match";
}
}
The code-evaluation feature used to be marked as experimental but the latest 'perlre' manual page (http://perldoc.perl.org/perlre.html) seems to now imply it is a core language feature.
Technically, what you have is no longer a 'regular expression' of course, but some hybrid of regexp and external code.
I've never heard of a regex flavor that can do that. Writing a Perl module to generate the appropriate regex (as you mentioned in your comment) sounds like a good idea to me. In fact, I'd be surprised if it hasn't been done already. Check CPAN first.
By the way, your regex contains a few more errors besides the excess pipes Yuriy pointed out.
First, the "three or more digits" portion will match invalid numbers like 024 and 00000007. You can solve that by requiring the first digit to be greater than zero. If you want to allow for leading zeroes, you can match them separately.
The third part, 2[6-9]\d, only matches numbers >= 260. Perhaps you meant to make the third digit optional (i.e. 2[6-9]\d?), but that would be redundant.
You should anchor the regex somehow to make sure you aren't matching part of a longer number or a "word" with digits in it. I don't know the best way to do that in your particular situation, but word boundaries (i.e. \b) will probably be all you need.
End result:
\b0*([1-9]\d{2,}|[3-9]\d|2[6-9])\b

A Question of Greedy vs. Negated Character Classes in Regex

I have a very large file that looks like this (see below). I have two basic choices of regex to use on it (I know there may be others but I'm really trying to compare Greedy and Negated Char Class) methods.
ftp: [^\D]{1,}
ftp: (\d)+
ftp: \d+
Note: what if I took off the parense around the \d?
Now + is greedy which forces backtracking but the Negated Char Class require a char-by-char comparison. Which is more efficient? Assume the file is very-very large so minute differences in processor usage will become exaggerated due to the length of the file.
Now that you've answered that, What if my Negated Char Class was very large, say 18 different characters? Would that change your answer?
Thanks.
ftp: 1117 bytes
ftp: 5696 bytes
ftp: 3207 bytes
ftp: 5696 bytes
ftp: 7200 bytes
[^\D]{1,} and \d+ is exactly the same. The regex parser will compile [^\D] and \d into character classes with the equal content, and + is just short for {1,}.
If you want lazy repetition you can add a ? at the end.
\d+?
The character classes are usually compiled into bitmaps for ASCII-characters. For Unicode (>=256) it is implementation dependent. One way could be to create a list of ranges, and use binary search on it.
For ASCII the lookup time is constant over the size. For Unicode it is logarithmic or linear.
Both your expressions have the same greediness. As others have said here, except for the capturing group they will execute in the same way.
Also in this case greediness won't matter much at the execution speed since you don't have anything following \d*. In this case the expression will simply process all the digits it can find and stop when the space is encountered. No backtracking should occur with these expressions.
To make it more explicit, backtracking should occur if you have an expression like this:
\d*123
In this case the parser will engulf all the digits, then backtrack to match the three following digits.
Yeah, I agree with MizardX... these two expressions are semantically equivalent. Although the grouping could require additional resources. That's not what you were asking about.
My initial tests show that [^\D{1,} is a bit slower than \d+, on a 184M file the former takes 9.6 seconds while the latter takes 8.2
Without capturing (the ()'s) both are about 1 second faster, but the difference between the two is about the same.
I also did a more extensive test where the captured value is printed to /dev/null, with a third version splitting on the space, results:
([^\D]{1,}): ~18s
(\d+): ~17s
(split / /)[1]: ~17s
Edit: split version improved and time decreased to be the same or lower than (\d+)
Fastest version so far (can anyone improve?):
while (<>)
{
if ($foo = (split / /)[1])
{
print $foo . "\n";
}
}
This is kind of a trick question as written because (\d)+ takes slightly longer due to the overhead of the capturing parentheses. If you change it to \d+ they take the same amount of time in my Perl / system.
Not a direct answer to the question, but why not a different approach altogether, since you know the format of the lines already? For example, you could use a regex on the whitespace between the fields, or avoid regex altogether and split() on the whitespace, which is generally going to be faster than any regular expression, depending on the language you're using.