Regex for All Digits but Not Exclusively Zeros - regex

I'm trying to find a regex that matches inputs that:
are non-empty AND
are not exclusively zeros (although leading zeros are fine) AND
have no non-digits
Put another way, the string must consist entirely of digits, but not exclusively zeros.

You could use a couple lookaheads:
^(?!0*$)(?=\d+$)
https://regex101.com/r/fTgsWE/3/

Or speed version (avoid look around altogether )
^\d*[1-9]\d*$

Related

Regex with constraints, look-ahead function

I'm trying to write a regex expression between 10 and 12 digits. There will be optional leading 0 (zeros) between {0,5} then a numeric string between 10-12 digits. Regardless of the number of zeros (0 to 5), I want 10-12 digits after leading zeros
Example:
0000012345 should not be passing
0012345678 should not be passing as there are only 8 digits after leading zeros
I've tried:
^(0{0,5}(?=\d{10,12}$)^\d{1,2}?\s?(\d{10})$
I think
^0{0,5}[1-9]\d{9,11}$
should be what you need. It enforces not counting the leading zeroes as one of the later digits by requiring it be non-zero. Then there can be 9-11 other digits (including 0).
If you need to include an optional space at any point (as suggested by your RegEx), the RegEx would grow a lot, and it might be easier to do this with some additional code. However, if you give the exact requirements, I will edit the answer accordingly.
^0{0,5}+\d{0,2}\s?\d{10}$
^^
You didn't specify the language. You need "possessive quantifier" here.
See demo:
https://regex101.com/r/m5sOAJ/1
Or if your regex does not support possessive quantifiers:
^(?=(0{0,5}))\1\d{0,2}\s?\d{10}$
See demo:
https://regex101.com/r/m5sOAJ/3

how to find integer with comma and zeros after that (regex)?

I try to create regex(es) to extract all integers. It can be 6 -12 bur also +6.000 or -5,0 and onother one to extract real numbers which are not integers, for example 3.14, -6,26 but no 5.0.
For finding integers I tried "^[+-]?([0-9]+)(\\[.,]0{1,})?$" but it doesn't work on -6.00. And I have no idea how to create second regex (how to exclude integers with comas or dots and then zeros). Any help appreciated.
The problem with your integer regex appears to be the backslash(es). I don't know any regex engine in which you would need to escape the opening bracket of a character class, and you certainly don't want to match a literal backslash. Also, to a regex engine that understands it at all, the quantifier {1,} is an uglier, more complex way of saying +.
This should do your integer matching:
"^[+-]?[0-9]+([.,]0+)?$"
And this variation should do your non-integer matching:
"^[+-]?[0-9]+[.,]0*[1-9][0-9]*$"
In both cases I omitted parentheses not needed for expressing a correct pattern, but if you need to capture parts of the match then you will want to add some back in. You might also want to convert the grouping parentheses into non-capturing form if you are using a regex engine that supports it.
Also, the real number pattern requires at least one digit before the fraction separator character, per your examples. It would be easy to convert the pattern to also match strings of the form .1 or -.17. Similarly, the integer pattern requires at least one zero in the fraction part if there is a fraction separator, and restriction could be removed, too.

Regex no two consecutive characters are the same

How do I write a regular expression where x is a string whose characters are either a, b, c but no two consecutive characters are the same
For example
abcacb is true
acbaac is false
^(?!.*(.)\1)[abc]+$ works if you follow the original question exactly. However, this does not work/check multiple "words" of characters a/b/c, ie. "abc cba".
The way it works is it asserts that any character is not followed by itself by utilizing a capture group inside a lookahead and that the entire string consists only of characters "a", "b", or "c".
Since the number of chars is limited, you can get away without a back reference in the look ahead:
^(?!.*(aa|bb|cc)[abc]*$
But I like tenub's answer better :)
using negative lookbehind: ^([abc]([abc](?<!(aa|bb|cc)))*)?$ TRY HERE
using negative lookahead: ^(((?!(aa|bb|cc))[abc])*[abc])?$ TRY HERE
Prefer either (both do the same job but differently) if you are going to use this regex as a part of some bigger regex that you might be creating.
In short, this is reusable. Copy & paste and it will do its work without disturbing any regex that is present around it.
In my humble opinion, regexes provided in #tenub and #Bohemian are not reusable which can cause bugs.
Note: empty string ("") will pass these 2 regexes. If you don't want it to, remove ? from regex.

What do we need Lookahead/Lookbehind Zero Width Assertions for?

I've just learned about these two concepts in more detail. I've always been good with RegEx, and it seems I've never seen the need for these 2 zero width assertions.
I'm pretty sure I'm wrong, but I do not see why these constructs are needed. Consider this example:
Match a 'q' which is not followed by a 'u'.
2 strings will be the input:
Iraq
quit
With negative lookahead, the regex looks like this:
q(?!u)
Without it, it looks like this:
q[^u]
For the given input, both of these regex give the same results (i.e. matching Iraq but not quit) (tested with perl). The same idea applies to lookbehinds.
Am I missing a crucial feature that makes these assertions more valuable than the classic syntax?
Why your test probably worked (and why it shouldn't)
The reason you were able to match Iraq in your test might be that your string contained a \n at the end (for instance, if you read it from the shell). If you have a string that ends in q, then q[^u] cannot match it as the others said, because [^u] matches a non-u character - but the point is there has to be a character.
What do we actually need lookarounds for?
Obviously in the above case, lookaheads are not vital. You could workaround this by using q(?:[^u]|$). So we match only if q is followed by a non-u character or the end of the string. There are much more sophisticated uses for lookaheads though, which become a pain if you do them without lookaheads.
This answer tries to give an overview of some important standard situations which are best solved with lookarounds.
Let's start with looking at quoted strings. The usual way to match them is with something like "[^"]*" (not with ".*?"). After the opening ", we simply repeat as many non-quote characters as possible and then match the closing quote. Again, a negated character class is perfectly fine. But there are cases, where a negated character class doesn't cut it:
Multi-character delimiters
Now what if we don't have double-quotes to delimit our substring of interest, but a multi-character delimiter. For instance, we are looking for ---sometext---, where single and double - are allowed within sometext. Now you can't just use [^-]*, because that would forbid single -. The standard technique is to use a negative lookahead at every position, and only consume the next character, if it is not the beginning of ---. Like so:
---(?:(?!---).)*---
This might look a bit complicated if you haven't seen it before, but it's certainly nicer (and usually more efficient) than the alternatives.
Different delimiters
You get a similar case, where your delimiter is only one character but could be one of two (or more) different characters. For instance, say in our initial example, we want to allow for both single- and double-quoted strings. Of course, you could use '[^']*'|"[^"]*", but it would be nice to treat both cases without an alternative. The surrounding quotes can easily be taken care of with a backreference: (['"])[^'"]*\1. This makes sure that the match ends with the same character it began with. But now we're too restrictive - we'd like to allow " in single-quoted and ' in double-quoted strings. Something like [^\1] doesn't work, because a backreference will in general contain more than one character. So we use the same technique as above:
(['"])(?:(?!\1).)*\1
That is after the opening quote, before consuming each character we make sure that it is not the same as the opening character. We do that as long as possible, and then match the opening character again.
Overlapping matches
This is a (completely different) problem that can usually not be solved at all without lookarounds. If you search for a match globally (or want to regex-replace something globally), you may have noticed that matches can never overlap. I.e. if you search for ... in abcdefghi you get abc, def, ghi and not bcd, cde and so on. This can be problem if you want to make sure that your match is preceded (or surrounded) by something else.
Say you have a CSV file like
aaa,111,bbb,222,333,ccc
and you want to extract only fields that are entirely numerical. For simplicity, I'll assume that there is no leading or trailing whitespace anywhere. Without lookarounds, we might go with capturing and try:
(?:^|,)(\d+)(?:,|$)
So we make sure that we have the start of a field (start of string or ,), then only digits, and then the end of a field (, or end of string). Between that we capture the digits into group 1. Unfortunately, this will not give us 333 in the above example, because the , that precedes it was already part of the match ,222, - and matches cannot overlap. Lookarounds solve the problem:
(?<=^|,)\d+(?=,|$)
Or if you prefer double negation over alternation, this is equivalent to
(?<![^,])\d+(?![^,])
In addition to being able to get all matches, we get rid of the capturing which can generally improve performance. (Thanks to Adrian Pronk for this example.)
Multiple independent conditions
Another very classic example of when to use lookarounds (in particular lookaheads) is when we want to check multiple conditions on an input at the same time. Say we want to write a single regex that makes sure our input contains a digit, a lower case letter, an upper case letter, a character that is none of those, and no whitespace (say, for password security). Without lookarounds you'd have to consider all permutations of digit, lower case/upper case letter, and symbol. Like:
\S*\d\S*[a-z]\S*[A-Z]\S*[^0-9a-zA_Z]\S*|\S*\d\S*[A-Z]\S*[a-z]\S*[^0-9a-zA_Z]\S*|...
Those are only two of the 24 necessary permutations. If you also want to ensure a minimum string length in the same regex, you'd have to distribute those in all possible combinations of the \S* - it simply becomes impossible to do in a single regex.
Lookahead to the rescue! We can simply use several lookaheads at the beginning of the string to check all of these conditions:
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[^0-9a-zA-Z])(?!.*\s)
Because the lookaheads don't actually consume anything, after checking each condition the engine resets to the beginning of the string and can start looking at the next one. If we wanted to add a minimum string length (say 8), we could simply append (?=.{8}). Much simpler, much more readable, much more maintainable.
Important note: This is not the best general approach to check these conditions in any real setting. If you are making the check programmatically, it's usually better to have one regex for each condition, and check them separately - this let's you return a much more useful error message. However, the above is sometimes necessary, if you have some fixed framework that lets you do validation only by supplying a single regex. In addition, it's worth knowing the general technique, if you ever have independent criteria for a string to match.
I hope these examples give you a better idea of why people would like to use lookarounds. There are a lot more applications (another classic is inserting commas into numbers), but it's important that you realise that there is a difference between (?!u) and [^u] and that there are cases where negated character classes are not powerful enough at all.
q[^u] will not match "Iraq" because it will look for another symbol.
q(?!u) however, will match "Iraq":
regex = /q[^u]/
/q[^u]/
regex.test("Iraq")
false
regex.test("Iraqf")
true
regex = /q(?!u)/
/q(?!u)/
regex.test("Iraq")
true
Well, another thing along with what others mentioned with the negative lookahead, you can match consecutive characters (e.g. you can negate ui while with [^...], you cannot negate ui but either u or i and if you try [^ui]{2}, you will also negate uu, ii and iu.
The whole point is to not "consume" the next character(s), so that it can be e.g. captured by another expression that comes afterwards.
If they're the last expression in the regex, then what you've shown are equivalent.
But e.g. q(?!u)([a-z]) would let the non-u character be part of the next group.

Regular expression for optional commas with a 6 character limit

I need a regex that allows numbers and optional commas, but the entire length cannot be greater than 6.
^[0-9]+([,]*[0-9]+)*$ allows numbers and optional commas.
^([0-9]+([,]*[0-9]+)*){0,6}$ does not limit the total length to 6.
If your regex engine supports lookahead assertions — most do — then you can write:
^(?=[0-9,]{1,6}$)[0-9]+(,[0-9]+)*$
The (?=[0-9,]{1,6}$) part is a "positive lookahead assertion", and means "looking forward from this point in the string, I see [0-9,]{1,6}$". So, in essence, the above regex is a combination of these two:
^[0-9,]{1,6}$
^[0-9]+(,[0-9]+)*$
and enforces them both.
(That said, it's likely to be clearer if you simply enforce the length restriction as a separate step, rather than incorporating the above into a single regex.)
^([\,0-9]{0,6})$
This regex simply allows any of the characters (comma, zero through nine) zero through six times.
If you require that the input start with a number, use this:
^([0-9]{1}[\,0-9]{0,5})$
Some additional ways -
^(?=.{1,6}$)\d+(?:,?\d)*$
^(?=.{1,6}$)\d(?:[,\d]*\d)?$