Regex to check if input (alphanumeric) contains a number smaller than X - regex

This will probably be easy for regex magicians, however I can't seem to figure out a way with my limited knowledge.
I need a regex that would check if an alphanumeric string contains a number smaller than a number (16539065 in my case).
For example the following should be matched:
alpha16000000beta
foo300bar
And the following should not be matched:
foo16539066bar
Help please.
EDIT: I'm aware that it's inefficient, however I'm doing it in a cPanel Account Level filter, which only accepts regex. Unless I figure out a way for it to trigger a script instead, this would definitely need to be done with regex. :(

Your best option for this kind of operation is to use a capture group to get the number and then use whatever language you are using to do the comparison. If you absolutely have to use a regex to do this, it will be extremely inefficient. To do so, you will need to combine a lot of similar expressions:
\d{1,7} will find any numbers with 1 to 7 digits, which will always be less than 16539065
1653906[1-4] will catch the absolute maximum values accepted
165390[1-5]\d will catch the next range of acceptable values
1653[1-8]\d{3} will continue on the acceptable range
Repeat the above until you reach 1[1-5]\d{6}
Once you have all of those expressions, they can be combined using the 'or' operator. Keep in mind that using regular expressions in this manner is considered to be bad practice and creates hard to read code.

Bad Karma might kill me, but here is a working solution for your cases (letters then numbers then letters). It will not work for e.g. ab12cd34de.
There is not really a way to shortcode anything, just the long way. I'm using a negative lookahead to check, that the number is not bigger or equal to 16539065.
^\D*(?!0*(?:\d{9}|2\d{7}|1[7-9]\d{6}|16[6-9]\d{5}|165[4-9]\d{4}|16539[1-9]\d{2}|165390[7-9]\d|1653906[5-9]))\d+\D*$
It checks for the general format ^\D*\d+\D*$ and then rolls 16539065 down to it's parts.
Here's a little demo to play around: https://regex101.com/r/aV6yQ9/1

Related

Get value from a column with regex

I have lines of text similar to this:
value1|value2|value3 etc.
The values itself are not interesting. The type of delimeter is also unimportant, just like the number of fields. There could be 100 column, there could be only 5.
I would like to know what is the usual way to write a regexp which will put any given column's value into a capture group.
For example if I would like to get the content of the third field:
[^\|]+?\|[^\|]+?\|(?<capture_group>[^\|]+?)\|
Maybe a little bit nicer version:
(?:[^\|]+?\|){2}(?<capture_group>[^\|]+?)\|
But this could be the 7th, the 100th, the 1000th, it doesn't matter.
My problem is, that after a while I run into catastrophic backtracking or simply extremely low running times.
What is the usual way to solve a problem like this?
Edit:
For further clarification: this is a use case where further string operations are simply not permitted. Workarounds are not possible. I would like to know if there's a way simply based on regex or not.
As you stated:
My problem is, that after a while I run into catastrophic backtracking
or simply extremely low running times.
What is the usual way to solve a problem like this?
IMHO, You should prefer to perform string operations when you have a predefined structure in string (like for your case | character is used as a separator) because string operations are faster than using Regex which is designed to find a pattern. Like, in case the separators may change and we have to identify it first and then split based on separator, here the need of a Regex comes.
e.g.,
value1|value2;value3-value4
For your case, you can simply perform string split based on the separator character and access the respected index from an array.
EDIT:
If Regex is your only option then try using this regex:
^((.+?)\|){200}
Here 200 is the element I wish to access and seems a bit less time consuming than yours.
Demo
For example if I would like to get the content of the third field:
[^\|]+?\|[^\|]+?\|(?<capture_group>[^\|]+?)\|
Maybe a little bit nicer version:
(?:[^\|]+?\|){2}(?<capture_group>[^\|]+?)\|
But this could be the 7th, the 100th, the 1000th, it doesn't matter.
As a matter of "steps", using capture groups will cost more step.
However, using capture groups will allow you to condense your pattern and use a curly bracketed quantifier.
In your first pattern above, you can get away with "greedy" negated character classes (remove the ?) because they will halt at the next |, and you don't need to escape the pipe inside the square brackets.
When you want to access a "much later" positioned substring in the input string, not using a quantifier is going to require a horrifically long pattern and make it very, very difficult to comprehend the exact point that will be matched. In these cases, it would be pretty silly not to use a capture group and a quantifier.
I agree with Toto's comment; accessing an array of split results is going to be a very sensible solution if possible.

Most succinct regular expression for integers from 0-100 inclusive

I'm trying to find the most succinct way to write a regular expression for integers from 0-100 inclusive. This is what I have so far, is there a better form?
^[0-9][0-9]?$|^100$
Regex is a very powerful tool for certain tasks, but it can quickly get out of hand when applied to things it's not designed for. It's hard to say without knowing why you need this particular regex, but in most cases I would prefer capturing the number you want and then using your programming language to evaluate whether the captured value is in the desired range. This just seems like a case where regex is going to needlessly complicate your code.
That said, if you're committed to using a regex and don't want leading zeros, you probably want ^[1-9]?\d$|^100$.
I'd recommend against doing this, but to answer your question...I'd argue that this regular expression is the most succinct/pure version:
^(?:100|[1-9]?[0-9])$
Demo
Notes
I used a (non-capturing) group so the ^ and $ anchors only are used once.
I put 100 first in the alternation since 99% (arbitrary estimation) of the time it will be more efficient...200 will fail right away rather than matching 20 and then failing.
I elected to not use \d, since it isn't the same as [0-9].
Handling every case,like 001 or 00001, makes it more complex, this is the best I can think of. For sure you can use \d to make it look shorter.
^0*\(100\|^[0-9]\?[0-9]\)$

Is there a simple regex to compare numbers to x?

I want a regex that will match if a number is greater than or equal to an arbitrary number. This seems monstrously complex for such a simple task... it seems like you need to reinvent 'counting' in an explicit regex hand-crafted for the x.
For example, intuitively to do this for numbers greater than 25, I get
(\d{3,}|[3-9]\d|2[6-9]\d)
What if the number was 512345? Is there a simpler way?
Seems that there is no simpler way. regex is not thing that for numbers.
You may try this one:
\[1-9]d{6,}|
[6-9]\d{5}|
5[2-9]\d{4}|
51[3-9]\d{3}|
512[4-9]\d{2}|
5123[5-9]\d|
51234[6-9]
(newlines for clarity)
What if the number was 512345? Is there a simpler way?
No, a regex to match a number in a certain range will be a horrible looking thing (especially large numbers ranges).
Regex is simply not meant for such tasks. The better solution would be to "freely" match the digits, like \d+, and then compare them with the language's relational operators (<, >, ...).
In Perl you can use the conditional regexp construct (?(condition)yes-pattern) where the (condition) is (?{CODE}) to run arbitrary Perl code. If you make the yes-pattern be (*FAIL) then you have a regexp fragment which succeeds only when CODE returns false. Thus:
foreach (0 .. 50) {
if (/\A(\d+)(?(?{$1 <= 25})(*FAIL))\z/) {
say "$_ matches";
}
else {
say "$_ does not match";
}
}
The code-evaluation feature used to be marked as experimental but the latest 'perlre' manual page (http://perldoc.perl.org/perlre.html) seems to now imply it is a core language feature.
Technically, what you have is no longer a 'regular expression' of course, but some hybrid of regexp and external code.
I've never heard of a regex flavor that can do that. Writing a Perl module to generate the appropriate regex (as you mentioned in your comment) sounds like a good idea to me. In fact, I'd be surprised if it hasn't been done already. Check CPAN first.
By the way, your regex contains a few more errors besides the excess pipes Yuriy pointed out.
First, the "three or more digits" portion will match invalid numbers like 024 and 00000007. You can solve that by requiring the first digit to be greater than zero. If you want to allow for leading zeroes, you can match them separately.
The third part, 2[6-9]\d, only matches numbers >= 260. Perhaps you meant to make the third digit optional (i.e. 2[6-9]\d?), but that would be redundant.
You should anchor the regex somehow to make sure you aren't matching part of a longer number or a "word" with digits in it. I don't know the best way to do that in your particular situation, but word boundaries (i.e. \b) will probably be all you need.
End result:
\b0*([1-9]\d{2,}|[3-9]\d|2[6-9])\b

Is it feasible to write a regex that can validate simple math?

I’m using a commercial application that has an option to use RegEx to validate field formatting. Normally this works quite well. However, today I’m faced with validating the following strings: quoted alphanumeric codes with simple arithmetic operators (+-/*). Apparently the issue is sometimes users add additional spaces (e.g. “ FLR01” instead of “FLR01”) or have other typos such as mismatched parenthesis that cause issues with downstream processing.
The first examples all had 5 codes being added:
"FLR01"+"FLR02"+"FLR03"+"FMD01"+"FMR05"
So I started going down the road of matching 5 alphanumeric characters quoted by strings:
"[0-9a-zA-Z]{5}"[+-*/]
However, the formulas quickly got harder and I don’t know how to get around the following complications:
I need to test for one of the four simple math operators (+-*/) between each code, but not after the last one.
There can be any number of codes being added together, not just five as in the example above.
Enclosed parenthesis are okay (“X”+”Y”)/”2”
Mismatched parenthesis are not okay.
No formula (e.g. a blank) is okay.
Valid:
"FLR01"+"FLR02"+"FLR03"+"FMD01"+"FMR05"
"0XT"+"1SEAL"+"1XT"+"23LSL"+"23NBL"
("LS400"+"LT400")*"LC430"/("EL414"+"EL414R"+"LC407"+"LC407R"+"LC410"+"LC410R"+"LC420"+"LC420R")
Invalid:
" FLR01" +"FLR02"
"FLR01"J"FLR02"
("FLR01"+"FLR02"
Is this not something you can easily do with RegExp? Based on Jeff’s answer to 230517, I suspect I’m failing at least the ‘matched pairing’ issue. Even a partial solution to the problem (e.g. flagging extra spaces, invalid operators) would likely be better than nothing, even if I can't solve the parenthesis issue. Suggestions welcomed!
Thanks,
Stephen
As you are aware you can't check for matching parentheses with regular expressions. You need something more powerful since regexes have no way of remembering state and counting the nested parentheses.
This is a simple enough syntax that you could hand code a simple parser which counts the parentheses, incrementing and decrementing a counter as it goes. You'd simply have to make sure the counter never goes negative.
As for the rest, how about this?
("[0-9a-zA-Z]+"([+\-*/]"[0-9a-zA-Z]+")*)?
You could also use this regular expression to check the parentheses. It wouldn't verify that they're nested properly but it would verify that the open and close parentheses show up in the right places. Add in the counter described above and you'd have a proper validator.
(\(*"[0-9a-zA-Z]+"\)*([+\-*/]\(*"[0-9a-zA-Z]+"\)*)*)?
You can easily use regex's to match your tokens (numbers, operators, etc), but you cannot match balanced parenthesis. This isn't too big of a problem though, as you just need to create a state machine that operates on the tokens you match. If you're not familiar with these, think of it as a flow chart within your program where you keep track of where you are, and where you can go. You can also have a look at the Wikipedia page.

Distance between regular expression

Can we compute a sort of distance between regular expressions ?
The idea is to mesure in which way two regular expression are similar.
You can build deterministic finite-state machines for both regular expressions and compare the transitions. The difference of both transitions can then be used to measure the distance of these regular expressions.
There are a few of metrics you could use:
The length of a valid match. Some regexs have a fixed size, some an upper limit and some a lower limit. Compare how similar their lengths or possible lengths are.
The characters that match. Any regex will have a set of characters a match can contain (maybe all characters). Compare the set of included characters.
Use a large document and see how many matches each regex makes and how many of those are identical.
Are you looking for strict equivalence?
I suppose you could compute a Levenshtein Distance between the actual Regular Experssion strings. That's certainly one way of measuring a "distance" between two different Regular Expression strings.
Of course, I think it's possible that regular expressions are not required here at all, and computing the Levenshtein Distance of the actual "value" strings that the Regular Expressions would otherwise be applied to, may yield a better result.
If you have two regular expressions and have a set of example inputs you could try matching every input against each regex. For each input:
If they both match or both don't match, score 0.
If one matches and the other doesn't, score 1.
Sum this score over all inputs, and this will give you a 'distance' between the regular expressions. This will give you an idea of how often two regular expressions will differ for typical input. It will be very slow to calculate if your sample input set is large. It won't work at all if both regexes fail to match for almost all random strings and your expected input is entirely random. For example the regex 'sgjlkwren' and the regex 'ueuenwbkaalf' would probably both never match anything if tested on random input, so this metric would say the distance between them is zero. That might or might not be what you want (probably not).
You might be able to analyze the structure of the regex and use biased random sampling to deliberately hit strings that match more frequently than in completely random input. For example, if both regex require that the string starts with 'foo', you could make sure that your test inputs also always start with foo, to avoid wasting time testing strings that you know will fail for both.
So in conclusion: unless you have a very specific situation with a restricted input set and/or restricted regular expression language, I'd say its not possible. If you do have some restrictions on your input and on the regular expression, it might be possible. Please specify what these restrictions are and maybe I can come up with something better.
There's an answer hidden in an earlier question here on SO: Generating strings from regexes. You can calculate an (asymmetric) distance measure by generating strings using one regex and checking how many of those match the other regex.
This can be optimized by stripping out shared prefixes/suffixes. E.g. a[0-9]* and a[0-7]* share the a prefix, so you can calculate the distance between [0-9]* and [0-7]* instead.
I think first you need to understand for yourself how you see a "difference" between two expressions. Basically, define a distance metric.
In general case, it would be quite different to make. Depending on what you need to do, you may see allowing one different character in some place as a big difference. In the other case, allowing any number of consequent but same characters may not yield much difference.
I'd like to emphasize as well that normally when they talk about distance functions, they apply them to..., well, let's call them, tokens. In our case, character sequences. What you are willing to do, is to apply this method not to those tokens, but to the rules a multitude of tokens will match. I'm not quite sure it even makes sense.
Still, I believe we could think of something, but not in general, but for one particular and quite restricted case. Do you have some sort of example to show us?