Is it feasible to write a regex that can validate simple math? - regex

I’m using a commercial application that has an option to use RegEx to validate field formatting. Normally this works quite well. However, today I’m faced with validating the following strings: quoted alphanumeric codes with simple arithmetic operators (+-/*). Apparently the issue is sometimes users add additional spaces (e.g. “ FLR01” instead of “FLR01”) or have other typos such as mismatched parenthesis that cause issues with downstream processing.
The first examples all had 5 codes being added:
"FLR01"+"FLR02"+"FLR03"+"FMD01"+"FMR05"
So I started going down the road of matching 5 alphanumeric characters quoted by strings:
"[0-9a-zA-Z]{5}"[+-*/]
However, the formulas quickly got harder and I don’t know how to get around the following complications:
I need to test for one of the four simple math operators (+-*/) between each code, but not after the last one.
There can be any number of codes being added together, not just five as in the example above.
Enclosed parenthesis are okay (“X”+”Y”)/”2”
Mismatched parenthesis are not okay.
No formula (e.g. a blank) is okay.
Valid:
"FLR01"+"FLR02"+"FLR03"+"FMD01"+"FMR05"
"0XT"+"1SEAL"+"1XT"+"23LSL"+"23NBL"
("LS400"+"LT400")*"LC430"/("EL414"+"EL414R"+"LC407"+"LC407R"+"LC410"+"LC410R"+"LC420"+"LC420R")
Invalid:
" FLR01" +"FLR02"
"FLR01"J"FLR02"
("FLR01"+"FLR02"
Is this not something you can easily do with RegExp? Based on Jeff’s answer to 230517, I suspect I’m failing at least the ‘matched pairing’ issue. Even a partial solution to the problem (e.g. flagging extra spaces, invalid operators) would likely be better than nothing, even if I can't solve the parenthesis issue. Suggestions welcomed!
Thanks,
Stephen

As you are aware you can't check for matching parentheses with regular expressions. You need something more powerful since regexes have no way of remembering state and counting the nested parentheses.
This is a simple enough syntax that you could hand code a simple parser which counts the parentheses, incrementing and decrementing a counter as it goes. You'd simply have to make sure the counter never goes negative.
As for the rest, how about this?
("[0-9a-zA-Z]+"([+\-*/]"[0-9a-zA-Z]+")*)?
You could also use this regular expression to check the parentheses. It wouldn't verify that they're nested properly but it would verify that the open and close parentheses show up in the right places. Add in the counter described above and you'd have a proper validator.
(\(*"[0-9a-zA-Z]+"\)*([+\-*/]\(*"[0-9a-zA-Z]+"\)*)*)?

You can easily use regex's to match your tokens (numbers, operators, etc), but you cannot match balanced parenthesis. This isn't too big of a problem though, as you just need to create a state machine that operates on the tokens you match. If you're not familiar with these, think of it as a flow chart within your program where you keep track of where you are, and where you can go. You can also have a look at the Wikipedia page.

Related

Regex that matches a list of comma separated items in any order

I have three "Clue texts" that say:
SomeClue=someText
AnotherClue=somethingElse
YetAnotherClue=moreText
I need to parse a string and see if it contains exactly these 3 texts, separated by a comma. No Clue Text contains any comma.
The problem is, they can be in any order and they must be the only clues in the string.
Matches:
SomeClue=someText,AnotherClue=somethingElse,YetAnotherClue=moreText
SomeClue=someText,YetAnotherClue=moreText,AnotherClue=somethingElse
AnotherClue=somethingElse,SomeClue=someText,YetAnotherClue=moreText
YetAnotherClue=moreText,SomeClue=someText,AnotherClue=somethingElse
Non-Matches:
SomeClue=someText,AnotherClue=somethingElse,YetAnotherClue=moreText,
SomeClue=someText,YetAnotherClue=moreText,,AnotherClue=somethingElse
,AnotherClue=somethingElse,SomeClue=someText,YetAnotherClue=moreText
YetAnotherClue=moreText,SomeClue=someText,AnotherClue=somethingElse,UselessText
YetAnotherClue=moreText,SomeClue=someText,AnotherClue=somethingElse,AClueThatIDontWant=wrongwrongwrong
Putting togheter what I found on other posts, I have:
(?=.*SomeClue=someText($|,))(?=.*AnotherClue=somethingElse($|,))(?=.*YetAnotherClue=moreText($|,))
This works as far as Clues and their order are concerned.
Unfortunately, I can't find a way to avoid adding a comma and then some stupid text at the end.
My real case has somewhat more complicated Clue Texts, because each of them is a small regex, but I am pretty sure once I know how to handle commas, the rest will be easy.
I think you'd be better off with a stronger tool than regexes (and I genuinely love regular expressions). Regexes aren't good with needing supplementary memory, which is what you have here: you need exactly these 3, but they can come in any order.
In principle, you could write a regex for each of the 6 permutations. But that would never scale. You ought to use something with parsing power.
I suggest writing a verification function in your favorite scripting language, made up of underlying string functions.
In basic Python, you could do (for instance)
ref = set(['SomeClue=someText', 'AnotherClue=somethingElse', 'YetAnotherClue=moreText'])
def ismatch(myline):
splt = myline.split(',')
return ref == set(splt)
You can tweak that as necessary, of course. Note that this nearly-complete solution is not really longer, and much more readable, than any regex would be.

Regex to check if input (alphanumeric) contains a number smaller than X

This will probably be easy for regex magicians, however I can't seem to figure out a way with my limited knowledge.
I need a regex that would check if an alphanumeric string contains a number smaller than a number (16539065 in my case).
For example the following should be matched:
alpha16000000beta
foo300bar
And the following should not be matched:
foo16539066bar
Help please.
EDIT: I'm aware that it's inefficient, however I'm doing it in a cPanel Account Level filter, which only accepts regex. Unless I figure out a way for it to trigger a script instead, this would definitely need to be done with regex. :(
Your best option for this kind of operation is to use a capture group to get the number and then use whatever language you are using to do the comparison. If you absolutely have to use a regex to do this, it will be extremely inefficient. To do so, you will need to combine a lot of similar expressions:
\d{1,7} will find any numbers with 1 to 7 digits, which will always be less than 16539065
1653906[1-4] will catch the absolute maximum values accepted
165390[1-5]\d will catch the next range of acceptable values
1653[1-8]\d{3} will continue on the acceptable range
Repeat the above until you reach 1[1-5]\d{6}
Once you have all of those expressions, they can be combined using the 'or' operator. Keep in mind that using regular expressions in this manner is considered to be bad practice and creates hard to read code.
Bad Karma might kill me, but here is a working solution for your cases (letters then numbers then letters). It will not work for e.g. ab12cd34de.
There is not really a way to shortcode anything, just the long way. I'm using a negative lookahead to check, that the number is not bigger or equal to 16539065.
^\D*(?!0*(?:\d{9}|2\d{7}|1[7-9]\d{6}|16[6-9]\d{5}|165[4-9]\d{4}|16539[1-9]\d{2}|165390[7-9]\d|1653906[5-9]))\d+\D*$
It checks for the general format ^\D*\d+\D*$ and then rolls 16539065 down to it's parts.
Here's a little demo to play around: https://regex101.com/r/aV6yQ9/1

PCRE Regex Syntax

I guess this is more or less a two-part question, but here's the basics first: I am writing some PHP to use preg_match_all to look in a variable for strings book-ended by {}. It then iterates through each string returned, replaces the strings it found with data from a MySQL query.
The first question is this: Any good sites out there to really learn the ins and outs of PCRE expressions? I've done a lot of searching on Google, but the best one I've been able to find so far is http://www.regular-expressions.info/. In my opinion, the information there is not well-organized and since I'd rather not get hung up having to ask for help whenever I need to write a complex regex, please point me at a couple sites (or a couple books!) that will help me not have to bother you folks in the future.
The second question is this: I have this regex
"/{.*(_){1}(.*(_){1}[a-z]{1}|.*)}/"
and I need it to catch instances such as {first_name}, {last_name}, {email}, etc. I have three problems with this regex.
The first is that it sees "{first_name} {last_name}" as one string, when it should see it as two. I've been able to solve this by checking for the existence of the space, then exploding on the space. Messy, but it works.
The second problem is that it includes punctuation as part of the captured string. So, if you have "{first_name} {last_name},", then it returns the comma as part of the string. I've been able to partially solve this by simply using preg_replace to delete periods, commas, and semi-colons. While it works for those punctuation items, my logic is unable to handle exclamation points, question marks, and everything else.
The third problem I have with this regex is that it is not seeing instances of {email} at all.
Now, if you can, are willing, and have time to simply hand me the solution to this problem, thank you as that will solve my immediate problem. However, even if you can do this, please please provide an lmgfty that provides good web sites as references and/or a book or two that would provide a good education on this subject. Sites would be preferable as money is tight, but if a book is the solution, I'll find the money (assuming my local library system is unable to procure said volume).
Back then I found PHP's own PCRE syntax reference quite good: http://uk.php.net/manual/en/reference.pcre.pattern.syntax.php
Let's talk about your expression. It's quite a bit more verbose than necessary; I'm going to simplify it while we go through this.
A rather simpler way of looking at what you're trying to match: "find a {, then any number of letters or underscores, then a }". A regular expression for that is (in PHP's string-y syntax): '/\{[a-z_]+\}/'
This will match all of your examples but also some wilder ones like {__a_b}. If that's not an option, we can go with a somewhat more complex description: "find a {, then a bunch of letters, then (as often as possible) an underscore followed by a bunch of letters, then a }". In a regular expression: /\{([a-z]+(_[a-z]+)*\}/
This second one maybe needs a bit more explanation. Since we want to repeat the thing that matches _foo segments, we need to put it in parentheses. Then we say: try finding this as often as possible, but it's also okay if you don't find it at all (that's the meaning of *).
So now that we have something to compare your attempt to, let's have a look at what caused your problems:
Your expression matches any characters inside the {}, including } and { and a whole bunch of other things. In other words, {abcde{_fgh} would be accepted by your regex, as would {abcde} fg_h {ijkl}.
You've got a mandatory _ in there, right after the first .*. The (_){1} (which means exactly the same as _) says: whatever happens, explode if this ain't here! Clearly you don't actually want that, because it'll never match {email}.
Here's a complete description in plain language of what your regex matches:
Match a {.
Match a _.
Match absolutely anything as long as you can match all the remaining rules right after that anything.
Match a _.
Match a single letter.
Instead of that _ and the single letter, absolutely anything is okay, too.
Match a }.
This is probably pretty far from what you wanted. Don't worry, though. Regular expressions take a while to get used to. I think it's very helpful if you think of it in terms of instructions, i.e. when building a regular expression, try to build it in your head as a "find this, then find that", etc. Then figure out the right syntax to achieve exactly that.
This is hard mainly because not all instructions you might come up with in your head easily translate into a piece of a regular expression... but that's where experience comes in. I promise you that you'll have it down in no time at all... if you are fairly methodical about making your regular expressions at first.
Good luck! :)
For PCRE, I simply digested the PCRE manpages, but then my brain works that way anyway...
As for matching delimited stuff, you generally have 2 approaches:
Match the first delimiter, match anything that is not the closing delimiter, match the closing delimiter.
Match the first delimiter, match anything ungreedily, match the closing delimiter.
E.g. for your case:
\{([^}]+)\}
\{(.+?)\} - Note the ? after the +
I added a group around the content you'd likely want to extract too.
Note also that in the case of #1 in particular but also for #2 if "dot matches anything" is in effect (dotall, singleline or whatever your favourite regex flavour calls it), that they would also match linebreaks within - you'd need to manually exclude that and anything else you don't want if that would be a problem; see the above answer for if you want something more like a whitelist approach.
Here's a good regex site.
Here's a PCRE regex that will work: \{\w+\}
Here's how it works:
It's basically looking for { followed by one ore more word characters followed by }. The interesting part is that the word character class actually includes an underscore as well. \w is essentially shorthand for [A-Za-z0-9_]
So it will basically match any combination of those characters within braces and because of the plus sign will only match braces that are not empty.

Regex for password requirements

I want to require the following:
Is greater than seven characters.
Contains at least two digits.
Contains at least two special (non-alphanumeric) characters.
...and I came up with this to do it:
(?=.{6,})(?=(.*\d){2,})(?=(.*\W){2,})
Now, I'd also like to make sure that no two sequential characters are the same. I'm having a heck of a time getting that to work though. Here's what I got that works by itself:
(\S)\1+
...but if I try to combine the two together, it fails.
I'm operating within the constraints of the application. It's default requirement is 1 character length, no regex, and no nonstandard characters.
Anyway...
Using this test harness, I would expect y90e5$ to match but y90e5$$ to not.
What an i missing?
This is a bad place for a regex. You're better off using simple validation.
Sometimes we cannot influence specifications and have to write the implementation regardless, i.e., when some ancient backoffice system has to be interfaced through the web but has certain restrictions on input, or just because your boss is asking you to.
EDIT: removed the regex that was based on the original regex of the asker.
altered original code to fit your description, as it didn't seem to really work:
EDIT: the q. was then updated to reflect another version. There are differences which I explain below:
My version: the two or more \W and \d can be repeated by each other, but cannot appear next to each other (this was my incorrect assumption), i fixed it for length>7 which is slightly more efficient to place as a typical "grab all" expression.
^(?!.*((\S)\1|\s))(?=.*(\d.+){2,})(?=.*(\W.+){2,}).{8,}
New version in original question: the two or more \W and the \d are allowed to appear next to each other. This version currently support length>=6, not length>7 as is explained in the text.
The current answer, corrected, should be something like this, which takes the updated q., my comments on length>7 and optimizations, then it looks like: ^(?!.*((\S)\1|\s))(?=(.*\d){2,})(?=(.*\W){2,}).{8,}.
Update: your original code doesn't seem to work, so I changed it a bit
Update: updated answer to reflect changes in question, spaces not allowed anymore
This may not be the most efficient but appears to work.
^(?!.*(\S)\1)(?=.{6,})(?=(.*\d){2,})(?=(.*\W){2,})
Test strings:
ad2f#we1$ //match valid.
adfwwe12#$ //No Match repeated ww.
y90e5$$ //No Match repeated $$.
y90e5$ //No Match too Short and only 1 \W class value.
One of the comments pointed out that the above regex allows spaces which are typically not used for password fields. While this doesn't appear to be a requirement of the original post, as pointed out a simple change will disallow spaces as well.
^(?!.*(\S)\1|.*\s)(?=.{6,})(?=(.*\d){2,})(?=(.*\W){2,})
Your regex engine may parse (?!.*(\S)\1|.*\s) differently. Just be aware and adjust accordingly.
All previous test results the same.
Test string with whitespace:
ad2f #we1$ //No match space in string.
If the rule was that passwords had to be two digits followed by three letters or some such, or course a regular expression would work very nicely. But I don't think regexes are really designed for the sort of rule you actually have. Even if you get it to work, it would be pretty cryptic to the poor sucker who has to maintain it later -- possibly you. I think it would be a lot simpler to just write a quick function that loops through the characters and counts how many total and how many of each type. Then at the end check the counts.
Just because you know how to use regexes doesn't mean you have to use them for everything. I have a cool cordless drill but I don't use it to put in nails.

Is stringing together multiple regular expressions with "or" safe?

We have a configuration file that lists a series of regular expressions used to exclude files for a tool we are building (it scans .class files). The developer has appended all of the individual regular expressions into a single one using the OR "|" operator like this:
rx1|rx2|rx3|rx4
My gut reaction is that there will be an expression that will screw this up and give us the wrong answer. He claims no; they are ORed together. I cannot come up with case to break this but still fee uneasy about the implementation.
Is this safe to do?
Not only is it safe, it's likely to yield better performance than separate regex matching.
Take the individual regex patterns and test them. If they work as expected then OR them together and each one will still get matched. Thus, you've increased the coverage using one regex rather than multiple regex patterns that have to be matched individually.
As long as they are valid regexes, it should be safe. Unclosed parentheses, brackets, braces, etc would be a problem. You could try to parse each piece before adding it to the main regex to verify they are complete.
Also, some engines have escapes that can toggle regex flags within the expression (like case sensitivity). I don't have enough experience to say if this carries over into the second part of the OR or not. Being a state machine, I'd think it wouldn't.
It's as safe as anything else in regular expressions!
As far as regexes go , Google code search provides regexes for searches so ... it's possible to have safe regexes
I don't see any possible problem too.
I guess by saying 'Safe' you mean that it will match as you needed (because I've never heard of RegEx security hole). Safe or not, we can't tell from this. You need to give us more detail like what the full regex is. Do you wrap it with group and allow multiple? Do you wrap it with start and end anchor?
If you want to match a few class file name make sure you use start and end anchor to be sure the matching is done from start til end. Like this "^(file1|file2)\.class$". Without start and end anchor, you may end up matching 'my_file1.class too'
The answer is that yes this is safe, and the reason why this is safe is that the '|' has the lowest precedence in regular expressions.
That is:
regexpa|regexpb|regexpc
is equivalent to
(regexpa)|(regexpb)|(regexpc)
with the obvious exception that the second would end up with positional matches whereas the first would not, however the two would match exactly the same input. Or to put it another way, using the Java parlance:
String.matches("regexpa|regexpb|regexpc");
is equivalent to
String.matches("regexpa") | String.matches("regexpb") | String.matches("regexpc");