Help Shortening a Regular Expression with repeated subpatterns

Help Shortening a Regular Expression with repeated subpatterns - regex

I have generated the following regular expression in a project I am working on, and it works fine, but out of professional curiosity I was wondering If it can be "compressed/shortened":
/[(]PRD[)].+;.+;.*;.+;.+;.*;.*;.*;/
Regexes have always seemed like voodoo to me...

For starters, the single-character blocks can just go away:
/\(PRD\).+;.+;.*;.+;.+;.*;.*;.*;/
Next, you can group the related items together:
/\(PRD\)(.+;){2}.*;(.+;){2}(.*;){3}/
This actually makes it textually longer, though.

/\(PRD\)(.+;.+;.*;){2}(.*;){2}/
is shorter than
/\(PRD\)((.+;){2}.*;){2}(.*;){2}/
but arguably less awesome. Both are successfully shorter than
/[(]PRD[)].+;.+;.*;.+;.+;.*;.*;.*;/
though (if only because of the character class change).
Or you could even go with
/\(PRD\)(.+;.+;.*;){2}.*;.*;/
which may be the shortest you can get with the same rules.

/\(PRD\).+;.+;.*;.+;.+;(.*;){3}/
I don't think you will gain much and arrive at the same exact rules. If you didn't care to make all the text between the ";" optional, then you could:
/\(PRD\)(.*;){8}/

Related

Regular Expression for whole world

First of all, I use C# 4.0 to parse the code of a VB6 application.
I have some old VB6 code and about 500+ copies of it. And I use a regular expression to grab all kinds of global variables from the code. The code is described as "Yuck" and some poor victim still has to support this. So I'm hoping to help this poor sucker a bit by generating overviews of specific constants. (And yes, it should be rewritten but it ain't broke, so...)
This is a sample of a code line I need to match, in this case all boolean constants:
Public Const gDemo = False 'Is this a demo version
And this is the regular expression I use at this moment:
Public\s+Const\s+g(?'Name'[a-zA-Z][a-zA-Z0-9]*)\s+=\s+(?'Value'[0-9]*)
And I think it too is yuckie, since the * at the end of the boolean group. But if I don't use it, it will only return 'T' or 'F'. I want the whole word.
Is this the proper RegEx to use as solution or is there an even nicer-looking option?
FYI, I use similar regexs to find all string constants and all numeric constants. Those work just fine. And basically the same .BAS file is used for all 50 copies but with different values for all these variables. By parsing all files, we have a good overview of how every version is configured.
And again, yes, we need to rebuild the whole project from scratch since it becomes harder to maintain these days. But it works and we need the manpower for other tasks. It just needs the occasional tweaks...

You can use: Public\s+Const\s+g(?<Name>[a-zA-Z][a-zA-Z0-9]*)\s+=\s+(?<Value>False|True)
demo

Effeciently match optional characters that must be insequence

Well, I got it working, but somehow it looks slow and inefficient (or maybe not).
What I've got is a sequence of characters, for simplicity sake let's just say it's
123456789
What I want to do is to make sure the input begins the same way, and is in the same sequence, but doesn't need to be the complete sequence.
What I've got is this:
^1(2(3(4(5(6(7(8(9)?)?)?)?)?)?)?)?
This looks pretty horrid, but is there a better way to do this?
Edit Added the ^ that was in the original code and I forgot to include here.

A ? quantifier is is like a spare part. Think of the engine that runs fine without it. It will try to ingore it if possible.
Sure x?x?x?x?x? looks pretty bad. But, its almost meaningless unless used with some context around it.
Asuming your groupings are just to denote options, you could factor out the last inner-group using this 1(2(3(4(5(6(7(89?)?)?)?)?)?)?)?.
Example:
1(2(3(4(5(6(7(8(9)?)?)?)?)?)?)?)? will globally match this
987654321 1111111111111112121211112121121212312111 multiple times.
So, its all relative.

Fixing regex to work around ICU/RegexKitLite bug

I'm using RegexKitLite, which in turn uses ICU as its engine. Despite the documentation, a regex like /x*/ when searching against "xxxxxxxxxxx" will match empty string. It is behaving like /x*?/ should. I would like to route around this bug when it's present, and I'm considering rewriting any unescaped * as + when a regex match returns a 0-length result. My naïve guess is that the regex with +s in placeof *s will always return a subset of the correct results. What are the unexpected consequences of this? Am I going the right way?
FWIW, ICU also offers a *+ operator, but it doesn't work either.
EDIT: I should have been clearer: this is for the search field of an interactive app. I have no control over the regex that the user enters. The broken * support appears to be a bug in ICU. I sure wish I didn't need to include that POS in my code, but it's the only game in town.

If you simply change every * quantifier to a +, the regex will fail to work in those instances where the * should have matched zero occurrences. In other words, the problem will have morphed from always matching zero to never matching zero. If you ask me, it's useless either way.
However, you might be able to handle the zero-occurrences case separately, with a negative lookahead. For example, x* could be rewritten as (?:(?!x)|x+). It's hideous I know, but it's the most self-contained fix I can envision at the moment. You would have to do this for possessive stars as well (*+), but not reluctant stars (*?).
Here it is in table form:
BEFORE AFTER
x* (?:(?!x)|x+)
x*+ (?:(?!x)|x++)
x*? x*? More complex atoms would need to have their own parentheses preserved:
(?:xyz)* (?:(?!(?:xyz))|(?:xyz)+) You could probably drop them inside the lookahead, but they don't hurt anything except readability, and that's a lost cause anyway. :D If the {min,} and {min,max} forms are affected too, they would get the same treatment (with the same modifications for possessive variants):
x{0,} same as x*
x{0,n} (?:(?!x)|x{1,n})
It occurs to me that conditionals--(?(condition)yes-pattern|no-pattern)--would be a perfect fit here; unfortunately, ICU doesn't seem to support them.

I can't say where things may have gone wrong with the code in question, but I can say with confidence that this specific bug is not in the ICU library. (I'm the author of the ICU regular expression package.)
I agree with the sentiment expressed above, the thing to do is not to try to hack around the problem by tweaking the regexp pattern, but to understand what the underlying problem is. There's probably some simple mistake being made that isn't clear from the original question as posed.

Both \* and [*] are literal asterisks, so a naive replacement mightn't work.
In fact, don't do dynamic rewriting, it's too complicated. Try to tweak your regexes statically first.
x* is equivalent to x{0,} and (?:x+)?.

Yeah, use that strategy:
(pseudo code)
if ($str =~ /x*/ && $str =~ /(x+)/) {
print "'$1'\n";
}
But the real problem is the BUG as you say. Why on earth is the basic construct of quantifiers screwed up? This is not a module you should include in your code.

Is it feasible to write a regex that can validate simple math?

I’m using a commercial application that has an option to use RegEx to validate field formatting. Normally this works quite well. However, today I’m faced with validating the following strings: quoted alphanumeric codes with simple arithmetic operators (+-/*). Apparently the issue is sometimes users add additional spaces (e.g. “ FLR01” instead of “FLR01”) or have other typos such as mismatched parenthesis that cause issues with downstream processing.
The first examples all had 5 codes being added:
"FLR01"+"FLR02"+"FLR03"+"FMD01"+"FMR05"
So I started going down the road of matching 5 alphanumeric characters quoted by strings:
"[0-9a-zA-Z]{5}"[+-*/]
However, the formulas quickly got harder and I don’t know how to get around the following complications:
I need to test for one of the four simple math operators (+-*/) between each code, but not after the last one.
There can be any number of codes being added together, not just five as in the example above.
Enclosed parenthesis are okay (“X”+”Y”)/”2”
Mismatched parenthesis are not okay.
No formula (e.g. a blank) is okay.
Valid:
"FLR01"+"FLR02"+"FLR03"+"FMD01"+"FMR05"
"0XT"+"1SEAL"+"1XT"+"23LSL"+"23NBL"
("LS400"+"LT400")*"LC430"/("EL414"+"EL414R"+"LC407"+"LC407R"+"LC410"+"LC410R"+"LC420"+"LC420R")
Invalid:
" FLR01" +"FLR02"
"FLR01"J"FLR02"
("FLR01"+"FLR02"
Is this not something you can easily do with RegExp? Based on Jeff’s answer to 230517, I suspect I’m failing at least the ‘matched pairing’ issue. Even a partial solution to the problem (e.g. flagging extra spaces, invalid operators) would likely be better than nothing, even if I can't solve the parenthesis issue. Suggestions welcomed!
Thanks,
Stephen

As you are aware you can't check for matching parentheses with regular expressions. You need something more powerful since regexes have no way of remembering state and counting the nested parentheses.
This is a simple enough syntax that you could hand code a simple parser which counts the parentheses, incrementing and decrementing a counter as it goes. You'd simply have to make sure the counter never goes negative.
As for the rest, how about this?
("[0-9a-zA-Z]+"([+\-*/]"[0-9a-zA-Z]+")*)?
You could also use this regular expression to check the parentheses. It wouldn't verify that they're nested properly but it would verify that the open and close parentheses show up in the right places. Add in the counter described above and you'd have a proper validator.
(\(*"[0-9a-zA-Z]+"\)*([+\-*/]\(*"[0-9a-zA-Z]+"\)*)*)?

You can easily use regex's to match your tokens (numbers, operators, etc), but you cannot match balanced parenthesis. This isn't too big of a problem though, as you just need to create a state machine that operates on the tokens you match. If you're not familiar with these, think of it as a flow chart within your program where you keep track of where you are, and where you can go. You can also have a look at the Wikipedia page.

Regex for password requirements

I want to require the following:
Is greater than seven characters.
Contains at least two digits.
Contains at least two special (non-alphanumeric) characters.
...and I came up with this to do it:
(?=.{6,})(?=(.*\d){2,})(?=(.*\W){2,})
Now, I'd also like to make sure that no two sequential characters are the same. I'm having a heck of a time getting that to work though. Here's what I got that works by itself:
(\S)\1+
...but if I try to combine the two together, it fails.
I'm operating within the constraints of the application. It's default requirement is 1 character length, no regex, and no nonstandard characters.
Anyway...
Using this test harness, I would expect y90e5$ to match but y90e5$$ to not.
What an i missing?

This is a bad place for a regex. You're better off using simple validation.

Sometimes we cannot influence specifications and have to write the implementation regardless, i.e., when some ancient backoffice system has to be interfaced through the web but has certain restrictions on input, or just because your boss is asking you to.
EDIT: removed the regex that was based on the original regex of the asker.
altered original code to fit your description, as it didn't seem to really work:
EDIT: the q. was then updated to reflect another version. There are differences which I explain below:
My version: the two or more \W and \d can be repeated by each other, but cannot appear next to each other (this was my incorrect assumption), i fixed it for length>7 which is slightly more efficient to place as a typical "grab all" expression.
^(?!.*((\S)\1|\s))(?=.*(\d.+){2,})(?=.*(\W.+){2,}).{8,}
New version in original question: the two or more \W and the \d are allowed to appear next to each other. This version currently support length>=6, not length>7 as is explained in the text.
The current answer, corrected, should be something like this, which takes the updated q., my comments on length>7 and optimizations, then it looks like: ^(?!.*((\S)\1|\s))(?=(.*\d){2,})(?=(.*\W){2,}).{8,}.
Update: your original code doesn't seem to work, so I changed it a bit
Update: updated answer to reflect changes in question, spaces not allowed anymore

This may not be the most efficient but appears to work.
^(?!.*(\S)\1)(?=.{6,})(?=(.*\d){2,})(?=(.*\W){2,})
Test strings:
ad2f#we1$ //match valid.
adfwwe12#$ //No Match repeated ww.
y90e5$$ //No Match repeated $$.
y90e5$ //No Match too Short and only 1 \W class value.
One of the comments pointed out that the above regex allows spaces which are typically not used for password fields. While this doesn't appear to be a requirement of the original post, as pointed out a simple change will disallow spaces as well.
^(?!.*(\S)\1|.*\s)(?=.{6,})(?=(.*\d){2,})(?=(.*\W){2,})
Your regex engine may parse (?!.*(\S)\1|.*\s) differently. Just be aware and adjust accordingly.
All previous test results the same.
Test string with whitespace:
ad2f #we1$ //No match space in string.

If the rule was that passwords had to be two digits followed by three letters or some such, or course a regular expression would work very nicely. But I don't think regexes are really designed for the sort of rule you actually have. Even if you get it to work, it would be pretty cryptic to the poor sucker who has to maintain it later -- possibly you. I think it would be a lot simpler to just write a quick function that loops through the characters and counts how many total and how many of each type. Then at the end check the counts.
Just because you know how to use regexes doesn't mean you have to use them for everything. I have a cool cordless drill but I don't use it to put in nails.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js