Effeciently match optional characters that must be insequence

Effeciently match optional characters that must be insequence - regex

Well, I got it working, but somehow it looks slow and inefficient (or maybe not).
What I've got is a sequence of characters, for simplicity sake let's just say it's
123456789
What I want to do is to make sure the input begins the same way, and is in the same sequence, but doesn't need to be the complete sequence.
What I've got is this:
^1(2(3(4(5(6(7(8(9)?)?)?)?)?)?)?)?
This looks pretty horrid, but is there a better way to do this?
Edit Added the ^ that was in the original code and I forgot to include here.

A ? quantifier is is like a spare part. Think of the engine that runs fine without it. It will try to ingore it if possible.
Sure x?x?x?x?x? looks pretty bad. But, its almost meaningless unless used with some context around it.
Asuming your groupings are just to denote options, you could factor out the last inner-group using this 1(2(3(4(5(6(7(89?)?)?)?)?)?)?)?.
Example:
1(2(3(4(5(6(7(8(9)?)?)?)?)?)?)?)? will globally match this
987654321 1111111111111112121211112121121212312111 multiple times.
So, its all relative.

Related

Rules of regex engines. Greediness, eagerness and laziness of regexes

As we all know, regex engine use two rules when it goes about its work:
Rule 1: The Match That Begins Earliest Wins or regular expressions
are eager.
Rule 2: Regular expressions are greedy.
These lines appear in tutorial:
The two of these rules go hand in hand.
It's eager to give you a result, so what it does is it tries to just
keep letting that first one do all the work.
While we're already in the middle of it, let's keep going, get to the
end of the string and then when it doesn't work out, then it will
backtrack and try another one.
It doesn't backtrack back to the beginning; it doesn't try all sorts
of other combinations.
It's still eager to get you a result, so it says, what if I just gave
back one?
Would that allow me to give a result back?
If it does, great, it's done. It's able to just finish there.
It doesn't have to keep backtracking further in the string, looking
for some kind of a better match or match that's further along.
I don't quite understand these lines (especially 2nd ("While we're...") and last ("It doesn't have to keep backtracking") sentences).
And these lines about lazy mode.
It still defers to the overall match just like the greedy one does
clearly.
I don't understand the following analogy:
It's not necessarily any faster or slower to choose a lazy strategy or
a greedy strategy, but it will probably match different things.
Now as far as is faster or slower, it's a little bit like saying, if
you've lost your car keys and your sunglasses inside your house, is it
better to start looking in the kitchen or to start looking in the
living room?
You don't know which one's going to yield the best result, and you
don't know which one's going to find the sunglasses first or the keys
first; it's just about different strategies of starting the search.
So you will likely get different results depending on where you start,
but it's not necessarily faster to start in one place or the other.
What 'faster or slower' means?
I'm going to draw scheme how it work (in both case). So I will contemplate this questions until I find out what's going on around here!)
I need understand it exactly and unambiguously.
Thanks.

Let's try by the exemple
for an input of this is input for test input on regex and a regex like /this.*input/
The match will be this is input for test input
What will be done is
starting to examine the string and it will get a match with this is input
But now its at the middle of the string, it will continue to see if it could match more on it (this is the While we're already in the middle of it, let's keep going )
It will match till this is input for test input and continue till the end of the string
at the end, there's things wich are not part of the match, so the interpreter "backtrack" to the last time it matches.
For the last part its more about the ored regexes
Consider input string as cdacdgabcdef and the regex (ab|a).*
A common mistake is thinking it will return the more precise one (in this case 'abcdef') but it will return 'acdgabcdef' because the a match is the first one to match.
what happens here is: There's something matching this part, let's continue to the next part of the pattern and forget about the other options in this part.
For the lazy and greedy questions, the link of #AvinashRaj is clear enough, I won't repeat it here.

regex match upto some character

Conditions updated
There is often a situation where you want to extract a substring upto (immediately before) certain characters. For example, suppose you have a text that:
Does not start with a semicolon or a period,
Contains several sentences,
Does not contain any "\n", and
Ends with a period,
and you want to extract the sequence from the start upto the closest semicolon or period. Two strategies come to mind:
/[^;.]*/
/.*?[;.]/
I do either of these quite randomly, with slight preference to the second strategy, and also see both ways in other people's code. Which is the better way? Is there a clear reason to prefer one over the other, or are there better ways? I personally feel, efficiency aside, that negating something (as with [^]) is conceptually more complex than not doing it. But efficiency may also be a good reason to chose one over the other.

I came up with my answer. The two regexes in my question were actually not expressing the same thing. And the better approach depends on what you want.
If you want a match up to and including a certain character, then using
/.*?[;.]/
is simpler.
If you want a match up to right before (excluding) a certain character, then you should use:
/[^;.]*/

Well, the first way is probably more efficient, not that it's likely to matter. By the way, \z in a character class does not mean "end of input"--in fact, it's a syntax error in every flavor I know of. /[^;.]*/ is all you need anyway.

I personally prefer the first one because it does exactly as you would expect. Get all characters except ...
But it's mostly a matter of preference. There are nearly always multiple ways to write a regular expression and it's mostly style that matters.
For example... do you prefer [0-9], [:digit:] or \d? They all do exactly* the same.
* In case of unicode the [:digit:] and \d classes match some other characters too.

you left out one other strategy. string split?
"my sentence; blahblah".split(/[;.]/,2)[0]

I think that it is mostly a matter of opinion as to which regular expression you use. On the note of efficiency, though, I think that adding \A to the beginning of a regular expression in this case would make the process faster because well designed regular expression engines should only try to match once in that case. For example:
/\A[^.;]/m
Note the m option; it indicates that newline characters can also be matched. This is just a technicality I would add for generic examples, but may not apply to you.
Although adding more to the solution might be viewed as increasing complexity, it can also serve to clarify meaning.

Regex for password requirements

I want to require the following:
Is greater than seven characters.
Contains at least two digits.
Contains at least two special (non-alphanumeric) characters.
...and I came up with this to do it:
(?=.{6,})(?=(.*\d){2,})(?=(.*\W){2,})
Now, I'd also like to make sure that no two sequential characters are the same. I'm having a heck of a time getting that to work though. Here's what I got that works by itself:
(\S)\1+
...but if I try to combine the two together, it fails.
I'm operating within the constraints of the application. It's default requirement is 1 character length, no regex, and no nonstandard characters.
Anyway...
Using this test harness, I would expect y90e5$ to match but y90e5$$ to not.
What an i missing?

This is a bad place for a regex. You're better off using simple validation.

Sometimes we cannot influence specifications and have to write the implementation regardless, i.e., when some ancient backoffice system has to be interfaced through the web but has certain restrictions on input, or just because your boss is asking you to.
EDIT: removed the regex that was based on the original regex of the asker.
altered original code to fit your description, as it didn't seem to really work:
EDIT: the q. was then updated to reflect another version. There are differences which I explain below:
My version: the two or more \W and \d can be repeated by each other, but cannot appear next to each other (this was my incorrect assumption), i fixed it for length>7 which is slightly more efficient to place as a typical "grab all" expression.
^(?!.*((\S)\1|\s))(?=.*(\d.+){2,})(?=.*(\W.+){2,}).{8,}
New version in original question: the two or more \W and the \d are allowed to appear next to each other. This version currently support length>=6, not length>7 as is explained in the text.
The current answer, corrected, should be something like this, which takes the updated q., my comments on length>7 and optimizations, then it looks like: ^(?!.*((\S)\1|\s))(?=(.*\d){2,})(?=(.*\W){2,}).{8,}.
Update: your original code doesn't seem to work, so I changed it a bit
Update: updated answer to reflect changes in question, spaces not allowed anymore

This may not be the most efficient but appears to work.
^(?!.*(\S)\1)(?=.{6,})(?=(.*\d){2,})(?=(.*\W){2,})
Test strings:
ad2f#we1$ //match valid.
adfwwe12#$ //No Match repeated ww.
y90e5$$ //No Match repeated $$.
y90e5$ //No Match too Short and only 1 \W class value.
One of the comments pointed out that the above regex allows spaces which are typically not used for password fields. While this doesn't appear to be a requirement of the original post, as pointed out a simple change will disallow spaces as well.
^(?!.*(\S)\1|.*\s)(?=.{6,})(?=(.*\d){2,})(?=(.*\W){2,})
Your regex engine may parse (?!.*(\S)\1|.*\s) differently. Just be aware and adjust accordingly.
All previous test results the same.
Test string with whitespace:
ad2f #we1$ //No match space in string.

If the rule was that passwords had to be two digits followed by three letters or some such, or course a regular expression would work very nicely. But I don't think regexes are really designed for the sort of rule you actually have. Even if you get it to work, it would be pretty cryptic to the poor sucker who has to maintain it later -- possibly you. I think it would be a lot simpler to just write a quick function that loops through the characters and counts how many total and how many of each type. Then at the end check the counts.
Just because you know how to use regexes doesn't mean you have to use them for everything. I have a cool cordless drill but I don't use it to put in nails.

Is stringing together multiple regular expressions with "or" safe?

We have a configuration file that lists a series of regular expressions used to exclude files for a tool we are building (it scans .class files). The developer has appended all of the individual regular expressions into a single one using the OR "|" operator like this:
rx1|rx2|rx3|rx4
My gut reaction is that there will be an expression that will screw this up and give us the wrong answer. He claims no; they are ORed together. I cannot come up with case to break this but still fee uneasy about the implementation.
Is this safe to do?

Not only is it safe, it's likely to yield better performance than separate regex matching.
Take the individual regex patterns and test them. If they work as expected then OR them together and each one will still get matched. Thus, you've increased the coverage using one regex rather than multiple regex patterns that have to be matched individually.

As long as they are valid regexes, it should be safe. Unclosed parentheses, brackets, braces, etc would be a problem. You could try to parse each piece before adding it to the main regex to verify they are complete.
Also, some engines have escapes that can toggle regex flags within the expression (like case sensitivity). I don't have enough experience to say if this carries over into the second part of the OR or not. Being a state machine, I'd think it wouldn't.

It's as safe as anything else in regular expressions!

As far as regexes go , Google code search provides regexes for searches so ... it's possible to have safe regexes

I don't see any possible problem too.
I guess by saying 'Safe' you mean that it will match as you needed (because I've never heard of RegEx security hole). Safe or not, we can't tell from this. You need to give us more detail like what the full regex is. Do you wrap it with group and allow multiple? Do you wrap it with start and end anchor?
If you want to match a few class file name make sure you use start and end anchor to be sure the matching is done from start til end. Like this "^(file1|file2)\.class$". Without start and end anchor, you may end up matching 'my_file1.class too'

The answer is that yes this is safe, and the reason why this is safe is that the '|' has the lowest precedence in regular expressions.
That is:
regexpa|regexpb|regexpc
is equivalent to
(regexpa)|(regexpb)|(regexpc)
with the obvious exception that the second would end up with positional matches whereas the first would not, however the two would match exactly the same input. Or to put it another way, using the Java parlance:
String.matches("regexpa|regexpb|regexpc");
is equivalent to
String.matches("regexpa") | String.matches("regexpb") | String.matches("regexpc");

Help Shortening a Regular Expression with repeated subpatterns

I have generated the following regular expression in a project I am working on, and it works fine, but out of professional curiosity I was wondering If it can be "compressed/shortened":
/[(]PRD[)].+;.+;.*;.+;.+;.*;.*;.*;/
Regexes have always seemed like voodoo to me...

For starters, the single-character blocks can just go away:
/\(PRD\).+;.+;.*;.+;.+;.*;.*;.*;/
Next, you can group the related items together:
/\(PRD\)(.+;){2}.*;(.+;){2}(.*;){3}/
This actually makes it textually longer, though.

/\(PRD\)(.+;.+;.*;){2}(.*;){2}/
is shorter than
/\(PRD\)((.+;){2}.*;){2}(.*;){2}/
but arguably less awesome. Both are successfully shorter than
/[(]PRD[)].+;.+;.*;.+;.+;.*;.*;.*;/
though (if only because of the character class change).
Or you could even go with
/\(PRD\)(.+;.+;.*;){2}.*;.*;/
which may be the shortest you can get with the same rules.

/\(PRD\).+;.+;.*;.+;.+;(.*;){3}/
I don't think you will gain much and arrive at the same exact rules. If you didn't care to make all the text between the ";" optional, then you could:
/\(PRD\)(.*;){8}/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js