What does ?: do in regex - regex

I have a regex that looks like this
/^(?:\w+\s)*(\w+)$*/
What is the ?:?

It indicates that the subpattern is a non-capture subpattern. That means whatever is matched in (?:\w+\s), even though it's enclosed by () it won't appear in the list of matches, only (\w+) will.
You're still looking for a specific pattern (in this case, a single whitespace character following at least one word), but you don't care what's actually matched.

It means only group but do not remember the grouped part.
By default ( ) tells the regex engine to remember the part of the string that matches the pattern between it. But at times we just want to group a pattern without triggering the regex memory, to do that we use (?: in place of (

Further to the excellent answers provided, its usefulness is also to simplify the code required to extract groups from the matched results. For example, your (\w+) group is known as group 1 without having to be concerned about any groups that appear before it. This may improve the maintainability of your code.

Let's understand by taking a example
In simple words we can say is let's for example I have been given a string say (s="a eeee").
Your regex(/^(?:\w+\s)(\w+)$/. ) will basically do in this case it will start with string finds 'a' in beginning of string and notice here there is 'white space character here) which in this case if you don't included ?: it would have returned 'a '(a with white space character).
If you may don't want this type of answer so u have included as*(?:\w+\s)* it will return you simply a without whitespace ie.'a' (which in this case ?: is doing it is matching with a string but it is excluding whatever comes after it means it will match the string but not whitespace(taking into account match(numbers or strings) not additional things with them.)
PS:I am beginner in regex.This is what i have understood with ?:.Feel free to pinpoint the wrong things explained.

Related

Why can "a*a+" and "(a{2,3})*a{2,3}" match "aaaa" while "(a{2,3})*" cannot?

My understanding of * is that it consumes as many characters as possible (greedily) but "gives back" when necessary. Therefore, in a*a+, a* would give one (or maybe more?) character back to a+ so it can match.
However, in (a{2,3})*, why doesn't the first "instance" of a{2,3} gives a character to the second "instance" so the second one can match?
Also, in (a{2,3})*a{2,3} the first part does seem to give a character to the second part.
A simple workaround for your question is to match aaaa with regex ^(a{2,3})*$.
Your problem is that:
In the case of (a{2,3})*, regex doesn't seem to consume as much
character as possible.
I suggest not to think in giving back characters. Instead, the key is acceptance.
Once regex accept your string, the matching will be over. The pattern a{2,3} only matches aa or aaa. So in the case of matching aaaa with (a{2,3})*, the greedy engine would match aaa. And then, it can't match more a{2,3} because there is only one a remained. Though it's able for regex engine to do backtrack and match an extra a{2,3}, it wouldn't. aaa is now accepted by the regex, thus regex engine would not do expensive backtracking.
If you add an $ to the end of the regex, it simply tells regex engine that a partly match is unacceptable. Moreover, it's easy to explain the (a{2,3})*a{2,3} case with accepting and backtracking.
The main problem is this:
My understanding of * is that it consumes as many characters as possible (greedily) but "gives back" when necessary
This is completely wrong. It is not what greedy means.
Greedy simply means "use the longest possible match". It does not give anything back.
Once you interpret the expressions with this new understanding everything makes sense.
a*a+ - zero or more a followed by one or more a
(a{2,3})*a{2,3} - zero or more of either two or three a followed by either two or three a (note: the KEY THING to remember is "zero or more", the first part not matching any character is considered a match)
(a{2,3})* - zero or more of either two or three a (this means that after matching three as the last single a left cannot match)
backtracking is done only if match fails however aaa is a valid match, a negative lookahead (?!a) can be use to prevent the match be followed by a a.
compare
(aaa?)*
and
(aaa?)*(?!a)

Regex (.*) without matching the second case

Given the following sample input text:
{{A1|def|ghi|jkl}}hello world. {{A2|mno}}bye world.
How can I create a regex pattern to only matching the first instance of {{ ... }} (i.e. only {{A1|def|ghi|jkl}}). A1 and A2 are fixed inputs and def, ghi, jkl, and mno could be anything.
I've tried this:
\{\{A1\|(.*)\|(.*)\|(.*)\}\}
But that returns everything ({{A1|def|ghi|jkl}}hello world. {{A2|mno}}).
Note that def or ghi or jkl or mno could be numbers, English letters or other languages (e.g. Chinese/Japanese/Korean).
It's a little unclear what you are trying to accomplish. At first, I thought that your problem was just that you were getting the entire thing when all you really wanted was the A1 or A2 part. If so, here's the answer:
Since you didn't specify which flavor of regex you are using, it's hard to say for sure. If you are using a version which supports look-arounds, you could do something like this:
(?<={{)\w+(?=(\|[^|}]*)+}})
Here's the meaning of the pattern:
(?<={{) - This is a positive look-behind expression which means that it asserts that any match must be preceded by certain characters. In this case, the characters are {{.
\w+ - This is the actual part that we are matching. In this case, it's one or more word characters. \w is a special character class. This varies, though, depending on which regex engine you are using. Something like [A-Z][0-9] may be more appropriate, depending on your needs.
(?=(\|[^|}]*)+}}) - This is a positive look-ahead expression. That means that it asserts that any match must be followed by some particular pattern of characters. In this case, it's looking for matches to be followed by (\|[^|}]*)+}}.
However, if look-arounds are not possible, then you can match it with a capturing group, like this:
{{(\w+)(\|[^|}]*)+}}
If you do it that way, you'll need to read the value of the first group for each match.
As far as only finding the first match goes, that really depends on which tool or language you are using. Most regex engines only find the first match by default and only find additional matches when a global modifier is specified (often /g at the end).
However, now, after having edited your question, and trying better to understand what you meant, I think that your real problem is greediness. The repetitions, such as *, in regex are greedy by default. That means they will capture as much text as they possibly can and still have it match. In this case, you don't want it to find the longest possible match. In this case, you want it to find the shortest possible match. You could do that simply by making the repetitions lazy (i.e. non-greedy). To do that, simply add a ? after the *. For instance:
\{\{A1\|(.*?)\|(.*?)\|(.*?)\}\}
However, that's not very efficient. If this pattern is going to be used often or on large inputs it would be better to use a more restrictive character class, such as [^}|] instead of ., so that the lazy modifier is unnecessary. For example:
\{\{A1\|([^}|]*)\|([^}|]*)\|([^}|]*)\}\}
Or, more simply:
{{A1(\|([^}|]*)){3}}}
The problem with your pattern is simply that you've made all of the * quantifiers greedy. They're matching as much of the string as they can (while still allowing the whole pattern to match). Just make them non-greedy *?:
\{\{A1\|(.*?)\|(.*?)\|(.*?)\}\}
https://regex101.com/r/pK4gE7/1

Regex query efficient?

I came up with the below regex expression to look for terms like Password,Passphrase,Pass001 etc and the word following it. Is it efficient or can it be made better? Thanks for the help
"([Pp][aA][sS][Ss]([wW][oO][rR][dD][sS]?|[Pp][hH][rR][aA][sS][eE])?|[Pp]([aA][sS]([sS])?)?[wW][Dd])[0-9]?[0-9]?[0-9]?[\s\:\-\=\_\/\#\&\'\[\(\+\*\r\n\)\]]+\S*"
I will be using it to scan files upto 300K for these terms. When I try now to scan with these expression a whole C: drive it takes 5 hours or worse case I have encountered, 5 days
You may use the following enhancement:
(?i)p(?:ass(?:words?|phrase)?|(?:ass?)?wd)[0-9]{0,3}[-\s:=_\/#&'\]\[()+*\r\n]\S*
See the regex demo
Instead of [sS], you may make the regex case insensitive by adding (?i) case insensitive modifier. Use corresponding option in your software if it does not work like this.
Make sure your alternations do not match at the same location in the string. It is not quite easy here, but p at the start of each alternative in the first group decreases the regex efficiency. So, move it outside (e.g. (?:pass|port) => p(ass|ort)).
Use non-capturing groups rather than capturing ones if you are not going to access submatches, that also has a slight impact on performance.
Use limiting quantifiers instead of repeating ? quantified patterns. Instead of a?a?a?, use a{0,3}.
Do not overescape chars inside the character class. I only left \/, \] and \[ as I am not sure what regex flavor you are using, it might appear you can avoid escaping at all.
Note that a performance penalty is big if you have consecutive non-fixed width patterns that may match the same type of chars. You have [\s\:\-\=\_\/\#\&\'\[\(\+\*\r\n\)\]]+\S*: [\s\:\-\=\_\/\#\&\'\[\(\+\*\r\n\)\]]+ matches 1 or more special chars and \S* matches 0 or more chars other than whitespace that also matches some chars matched by the preceding pattern. Remove the + from the preceding subpattern.

What do we need Lookahead/Lookbehind Zero Width Assertions for?

I've just learned about these two concepts in more detail. I've always been good with RegEx, and it seems I've never seen the need for these 2 zero width assertions.
I'm pretty sure I'm wrong, but I do not see why these constructs are needed. Consider this example:
Match a 'q' which is not followed by a 'u'.
2 strings will be the input:
Iraq
quit
With negative lookahead, the regex looks like this:
q(?!u)
Without it, it looks like this:
q[^u]
For the given input, both of these regex give the same results (i.e. matching Iraq but not quit) (tested with perl). The same idea applies to lookbehinds.
Am I missing a crucial feature that makes these assertions more valuable than the classic syntax?
Why your test probably worked (and why it shouldn't)
The reason you were able to match Iraq in your test might be that your string contained a \n at the end (for instance, if you read it from the shell). If you have a string that ends in q, then q[^u] cannot match it as the others said, because [^u] matches a non-u character - but the point is there has to be a character.
What do we actually need lookarounds for?
Obviously in the above case, lookaheads are not vital. You could workaround this by using q(?:[^u]|$). So we match only if q is followed by a non-u character or the end of the string. There are much more sophisticated uses for lookaheads though, which become a pain if you do them without lookaheads.
This answer tries to give an overview of some important standard situations which are best solved with lookarounds.
Let's start with looking at quoted strings. The usual way to match them is with something like "[^"]*" (not with ".*?"). After the opening ", we simply repeat as many non-quote characters as possible and then match the closing quote. Again, a negated character class is perfectly fine. But there are cases, where a negated character class doesn't cut it:
Multi-character delimiters
Now what if we don't have double-quotes to delimit our substring of interest, but a multi-character delimiter. For instance, we are looking for ---sometext---, where single and double - are allowed within sometext. Now you can't just use [^-]*, because that would forbid single -. The standard technique is to use a negative lookahead at every position, and only consume the next character, if it is not the beginning of ---. Like so:
---(?:(?!---).)*---
This might look a bit complicated if you haven't seen it before, but it's certainly nicer (and usually more efficient) than the alternatives.
Different delimiters
You get a similar case, where your delimiter is only one character but could be one of two (or more) different characters. For instance, say in our initial example, we want to allow for both single- and double-quoted strings. Of course, you could use '[^']*'|"[^"]*", but it would be nice to treat both cases without an alternative. The surrounding quotes can easily be taken care of with a backreference: (['"])[^'"]*\1. This makes sure that the match ends with the same character it began with. But now we're too restrictive - we'd like to allow " in single-quoted and ' in double-quoted strings. Something like [^\1] doesn't work, because a backreference will in general contain more than one character. So we use the same technique as above:
(['"])(?:(?!\1).)*\1
That is after the opening quote, before consuming each character we make sure that it is not the same as the opening character. We do that as long as possible, and then match the opening character again.
Overlapping matches
This is a (completely different) problem that can usually not be solved at all without lookarounds. If you search for a match globally (or want to regex-replace something globally), you may have noticed that matches can never overlap. I.e. if you search for ... in abcdefghi you get abc, def, ghi and not bcd, cde and so on. This can be problem if you want to make sure that your match is preceded (or surrounded) by something else.
Say you have a CSV file like
aaa,111,bbb,222,333,ccc
and you want to extract only fields that are entirely numerical. For simplicity, I'll assume that there is no leading or trailing whitespace anywhere. Without lookarounds, we might go with capturing and try:
(?:^|,)(\d+)(?:,|$)
So we make sure that we have the start of a field (start of string or ,), then only digits, and then the end of a field (, or end of string). Between that we capture the digits into group 1. Unfortunately, this will not give us 333 in the above example, because the , that precedes it was already part of the match ,222, - and matches cannot overlap. Lookarounds solve the problem:
(?<=^|,)\d+(?=,|$)
Or if you prefer double negation over alternation, this is equivalent to
(?<![^,])\d+(?![^,])
In addition to being able to get all matches, we get rid of the capturing which can generally improve performance. (Thanks to Adrian Pronk for this example.)
Multiple independent conditions
Another very classic example of when to use lookarounds (in particular lookaheads) is when we want to check multiple conditions on an input at the same time. Say we want to write a single regex that makes sure our input contains a digit, a lower case letter, an upper case letter, a character that is none of those, and no whitespace (say, for password security). Without lookarounds you'd have to consider all permutations of digit, lower case/upper case letter, and symbol. Like:
\S*\d\S*[a-z]\S*[A-Z]\S*[^0-9a-zA_Z]\S*|\S*\d\S*[A-Z]\S*[a-z]\S*[^0-9a-zA_Z]\S*|...
Those are only two of the 24 necessary permutations. If you also want to ensure a minimum string length in the same regex, you'd have to distribute those in all possible combinations of the \S* - it simply becomes impossible to do in a single regex.
Lookahead to the rescue! We can simply use several lookaheads at the beginning of the string to check all of these conditions:
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[^0-9a-zA-Z])(?!.*\s)
Because the lookaheads don't actually consume anything, after checking each condition the engine resets to the beginning of the string and can start looking at the next one. If we wanted to add a minimum string length (say 8), we could simply append (?=.{8}). Much simpler, much more readable, much more maintainable.
Important note: This is not the best general approach to check these conditions in any real setting. If you are making the check programmatically, it's usually better to have one regex for each condition, and check them separately - this let's you return a much more useful error message. However, the above is sometimes necessary, if you have some fixed framework that lets you do validation only by supplying a single regex. In addition, it's worth knowing the general technique, if you ever have independent criteria for a string to match.
I hope these examples give you a better idea of why people would like to use lookarounds. There are a lot more applications (another classic is inserting commas into numbers), but it's important that you realise that there is a difference between (?!u) and [^u] and that there are cases where negated character classes are not powerful enough at all.
q[^u] will not match "Iraq" because it will look for another symbol.
q(?!u) however, will match "Iraq":
regex = /q[^u]/
/q[^u]/
regex.test("Iraq")
false
regex.test("Iraqf")
true
regex = /q(?!u)/
/q(?!u)/
regex.test("Iraq")
true
Well, another thing along with what others mentioned with the negative lookahead, you can match consecutive characters (e.g. you can negate ui while with [^...], you cannot negate ui but either u or i and if you try [^ui]{2}, you will also negate uu, ii and iu.
The whole point is to not "consume" the next character(s), so that it can be e.g. captured by another expression that comes afterwards.
If they're the last expression in the regex, then what you've shown are equivalent.
But e.g. q(?!u)([a-z]) would let the non-u character be part of the next group.

Regex to Match All Except a String

Given the string beginend where begin and end are both optional, I want to match the whole string and back-reference only begin. Begin is unknown but alpha-numeric; end is literally end. How would I go about doing this?
In case it matters, I'd be using this in a Textpad macro to replace "beginend" with something else including "begin".
To match an string of "alpha-numeric" characters that do not contain "end" you can use something like:
(?:(?!end)[A-Za-z\d])+
An expression like this would do what you ask:
^((?:(?!end)[A-Za-z0-9])+)(?:end)?\z
EDITED (see after blockquote)
I don't have commenting privileges, so I can't comment on his
solution, but Qtax's solution will not work because it assumes that
begin will never contain the substring "end", e.g., it wouldn't
match the string "sendingend".
My solution:
^([A-Za-z0-9]*)(?:end)?$
Of course, it also depends on what you mean by alphanumeric. My
example has the strictest definition, i.e., just upper- and lower-case
letters plus digits. You'd need to add in other characters if you want
them. If you want to include the underscore as well as those
characters, you can replace the whole bulky [A-Za-z0-9] with \w
(equivalent to [A-Za-z0-9_]). Add \s if you want whitespace.
Since you said your regex knowledge is limited, I'll explain the rest
of the solution to you and whoever else comes along.
^ and $ match the beginning and the end of the string, respectively. By including the $ in particular, you're
guaranteeing that the last "end" you encounter is really at the end.
For example, without them, it would still match the string
"sendingsending" and the rest of your program would think it's found
that "end" at the end. With these, it's still going to match
"sendingsending" because any characters are allowed (see below), but
other steps in your script will recognize the presence of
"end". It actually doesn't matter much for this current
string, because the ([A-Za-z0-9]*) will capture the entire string if
"end" is not present. However, you therefore need another regex to
ensure the presence or absence of "end"...so you'd do something like
(end)$ to locate it.
([A-Za-z0-9]*): the square brackets contain the specific characters that are allowed (you should definitely read up on this if
you don't know). The * means it will match one of those characters 0
or more times, so this allows for no string (i.e., just "end") as well
as super-long strings. The parentheses are capturing that pattern so
you can back-reference it.
(?:end)?: the last ? makes it match this pattern 0 or 1 times (i.e., makes it optional). The (?:string) structure allows you to
group characters together as you would with parentheses but the ?:
makes it not save that pattern, so it uses less memory. In your
case, that memory would be negligible, but it's nice to know for
future use.
If you need more help, try Googling 'regex'. There's tons of good
references. You can also test them out. My personal favourite tester
is called My Regex Tester.
Good luck!
I just tried looking up TextPad macros, and you might run into a problem. As I've explained above, to verify the presence of "end" at the end of the string, you'll need something separate. I was envisioning some kind of conditional, something like IF (end)$ THEN replace with ^([A-Za-z0-9]*)(?:end)?$ ELSE use the whole string. However, I don't know if you can do that with these macros...it's hard to say, because I'm not a TextPad user and there's next to no documentation. If you can't, then I think you're going to have to put some restrictions on it. One idea is to not allow "end" to be anywhere in the begin substring (which is how Qtax's solution did it). But now I'm wondering...if "end" if going to be optional, and if conditionals aren't allowed, what's the point of having it at all? ...perhaps I'm overthinking things. I await your reply.
Try using a positive look-ahead. This is a zero-width assertion so won't be included in the match. It also allows for the substring end to be present within the alpha-numeric string
([a-z0-9]*)(?=end)
What this is saying is: Match an alpha-numeric string only if it is immediately followed by end