Meaning of regular expression - regex

I have the next regular expression:
(?=.*\d)(?=.*[A-Z]).*$
1) Contains one digit.
2) Contains UPPER case letters.
Example:
"asaZ1h" -> Correct
"asaZaksa" -> Incorrect
My question is what's the meaning of "?=" in this expression ?

The meaning of "?=" signifies a lookahead. This means that it'll assert that in a particular string a condition is true, but it won't consume any characters, thus the match that follows will start at the cursor position before the lookaheads. Good if you want to start matching one thing conditionally on another expression.
This might help.

If you're working in javascript (and probably other languages) it means that the match you're trying to make in the parentheses is required.
See this answer for a more thorough - and useful - response.

Related

Matching Word() when word is not (some word)

Specifically, I want to match functions in my Javascript code that are not in a set of common standard Javascript functions. In other words, I want to match user defined functions. I'm working with vim's flavour of regexp, but I don't mind seeing solutions for other flavours.
As I understand it, regexp crawls through a string character by character, so thinking in terms of sets of characters can be problematic even when a problem seems simple. I've tried negative lookahead, and as you might expect all the does is prevent the first character of the functions I don't want from being matched (ie, onsole.log instead of console.log).
(?(?!(if)|(console\.log)|(function))\w+)\(.*\)
function(meep, boop, doo,do)
JSON.parse(localStorage["beards"])
console.log("sldkfjls" + dododo);
if (beepboop) {
BLAH.blah.somefunc(arge, arg,arg);
https://regexr.com/
I would like to be able to crawl through a function and see where it is calling other usermade functions. Will I need to do post-processing (ie mapping with another regexp) on the matches to reject matches I don't want, or is there a way to do this in one regexp?
The basic recipe for a regular expression that matches all words except foo (in Vim's regular expression syntax) is:
/\<\%(foo\>\)\#!\k\+\>/
Note how the negative lookahead (\#!) needs an end assertion (here: \>) on its own, to avoid that it also excludes anything that just starts with the expression!
Applied to your examples (excluding if (potentially with whitespace), console.log, and function, ending with (), that gives:
\<\%(\%(if *\|console\.log\|function\)(\)\#!\(\k\|\.\)\+\>(.*)
As you seem to want to include the entire object chain (so JSON.parse instead of just parse), the actual match includes both keyword characters (\k) and the period. There's one complication with that: The negative lookahead will latch onto the log() in console.log(), because the leading keyword boundary assertion (\<) matches there as well. We can disallow that match by also excluding a period just before the function; i.e. by placing \.\#<! in between:
\<\%(\%(if *\|console\.log\|function\)(\)\#!\.\#<!\(\k\|\.\)\+\>(.*)
That will highlight just the following calls:
JSON.parse(localStorage["beards"])
BLAH.blah.somefunc(arge, arg,arg);
foo.log(asdf)

Regex in lex [^a-z]*

For the given regex, [^a-z]* in lex, the question is will it match any word not containing any lower case letter, or this is not the correct implementation? I.e., for that specific scenario, should the given regex be used, or this is a proper one for matching a word that has no lower case letters: [^a-z]+?
My reasoning is that it is not, it should be + instead of the *, since the negation of range, with 0 or more possible cases. seems wrong. But it's hard to get my mind on why is it wrong. I tried several regex tools online, and its hit and miss, some manage to show that it works, some show more matches between characters.
I would say that negating a lowercase string, and saying its either 0 or more of those, that would match the string abc as well, since it (does satisfy the scenario that it doesn't have 0 of anything. That could be said for any string. + seems like a more intuitive option, but in this case * was used and I think its an incorrect implementation, but can't find any resources to back it up since Google doesn't play nicely with those search strings.
Some test cases, this is is node.js:
/[^a-z]*$/.test('testTEST123') - True
/[^a-z]*$/.test('test') - True (this one should be false as per problem statement)
/[^a-z]+$/.test('testTEST123') - True
/[^a-z]+$/.test('test') - False (this one is correct, so there are no matches that dont satisfy the regex)
On regex101.com, the results are similar, but the highlighted part is the end of the line, although there are no characters there.
I dont know is there some specific lex implementation of regex that is different, but as I described, something feels wrong with * usage for not-matching the range.
(F)lex rules never match the empty string, so it does not make any difference whether you use * or + in this context.
But I don't think the question captures the behaviour. A (f)lex rule matches the longest string matching any pattern and [^a-z]+ will match any sequence of characters, whether punctuation, white space, unprintable control codes, etc., except lower-case letters. (In other words, it does not just match "words" unless you have an unusual definition of "word".

Find strings that do not begin with something

This has been gone over but I've not found anything that works consistently... or assist me in learning where I've gone awry.
I have file names that start with 3 or more digits and ^\d{3}(.*) works just fine.
I also have strings that start with the word 'account' and ^ACCOUNT(.*) works just fine for these.
The problem I have is all the other strings that DO NOT meet the two previous criteria. I have been using ^[^\d{3}][^ACCOUNT](.*) but occasionally it fails to catch one.
Any insights would be greatly appreciated.
^[^\d{3}][^ACCOUNT](.*)
That's definitely not what you want. Square brackets create character classes: they match one character from the list of characters in brackets. If you put a ^ then the match is inverted and it matches one character that's not listed. The meaning of ^ inside brackets is completely different from its meaning outside.
In short, [] is not at all what you want. What you can do, if your regex implementation supports it, is use a negative lookahead assertion.
^(?!\d{3}|ACCOUNT)(.*)
This negative lookahead assertion doesn't match anything itself. It merely checks that the next part of the string (.*) does not match either \d{3} or ACCOUNT.
Demorgan's law says: !(A v B) = !A ^ !B.
But unfortunately Regex itself does
not support the negation of expressions. (You always could rewrite it, but sometimes, this is a huge task).
Instead, you should look at your Programming Language, where you can negate values without problems:
let the "matching" function be "match" and you are using match("^(?:\d{3}|ACCOUNT)(.)") to determine, whether the string matches one of both conditions. Then you could simple negate the boolean return value of that matching function and you'll receive every string that does NOT match.

The Greedy Option of Regex is really needed?

The Greedy Option of Regex is really needed?
Lets say I have following texts, I like to extract texts inside [Optionx] and [/Optionx] blocks
[Option1]
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
[/Option2]
But with Regex Greedy Option, its give me
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
Anybody need like that? If yes, could you let me know?
If I understand correctly, the question is “why (when) do you need greedy matching?”
The answer is – almost always. Consider a regular expression that matches a sequence of arbitrary – but equal – characters, of length at least two. The regular expression would look like this:
(.)\1+
(\1 is a back-reference that matches the same text as the first parenthesized expression).
Now let’s search for repeats in the following string: abbbbbc. What do we find? Well, if we didn’t have greedy matching, we would find bb. Probably not what we want. In fact, in most application s we would be interested in finding the whole substring of bs, bbbbb.
By the way, this is a real-world example: the RLE compression works like that and can be easily implemented using regex.
In fact, if you examine regular expressions all around you will see that a lot of them use quantifiers and expect them to behave greedily. The opposite case is probably a minority. Often, it makes no difference because the searched expression is inside guard clauses (e.g. a quoted string is inside the quote marks) but like in the example above, that’s not always the case.
Regular expressions can potentially match multiple portion of a text.
For example consider the expression (ab)*c+ and the string "abccababccc". There are many portions of the string that can match the regular expressions:
(abc)cababccc
(abcc)ababccc
abcc(ababccc)
abccab(abccc)
ab(c)cababccc
ab(cc)ababccc
abcabab(c)ccc
....
some regular expressions implementation are actually able to return the entire set of matches but it is most common to return a single match.
There are many possible ways to determine the "winning match". The most common one is to take the "longest leftmost match" which results in the greedy behaviour you observed.
This is tipical of search and replace (a la grep) when with a+ you probably mean to match the entire aaaa rather than just a single a.
Choosing the "shortest non-empty leftmost" match is the usual non-greedy behaviour. It is the most useful when you have delimiters like your case.
It all depends on what you need, sometimes greedy is ok, some other times, like the case you showed, a non-greedy behaviour would be more meaningful. It's good that modern implementations of regular expressions allow us to do both.
If you're looking for text between the optionx blocks, instead of searching for .+, search for anything that's not "[\".
This is really rough, but works:
\[[^\]]+]([^(\[/)]+)
The first bit searches for anything in square brackets, then the second bit searches for anything that isn't "[\". That way you don't have to care about greediness, just tell it what you don't want to see.
One other consideration: In many cases, greedy and non-greedy quantifiers result in the same match, but differ in performance:
With a non-greedy quantifier, the regex engine needs to backtrack after every single character that was matched until it finally has matched as much as it needs to. With a greedy quantifier, on the other hand, it will match as much as possible "in one go" and only then backtrack as much as necessary to match any following tokens.
Let's say you apply a.*c to
abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc. This finds a match in 5 steps of the regex engine. Now apply a.*?c to the same string. The match is identical, but the regex engine needs 101 steps to arrive at this conclusion.
On the other hand, if you apply a.*c to abcbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb, it takes 101 steps whereas a.*?c only takes 5.
So if you know your data, you can tailor your regex to match it as efficiently as possible.
just use this algorithm which you can use in your fav language. No need regex.
flag=0
open file for reading
for each line in file :
if check "[/Option" in line:
flag=0
if check "[Option" in line:
flag=1
continue
if flag:
print line.strip()
# you can store the values of each option in this part

Regular Expression Opposite

Is it possible to write a regex that returns the converse of a desired result? Regexes are usually inclusive - finding matches. I want to be able to transform a regex into its opposite - asserting that there are no matches. Is this possible? If so, how?
http://zijab.blogspot.com/2008/09/finding-opposite-of-regular-expression.html states that you should bracket your regex with
/^((?!^ MYREGEX ).)*$/
, but this doesn't seem to work. If I have regex
/[a|b]./
, the string "abc" returns false with both my regex and the converse suggested by zijab,
/^((?!^[a|b].).)*$/
. Is it possible to write a regex's converse, or am I thinking incorrectly?
Couldn't you just check to see if there are no matches? I don't know what language you are using, but how about this pseudocode?
if (!'Some String'.match(someRegularExpression))
// do something...
If you can only change the regex, then the one you got from your link should work:
/^((?!REGULAR_EXPRESSION_HERE).)*$/
The reason your inverted regex isn't working is because of the '^' inside the negative lookahead:
/^((?!^[ab].).)*$/
^ # WRONG
Maybe it's different in vim, but in every regex flavor I'm familiar with, the caret matches the beginning of the string (or the beginning of a line in multiline mode). But I think that was just a typo in the blog entry.
You also need to take into account the semantics of the regex tool you're using. For example, in Perl, this is true:
"abc" =~ /[ab]./
But in Java, this isn't:
"abc".matches("[ab].")
That's because the regex passed to the matches() method is implicitly anchored at both ends (i.e., /^[ab].$/).
Taking the more common, Perl semantics, /[ab]./ means the target string contains a sequence consisting of an 'a' or 'b' followed by at least one (non-line separator) character. In other words, at ANY point, the condition is TRUE. The inverse of that statement is, at EVERY point the condition is FALSE. That means, before you consume each character, you perform a negative lookahead to confirm that the character isn't the beginning of a matching sequence:
(?![ab].).
And you have to examine every character, so the regex has to be anchored at both ends:
/^(?:(?![ab].).)*$/
That's the general idea, but I don't think it's possible to invert every regex--not when the original regexes can include positive and negative lookarounds, reluctant and possessive quantifiers, and who-knows-what.
You can invert the character set by writing a ^ at the start ([^…]). So the opposite expression of [ab] (match either a or b) is [^ab] (match neither a nor b).
But the more complex your expression gets, the more complex is the complementary expression too. An example:
You want to match the literal foo. An expression, that does match anything else but a string that contains foo would have to match either
any string that’s shorter than foo (^.{0,2}$), or
any three characters long string that’s not foo (^([^f]..|f[^o].|fo[^o])$), or
any longer string that does not contain foo.
All together this may work:
^[^fo]*(f+($|[^o]|o($|[^fo]*)))*$
But note: This does only apply to foo.
You can also do this (in python) by using re.split, and splitting based on your regular expression, thus returning all the parts that don't match the regex, how to find the converse of a regex
In perl you can anti-match with $string !~ /regex/;.
With grep, you can use --invert-match or -v.
Java Regexps have an interesting way of doing this (can test here) where you can create a greedy optional match for the string you want, and then match data after it. If the greedy match fails, it's optional so it doesn't matter, if it succeeds, it needs some extra data to match the second expression and so fails.
It looks counter-intuitive, but works.
Eg (foo)?+.+ matches bar, foox and xfoo but won't match foo (or an empty string).
It might be possible in other dialects, but couldn't get it to work myself (they seem more willing to backtrack if the second match fails?)