How is this regex wrong? - regex

I have a regex which I'm using to match user functions inside an IDE (Sublime). This matches what I want (the function name itself), but it also matches the first parentheses. Therefore the match is like follows:
this._myFunction('content');
Notice the opening paran.
Here is my expression:
(?:[^\._])?([\w-]+)(?:[\(]){1}
How can I exclude the opening paran from getting matched?
.
As a bonus question: How can I successfully not match the string: function, because as you can expect function( matches (not fun in JS).
Thank you to anyone who can assist.

You can use (?=pattern):
A zero-width positive look-ahead assertion. For example,
"/\w+(?=\t)/" matches a word followed by a tab, without
including the tab in $&.
So where you match your open paren wrap it in (?=) instead of (?:)

Unfortunately, you cannot really use regex to parse any context-free grammar, but hopefully this can do better. It uses positive lookahead to not include the opening paren in the match but look for it anyways:
(?:[^\._])?([\w-]+)(?=[\(])
If your IDE's regex engine supports negative lookbehind (the subexpression is not found before the match), you can avoid matching the string 'function' or "function":
(?!<['"])(?:[^\._])?([\w-]+)(?=[\(])

Related

conditional group matching using regex

how to match a group except if it starts with a certain character.
e.g. I have the following sentence:
just _checking any _string.
I have the regex ([\w]+) which matches all the words {just, _checking, any, _sring}. But, what I want is to match all the words that don't start with character _ i.e. {just, any}.
The above example is a watered down version of what I'm actually trying to parse.
I'm parsing a code file, which contains string in the following format :
package1.class1<package2.class2 <? extends package3.class3> , package4.class4 <package5.package6.class5<?>.class6.class7<class8> >.class9.class10
The output that I require should create a match result like all the fully qualified names (having at least one . in the middle )but stop if encounter a <.
So, the result should be :
{ package1.class1, package2.class2, package3.class3, package4.class4, package5.package6.class5 }
I wrote ([\w]+\.)+([\w]+) to parse it but it also matches class6.class7 and class9.class10 which I don't want. I know it's way off the mark and I apologize for that.
Hence, I earlier asked if I can ignore a capture group starting from a specific character.
Here's the link where I tried : regex101
there everything that it is matching is correct except the part where it matches class6.class7 and class9.class10.
I'm not sure how to proceed on this. I'm using C++14 and it supports ECMAScript grammar as well along with POSIX style.
EDIT : as suggested by #Corion, I've added more details.
EDIT2 : added regex101 link
Just use a word boundary \b and make sure that the first character is not an underscore (but still a letter):
(\b(?=[^_])[\w]+)
Using the following Perl script to validate that:
perl -wlne "print qq(Matched <$_>) for /(\b(?=[^_])[\w]+)/g"
Matched <just>
Matched <any>
regex101 playground
In response to the expansion of the question in the comment, the following regular expression will also capture dots in the "middle" of the word (but still disallow them at the start of a word):
(\b(?=[^_.])[\w.]+)
perl -wlne "print qq(Matched <$_>) for /(\b(?=[^_.])[\w.]+)/g"
just _checking any _string. and. this. inclu.ding dots
Matched <just>
Matched <any>
Matched <and.>
Matched <this.>
Matched <inclu.ding>
Matched <dots>
regex101 playground
After the third expansion of the question, I've expanded the regular expression to match the class names but exclude the extends keyword, and only start a new match when there was a space (\s) or less-than sign (<). The full qualified matches are achieved by forcing a dot ( \. ) to appear in the match:
(?:^|[<>\s])(?:(?![_.]|\bextends\b)([\w]+\.[\w.]+))
perl -nwle "print qq(Matched <$_>) for /(?:^|[<>\s])(?:(?![_.]|\bextends\b)([\w]+\.[\w.]+))/g"
Matched <package1.class1>
Matched <package2.class2>
Matched <package3.class3>
Matched <package4.class4>
Matched <package5.package6.class5>
regex 101 playground

PCRE Regex Match /x... but not /y/x

When configuring redirections, it's common to run into multiple pages that include some of the same path strings. We've ran into this instance multiple times where we need to redirect:
https://example.com/x...
But not:
https://example.com/y/x...
To match the /x... we use PCRE regex of:
/x.*
We've been struggling to get the exclude to match correctly; we apologize in advance as our regex is a bit weak, here's our pseudo code:
Match all /x... except /y/x...
Here is what we thought that looked like:
^\/(?!y\/).x.*
In our mind that reads:
Any query starting with /x..., except starting with /y/x...
Thank you in advance, and please feel free to suggest better formatting, we are not stack overflow pros.
Your regex matches from the start of the string a forward slash and then uses a negative lookahead to check what follows is not y/. If that is true, then match any character followed by x and 0+ character. That will match for example //x///
Without taking matching the url part into account, one way could be to use a negative lookahead (?! to check if what is on the right side does not contain /y/x and then match any character:
^(?!.*/y/x).+
Regex demo
You may use a negative lookbehind assertion:
~(?<!/y)/x~
RegEx Demo
(?<!/y) is a negative lookbehind assertnion that will fail the match if /y appears before matching /x.

Don't match regex when trailed by character

Current regex: [[\/\!]*?[^\[\]]*?]
The goal it to successfully match [size=16] and [/size] in the following test case but not match [abc].
[size=16]1234[/size]
[abc](htt)
Regex currently matches the 3rd test case; which is specific to always being followed by a parenthesis. So I was thinking about using the logic where if group's next char == "(", do not match
But- I don't really know how to write logic like that in regex...
Look assertions look before or ahead to see if there's a match and then proceed (or not) depending on whether there's a match.
A negative lookahead assertion looks like this:
(?!regex)
Stick it on the end, supplying it the parantheses and you're good to go:
[[\/\!]*?[^\[\]]*?](?!\()
https://regex101.com/r/2jEApI/1
What you want is a "negative lookahead".
A "lookaround" is a group which gets matched, but not included in the result. They start with (? and end with ).
There are two types of lookaround, lookahead and lookbehind:
A "lookbehind" looks backward and is indicated with a < immediately after the ? (i.e. ?<), but that's not what you're here for.
A "lookahead" looks forward and is the default if there is no < after the ?.
Both types can be either positive or negative:
A positive lookaround requires the included group to be present to form a match and is indicated with an =.
A negative lookaround requires that the included group is NOT present to form a match and is indicated with an !.
After you have the basic structure for a positive or negative lookahead or lookbehind the contents in the middle is the normal regular expression syntax, the same as if it were any other group, so in your case you'll need an escaped left parenthesis \(.
Put it all together and you just need to tack this on the end of what you have: (?!\()

How to get first match of string by Regular Expression?

I have the following text string:
$ABCD(file="somefile.txt")$' />Some more text followed by a dollar like this one)$. Some more random text
I am trying to match the $ABCD(file="somefile.txt")$ part of the string using a regular expression.
I am using this (?=[$]ABCD[(]file=).*(?<=[)][$]) regular expression pattern to make the intended match. It's not working as expected because I am getting a match all the way to the second )$ in the string.
For example, the match will be as follows:
$ABCD(file="somefile.txt")$' />Some more text followed by a dollar like this one)$
How should I modify the pattern to match to the end of the first occurrence of the )$?
Here is a good online regular expression engine tester:
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
try appending a ? to the greedy *
(?=[$]ABCD[(]file=).*?(?<=[)][$])
Lazy quantification
The standard quantifiers in regular expressions are greedy, meaning
they match as much as they can. Modern regular expression tools allow a quantifier to be specified as lazy (also known as > non-greedy, reluctant, minimal, or ungreedy) by putting a question mark after the quantifier
You could just use this:
\$ABCD\(file="[a-z.]+"\)\$
to get $ABCD(file="somefile.txt")$.
Your problem was the .* bit, it was too general and thus matched everything up to the last $.
I would advance you to use the second quote to define the end of the searched pattern: [^"]* will match to anything except ".
So the pattern for the file name would be: \$ABCD\(file="([^"]*)

How to negate the whole regex?

I have a regex, for example (ma|(t){1}). It matches ma and t and doesn't match bla.
I want to negate the regex, thus it must match bla and not ma and t, by adding something to this regex. I know I can write bla, the actual regex is however more complex.
Use negative lookaround: (?!pattern)
Positive lookarounds can be used to assert that a pattern matches. Negative lookarounds is the opposite: it's used to assert that a pattern DOES NOT match. Some flavor supports assertions; some puts limitations on lookbehind, etc.
Links to regular-expressions.info
Lookahead and Lookbehind Zero-Width Assertions
Flavor comparison
See also
How do I convert CamelCase into human-readable names in Java?
Regex for all strings not containing a string?
A regex to match a substring that isn’t followed by a certain other substring.
More examples
These are attempts to come up with regex solutions to toy problems as exercises; they should be educational if you're trying to learn the various ways you can use lookarounds (nesting them, using them to capture, etc):
codingBat plusOut using regex
codingBat repeatEnd using regex
codingbat wordEnds using regex
Assuming you only want to disallow strings that match the regex completely (i.e., mmbla is okay, but mm isn't), this is what you want:
^(?!(?:m{2}|t)$).*$
(?!(?:m{2}|t)$) is a negative lookahead; it says "starting from the current position, the next few characters are not mm or t, followed by the end of the string." The start anchor (^) at the beginning ensures that the lookahead is applied at the beginning of the string. If that succeeds, the .* goes ahead and consumes the string.
FYI, if you're using Java's matches() method, you don't really need the the ^ and the final $, but they don't do any harm. The $ inside the lookahead is required, though.
\b(?=\w)(?!(ma|(t){1}))\b(\w*)
this is for the given regex.
the \b is to find word boundary.
the positive look ahead (?=\w) is here to avoid spaces.
the negative look ahead over the original regex is to prevent matches of it.
and finally the (\w*) is to catch all the words that are left.
the group that will hold the words is group 3.
the simple (?!pattern) will not work as any sub-string will match
the simple ^(?!(?:m{2}|t)$).*$ will not work as it's granularity is full lines
This regexp math your condition:
^.*(?<!ma|t)$
Look at how it works:
https://regex101.com/r/Ryg2FX/1
Apply this if you use laravel.
Laravel has a not_regex where field under validation must not match the given regular expression; uses the PHP preg_match function internally.
'email' => 'not_regex:/^.+$/i'