How to find a pattern that does not contains a subpattern? - regex

I'm working on a Netbeans project (Java), and I'm checking if everything is allright. I've just discovered that the Find tool in Netbeans supports regular expressions, so I used the following Regex to find the correct code:
ps\.set.*\(3, c0.get\((.*)\)\);(\n.*){3}.*ps\.set.*\(7, c1.get\(t\).get\(\1\)\);\n.*ps\.addBatch\(\);
^^ ^^
Subpattern 1... ...must also appear here
Example (matched by Regex above):
ps.setString(3, c0.get(fooPattern)); // "fooPattern" here...
ps.setDouble(5, s.get(k));
ps.setDouble(6, 0d);
ps.setLong(7, c1.get(t).get(fooPattern)); // ... must also be here
ps.addBatch();
What I need it to check if the subpattern 1 appears in the second position. This Regex works. However, I need to find all occurrences where this does not happen.
Example (what I'd like to find):
ps.setString(3, c0.get(fooPattern)); // "fooPattern" here...
ps.setDouble(5, s.get(k));
ps.setDouble(6, 0d);
ps.setLong(7, c1.get(t).get(barPattern)); // ... is NOT here
ps.addBatch();
So, the specific question is: How to modify this Regex to find all occurrences where Subpattern 1 is not repeated in the position it should be?

Use a negative lookahead as I have already mentioned, just use a consuming subpattern instead of the \1 (that was consuming text in your original regex). Also, do not forget that to match a literal dot, you need to escape it.
ps\.set.*\(3, c0\.get\((.*)\)\);(\n.*){3}.*ps\.set.*\(7, c1\.get\(t\)\.get\((?!\1\))[^()]*\)\);\n.*ps\.addBatch\(\);
^^^^^^^^^^^^
See the regex demo
The (?!\1\)) negative lookahead restricts the more generic [^()]* pattern (that matches zero or more characters other than ( and ), you may replace it with .* if there can be parentheses inside) making it fail if these 0+ chars are equal to the value captured into Group 1.

Related

How can I get the second part of a hyphenated word using regex?

For example, I have the word: sh0rt-t3rm.
How can I get the t3rm part using perl regex?
I could get sh0rt by using [(a-zA-Z0-9)+]\[-\], but \[-\][(a-zA-Z0-9)+] doesn't work to get t3rm.
The syntax used for the regex is not correct to get either sh0rt or t3rm
You flipped the square brackets and the parenthesis, and the hyphen does not have to be between square brackets.
To get sh0rt in sh0rt-t3rm you you might use for example one of:
Regex
Demo
Explanation
\b([a-zA-Z0-9]+)-
Demo 1
\b is a word boundary to prevent a partial word match, the value is in capture group 1.
\b[a-zA-Z0-9]+(?=-)
Demo 2
Match the allowed chars in the character class, and assert a - to the right using a positive lookahead (?=-)
To get t3rm in sh0rt-t3rm you might use for example one of:
Regex
Demo
Explanation
-([a-zA-Z0-9]+)\b
Demo 3
The other way around with a leading - and get the value from capture group 1.
-\K[a-zA-Z0-9]+\b
Demo 4
Match - and use \K to keep out what is matched so far. Then match 1 or more times the allowed chars in the character class.
If your whole target string is literally just sh0rt-t3rm then you want all that comes after the -.
So the barest and minimal version, cut precisely for this description, is
my ($capture) = $string =~ /-(.+)/;
We need parenthesis on the left-hand-side so to make regex run in a list context because that's when it returns the matches (otherwise it returns true/false, normally 1 or '').
But what if the preceding text may have - itself? Then make sure to match all up to that last -
my ($capture) = $string =~ /.*-(.+)/;
Here the "greedy" nature of the * quantifier makes the previous . match all it possibly can so that the whole pattern still matches; thus it goes up until the very last -.
There are of course many other variations on how the data may look like, other than just being one hyphenated-word. In particular, if it's a part of a text, you may want to include word-boundaries
my ($capture) = $string =~ /\b.*?-(.+?)\b/;
Here we also need to adjust our "wild-card"-like pattern .+ by limiting it using ? so that it is not greedy. This matches the first such hyphenated word in the $string. But if indeed only "word" characters fly then we can just use \w (instead of . and word-boundary anchors)
my ($capture) = $string =~ /\w*?-(\w+)/;
Note that \w matches [a-zA-Z0-9_] only, which excludes some characters that may appear in normal text (English, not to mention all other writing systems).
But this is clearly getting pickier and cookier and would need careful close inspection and testing, and more complete knowledge of what the data may look like.
Perl offers its own tutorial, perlretut, and the main full reference is perlre
-([a-zA-Z0-9]+) will match a - followed by a word, with just the word being captured.
Demo

Regex Pattern to Match except when the clause enclosed by the tilde (~) on both sides

I want to extract matches of the clauses match-this that is enclosed with anything other than the tilde (~) in the string.
For example, in this string:
match-this~match-this~ match-this ~match-this#match-this~match-this~match-this
There should be 5 matches from above. The matches are explained below (enclosed by []):
Either match-this~ or match-this is correct for first match.
match-this is correct for 2nd match.
Either ~match-this# or ~match-this is correct for 3rd match.
Either #match-this~ or #match-this or match-this~ is correct for 4th match.
Either ~match-this or match-this is correct for 5th match.
I can use the pattern ~match-this~ catch these ~match-this~, but when I tried the negation of it (?!(~match-this)), it literally catches all nulls.
When I tried the pattern [^~]match-this[^~], it catches only one match (the 2nd match from above). And when I tried to add asterisk wild card on any negation of tilde, either [^~]match-this[^~]* or [^~]*match-this[^~], I got only 2 matches. When I put the asterisk wild card on both, it catches all match-this including those which enclosed by tildes ~.
Is it possible to achieve this with only one regex test? Or Does it need more??
If you also want to match #match-this~ as a separate match, you would have to account for # while matching, as [^~] also matches #
You could match what you don't want, and capture in a group what you want to keep.
~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)
Explanation
~[^~#]*~ Match any char except ~ or # between ~
| Or
( Capture group 1
(?:(?!match-this).)* Match any char if not directly followed by *match-this~
match-this Match literally
(?:(?!match-this)[^#~])* Match any char except ~ or # if not directly followed by match this
) Close group 1
See a regex demo and a Python demo.
Example
import re
pattern = r"~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)"
s = "match-this~match-this~ match-this ~match-this#match-this~match-this~match-this"
res = [m for m in re.findall(pattern, s) if m]
print (res)
Output
['match-this', ' match-this ', '~match-this', '#match-this', 'match-this']
If all five matches can be "match-this" (contradicting the requirement for the 3rd match) you can match the regular expression
~match-this~|(\bmatch-this\b)
and keep only matches that are captured (to capture group 1). The idea is to discard matches that are not captured and keep matches that are captured. When the regex engine matches "~match-this~" its internal string pointer is moved just past the closing "~", thereby skipping an unwanted substring.
Demo
The regular expression can be broken down as follows.
~match-this~ # match literal
| # or
( # begin capture group 1
\b # match a word boundary
match-this # match literal
\b # match a word boundary
) # end capture group 1
Being so simple, this regular expression would be supported by most regex engines.
For this you need both kinds of lookarounds. This will match the 5 spots you want, and there's a reason why it only works this way and not another and why the prefix and/or suffix can't be included:
(?<=~)match-this(?!~)|(?<!~)match-this(?=~)|(?<!~)match-this(?!~)
Explaining lookarounds:
(?=...) is a positive lookahead: what comes next must match
(?!...) is a negative lookahead: what comes next must not match
(?<=...) is a positive lookbehind: what comes before must match
(?<!...) is a negative lookbehind: what comes before must not match
Why other ways won't work:
[^~] is a class with negation, but it always needs one character to be there and also consumes that character for the match itself. The former is a problem for a starting text. The latter is a problem for having advanced too far, so a "don't match" character is gone already.
(^|[^~]) would solve the first problem: either the text starts or it must be a character not matching this. We could do the same for ending texts, but this is a dead again anyway.
Only lookarounds remain, and even then we have to code all 3 variants, hence the two |.
As per the nature of lookarounds the character in front or behind cannot be captured. Additionally if you want to also match either a leading or a trailing character then this collides with recognizing the next potential match.
It's a difference between telling the engine to "not match" a character and to tell the engine to "look out" for something without actually consuming characters and advancing the current position in the text. Also not every regex engine supports all lookarounds, so it matters where you actually want to use it. For me it works fine in TextPad 8 and should also work fine in PCRE (f.e. in PHP). As per regex101.com/r/CjcaWQ/1 it also works as expected by me.
What irritates me: if the leading and/or trailing character of a found match is important to you, then just extract it from the input when processing all the matches, since they also come with starting positions and lengths: first match at position 0 for 10 characters means you look at input text position -1 and 10.

Regex negation?

I'm playing Regex Golf (http://regex.alf.nu/) and I'm doing the Abba hole. I have the following regex that matches the wrong side entirely (which is what I was trying to do):
(([\w])([\w])\3\2)
However, I'm trying to negate it now so it matches the other side. I can't seem to figure that part out. I tried:
(?!([\w])([\w])\3\2)
But that didn't work. Any tips from the regex masters?
You can make it much shorter (and get more points) by simply using . and removing unnecessary parens:
^(?!.*(.)(.)\2\1)
It just makes sure that there's no "abba" ("abba" here means 4 letters in that particular order we don't want to match) in any part of the string without having to match the whole word.
Using the explanation here: https://stackoverflow.com/a/406408/584663
I came up with: ^((?!((\w)(\w)\4\3)).)*$
The key here turns out to be the leading caret, ^, and the .*
(?! ...) is a look-ahead construct, and so does not advance the regex processing engine.
/(?! ...)/ on its own will correctly return a negative result for items matching the expression within; but for items which do not match (...) the regex engine continues processing. However if your regex only contains the (?! ) there is nothing left to process, and the regex processing position never advances. (See this great answer).
Apparently since the remaining regex is empty, it matches any zero-width segment of a string, i.e. it matches any string.
[begin SWAG]
With the caret ^ present, the regex engine is able to recognize that you are looking for a real answer and that you do not want it to tell you the string contains zero-width components.
[end SWAG]
Thus it is able to correctly fail to match when the (?! ) succeeds.

Regex to match number in #define statement

I have a line like this:
#define PROG_HWNR "36084"
or this:
#define PROG_HWNR "#37595"
I'd like to extract the number (and increase it, but that's not the matter here)
I wrote a regex, but it's not working (at least in http://gskinner.com/RegExr/ )
(?<="#?)(.*?)(?=")
I also tried variations like
(?<=("#?))(.*?)(?=")
or
(?<=("|"#)))(.*?)(?=")
But no success. The problem is, that I want to match only the number, no matter if there is a # or not ...
Can you point me in the right direction? Thanks!!
Try this regex:
"#?(\d+)"$
It will match:
" a quote
#? optional hash
( (start capturing)
\d+ one or more digits
) (stop capturing)
" a quote
$ anchor to end
Here is a JSFiddle, and here is a RegExr
The problem is the variable length of the lookbehind. Only few regex engines can deal with this. Because there are only two possible lookbehinds (including the # or not), you can expand that into two lookbehinds:
(?:(?<="#)|(?<=")).*?(?=")
Note that you don't need to capture the .*? if you use lookarounds, as they are excluded from the match anyway. Also, a better way than using non-greedy .*? is to use a greedy expression that can never go past the ending delimiter:
(?:(?<="#)|(?<="))[^"]*(?=")
Alternatively (if you can access captured submatches), you can use a capturing approach and get rid of the lookarounds:
"#?([^"]*)"
Try this:
^#define \w+ "#?(\d+)"$
That will match the whole line, with the first/single group being the number you are looking for.
This is actually pretty basic regex functionality: match an optional character (?) and match a group of characters (the parentheses).
You can even go one simpler:
\d+
will match a string of digits. Only the digits. And ignore the rest of the input string.
Use this tool for testing this stuff, I found it pretty handy: http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx

Regex to match all permutations of {1,2,3,4} without repetition

I am implementing the following problem in ruby.
Here's the pattern that I want :
1234, 1324, 1432, 1423, 2341 and so on
i.e. the digits in the four digit number should be between [1-4] and should also be non-repetitive.
to make you understand in a simple manner I take a two digit pattern
and the solution should be :
12, 21
i.e. the digits should be either 1 or 2 and should be non-repetitive.
To make sure that they are non-repetitive I want to use $1 for the condition for my second digit but its not working.
Please help me out and thanks in advance.
You can use this (see on rubular.com):
^(?=[1-4]{4}$)(?!.*(.).*\1).*$
The first assertion ensures that it's ^[1-4]{4}$, the second assertion is a negative lookahead that ensures that you can't match .*(.).*\1, i.e. a repeated character. The first assertion is "cheaper", so you want to do that first.
References
regular-expressions.info/Lookarounds and Backreferences
Related questions
How does the regular expression (?<=#)[^#]+(?=#) work?
Just for a giggle, here's another option:
^(?:1()|2()|3()|4()){4}\1\2\3\4$
As each unique character is consumed, the capturing group following it captures an empty string. The backreferences also try to match empty strings, so if one of them doesn't succeed, it can only mean the associated group didn't participate in the match. And that will only happen if string contains at least one duplicate.
This behavior of empty capturing groups and backreferences is not officially supported in any regex flavor, so caveat emptor. But it works in most of them, including Ruby.
I think this solution is a bit simpler
^(?:([1-4])(?!.*\1)){4}$
See it here on Rubular
^ # matches the start of the string
(?: # open a non capturing group
([1-4]) # The characters that are allowed the found char is captured in group 1
(?!.*\1) # That character is matched only if it does not occur once more
){4} # Defines the amount of characters
$
(?!.*\1) is a lookahead assertion, to ensure the character is not repeated.
^ and $ are anchors to match the start and the end of the string.
While the previous answers solve the problem, they aren't as generic as they could be, and don't allow for repetitions in the initial string. For example, {a,a,b,b,c,c}. After asking a similar question on Perl Monks, the following solution was given by Eily:
^(?:(?!\1)a()|(?!\2)a()|(?!\3)b()|(?!\4)b()|(?!\5)c()|(?!\6)c()){6}$
Similarly, this works for longer "symbols" in a string, and for variable length symbols too.