Why do I get successful but empty regex matches?

Why do I get successful but empty regex matches? - regex

I'm searching the pattern (.*)\\1 on the text blabl with regexec(). I get successful but empty matches in regmatch_t structures. What exactly has been matched?

The regex .* can match successfully a string of zero characters, or the nothing that occurs between adjacent characters.
So your pattern is matching zero characters in the parens, and then matching zero characters immediately following that.
So if your regex was /f(.*)\1/ it would match the string "foo" between the 'f' and the first 'o'.
You might try using .+ instead of .*, as that matches one or more instead of zero or more. (Using .+ you should match the 'oo' in 'foo')

\1 is the backreference typically used for replacement later or when trying to further refine your regex by getting a match within a match. You should just use (.*), this will give you the results you want and will automatically be given the backreference number 1. I'm no regex expert but these are my thoughts based on my limited knowledge.
As an aside, I always revert back to RegexBuddy when trying to see what's really happening.

\1 is the "re-match" instruction. The question is, do you want to re-match immediately (e.g., BLABLA)
/(.+)\1/
or later (e.g., BLAahemBLA)
/(.+).*\1/

Related

Why the character ^ is required in an regex ^(?!.*?spam) to filter strings?

I try to filter strings, that don't contain word "spam".
I use the regex from here!
But I can't understand why I need the symbol ^ at the start of expression. I know that it signs the start of regex but I do not understand why it doesn't work without ^ in my case?
UPD. All the answers hereunder are very usefull.
It's completely clear now. Thank you!

The regex (?!.*?spam) matches a position in a string that is not followed by something matching .*?spam.
Every single string has such a position, because if nothing else, the very end of the string is certainly not followed by anything matching .*?spam.
So every single string contains a match for the regex (?!.*?spam).
The anchor ^ in ^(?!.*?spam) restricts the regex, so that it only matches strings where the very beginning of the string isn't followed by anything matching .*?spam — i.e., strings that don't contain spam at all (or anywhere in the first line, at least, depending on whether . matches newlines).

The lookahead is a zero-width assertion (that is, it ensures a position in your string). In your case it is a negative lookahead making sure that not "zero more characters, followed by the word spam" are following. This is true for a couple of positions in your string, see a demo on regex101.com without the anchor.
With the anchor the matching process starts right at the very beginning, so the whole string is analyzed, see the altered demo on regex101.com as well.

Regex negation?

I'm playing Regex Golf (http://regex.alf.nu/) and I'm doing the Abba hole. I have the following regex that matches the wrong side entirely (which is what I was trying to do):
(([\w])([\w])\3\2)
However, I'm trying to negate it now so it matches the other side. I can't seem to figure that part out. I tried:
(?!([\w])([\w])\3\2)
But that didn't work. Any tips from the regex masters?

You can make it much shorter (and get more points) by simply using . and removing unnecessary parens:
^(?!.*(.)(.)\2\1)
It just makes sure that there's no "abba" ("abba" here means 4 letters in that particular order we don't want to match) in any part of the string without having to match the whole word.

Using the explanation here: https://stackoverflow.com/a/406408/584663
I came up with: ^((?!((\w)(\w)\4\3)).)*$

The key here turns out to be the leading caret, ^, and the .*
(?! ...) is a look-ahead construct, and so does not advance the regex processing engine.
/(?! ...)/ on its own will correctly return a negative result for items matching the expression within; but for items which do not match (...) the regex engine continues processing. However if your regex only contains the (?! ) there is nothing left to process, and the regex processing position never advances. (See this great answer).
Apparently since the remaining regex is empty, it matches any zero-width segment of a string, i.e. it matches any string.
[begin SWAG]
With the caret ^ present, the regex engine is able to recognize that you are looking for a real answer and that you do not want it to tell you the string contains zero-width components.
[end SWAG]
Thus it is able to correctly fail to match when the (?! ) succeeds.

Regex for deleting characters before a certain character?

I'm very new at regex, and to be completely honest it confounds me. I need to grab the string after a certain character is reached in said string. I figured the easiest way to do this would be using regex, however like I said I'm very new to it. Can anyone help me with this or point me in the right direction?
For instance:
I need to check the string "23444:thisstring" and save "thisstring" to a new string.

If this is your string:
I'm very new at regex, and to be completely honest it confounds me
and you want to grab everything after the first "c", then this regular expression will work:
/c(.*)/s
It will return this match in the first matched group:
"ompletely honest it confounds me"
Try it at the regex tester here: regex tester
Explanation:
The c is the character you are looking for
.* (in combination with /s) matches everything left
(.*) captures what .* matched, making it available in $1 and returned in list context.

Regex for deleting characters before a certain character!
You can use lookahead like this
.*(?=x)
where x is a particular character or word or string.{using characters like .,$,^,*,+ have special meaning in regex so don't forget to escape when using it within x}
EDIT
for your sample string it would be
.*(?=thisstring)
.* matches 0 to many characters till thisisstring

Here is a one-line solution for matching everything after "before"
print $1."\n" if "beforeafter" =~ m/before(.*)/;
Edit:
While using lookbehind is possible, it's not required. Grouping provides an easier solution.

To get the string before : in your example, you have to use [^:][^:]*:\(.*\). Notice that you should have at least one [^:] followed by any number of [^:]s followed by an actual :, the character you are searching for.

Regular exp to match string from beginning until certain char is met

I have some long string where i'm trying to catch a substring until a certain character is met.
Lets suppose I have the following string, and I would like to get the text until the first ampersand.
abc.8965.aghtj&hgjkiyu5.8jfhsdj
I would like to extract what is present before the ampersand so: abc.8965.aghtj
W thought this would work:
grep'^.*&{1}'
I would translate it as
^ start of string
.* match whatever chars
&{1} until the first ampersand is matched
Any advice?
I'm afraid this will take me weeks

{1} does not match the first occurrence; instead it means "match exactly one of the preceding pattern/character", which is identical to just matching the character (&{3} would match &&&).
In order to match the first occurrence of &, you need to use .*?:
grep'^.*?&'
Normally, .* is greedy, meaning it matches as much as possible. This means your pattern would match the last ampersand rather than the first one. .*? is the non-greedy version, matching as little as possible while fulfilling the pattern.
Update: That syntax may not be supported by grep. Here is another option:
'^[^&]*&'
It matches anything that is not an ampersand, up to the first ampersand.
You also may have to enable extended regular expression in grep (-E).

Try this one:
^.*?(?=&)
it won't get ampersand sign, just a text before it

Regex to match all permutations of {1,2,3,4} without repetition

I am implementing the following problem in ruby.
Here's the pattern that I want :
1234, 1324, 1432, 1423, 2341 and so on
i.e. the digits in the four digit number should be between [1-4] and should also be non-repetitive.
to make you understand in a simple manner I take a two digit pattern
and the solution should be :
12, 21
i.e. the digits should be either 1 or 2 and should be non-repetitive.
To make sure that they are non-repetitive I want to use $1 for the condition for my second digit but its not working.
Please help me out and thanks in advance.

You can use this (see on rubular.com):
^(?=[1-4]{4}$)(?!.*(.).*\1).*$
The first assertion ensures that it's ^[1-4]{4}$, the second assertion is a negative lookahead that ensures that you can't match .*(.).*\1, i.e. a repeated character. The first assertion is "cheaper", so you want to do that first.
References
regular-expressions.info/Lookarounds and Backreferences
Related questions
How does the regular expression (?<=#)[^#]+(?=#) work?

Just for a giggle, here's another option:
^(?:1()|2()|3()|4()){4}\1\2\3\4$
As each unique character is consumed, the capturing group following it captures an empty string. The backreferences also try to match empty strings, so if one of them doesn't succeed, it can only mean the associated group didn't participate in the match. And that will only happen if string contains at least one duplicate.
This behavior of empty capturing groups and backreferences is not officially supported in any regex flavor, so caveat emptor. But it works in most of them, including Ruby.

I think this solution is a bit simpler
^(?:([1-4])(?!.*\1)){4}$
See it here on Rubular
^ # matches the start of the string
(?: # open a non capturing group
([1-4]) # The characters that are allowed the found char is captured in group 1
(?!.*\1) # That character is matched only if it does not occur once more
){4} # Defines the amount of characters
$
(?!.*\1) is a lookahead assertion, to ensure the character is not repeated.
^ and $ are anchors to match the start and the end of the string.

While the previous answers solve the problem, they aren't as generic as they could be, and don't allow for repetitions in the initial string. For example, {a,a,b,b,c,c}. After asking a similar question on Perl Monks, the following solution was given by Eily:
^(?:(?!\1)a()|(?!\2)a()|(?!\3)b()|(?!\4)b()|(?!\5)c()|(?!\6)c()){6}$
Similarly, this works for longer "symbols" in a string, and for variable length symbols too.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why do I get successful but empty regex matches? - regex

I'm searching the pattern (.*)\\1 on the text blabl with regexec(). I get successful but empty matches in regmatch_t structures. What exactly has been matched?

\1 is the "re-match" instruction. The question is, do you want to re-match immediately (e.g., BLABLA) /(.+)\1/ or later (e.g., BLAahemBLA) /(.+).*\1/

Related

Why the character ^ is required in an regex ^(?!.*?spam) to filter strings?

Regex negation?

Regex for deleting characters before a certain character?

Regular exp to match string from beginning until certain char is met

Regex to match all permutations of {1,2,3,4} without repetition

Categories

Resources