What PCRE will deliver a sub-string if present, else null? - regex

What regex that includes X as a subexpression will, when replaced by $1, yield the first match with X, or if there's none, null (i.e. empty string)?
For example, with X == "there"
<?php
echo '1: '.preg_replace(???, '$1','hello there dolly')."\n"; // -> 'there'
echo '2: '.preg_replace(???, '$1','hello dolly')."\n" ; // -> ''
?>
Please note that what I'm seeking is an answer to the question, not just to this one example.

If you make the capture optional, you'll get a blank if there's no match:
(?<=hello )(\w+)?(?= dolly)
Note: I have assumed you to match a wordbetween "hello" and "dolly". Adjust the regex to suit.

You can use \w* for zero or more in the middle match:
^hello\s(\w*)\s?dolly
Demo
If you want to match everything in between (like bookends) you can make the matching group optional:
^hello\s(.*)?\bdolly
Demo 2

If I understand your question, it's probably easiest check for a failed match in whatever language you are calling PCRE from.
In Perl itself, for instance, a failed match does not update the capture variables. For this reason, usually you want to check the success or failure of a match: print "$1\n" if /(there)/. But you can use this behavior to your advantage:
{ # Start a new scope so that $1 is null.
/(there)/; # Or whatever pattern you are searching for
print "$1\n"; # Print whether or not the string matched
}
You might be able do it in a regex if you know something more about the string. A commenter suggested:
You only need to make the group optional: see this example: regex101.com/r/iO5iA5/1 – Casimir et Hippolyte
As you noted, that regex assumes the subpattern is surrounded by spaces. If the string you are matching doesn't have spaces to anchor the capturing group, it will fail. If you remove all anchors, the optional group will match each null string, which can produce some strange results.
In summary, if you know something about the structure of the string, you can use an optional capturing group. If you just want to check if a string contains a particular pattern (and return null if not) use the language that wraps PCRE.

.*?(X).*|.*
e.g.
<?php
echo '1: '.preg_replace('~.*?(there).*|.*~', '$1','hello there dolly')."\n"; // -> 'there'
echo '2: '.preg_replace('~.*?(there).*|.*~', '$1','hello dolly')."\n" ; // -> ''
?>
Fuller tests here https://regex101.com/r/YLRTfZ/3/tests .

Related

Capturing text before and after a C-style code block with a Perl regular expression

I am trying to capture some text before and after a C-style code block using a Perl regular expression. So far this is what I have:
use strict;
use warnings;
my $text = << "END";
int max(int x, int y)
{
if (x > y)
{
return x;
}
else
{
return y;
}
}
// more stuff to capture
END
# Regex to match a code block
my $code_block = qr/(?&block)
(?(DEFINE)
(?<block>
\{ # Match opening brace
(?: # Start non-capturing group
[^{}]++ # Match non-brace characters without backtracking
| # or
(?&block) # Recursively match the last captured group
)* # Match 0 or more times
\} # Match closing brace
)
)/x;
# $2 ends up undefined after the match
if ($text =~ m/(.+?)$code_block(.+)/s){
print $1;
print $2;
}
I am having an issue with the 2nd capture group not being initialized after the match. Is there no way to continue a regular expression after a DEFINE block? I would think that this should work fine.
$2 should contain the comment below the block of code but it doesn't and I can't find a good reason why this isn't working.
Capture groups are numbered left-to-right in the order they occur in the regex, not in the order they are matched. Here is a simplified view of your regex:
m/
(.+?) # group 1
(?: # the $code_block regex
(?&block)
(?(DEFINE)
(?<block> ... ) # group 2
)
)
(.+) # group 3
/xs
Named groups can also be accessed as numbered groups.
The 2nd group is the block group. However, this group is only used as a named subpattern, not as a capture. As such, the $2 capture value is undef.
As a consequence, the text after the code-block will be stored in capture $3.
There are two ways to deal with this problem:
For complex regexes, only use named capture. Consider a regex to be complex as soon as you assemble it from regex objects, or if captures are conditional. Here:
if ($text =~ m/(?<before>.+?)$code_block(?<afterwards>.+)/s){
print $+{before};
print $+{afterwards};
}
Put all your defines at the end, where they can't mess up your capture numbering. For example, your $code_block regex would only define a named pattern which you then invoke explicitly.
There are also ready tools that can be leveraged for this, in a few lines of code.
Perhaps the first module to look at is the core Text::Balanced.
The extract_bracketed in list context returns: matched substring, remainder of the string after the match, and the substring before the match. Then we can keep matching in the remainder
use warnings;
use strict;
use feature 'say';
use Text::Balanced qw/extract_bracketed/;
my $text = 'start {some {stuff} one} and {more {of it} two}, and done';
my ($match, $lead);
while (1) {
($match, $text, $lead) = extract_bracketed($text, '{', '[^{]*');
say $lead // $text;
last if not defined $match;
}
what prints
start
and
, and done
Once there is no match we need to print the remainder, thus $lead // $text (as there can be no $lead either). The code uses $text directly and modifies it, down to the last remainder; if you'd like to keep the original text save it away first.
I've used a made-up string above, but I tested it on your code sample as well.
This can also be done using Regexp::Common.
Break the string using its $RE{balanced} regex, then take odd elements
use Regexp::Common qw(balanced);
my #parts = split /$RE{balanced}{-parens=>'{}'}/, $text;
my #out_of_blocks = #parts[ grep { $_ & 1 } 1..$#parts ];
say for #out_of_blocks;
If the string starts with the delimiter the first element is an empty string, as usual with split.
To clean out leading and trailing spaces pass it through map { s/(^\s*|\s*$//gr }.
You're very close.
(?(DEFINE)) will define the expression & parts you want to use but it doesn't actually do anything other than define them. Think of this tag (and everything it envelops) as you defining variables. That's nice and clean, but defining the variables doesn't mean the variables get used!
You want to use the code block after defining it so you need to add the expression after you've declared your variables (like in any programming language)
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
(?&block)
This part defines your variables
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
This part calls your variables into use.
(?&block)
Edits
Edit 1
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
(?&block)\s*(?:\/\/|\/\*)([\s\S]*?)(?:\r\n|\r|\n|$)
The regex above will get the comment after a block (as you've already defined).
You had a . which will match any character (except newline - unless you use the s modifier which specifies that . should also match newline characters)
Edit 2
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
(?&block)\s*(?:(?:\/\/([\s\S]*?)(?:\r\n|\r|\n|$))|\/\*([\s\S]*?)\*\/)
This regex is more syntactically correct for capturing comments. The previous edit will work with /* up until a new line or end of file. This one will work until the closing tag or end of file.
Edit 3
As for your code not working, I'm not exactly sure. You can see your code running here and it seems to be working just fine. I would use one of the regular expressions I've written above instead.
Edit 4
I think I finally understand what you're saying. What you're trying to do is impossible with regex. You cannot reference a group without capturing it, therefore, the only true solution is to capture it. There is, however, a hack-around alternative that works for your situation. If you want to grab the first and last sections without the second section you can use the following regex, which, will not check the second section of your regex for proper syntax (downside). If you do need to check the syntax you're going to have to deal with there being an additional capture group.
(.+?)\{.*\}\s*(?:(?:\/\/([\s\S]*?)(?:\r\n|\r|\n|$))|\/\*([\s\S]*?)\*\/)
This regex captures everything before the { character, then matches everything after it until it meets } followed by any whitespace, and finally by //. This, however, will break if you have a comment within a block of code (after a })

Regex to match absence of substring

I am looking for a regex which only matches when certain sub-strings are not present. In particular - if a line of code does not assign or return the return value from a method.
Examples:
this.execute(); // should match
var x = this.execute(); // no match
return this.execute(); // no match
I was trying to use the following regex
^(?!.*=|return).*execute\(\).*
This works with regex testers etc. - but I am getting "invalid perl operator" exception when using in practice.
Thanks..
Since you want to exclude only assignment or return it's easily negated
while (<DATA>) { print if not /(?:=|return)\s+this\.execute/ }
__DATA__
this.execute();
var x = this.execute();
return this.execute();
This prints only the line this.execute();.
With Lookaround Assertions, a negative lookahead that you offer does work
if (/^(?!.*=|return)\s+this\.execute/x) { print "$_\n" }
As for the negative lookbehind, there is one problem. First, here's what works
if ( /(?<! =\s ) this\.execute/x ) { print "$_\n" }
if ( /(?<! return \s ) this\.execute/x ) { print "$_\n" }
This excludes = or return, with one space. The thing is, we can't put \s+ there nor can we do alternation -- Perl can't do it for this particular assertion, see perlretut. We get
Variable length lookbehind not implemented in regex m/(?<!=\s+)this\.execute/ at
We can add varying space \s+ outside of the assertion, with this..., and then combine multiple conditions to provide for a possibility that there is no space between = and this....
However, there's no reason for this if you can use a regular negated match.
The reported error can only be about basic syntax. It is about the exact code you run, not the regex.
Not so sure if I understand the question but you might consider trying this one. ^this.execute\(\);
With situations like these, its best to find the "lowest common denominator" in the matches you want to distinguish from similar looking strings. In this case, the var x can be ignored - your requirements are satisfied by saying "anything before the method call is ok - the method call alone is not." That statement is probably a bit too tight though, so let's change it to "anthing other than whitespace before the method call is ok, otherwise flag the call". Which means;
my $method_call = qr/ ( this \. \w+ ) \( /x;
while (<$fh>) {
if (/ ^ \s* $method_call /x) {
warn "Found method call on line $.: $1\n"
}
}
I'm presumming $fh is a filehandle to the souce code file. I've also made some presumptions which you may need to tweek about how you want to define a method call - ie. opening bracket for parameters is compulsory. Using 'extended mode regexs' allows the use of whitespace in the regex for easier reading. Also, using 'quote rule' allows referring to a regex by name inside another to make things clearer.
If on the other hand, you want to insist on the presence of var x or return before giving the ok, we can reverse the search - ie explicitly look for the "ok" situations and flag any other calls:
my $method_call = qr/ ( this \. \w+ ) \( /x;
while (<$fh>) {
next if / ^ \s* return \s+ $method_call /x; # return OK
next if / ^ \s* var \s+ \w+ = \s+ $method_call /x; # var OK
warn "Found method call on line $.: $1\n" if /$method_call/ ;
}
Both of these are a little verbose but show more clearly what you're trying to do.
I don't think we have enough information here. I say this because the following works for me in the shell
~$ echo "execute()"| perl -ne 'print if /^(?!.*=|return).*execute\(\).*/'
execute()
~$ echo "return execute()"| perl -ne 'print if /^(?!.*=|return).*execute\(\).*/'
~$
In the above code, I am running a one liner in a shell that pipes a string into a perl program. The perl program will print the string if it matches the regex. I get no errors from your regex.
It's possible that the error is due to your version of perl or something else entirely may be happening.
I am using perl v5.22.2
I mean, the simple answer is, just use the ! operator on your test, but here's the conversion in case you were wondering:
/expression/ => /^(?!.*expression)/ (either use DOTALL or [^] in JavaScript)
/^expression/ => /^(?!expression)/

Regex matching only a portion of string

I would like to match a portion of a URL in this order.
First the domain name will remain static. So, nothing check with regex.
$domain_name = "http://foo.com/";
What I would like to validate is what comes after the last /.
So, my AIM is to create something like.
$stings_only = "[\w+]";
$number_only = "[\d+]";
$numbers_and_strings = "[0-9][a-z][A-Z]";
Now, I would like to just use the above variables to check if a URL confirms to the patterns mentioned.
$example_url = "http://foo.com/some-title-with-id-1";
var_dump(preg_match({$domain_name}{$strings_only}, $example_url));
The above should return false, because title is NOT $string_only.
$example_url = "http://foo.com/foobartar";
var_dump(preg_match({$domain_name}{$strings_only}, $example_url));
The above should return true, because title is $string_only.
Update:
~^http://foo\.com/[a-z]+/?$~i
~^http://foo\.com/[0-9]+/?$~
~^http://foo\.com/[a-z0-9]+/?$~i
These would be your three expressions to match alphabetical URLs, numeric URLS, and alphanumeric. A couple notes, \w matches [a-zA-Z0-9_] so I don't think it is what you expected. The + inside of your character class ([]) does not have any special meaning, like you may expect. \w and \d are "shorthand character classes" and do not need to be within the [] syntax (however they can be, e.g. [\w.,]). Notice the i modifier, this makes the expressions case-insensitive so we do not need to use [a-zA-Z].
$strings_only = '~^http://foo\.com/[a-z]+/?$~i';
$url = 'http://foo.com/some-title-with-id-1';
var_dump(preg_match($strings_only, $url)); // int(0)
$url = 'http://foo.com/foobartar';
var_dump(preg_match($strings_only, $url)); // int(1)
Test/tweak all of my above expressions with Regex101.
. matches any character, but only once. Use .* for 0+ or .+ for 1+. However, these will be greedy and match your whole string and can potentially cause problems. You can make it lazy by adding ? to the end of them (meaning it will stop as soon as it sees the next character /). Or, you can specify anything but a / using a negative character class [^/].
My final regex of choice would be:
~^https://stolak\.ru/([^/]+)/?$~
Notice the ~ delimiters, so that you don't need to escape every /. Also, you need to escape the . with \ since it has a special meaning. I threw the [^/]+ URI parameter into a capture group and made the trailing slash optional by using /?. Finally, I anchored this to the beginning and the end of the strings (^ and $, respectively).
Your question was somewhat vague, so I tried to interpret what you wanted to match. If I was wrong, let me know and I can update it. However, I tried to explain it all so that you could learn and tweak it to your needs. Also, play with my Regex101 link -- it will make testing easier.
Implementation:
$pattern = '~^https://stolak\.ru/([^/]+)/?$~';
$url = 'https://stolak.ru/car-type-b1';
preg_match($pattern, $url, $matches);
var_dump($matches);
// array(2) {
// [0]=>
// string(29) "https://stolak.ru/car-type-b1"
// [1]=>
// string(11) "car-type-b1"
// }

RegEx: Word immediately before the last opened parenthesis

I have a little knowledge about RegEx, but at the moment, it is far above of my abilities.
I'm needing help to find the text before the last open-parenthesis that doesn't have a matching close-parenthesis.
(It is for CallTip of a open source software in development.)
Below some examples:
--------------------------
Text I need
--------------------------
aaa( aaa
aaa(x) ''
aaa(bbb( bbb
aaa(y=bbb( bbb
aaa(y=bbb() aaa
aaa(y <- bbb() aaa
aaa(bbb(x) aaa
aaa(bbb(ccc( ccc
aaa(bbb(x), ccc( ccc
aaa(bbb(x), ccc() aaa
aaa(bbb(x), ccc()) ''
--------------------------
Is it possible to write a RegEx (PCRE) for these situations?
The best I got was \([^\(]+$ but, it is not good and it is the opposite of what I need.
Anyone can help please?
Take a look at this JavaScript function
var recreg = function(x) {
var r = /[a-zA-Z]+\([^()]*\)/;
while(x.match(r)) x = x.replace(r,'');
return x
}
After applying this you are left with all unmatched parts which don't have closing paranthesis and we just need the last alphabetic word.
var lastpart = function(y) { return y.match(/([a-zA-Z]+)\([^(]*$/); }}
The idea is to use it like
lastpart(recreg('aaa(y <- bbb()'))
Then check if the result is null or else take the matching group which will be result[1]. Most of the regex engines don't support ?R flag which is needed for recursive regex matching.
Note that this is a sample JavaScript representation which simulated recursive regex.
Read http://www.catonmat.net/blog/recursive-regular-expressions/
This works correctly on all your sample strings:
\w+(?=\((?:[^()]*\([^()]*\))*[^()]*$)
The most interesting part is this:
(?:[^()]*\([^()]*\))*
It matches zero or more balanced pairs of parentheses along with the non-paren characters before and between them (like the y=bbb() and bbb(x), ccc() in your sample strings). When that part is done, the final [^()]*$ ensures that there are no more parens before the end of the string.
Be aware, though, that this regex is based on the assumption that there will never be more than one level of nesting. In other words, it assumes these are valid:
aaa()
aaa(bbb())
aaa(bbb(), ccc())
...but this isn't:
aaa(bbb(ccc()))
The string ccc(bbb(aaa( in your samples seems to imply that multi-level nesting is indeed permitted. If that's the case, you won't be able to solve your problem with regex alone. (Sure, some regex flavors support recursive patterns, but the syntax is hideous even by regex standards. I guarantee you won't be able to read your own regex a week after you write it.)
A partial solution - this is assuming that your regex is called from within a programming language that can loop.
1) prune the input: find matching parentheses, and remove them with everything in between. Keep going until there is no match. The regex would look for ([^()]) - open parenthesis, not a parenthesis, close parenthesis. It has to be part of a "find and replace with nothing" loop. This trims "from the inside out".
2) after the pruning you have either no parentheses left, or only leading/trailing ones. Now you have to find a word just before an open parenthesis. This requires a regex like \w(. But that won't work if there are multiple unclosed parentheses. Taking the last one could be done with a greedy match (with grouping around the last \w): ^.*\w( "as many characters as you can up to a word before a parenthesis" - this will find the last one.
I am saying "approximate" solution because, depending on the environment you are using, how you say "this matching group" and whether you need to put a backslash before the () varies. I left that detail out as its hard to check on my iPhone.
I hope this inspires you or others to come up with a complete solution.
Not sure which regex langage/platform you're using for this and don't know if subpatterns are allowed in your platform or not. However following 2 step PHP code will work for all the cases you listed above:
$str = 'aaa(bbb(x), ccc()'; // your original string
// find and replace all balanced square brackets with blank
$repl = preg_replace('/ ( \( (?: [^()]* | (?1) )* \) ) /x', '', $str);
$matched = '';
// find word just before opening square bracket in replaced string
if (preg_match('/\w+(?=[^\w(]*\([^(]*$)/', $repl, $arr))
$matched = $arr[0];
echo "*** Matched: [$matched]\n";
Live Demo: http://ideone.com/evXQYt

regex string does not contain substring

I am trying to match a string which does not contain a substring
My string always starts "http://www.domain.com/"
The substring I want to exclude from matches is ".a/" which comes after the string (a folder name in the domain name)
There will be characters in the string after the substring I want to exclude
For example:
"http://www.domain.com/.a/test.jpg" should not be matched
But "http://www.domain.com/test.jpg" should be
Use a negative lookahead assertion as:
^http://www\.domain\.com/(?!\.a/).*$
Rubular Link
The part (?!\.a/) fails the match if the URL is immediately followed with a .a/ string.
My advise in such cases is not to construct overly complicated regexes whith negative lookahead assertions or such stuff.
Keep it simple and stupid!
Do 2 matches, one for the positives, and sort out later the negatives (or the other way around). Most of the time, the regexes become easier, if not trivial.
And your program gets clearer.
For example, to extract all lines with foo, but not foobar, I use:
grep foo | grep -v foobar
I would try with
^http:\/\/www\.domain\.com\/([^.]|\.[^a]).*$
You want to match your domain, plus everything that do not continue with a . and everything that do continue with a . but not a a. (Eventually you can add you / if needed after)
If you don't use look ahead, but just simple regex, you can just say, if it matches your domain but doesn't match with a .a/
<?php
function foo($s) {
$regexDomain = '{^http://www.domain.com/}';
$regexDomainBadPath = '{^http://www.domain.com/\.a/}';
return preg_match($regexDomain, $s) && !preg_match($regexDomainBadPath, $s);
}
var_dump(foo('http://www.domain.com/'));
var_dump(foo('http://www.otherdomain.com/'));
var_dump(foo('http://www.domain.com/hello'));
var_dump(foo('http://www.domain.com/hello.html'));
var_dump(foo('http://www.domain.com/.a'));
var_dump(foo('http://www.domain.com/.a/hello'));
var_dump(foo('http://www.domain.com/.b/hello'));
var_dump(foo('http://www.domain.com/da/hello'));
?>
note that http://www.domain.com/.a will pass the test, because it doesn't end with /.