Regex: How do I match something that may OR may not be between [ ] - regex

I am parsing a log using Perl and I am stumped with as to how I can parse something like this:
from=[ihatethisregex#hotmail.com]
from=ihatethisregex#hotmail.com
What I need is ihatethisregex#hotmail.com and I need to capture this in a named capture group called "email".
I tried the following:
(?<email>(?:\[[^\]]+\])|(?:\S+))
But this captures the square brackets when it parses the first line. I don't want the square brackets. Was wondering if I could do something like this:
(?:\[(?<email>[^\]]+)\])|(?<email>\S+)
and when I evaluate $+{email}, it will just take whichever one that was matched. I also tried the following:
(?:\[?(?<email>(?:[^\]]+\])|(?:\S+)))
But this gave strange results when the email was wrapped in a pair of square brackets.
Any help is appreciated.

/(\[)?your-regexp-here(?(1)\]|)/
( ) capture group #1
\[ opening bracket
? optionally
your-regexp-here your regexp
(?( ) ) conditional match:
1 if capture group #1 evaluated,
\] closing bracket
| else nothing
Note that this does not work in all languages, since conditional match is not a part of a standard regular expression, but rather an extension. Works in Perl, though.
EDIT: misplaced question mark.

I tend to do these kinds of things in two steps, just because its clearer:
my ($val)= /\w+=(.*)/ ;
$val =~ s/\[(.*)\]/$1/e ;
This trims off [] seperately.

Perhaps the following will be helpful:
use strict;
use warnings;
while (<DATA>) {
/from\s*=\s*\[?(?<email>(?:[^\]]+))\]?/;
print $+{email}, "\n";
}
__DATA__
from=[ihatethisregex#hotmail.com]
from=ihatethisregex#hotmail.com
Output:
ihatethisregex#hotmail.com
ihatethisregex#hotmail.com

Related

How to do Perl regex with alternation and substitution

I wish to transform "eAlpha eBeta eGamma" into "fAlpha fBeta fGamma." Of course this is just a simplified example of more complex substitutions.
Here is my perl program:
my $data= "eAlpha eBeta eGamma";
$data=~ s/(e)(Alpha)|(e)(Beta)|(e)(Gamma)/f$2/g;
say $data;
The output is
fAlpha f f
Perl regex seems to remember the $1 but not the $2. Is there a way to use regex alternation, global substitution, and capture variables like $1, $2?
There are never more than 3 alternates so I could do it in three steps but wish not to.
Any help would be appreciated.
You an use positive look ahead using alternation and just match e and substitute it with f.
my $data = "eAlpha eBeta eGamma";
$data=~ s/e(?=Alpha|Beta|Gamma)/f/g;
print($data);
Prints,
fAlpha fBeta fGamma
The capture groups go from left to right without "respecting" the |. So Beta gets captured in $4 and Gamma in $6.
But you can condense it to two groups and additionally make the first group non-capturing with ?: (or don't use a group at all, as it's just a single character, but you may have something else in there in the "real thing" that requires a group).
...
$data =~ s/(?:e)(Alpha|Beta|Gamma)/f$1/g;
...

Capturing text before and after a C-style code block with a Perl regular expression

I am trying to capture some text before and after a C-style code block using a Perl regular expression. So far this is what I have:
use strict;
use warnings;
my $text = << "END";
int max(int x, int y)
{
if (x > y)
{
return x;
}
else
{
return y;
}
}
// more stuff to capture
END
# Regex to match a code block
my $code_block = qr/(?&block)
(?(DEFINE)
(?<block>
\{ # Match opening brace
(?: # Start non-capturing group
[^{}]++ # Match non-brace characters without backtracking
| # or
(?&block) # Recursively match the last captured group
)* # Match 0 or more times
\} # Match closing brace
)
)/x;
# $2 ends up undefined after the match
if ($text =~ m/(.+?)$code_block(.+)/s){
print $1;
print $2;
}
I am having an issue with the 2nd capture group not being initialized after the match. Is there no way to continue a regular expression after a DEFINE block? I would think that this should work fine.
$2 should contain the comment below the block of code but it doesn't and I can't find a good reason why this isn't working.
Capture groups are numbered left-to-right in the order they occur in the regex, not in the order they are matched. Here is a simplified view of your regex:
m/
(.+?) # group 1
(?: # the $code_block regex
(?&block)
(?(DEFINE)
(?<block> ... ) # group 2
)
)
(.+) # group 3
/xs
Named groups can also be accessed as numbered groups.
The 2nd group is the block group. However, this group is only used as a named subpattern, not as a capture. As such, the $2 capture value is undef.
As a consequence, the text after the code-block will be stored in capture $3.
There are two ways to deal with this problem:
For complex regexes, only use named capture. Consider a regex to be complex as soon as you assemble it from regex objects, or if captures are conditional. Here:
if ($text =~ m/(?<before>.+?)$code_block(?<afterwards>.+)/s){
print $+{before};
print $+{afterwards};
}
Put all your defines at the end, where they can't mess up your capture numbering. For example, your $code_block regex would only define a named pattern which you then invoke explicitly.
There are also ready tools that can be leveraged for this, in a few lines of code.
Perhaps the first module to look at is the core Text::Balanced.
The extract_bracketed in list context returns: matched substring, remainder of the string after the match, and the substring before the match. Then we can keep matching in the remainder
use warnings;
use strict;
use feature 'say';
use Text::Balanced qw/extract_bracketed/;
my $text = 'start {some {stuff} one} and {more {of it} two}, and done';
my ($match, $lead);
while (1) {
($match, $text, $lead) = extract_bracketed($text, '{', '[^{]*');
say $lead // $text;
last if not defined $match;
}
what prints
start
and
, and done
Once there is no match we need to print the remainder, thus $lead // $text (as there can be no $lead either). The code uses $text directly and modifies it, down to the last remainder; if you'd like to keep the original text save it away first.
I've used a made-up string above, but I tested it on your code sample as well.
This can also be done using Regexp::Common.
Break the string using its $RE{balanced} regex, then take odd elements
use Regexp::Common qw(balanced);
my #parts = split /$RE{balanced}{-parens=>'{}'}/, $text;
my #out_of_blocks = #parts[ grep { $_ & 1 } 1..$#parts ];
say for #out_of_blocks;
If the string starts with the delimiter the first element is an empty string, as usual with split.
To clean out leading and trailing spaces pass it through map { s/(^\s*|\s*$//gr }.
You're very close.
(?(DEFINE)) will define the expression & parts you want to use but it doesn't actually do anything other than define them. Think of this tag (and everything it envelops) as you defining variables. That's nice and clean, but defining the variables doesn't mean the variables get used!
You want to use the code block after defining it so you need to add the expression after you've declared your variables (like in any programming language)
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
(?&block)
This part defines your variables
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
This part calls your variables into use.
(?&block)
Edits
Edit 1
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
(?&block)\s*(?:\/\/|\/\*)([\s\S]*?)(?:\r\n|\r|\n|$)
The regex above will get the comment after a block (as you've already defined).
You had a . which will match any character (except newline - unless you use the s modifier which specifies that . should also match newline characters)
Edit 2
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
(?&block)\s*(?:(?:\/\/([\s\S]*?)(?:\r\n|\r|\n|$))|\/\*([\s\S]*?)\*\/)
This regex is more syntactically correct for capturing comments. The previous edit will work with /* up until a new line or end of file. This one will work until the closing tag or end of file.
Edit 3
As for your code not working, I'm not exactly sure. You can see your code running here and it seems to be working just fine. I would use one of the regular expressions I've written above instead.
Edit 4
I think I finally understand what you're saying. What you're trying to do is impossible with regex. You cannot reference a group without capturing it, therefore, the only true solution is to capture it. There is, however, a hack-around alternative that works for your situation. If you want to grab the first and last sections without the second section you can use the following regex, which, will not check the second section of your regex for proper syntax (downside). If you do need to check the syntax you're going to have to deal with there being an additional capture group.
(.+?)\{.*\}\s*(?:(?:\/\/([\s\S]*?)(?:\r\n|\r|\n|$))|\/\*([\s\S]*?)\*\/)
This regex captures everything before the { character, then matches everything after it until it meets } followed by any whitespace, and finally by //. This, however, will break if you have a comment within a block of code (after a })

Can not catch substring by regex which ends with tab

I have two types of strings:
1: ANN=abcdefgh;blabla
2 wrong version: ANN=abcdefgh\tyxz\tyxz
2 actual version: ANN=abcdefgh
Now I want to extract the abcdefgh with a regex. So the start to extract is always after "ANN=". But the end is eighter a semicolon (;) or the FIRST occurrence of a tab.
How does the regex for this look? I tried:
(my #splitUpAnn) = $tabValues[7] =~ /ANN=(.*)[;\t]/;
But I always get just the version 1 with the semicolon back, but it does not work for version two...
EDIT: To be clear. I did not get back ANYTHING for the version two. The problem is NOT that the last tab is used!
EDIT2: Ups, there was something different in the input data than expected. Either I have a semicolon at the end of NOTHING (see "2 actual version"). Sorry for that! So what would the regex then be?
Use .*? instead of .*.
.* is greedy so it matches with second occurrence of TAB.
DEMO
Just use the non-greedy quantifier *? that matches the least it can:
for my $string ('ANN=abcdefgh;blabla', "ANN=abcdefgh\tyxz\tyxz") {
(my #splitUpAnn) = $string =~ /ANN=(.*?)[;\t]/;
print "#splitUpAnn\n";
}
If you want to get the string up to the first semicolon if present, or everything otherwise, just use
$string =~ /ANN=([^;]*)/
i.e. capture everything that's not a semicolon.
/ANN=(.*?)[;\t]/
Make your regex non greedy.
.* is greedy and will match upto the last ; or \t available.
my ($ann) = $tabValues[7] =~ /ANN=(.*?)[;\t]/;
The leading ^ negates the character class, so [^;\t] matches any character except ; and tab.
There are multiple suggestions of making you .* non-greedy, but using non-greediness as anything but an optimization is very fragile and error prone.
I've tested and I got a match
if ( "ANN=abcdefgh;blabla" =~ /(ANN=(.*)[;\t])/ ) {
print $1."\n" ;}
if ( "ANN=abcdefgh\tyxz\tyxz" =~ /(ANN=(.*)[;\t])/ ) {
print $1."\n" ;}
result is:
ANN=abcdefgh;
ANN=abcdefgh yxz
So:
your request is really greedy, as described in previous answers
Perhaps the problem lies in the way you put the values in the array, but the regexp is correct

Perl Regex negation for multiple words

I need to exclude some URLs for a jMeter test:
dont exclude:
http://foo/bar/is/valid/with/this
http://foo/bar/is/also/valid/with/that
exclude:
http://foo/bar/is/not/valid/with/?=action
http://foo/bar/is/not/valid/with/?=action
http://foo/bar/is/not/valid/with/specialword
Please help me?
My following Regex isnt working:
foo/(\?=|\?action|\?form_action|specialword).*
First problem: / is the general delimiter so escape it with \/ or alter the delimiter.
Second Problem: It will match only foo/action and so on, you need to include a wildcard before the brackets: foo\/.*(\?=|\?action|\?form_action|specialword).*
So:
/foo\/.*(\?=|\?action|\?form_action|specialword).*/
Next problem is that this will match the opposite: Your excludes. You can either finetune your regex to do the inverse OR you can handle this in your language (i.e. if there is no match, do this and that).
Always pay attention to special characters in regex. See here also.
There are countless ways to shoot yourself in the foot with regular expressions. You could write some kind of "parser" using /g and /c in a loop, but why bother? It seems like you are already having trouble with the current regular expression.
Break the problem down into smaller parts and everything will be less complicated. You could write yourself some kind of filter for grep like:
sub filter {
my $u = shift;
my $uri = URI->new($u);
return undef if $uri->query;
return undef if grep { $_ eq 'specialword' } $uri->path_segments;
return $u;
}
say for grep {filter $_} #urls;
I wouldn't cling that hard to a regular expression, especially if others have to read the code too...
Change the regex delimiter to something other than '/' so you don't have to escape it in your matches. You might do:
m{//foo/.+(?:\?=action|\?form_action|specialword)$};
The ?: denotes grouping-only.
Using this, you could say:
print unless m{//foo/.+(?:\?=action|\?form_action|specialword)$};
Your alternation is wrong. foo/(\?=|\?action|\?form_action|specialword) matches any of
foo/?=
foo/?action
foo/?form_action
foo/?specialword
so you need instead
m{foo/.*(?:\?=action|\?=form_action|specialword)}
The .* is necessary to account for the possible bar/is/valid/with/this after /foo/.
Note that I have changed your ( .. ) to the non-capturing (?: .. ) and I have used braces for the regex delimiter to avoid having to escape the slashes in the expression.
Finally, you need to write either
unless ($url =~ m{/foo/.*(?:\?=action|\?=form_action|specialword)}) { ... }
or
if ($url !~ m{/foo/.*(?:\?=action|\?=form_action|specialword)}) { ... }
since the regex matches URLs that are to be discarded.

Regex: delete contents of square brackets

Is there a regular expression that can be used with search/replace to delete everything occurring within square brackets (and the brackets)?
I've tried \[.*\] which chomps extra stuff (e.g. "[chomps] extra [stuff]")
Also, the same thing with lazy matching \[.*?\] doesn't work when there is a nested bracket (e.g. "stops [chomping [too] early]!")
Try something like this:
$text = "stop [chomping [too] early] here!";
$text =~ s/\[([^\[\]]|(?0))*]//g;
print($text);
which will print:
stop here!
A short explanation:
\[ # match '['
( # start group 1
[^\[\]] # match any char except '[' and ']'
| # OR
(?0) # recursively match group 0 (the entire pattern!)
)* # end group 1 and repeat it zero or more times
] # match ']'
The regex above will get replaced with an empty string.
You can test it online: http://ideone.com/tps8t
EDIT
As #ridgerunner mentioned, you can make the regex more efficiently by making the * and the character class [^\[\]] match once or more and make it possessive, and even by making a non capturing group from group 1:
\[(?:[^\[\]]++|(?0))*+]
But a real improvement in speed might only be noticeable when working with large strings (you can test it, of course!).
This is technically not possible with regular expressions because the language you're matching does not meet the definition of "regular". There are some extended regex implementations that can do it anyway using recursive expressions, among them are:
Greta:
http://easyethical.org/opensource/spider/regexp%20c++/greta2.htm#_Toc39890907
and
PCRE
http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions
See "Recursive Patterns", which has an example for parentheses.
A PCRE recursive bracket match would look like this:
\[(?R)*\]
edit:
Since you added that you're using Perl, here's a page that explicitly describes how to match balanced pairs of operators in Perl:
http://perldoc.perl.org/perlfaq6.html#Can-I-use-Perl-regular-expressions-to-match-balanced-text%3f
Something like:
$string =~ m/(\[(?:[^\[\]]++|(?1))*\])/xg;
Since you're using Perl, you can use modules from the CPAN and not have to write your own regular expressions. Check out the Text::Balanced module that allows you to extract text from balanced delimiters. Using this module means that if your delimiters suddenly change to {}, you don't have to figure out how to modify a hairy regular expression, you only have to change the delimiter parameter in one function call.
If you are only concerned with deleting the contents and not capturing them to use elsewhere you can use a repeated removal from the inside of the nested groups to the outside.
my $string = "stops [chomping [too] early]!";
# remove any [...] sequence that doesn't contain a [...] inside it
# and keep doing it until there are no [...] sequences to remove
1 while $string =~ s/\[[^\[\]]*\]//g;
print $string;
The 1 while will basically do nothing while the condition is true. If a s/// matches and removes a bracketed section the loop is repeated and the s/// is run again.
This will work even if your using an older version of Perl or another language that doesn't support the (?0) recursion extended pattern in Bart Kiers's answer.
You want to remove only things between the []s that aren't []s themselves. IE:
\[[^\]]*\]
Which is a pretty hairy mess of []s ;-)
It won't handle multiple nested []s though. IE, matching [foo[bar]baz] won't work.