Perl Regex negation for multiple words - regex

I need to exclude some URLs for a jMeter test:
dont exclude:
http://foo/bar/is/valid/with/this
http://foo/bar/is/also/valid/with/that
exclude:
http://foo/bar/is/not/valid/with/?=action
http://foo/bar/is/not/valid/with/?=action
http://foo/bar/is/not/valid/with/specialword
Please help me?
My following Regex isnt working:
foo/(\?=|\?action|\?form_action|specialword).*

First problem: / is the general delimiter so escape it with \/ or alter the delimiter.
Second Problem: It will match only foo/action and so on, you need to include a wildcard before the brackets: foo\/.*(\?=|\?action|\?form_action|specialword).*
So:
/foo\/.*(\?=|\?action|\?form_action|specialword).*/
Next problem is that this will match the opposite: Your excludes. You can either finetune your regex to do the inverse OR you can handle this in your language (i.e. if there is no match, do this and that).
Always pay attention to special characters in regex. See here also.

There are countless ways to shoot yourself in the foot with regular expressions. You could write some kind of "parser" using /g and /c in a loop, but why bother? It seems like you are already having trouble with the current regular expression.
Break the problem down into smaller parts and everything will be less complicated. You could write yourself some kind of filter for grep like:
sub filter {
my $u = shift;
my $uri = URI->new($u);
return undef if $uri->query;
return undef if grep { $_ eq 'specialword' } $uri->path_segments;
return $u;
}
say for grep {filter $_} #urls;
I wouldn't cling that hard to a regular expression, especially if others have to read the code too...

Change the regex delimiter to something other than '/' so you don't have to escape it in your matches. You might do:
m{//foo/.+(?:\?=action|\?form_action|specialword)$};
The ?: denotes grouping-only.
Using this, you could say:
print unless m{//foo/.+(?:\?=action|\?form_action|specialword)$};

Your alternation is wrong. foo/(\?=|\?action|\?form_action|specialword) matches any of
foo/?=
foo/?action
foo/?form_action
foo/?specialword
so you need instead
m{foo/.*(?:\?=action|\?=form_action|specialword)}
The .* is necessary to account for the possible bar/is/valid/with/this after /foo/.
Note that I have changed your ( .. ) to the non-capturing (?: .. ) and I have used braces for the regex delimiter to avoid having to escape the slashes in the expression.
Finally, you need to write either
unless ($url =~ m{/foo/.*(?:\?=action|\?=form_action|specialword)}) { ... }
or
if ($url !~ m{/foo/.*(?:\?=action|\?=form_action|specialword)}) { ... }
since the regex matches URLs that are to be discarded.

Related

Regex: How do I match something that may OR may not be between [ ]

I am parsing a log using Perl and I am stumped with as to how I can parse something like this:
from=[ihatethisregex#hotmail.com]
from=ihatethisregex#hotmail.com
What I need is ihatethisregex#hotmail.com and I need to capture this in a named capture group called "email".
I tried the following:
(?<email>(?:\[[^\]]+\])|(?:\S+))
But this captures the square brackets when it parses the first line. I don't want the square brackets. Was wondering if I could do something like this:
(?:\[(?<email>[^\]]+)\])|(?<email>\S+)
and when I evaluate $+{email}, it will just take whichever one that was matched. I also tried the following:
(?:\[?(?<email>(?:[^\]]+\])|(?:\S+)))
But this gave strange results when the email was wrapped in a pair of square brackets.
Any help is appreciated.
/(\[)?your-regexp-here(?(1)\]|)/
( ) capture group #1
\[ opening bracket
? optionally
your-regexp-here your regexp
(?( ) ) conditional match:
1 if capture group #1 evaluated,
\] closing bracket
| else nothing
Note that this does not work in all languages, since conditional match is not a part of a standard regular expression, but rather an extension. Works in Perl, though.
EDIT: misplaced question mark.
I tend to do these kinds of things in two steps, just because its clearer:
my ($val)= /\w+=(.*)/ ;
$val =~ s/\[(.*)\]/$1/e ;
This trims off [] seperately.
Perhaps the following will be helpful:
use strict;
use warnings;
while (<DATA>) {
/from\s*=\s*\[?(?<email>(?:[^\]]+))\]?/;
print $+{email}, "\n";
}
__DATA__
from=[ihatethisregex#hotmail.com]
from=ihatethisregex#hotmail.com
Output:
ihatethisregex#hotmail.com
ihatethisregex#hotmail.com

Regex to find text between second and third slashes

I would like to capture the text that occurs after the second slash and before the third slash in a string. Example:
/ipaddress/databasename/
I need to capture only the database name. The database name might have letters, numbers, and underscores. Thanks.
How you access it depends on your language, but you'll basically just want a capture group for whatever falls between your second and third "/". Assuming your string is always in the same form as your example, this will be:
/.*/(.*)/
If multiple slashes can exist, but a slash can never exist in the database name, you'd want:
/.*/(.*?)/
/.*?/(.*?)/
In the event that your lines always have / at the end of the line:
([^/]*)/$
Alternate split method:
split("/")[2]
The regex would be:
/[^/]*/([^/]*)/
so in Perl, the regex capture statement would be something like:
($database) = $text =~ m!/[^/]*/([^/]*)/!;
Normally the / character is used to delimit regexes but since they're used as part of the match, another character can be used. Alternatively, the / character can be escaped:
($database) = $text =~ /\/[^\/]*\/([^\/]*)\//;
You can even more shorten the pattern by going this way:
[^/]+/(\w+)
Here \w includes characters like A-Z, a-z, 0-9 and _
I would suggest you to give SPLIT function a priority, since i have experienced a good performance of them over RegEx functions wherever it is possible to use them.
you can use explode function with PHP or split with other languages to so such operation.
anyways, here is regex pattern:
/[\/]*[^\/]+[\/]([^\/]+)/
I know you specifically asked for regex, but you don't really need regex for this. You simply need to split the string by delimiters (in this case a backslash), then choose the part you need (in this case, the 3rd field - the first field is empty).
cut example:
cut -d '/' -f 3 <<< "$string"
awk example:
awk -F '/' {print $3} <<< "$string"
perl expression, using split function:
(split '/', $string)[2]
etc.

Regular expression using powershell

Here's is the scenario, i have these lines mentioned below i wanted to extract only the middle character in between two dots.
"scvmm.new.resources" --> This after an regular expression match should return only "new"
"sc.new1.rerces" --> This after an regular expression match should return only "new1"
What my basic requirement was to exract anything between two dots anything can come in prefix and suffix
(.*).<required code>.(.*)
Could anyone please help me out??
You can do that without using regex. Split the string on '.' and grab the middle element:
PS> "scvmm.new.resources".Split('.')[1]
new
Or this
'scvmm.new.resources' -replace '.*\.(.*)\..*', '$1'
Like this:
([regex]::Match("scvmm.new1.resources", '(?<=\.)([^\.]*)(?=\.)' )).value
You don't actually need regular expressions for such a trivial substring extraction. Like Shay's Split('.') one can use IndexOf() for similar effect like so,
$s = "scvmm.new.resources"
$l = $s.IndexOf(".")+1
$r = $s.IndexOf(".", $l)
$s.Substring($l, $r-$l) # Prints new
$s = "sc.new1.rerces"
$l = $s.IndexOf(".")+1
$r = $s.IndexOf(".", $l)
$s.Substring($l, $r-$l) # Prints new1
This looks the first occurence of a dot. Then it looks for first occurense of a dot after the first hit. Then it extracts the characters between the two locations. This is useful in, say, scenarios in which the separation characters are not the same (though the Split() way would work in many cases too).

Perl - regex - I want to read and search each line for a string followed by a ";"

I'm playing and learning Perl so that I can read log files. I want to search every line and look for a string of alphanumeric followed by this ; at the beginning of each line.
This is part of what I have:
if ($line =~ /\S([a-zA-Z][a-zA-Z0-9]*)/)
but I think this is wrong.
Please advise.
"Alphanumeric" is a bit ambiguous now, since many people still infected with ASCII think it means A-Z with 0-9, but Perl thinks about it differently depending on the version (Know your character classes under different semantics). As with any regular expression, your job is to design a pattern the includes only what you want and doesn't exclude anything that you do want.
Also, many people still use the ^ to mean the beginning of the string, which is does if there's no /m flag. However, the re module can now set default flags, so your regex might not be what you think it is when another programmer tries to be helpful.
I tend to write things like:
my $alphanum = qr/[a-z0-9]/i;
my $regex = qr/
\A # absolute start of string
(?:$alphanum)+ # I can change this elsewhere
;
/x;
if( $line =~ $regex ) { ... }
Try:
if ($line =~ /^[a-z0-9]+;/i) { ... }
^ matches the start of a line. The + matches once or more. /i makes the search case-insensitive.

Regex: delete contents of square brackets

Is there a regular expression that can be used with search/replace to delete everything occurring within square brackets (and the brackets)?
I've tried \[.*\] which chomps extra stuff (e.g. "[chomps] extra [stuff]")
Also, the same thing with lazy matching \[.*?\] doesn't work when there is a nested bracket (e.g. "stops [chomping [too] early]!")
Try something like this:
$text = "stop [chomping [too] early] here!";
$text =~ s/\[([^\[\]]|(?0))*]//g;
print($text);
which will print:
stop here!
A short explanation:
\[ # match '['
( # start group 1
[^\[\]] # match any char except '[' and ']'
| # OR
(?0) # recursively match group 0 (the entire pattern!)
)* # end group 1 and repeat it zero or more times
] # match ']'
The regex above will get replaced with an empty string.
You can test it online: http://ideone.com/tps8t
EDIT
As #ridgerunner mentioned, you can make the regex more efficiently by making the * and the character class [^\[\]] match once or more and make it possessive, and even by making a non capturing group from group 1:
\[(?:[^\[\]]++|(?0))*+]
But a real improvement in speed might only be noticeable when working with large strings (you can test it, of course!).
This is technically not possible with regular expressions because the language you're matching does not meet the definition of "regular". There are some extended regex implementations that can do it anyway using recursive expressions, among them are:
Greta:
http://easyethical.org/opensource/spider/regexp%20c++/greta2.htm#_Toc39890907
and
PCRE
http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions
See "Recursive Patterns", which has an example for parentheses.
A PCRE recursive bracket match would look like this:
\[(?R)*\]
edit:
Since you added that you're using Perl, here's a page that explicitly describes how to match balanced pairs of operators in Perl:
http://perldoc.perl.org/perlfaq6.html#Can-I-use-Perl-regular-expressions-to-match-balanced-text%3f
Something like:
$string =~ m/(\[(?:[^\[\]]++|(?1))*\])/xg;
Since you're using Perl, you can use modules from the CPAN and not have to write your own regular expressions. Check out the Text::Balanced module that allows you to extract text from balanced delimiters. Using this module means that if your delimiters suddenly change to {}, you don't have to figure out how to modify a hairy regular expression, you only have to change the delimiter parameter in one function call.
If you are only concerned with deleting the contents and not capturing them to use elsewhere you can use a repeated removal from the inside of the nested groups to the outside.
my $string = "stops [chomping [too] early]!";
# remove any [...] sequence that doesn't contain a [...] inside it
# and keep doing it until there are no [...] sequences to remove
1 while $string =~ s/\[[^\[\]]*\]//g;
print $string;
The 1 while will basically do nothing while the condition is true. If a s/// matches and removes a bracketed section the loop is repeated and the s/// is run again.
This will work even if your using an older version of Perl or another language that doesn't support the (?0) recursion extended pattern in Bart Kiers's answer.
You want to remove only things between the []s that aren't []s themselves. IE:
\[[^\]]*\]
Which is a pretty hairy mess of []s ;-)
It won't handle multiple nested []s though. IE, matching [foo[bar]baz] won't work.