Perl Regular Expression to match special characters - regex

I have the below code in which I am checking for a specific variable location in an array excluded.It works fine with all the array elements except one (abc/def/libraries/linux_3.2.60-1+deb7u3.dsc). When I provide this element as my location its printing "location not excluded" , even though its excluded.
How can I made my code to get this element as well as excluded?
use strict;
use warnings;
my #excluded = (
"xyz/efg/headers/",
"abc/def/libraries/jni-mr.h",
"abc/def/libraries/linux_3.2.60-1+deb7u3.dsc",
);
my $location = "abc/def/libraries/linux_3.2.60-1+deb7u3.dsc";
my $badpath = 0;
foreach (#excluded) {
# -- Check if location is contained in excluded array
if ($location =~ /^$_/) {
$badpath = 1;
print "location is excluded : $location \n";
}
}
if (! $badpath) {
print "location is not excluded : $location \n";
}
Desired Output:
location is excluded : abc/def/libraries/linux_3.2.60-1+deb7u3.dsc
Current Output:
location is not excluded : abc/def/libraries/linux_3.2.60-1+deb7u3.dsc

Use quotemeta($text) or \Q$text\E (inside double quotes or a regex literal) to create a pattern that matches the value of $text. In other words, use
if ($location =~ /^\Q$_\E/)
instead of:
if ($location =~ /^$_/)

It looks like you intend to define your exclusions by regex, but you have not escaped regex metachars properly in those regexes. For your failing case, the metachar causing it to fail is the plus (+), which is a one-or-more multiplier in most regex flavors (including Perl), but you need to match it literally.
Also, I'd recommend moving the ^ anchor from the loop to each individual regex, which would make the code more flexible, in that you could choose not to anchor some of the exclusion regexes if you wanted.
Also, you should use the qr() construct, which allows you to precompile regexes, saving on CPU.
Also, this requirement is a good candidate for using grep().
use strict;
use warnings;
my #excluded = (
qr(^xyz/efg/headers/),
qr(^abc/def/libraries/jni-mr\.h),
qr(^abc/def/libraries/linux_3\.2\.60-1\+deb7u3\.dsc),
);
my $location = 'abc/def/libraries/linux_3.2.60-1+deb7u3.dsc';
# -- Check if location is contained in excluded array
my $badpath = scalar(grep($location =~ $_, #excluded )) >= 1 ? 1 : 0;
if ($badpath) {
print "location is excluded : $location \n";
} else {
print "location is not excluded : $location \n";
}

Related

Search for substring and store another part of the string as variable in perl

I am revamping an old mail tool and adding MIME support. I have a lot of it working but I'm a perl dummy and the regex stuff is losing me.
I had:
foreach ( #{$body} ) {
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
if ( $delimit ) {
next if (/$delimit/ && ! $tp);
last if (/$delimit/ && $tp);
$tp = 1, next if /text.plain/;
$tp = 0, next if /text.html/;
s/<[^>]*>//g;
$newbody .= $_ if $tp;
} else {
s/<[^>]*>//g;
$newbody .= $_ ;
}
} # End Foreach
Now I have $body_text as the plain text mail body thanks to MIME::Parser. So now I just need this part to work:
foreach ( #{$body_text} ) {
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
} # End Foreach
The actual challenge is to find NEMS=12345 or NEMS=1234567 and set $nems=12345 if found. I think I have a very basic syntax problem with the test because I'm not exposed to perl very often.
A coworker suggested:
foreach (split(/\n/,$body_text)){
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
}
Which seems to be working, but it may not be the preferred way?
edit:
So this is the most current version based on tips here and testing:
foreach (split(/\n/,$body_text)){
next if /^$/;
if ( /NEMS/i ) {
/^\s*NEMS\s*=\s*(\d+)/i;
$nems = $1;
next;
}
}
Match the last two digits as optional and capture the first five, and assign the capture directly
($nems) = /(\d{5}) (?: \d{2} )?/x; # /x allows spaces inside
The construct (?: ) only groups what's inside, without capture. The ? after it means to match that zero or one time. We need parens so that it applies to that subpattern only. So the last two digits are optional -- five digits or seven digits match. I removed the unneeded .*? and .*
However, by what you say it appears that the whole thing can be simplified
if ( ($nems) = /^\s*NEMS \s* = \s* (\d{5}) (?:\d{2})?/ix ) { next }
where there is now no need for if (/NEMS/) and I've adjusted to the clarification that NEMS is at the beginning and that there may be spaces around =. Then you can also say
my $nems;
foreach ( split /\n/, $body_text ) {
# ...
next if ($nems) = /^\s*NEMS\s*=\s*(\d{5})(?:\d{2})?/i;
# ...
}
what includes the clarification that the new $body_text is a multiline string.
It is clear that $nems is declared (needed) outside of the loop and I indicate that.
This allows yet more digits to follow; it will match on 8 digits as well (but capture only the first five). This is what your trailing .* in the regex implies.
Edit It's been clarified that there can only be 5 or 7 digits. Then the regex can be tightened, to check whether input is as expected, but it should work as it stands, too.
A few notes, let me know if more would be helpful
The match operator returns a list so we need the parens in ($nems) = /.../;
The ($nems) = /.../ syntax is a nice shortcut, for ($nems) = $_ =~ /.../;.
If you are matching on a variable other than $_ then you need the whole thing.
You always want to start Perl programs with
use warnings 'all';
use strict;
This directly helps and generally results in better code.
The clarification of the evolved problem understanding states that all digits following = need be captured into $nems (and there may be 5,(not 6),7,8,9,10 digits). Then the regex is simply
($nems) = /^\s*NEMS\s*=\s*(\d+)/i;
where \d+ means a digit, one or more times. So a string of digits (match fails if there are none).

Regex help need to match an ampersand OR and end of string

I'm trying to create a regex to match part of a URL
The possible URLs might be
www.mysite.com?userid=123xy
www.mysite.com?userid=123x&username=joe
www.mysite.com?tag=xyz&userid=1ww45
www.mysite.com?tag=xyz&userid=1g3x5&username=joe
I'm trying to match the userid=123456
So far I have
Dim r As New Regex("[&?]userID.*[?&]")
Debug.WriteLine(r.Match(strUrl))
But this is only matching lines 2 and 4.
Can anyone help?
(?<=[?&]userid=)[^&#\s]*
Output:
123xy
123x
1ww45
1g3x5
A few points:
This works both if you are matching one URL at a time and if you have a whitespace-separated set.
This captures the username only. It uses the non-capturing positive look-behind assertion since you only care about the username.
The fragment part, if present, will be ignored (e.g. if the URL looked like this: www.mysite.com?tag=xyz&userid=1ww45#top)
If the case of userid doesn't matter, use RegexOptions.IgnoreCase.
I got it:
[&?]userID=[^\s&#]+
PHP solution:
"/[\\?&]userid=([^&]*)/"
Tests:
$tests = [
[
"regex" => "/[\\?&]userid=([^&]*)/",
"expected" => "123xy",
"inputs" => [
"www.mysite.com?userid=123xy",
"www.mysite.com?userid=123xy&username=joe",
"www.mysite.com?tag=xyz&userid=123xy",
"www.mysite.com?tag=xyz&userid=123xy&username=joe"
]
]
];
foreach ($tests as $test) {
$regex = $test['regex'];
$expected = $test['expected'];
foreach ($test['inputs'] as $input) {
if (!preg_match($regex, $input, $match)) {
throw new Exception("Regex '{$regex}' doesn't match for input '{$input}' or error has occured.");
}
$matched = $match[1];
if ($matched !== $expected) {
throw new Exception("Found '{$matched}' instead of '{$expected}'.");
}
echo "Matched '{$matched}' in '{$input}'." . PHP_EOL;
}
}
Results:
Matched '123xy' in 'www.mysite.com?userid=123xy'.
Matched '123xy' in 'www.mysite.com?userid=123xy&username=joe'.
Matched '123xy' in 'www.mysite.com?tag=xyz&userid=123xy'.
Matched '123xy' in 'www.mysite.com?tag=xyz&userid=123xy&username=joe'.
You can use the regex: .*?(userid=\d+).*
.*? - is a non-greedy way to express: everything that comes before (userid=\d+)
Python example:
import re
a = 'www.mysite.com?userid=12345'
b = 'www.mysite.com?userid=12345&username=joe'
mat = re.match('.*?(userid=\d+).*', a)
print mat.group(1) # prints userid=12345
mat = re.match('.*?(userid=\d+).*', b)
print mat.group(1) # prints userid=12345
Link to Fiddler

regular expression help: catch this: |TrxId=475665|

For example I have a string:
MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|
and I want to catch this: |TrxId=475665|
after TrxId= it could be any numbers and any amount of them, so regex should catch as well:
|TrxId=111333| and |TrxId=0000011112222| and |TrxId=123|
TrxId=(\d+)
That would give a group (1) with the TrxId.
PS: Use global modifier.
The regex should look somewhat like this:
TrxId=[0-9]+
It will match TrxId= followed by at least one digit.
An example solution in Python:
In [107]: data = 'MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|'
In [108]: m = re.search(r'\|TrxId=(\d+)\|', data)
In [109]: m.group(0)
Out[109]: '|TrxId=475665|'
In [110]: m.group(1)
Out[110]: '475665'
/MsgNam\=.*?\|(TrxId\=\d+)\|.*/
for example in perl:
$a = "MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100111|";
$a =~ /MsgNam\=.*?\|(TrxId\=\d+)\|.*/;
print $1;
will print TrxId=475665
You know what your delimiters look like, so you don't need a regex, you need to split. Here's an implementation in Perl.
use strict;
use warnings;
my $input = "MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|";
my #first_array = split(/\|/,$input); #splitting $input on "|"
#Now, since the last character of $input is "|", the last element
#of this array is undef (ie the Perl equivalent of null)
#So, filter that out.
#first_array = grep{defined}#first_array;
#Also filter out elements that do not have an equals sign appearing.
#first_array = grep{/=/}#first_array;
#Now, put these elements into an associative array:
my %assoc_array;
foreach(#first_array)
{
if(/^([^=]+)=(.+)$/)
{
$assoc_array{$1} = $2;
}
else
{
#Something weird may be happening...
#we may have an element starting with "=" for example.
#Do what you want: throw a warning, die, silently move on, etc.
}
}
if(exists $assoc_array{TrxId})
{
print "|TrxId=" . $assoc_array{TrxId} . "|\n";
}
else
{
print "Sorry, TrxId not found!\n";
}
The code above yields the expected output:
|TrxId=475665|
Now, obviously this is more complex than some of the other answers, but it's also a bit more robust in that it allows you to search for more keys as well.
This approach does have a potential issue if your keys appear more than once. In that case, it's easy enough to modify the code above to collect an array reference of values for each key.

Need help converting "sassy" to "$a55y" using a regular expression?

Any s at the beginning of the word should be converted to a $.
Any s inside the word should be converted to a 5.
To match an s at the start of the word, use \b to match word boundaries and \w to match alphanumerics:
/\bs\w/
(as #Matthew points out, the \w is really superfluous:)
/\bs/
Once you've replaced all s at the start of a word, then the only remaining ones are inside the word (I'm assuming that you also want to replace s at the end of a word with 5) so you can simply use
/s/
For completeness, here's how to put it all together (I'm going to assume JavaScript):
function pimpMyEsses(str)
{
return str.replace(/\bs/gi, '$').replace(/s/gi, '5');
}
console.log(pimpMyEsses('slither quantum Sassy. arcades'));
// > "$lither quantum $a55y. arcade5"
Depending on the language it may be possible to capture the substitutions with a single regular expression and replace them procedurally. Here's a PHP example:
<?php
$word = 'sassy';
preg_match_all('/\b(s)|([^s]+)|(s)/', $word, $matches, PREG_SET_ORDER);
/* captures:
* $matches = array(
* array('s','s'),
* array('a','','a'),
* array('s','','','s'),
* array('s','','','s'),
* array('y','','y')
* )
*/
$newword = '';
foreach ($matches as $m){
if ($m[1]) $newword .= '$'; # leading s --> $
elseif ($m[2]) $newword .= $m[2]; # not an s --> as-is
else $newword .= '5'; # any other s --> 5
}
echo $newword;
Because I've used \b to match a word-boundary before the "leading s", the string 'sassy socks' becomes '$a55y $ock5'
If you want only the s at the start of "sassy" to become a $, change the regular expression to:
'/^(s)|([^s]+)|(s)/'
You can do:
/^(s)/ to select only the first "s";
/(?:[^s])(?:(s)[^s]*)+ to select all other "s". Note that the first character will be skipped (which is independent of);
Explain:ignore first character;Repeat one or more: get a "s" and ignore others character that not "s";
Next step: you need to determinate what language you will use.

Why does my regex fail when the number ends in 0?

This is a really basic regex question but since I can't seem to figure out why the match is failing in certain circumstances I figured I'd post it to see if anyone else can point out what I'm missing.
I'm trying to pull out the 2 sets of digits from strings of the form:
12309123098_102938120938120938
1321312_103810312032123
123123123_10983094854905490
38293827_1293120938129308
I'm using the following code to process each string:
if($string && $string =~ /^(\d)+_(\d)+$/) {
if(IsInteger($1) && IsInteger($2)) { print "success ('$1','$2')"; }
else { print "fail"; }
}
Where the IsInterger() function is as follows:
sub IsInteger {
my $integer = shift;
if($integer && $integer =~ /^\d+$/) { return 1; }
return;
}
This function seems to work most of the time but fails on the following for some reason:
1287123437_1268098784380
1287123437_1267589971660
Any ideas on why these fail while others succeed? Thanks in advance for your help!
This is an add-on to the answers from unicornaddict and ZyX: what are you trying to match?
If you're trying to match the sequences left and right of '_', unicorn addict is correct and your regex needs to be ^(\d+)_(\d+)$. Also, you can get rid of the first qualifier and the 'IsIntrger()` function altogether - you already know it's an integer - it matched (\d+)
if ($string =~ /^(\d+)_(\d+)$/) {
print "success ('$1','$2')";
} else {
print "fail\n";
}
If you're trying to match the last digit in each and wondering why it's failing, it's the first check in IsInteger() ( if($intger && ). It's redundant anyway (you know it's an integer) and fails on 0 because, as ZyX notes - it evaluates to false.
Same thing applies though:
if ($string =~ /^(\d)+_(\d)+$/) {
print "success ('$1','$2')";
} else {
print "fail\n";
}
This will output success ('8','8') given the input 12309123098_102938120938120938
Because you have 0 at the end of the second string, (\d)+ puts only the last match in the $N variable, string "0" is equivalent to false.
When in doubt, check what your regex is actually capturing.
use strict;
use warnings;
my #data = (
'1321312_103810312032123',
'123123123_10983094854905490',
);
for my $s (#data){
print "\$1=$1 \$2=$2\n" if $s =~ /^(\d)+_(\d)+$/;
# Output:
# $1=2 $2=3
# $1=3 $2=0
}
You probably intended the second of these two approaches.
(\d)+ # Repeat a regex group 1+ times,
# capturing only the last instance.
(\d+) # Capture 1+ digits.
In addition, both in your main loop and in IsInteger (which seems unnecessary, given the initial regex in the main loop), you are testing for truth rather than something more specific, such as defined or length. Zero, for example, is a valid integer but false.
Shouldn't + be included in the grouping:
^(\d+)_(\d+)$ instead of ^(\d)+_(\d)+$
Many people have commented on your regex, but the problem you had in your IsInteger (which you really don't need for your example). You checked for "truth" when you really want to check for defined:
sub IsInteger {
my $integer = shift;
if( defined $integer && $integer =~ /^\d+$/) { return 1; }
return;
}
You don't need most of the infrastructure in that subroutine though:
sub IsInteger {
defined $_[0] && $_[0] =~ /^\d+$/
}