Perl regular expression question - regex

Suppose I have variables
$x1 = 'XX a b XX c d XX';
$x2 = 'XX a b XX c d XX e f XX';
I want a regular expression that will find each instance of letters between XX. I'm looking for a general solution, because I don't know how many XX's there are.
I tried using /XX(.*?)XX/g but this only matches "a b" for x1 and "a b", "e f" for x2 because once the first match is found, the engine has already read the second "XX".
Thanks for any help.

Try using a positive lookahead:
/XX(.*?)(?=XX)/

I would suggest split as well as knittl. But you might want to remove the whitespace as well:
my #stuff = split /\s*XX\s*/, $line;
Also you could use lookaheads, but you really don't need them, because you can use reasonably complex alternations as well:
Non-ws version would just be:
my #stuff = $line =~ m/XX((?:[^X]|X[^X])*)/g;
The alternation says that you'll take anything if it's not an 'X'--but you will take an 'X' if it's not followed by another 'X'. There will be one character of lookahead, but it can consume characters aggressively, without backtracking.
The trimming version will have to backtrack to get rid of space characters, so the expression is uglier.
my #stuff = $line =~ m/XX\s*((?:[^X]|X[^X])*?(?:[^X\s]|X[^X]))/g;

You can use split
#stuff_between_xx = split /XX/, $x1;
To store the number of matches in a scalar variable:
$stuff_between_xx = split /XX/, $x1;

my $x2 = 'XX a b XX c d XX e f XX';
my #parts = grep { $_ ne '' } split /\s*XX\s*/, $x2;

Related

How to use regex in Perl to see if a variable has only specific characters?

I'd like to use regex in Perl to see if a scalar variable has any characters other than the ones I'm looking for. The placement or order of the characters don't matter.
For example if want to filter out other than the characters C and F:
Matching to ABCADF would equal true (it has other than my filter characters)
Matching FFC would equal false.
Matching CCCC would also equal false.
Thanks
The following returns true if the string contains a character that is neither C nor F:
$str =~ /[^CF]/
In the comments, you mentioned you actually want the opposite. You could negate the above as follows:
!( $str =~ /[^CF]/ )
$str !~ /[^CF]/
If you'd rather avoid the double-negative, you could check if the string consists entirely of C and F characters.
$str =~ /^[CF]*\z/

Regex logic comparison simplification

I have multiple complex logic and one of them is comparison of 2 strings.
$db_string_param1 = 'AA1 AB AC1 AK1 BKK2';
$file_string_param1 = 'AK1B25';
I need to test whether $file_string_param1 contains any of the content of $db_string_param1, delimited by space without making $db_string_param1 an array.
I am thinking maybe this is possible using regex, and for now I am not that good using complex regex.
Suppose your data hasn't contain the special character you could use simple | in a substitution s/\s/|/g.
In below RegEx substitution I have handle the special characters.
Group the space and non word character. Then e flag used for evaluate the right-hand side as an expression. Then check if $1 is defined or not if it is defined use | to separate. if it is not defined it means $2 contain the special character so you could escape the character.
$file_string_param1 = 'AK1B25';
$db_string_param1 = 'AA1 AB AC1 AK1 BKK2';
$db_string_param1=~s/(\s)|(\W)/ defined($1) ? "|" : "\\$2" /ge;
$file_string_param1=~m/$db_string_param1/;
print $&,"\n";
This solution doesn't suffer from regex injection bug. It will work no matter what characters are separated by whitespace.
my $pat = join '|', map quotemeta, split ' ', $db_string_param1;
if ( $file_string_param1 =~ /$pat/ ) {
...
}
You don't explain why you don't want your database parameter converted into an array so it is hard to understand your intention, but this program demonstrates how to convert the string into a regex pattern and test the file parameter against it
use strict;
use warnings 'all';
use feature 'say';
my $db_string_param1 = 'AA1 AB AC1 AK1 BKK2';
my $file_string_param1 = 'AK1B25';
( my $db_re = $db_string_param1 ) =~ s/\A\s+|\s+\z//g;
$db_re =~ s/\s+/|/g;
$db_re = qr/$db_re/;
say $file_string_param1 =~ /(?<![A-Z])(?:$db_re)(?![0-9])/ ? 'found' : 'not found';
output
found

Perl assign regex match groups to variables

I'have a string in a following format: _num1_num2. I need to assign num1 and num2 values to some variables. My regex is (\d+) and it shows me the correct match groups on Rubular.com, but I don't know how to assign these match groups to some variables. Can anybody help me? Thanks in adv.
That should be (assuming your string is stored in '$string'):
my ($var1, $var2) = $string =~ /_(\d+)_(\d+)/s;
The idea is to grab numbers until you get a non-number character: here '_'.
Each capturing group is then assign to their respective variable.
As mentioned in this question (and in the comments below by Kaoru):
\d can indeed match more than 10 different characters, if applied to Unicode strings.
So you can use instead:
my ($var1, $var2) = $string =~ /_([0-9]+)_([0-9]+)/s;
Using the g-modifier also allows you to do away with the the grouping parenthesis:
my ($five, $sixty) = '_5_60' =~ /\d+/g;
This allows any separation of integers but it doesn't verify the input format.
The use of the global flag in the first answer is a bit confusing. The regex /_(\d+)_(\d+)/ already captures two integers. Additionally the g modifier tries to match multiple times. So this is redundant.
IMHO the g modifier should be used when the number of matches is unknown or when it simplifies the regex.
As far as I see this works exactly the same way as in JavaScript.
Here are some examples:
use strict;
use warnings;
use Data::Dumper;
my $str_a = '_1_22'; # three integers seperated by an underscore
# expect two integert
# using the g modifier for global matching
my ($int1_g, $int2_g) = $str_a =~ m/_(\d+)/g;
print "global:\n", Dumper( $str_a, $int1_g, $int2_g ), "\n";
# match two ints explicitly
my ( $int1_e, $int2_e) = $str_a =~ m/_(\d+)_(\d+)/;
print "explicit:\n", Dumper( $str_a, $int1_e, $int2_e ), "\n";
# matching an unknown number of integers
my $str_b = '_1_22_333_4444';
my #ints = $str_b =~ m/_(\d+)/g;
print "multiple integers:\n", Dumper( $str_b, \#ints ), "\n";
# alternatively you can use split
my ( $int1_s, $int2_s ) = split m/_/, $str_a;
print "split:\n", Dumper( $str_a, $int1_g, $int2_g ), "\n";

How to change nested quotes?

I'm looking for way how to change quotes for fancy ones: "abc" -> «abc».
It works for me in simple situations and next step i am looking for is how to get it work also with nested quotes: "abc "d e f" ghi" -> «abc «d e f» ghi»
$pk =~ s/
"( # first qoute, start capture
[\p{Word}\.]+? # at least one word-char or point
.*?\b[\.,?!]*? # any char followed boundary + opt. punctuation
)" # stop capture, ending quote
/«$1»/xg; # change to fancy
I hoped regex will match 1st and 3rd quote and changes them. And it does. Problem is: i hoped then match again 2nd and 4th, but it wont, because 2nd is already left behind. One solution is to run same replacement again until there is less than 2 quote chars in.
Is there better way to achieve my goal? My approach won't work when there will be third level of nesting and this is not my goal, i stay with 2 levels.
NB! Changing startquote and enquote in separate replacement wont work because then will single doublequotes replaced too. I need to replace only when they appear as couple!
MORE examples:
"abc "d e f" -> «abc "d e f»
"abc"d e f" -> «abc"d e f»
This seems impossible:
"abc" d e f" -> «abc" d e f»
There is no general way to pair up nested double quotes. If your quotes are always next to the beginning or end of a word then this may work. It replaces a double quote that precedes a non-space character with an open quote, and one that succeeds a non-space character with an close quote.
use strict;
use warnings;
use utf8;
my $string = '"abc "d e f" ghi"';
$string =~ s/"(?=\S)/«/g;
$string =~ s/(?<=\S)"/»/g;
print $string;
output
«abc «d e f» ghi»
You can use negative lookaround assertions to find the matching directions on your fancy quotes. The double negations help handle the edge cases (e.g. end/beginning of line). I used << and >> instead of your fancy quotes here for simplicity.
use strict;
use warnings;
while (<DATA>) {
s/(?<!\S)"(?!\s)/<</g;
s/(?<!\s)"(?!\S)/>>/g;
print;
}
__DATA__
"abc "d e f" ghi"
Output:
<<abc <<d e f>> ghi>>

Regular expression for selecting single spaced phrases but not whitespace

I need a rather complicated regular expression that will select words with one space between them and that can include the '-' symbol in them, it should not however select continuous whitespace.
'KENEDY JOHN G JR E' 'example' 'D-54'
I have tried the following regular expression:
\'([\s\w-]+)\'
but it selects continuous whitespace which I don't want it to do.
I want the expression to select
'KENEDY JOHN G JR E'
'example'
'D-54'
Perhaps,
\'([\w-]+(?:\s[\w-]+)*)\'
?
EDIT
If leading/trailing dashes (on the word boundaries) are not allowed, this should read:
/\'(\w+(?:[\s-]\w+)*)\'/
An expression like this should do it:
'[\w-]+(?:\s[\w-]+)*'
Try this:
my $data = "'KENEDY JOHN G JR E' 'example' 'D-54'";
# Sets of
# one or more word characters or dash
# followed by an optional space
# enclosed in single quotes
#
# The outermost ()s are optional. There just
# so i can print the match easily as $1.
while ($data =~ /(\'([\w-]+\s?)+\')/g)
{
print $1, "\n";
}
outputs
'KENEDY JOHN G JR E'
'example'
'D-54'
Not sure if this applies to you, since you asked for a regex specifically. However, if you want strings separated by two or more whitespace or dashes, you can use split
use strict;
use warnings;
use v5.10;
my $str = q('KENEDY JOHN G JR E' 'example' 'D-54');
my #match = split /\s{2,}/, $str;
say for #match;
A regex with similar functionality would be
my #match = $str =~ /(.*?)(?:\s{2,}|$)/g;
Note that you'll need the edge case of finding end of line $.
The benefit of using split or the wildcard . is that you rely on whitespace to define your fields, not the content of the fields themselves.
Your code actually works as is.
use feature qw( say );
$_ = "'KENEDY JOHN G JR E' 'example' 'D-54'";
say for /\'([\s\w-]+)\'/g;
output:
KENEDY JOHN G JR E
example
D-54
(Move the parens if you want the quotes too.)
I would simply use
my #data = /'([^']*)'/g;
If you have any validation to do, do it afterwards.