Regex logic comparison simplification - regex

I have multiple complex logic and one of them is comparison of 2 strings.
$db_string_param1 = 'AA1 AB AC1 AK1 BKK2';
$file_string_param1 = 'AK1B25';
I need to test whether $file_string_param1 contains any of the content of $db_string_param1, delimited by space without making $db_string_param1 an array.
I am thinking maybe this is possible using regex, and for now I am not that good using complex regex.

Suppose your data hasn't contain the special character you could use simple | in a substitution s/\s/|/g.
In below RegEx substitution I have handle the special characters.
Group the space and non word character. Then e flag used for evaluate the right-hand side as an expression. Then check if $1 is defined or not if it is defined use | to separate. if it is not defined it means $2 contain the special character so you could escape the character.
$file_string_param1 = 'AK1B25';
$db_string_param1 = 'AA1 AB AC1 AK1 BKK2';
$db_string_param1=~s/(\s)|(\W)/ defined($1) ? "|" : "\\$2" /ge;
$file_string_param1=~m/$db_string_param1/;
print $&,"\n";

This solution doesn't suffer from regex injection bug. It will work no matter what characters are separated by whitespace.
my $pat = join '|', map quotemeta, split ' ', $db_string_param1;
if ( $file_string_param1 =~ /$pat/ ) {
...
}

You don't explain why you don't want your database parameter converted into an array so it is hard to understand your intention, but this program demonstrates how to convert the string into a regex pattern and test the file parameter against it
use strict;
use warnings 'all';
use feature 'say';
my $db_string_param1 = 'AA1 AB AC1 AK1 BKK2';
my $file_string_param1 = 'AK1B25';
( my $db_re = $db_string_param1 ) =~ s/\A\s+|\s+\z//g;
$db_re =~ s/\s+/|/g;
$db_re = qr/$db_re/;
say $file_string_param1 =~ /(?<![A-Z])(?:$db_re)(?![0-9])/ ? 'found' : 'not found';
output
found

Related

Perl split function - use repeating characters as delimiter

I want to split a string using repeating letters as delimiter, for example,
"123aaaa23a3" should be split as ('123', '23a3') while "123abc4" should be left unchanged.
So I tried this:
#s = split /([[:alpha:]])\1+/, '123aaaa23a3';
But this returns '123', 'a', '23a3', which is not what I wanted. Now I know that this is because the last 'a' in 'aaaa' is captured by the parantheses and thus preserved by split(). But anyway, I can't add something like ?: since [[:alpha:]] must be captured for back reference.
How can I resolve this situation?
Hmm, its an interesting one. My first thought would be - your delimiter will always be odd numbers, so you can just discard any odd numbered array elements.
Something like this perhaps?:
my %s = (split (/([[:alpha:]])\1+/, '123aaaa23a3'), '' );
print Dumper \%s;
This'll give you:
$VAR1 = {
'23a3' => '',
'123' => 'a'
};
So you can extract your pattern via keys.
Unfortunately my second approach of 'selecting out' the pattern matches via %+ doesn't help particularly (split doesn't populate the regex stuff).
But something like this:
my #delims ='123aaaa23a3' =~ m/(?<delim>[[:alpha:]])\g{delim}+/g;
print Dumper \%+;
By using a named capture, we identify that a is from the capture group. Unfortunately, this doesn't seem to be populated when you do this via split - which might lead to a two-pass approach.
This is the closest I got:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $str = '123aaaa23a3';
#build a regex out of '2-or-more' characters.
my $regex = join ( "|", map { $_."{2,}"} $str =~ m/([[:alpha:]])\1+/g);
#make the regex non-capturing
$regex = qr/(?:$regex)/;
print "Using: $regex\n";
#split on the regex
my #s = split m/$regex/, $str;
print Dumper \#s;
We first process the string to extract "2-or-more" character patterns, to set as our delmiters. Then we assemble a regex out of them, using non-capturing, so we can split.
One solution would be to use your original split call and throw away every other value. Conveniently, List::Util::pairkeys is a function that keeps the first of every pair of values in its input list:
use List::Util 1.29 qw( pairkeys );
my #vals = pairkeys split /([[:alpha:]])\1+/, '123aaaa23a3';
Gives
Odd number of elements in pairkeys at (eval 6) line 1.
[ '123', '23a3' ]
That warning comes from the fact that pairkeys wants an even-sized list. We can solve that by adding one more value at the end:
my #vals = pairkeys split( /([[:alpha:]])\1+/, '123aaaa23a3' ), undef;
Alternatively, and maybe a little neater, is to add that extra value at the start of the list and use pairvalues instead:
use List::Util 1.29 qw( pairvalues );
my #vals = pairvalues undef, split /([[:alpha:]])\1+/, '123aaaa23a3';
The 'split' can be made to work directly by using the delayed execution assertion (aka postponed regular subexpression), (??{ code }), in the regular expression:
#s = split /[[:alpha:]](??{"$&+"})/, '123aaaa23a3';
(??{ code }) is documented on the 'perlre' manual page.
Note that, according to the 'perlvar' manual page, the use of $& anywhere in a program imposes a considerable performance penalty on all regular expression matches. I've never found this to be a problem, but YMMV.

Extracting first two words in perl using regex

I want to create extract the first two words from a sentence using a Perl function in PostgreSQL. In PostgreSQL, I can do this with:
text = "I am trying to make this work";
Select substring(text from '(^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)');
It would return "I Am"
I tried to build a Perl function in Postgresql that does the same thing.
CREATE OR REPLACE FUNCTION extract_first_two (text)
RETURNS text AS
$$
my $my_text = $_[0];
my $temp;
$pattern = '^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)';
my $regex = qr/$pattern/;
if ($my_text=~ $regex) {
$temp = $1;
}
return $temp;
$$ LANGUAGE plperl;
But I receive a syntax error near the regular expression. I am not sure what I am doing wrong.
Extracting words is none trivial even in English. Take the following contrived example using Locale::CLDR
use 'Locale::CLDR';
my $locale = Locale::CLDR->new('en');
my #words = $locale->split_words('adf543. 123.25');
#words now contains
adf543
.
123.25
Note that the full stop after adf543 is split into a separate word but the one between 123 and 25 is kept as part of the number 123.25 even though the '.' is the same character
If gets worse when you look at non English languages and much worse when you use non Latin scripts.
You need to precisely define what you think a word is otherwise the following French gets split incorrectly.
Je avais dit «Elle a dit «Il a dit «Ni» il ya trois secondes»»
The parentheses are mismatched in our regex pattern. It has three opening parentheses and four closing ones.
Also, you have two single quotes in the middle of a singly-quoted string, so
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)'
is parsed as two separate strings
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)'
and
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'
')?(\s+)?\w+)'
But I can't suggest how to fix it as I don't understand your intention.
Did you mean a double quote perhaps? In which case (!|,|\&|")? can be written as [!,&"]?
Update
At a rough guess I think you want this
my $regex = qr{ ^ \w++ \s* [-!,&"]* \s* \w+ }x;
$temp = $1 if $my_text=~ /($regex)/;
but I can't be sure. If you describe what you're looking for in English then I can help you better. For instance, it's unclear why you don't have question marks, full stops, and semicolons in the list of intervening punctuation.

Perl assign regex match groups to variables

I'have a string in a following format: _num1_num2. I need to assign num1 and num2 values to some variables. My regex is (\d+) and it shows me the correct match groups on Rubular.com, but I don't know how to assign these match groups to some variables. Can anybody help me? Thanks in adv.
That should be (assuming your string is stored in '$string'):
my ($var1, $var2) = $string =~ /_(\d+)_(\d+)/s;
The idea is to grab numbers until you get a non-number character: here '_'.
Each capturing group is then assign to their respective variable.
As mentioned in this question (and in the comments below by Kaoru):
\d can indeed match more than 10 different characters, if applied to Unicode strings.
So you can use instead:
my ($var1, $var2) = $string =~ /_([0-9]+)_([0-9]+)/s;
Using the g-modifier also allows you to do away with the the grouping parenthesis:
my ($five, $sixty) = '_5_60' =~ /\d+/g;
This allows any separation of integers but it doesn't verify the input format.
The use of the global flag in the first answer is a bit confusing. The regex /_(\d+)_(\d+)/ already captures two integers. Additionally the g modifier tries to match multiple times. So this is redundant.
IMHO the g modifier should be used when the number of matches is unknown or when it simplifies the regex.
As far as I see this works exactly the same way as in JavaScript.
Here are some examples:
use strict;
use warnings;
use Data::Dumper;
my $str_a = '_1_22'; # three integers seperated by an underscore
# expect two integert
# using the g modifier for global matching
my ($int1_g, $int2_g) = $str_a =~ m/_(\d+)/g;
print "global:\n", Dumper( $str_a, $int1_g, $int2_g ), "\n";
# match two ints explicitly
my ( $int1_e, $int2_e) = $str_a =~ m/_(\d+)_(\d+)/;
print "explicit:\n", Dumper( $str_a, $int1_e, $int2_e ), "\n";
# matching an unknown number of integers
my $str_b = '_1_22_333_4444';
my #ints = $str_b =~ m/_(\d+)/g;
print "multiple integers:\n", Dumper( $str_b, \#ints ), "\n";
# alternatively you can use split
my ( $int1_s, $int2_s ) = split m/_/, $str_a;
print "split:\n", Dumper( $str_a, $int1_g, $int2_g ), "\n";

How do I check if a string has exactly one of a certain character

I'm trying to scan strings to see if they have exactly one of a certain character.
For example if I'm looking for a question mark
Hello? I'm here
Will match the regex however
Hello? Are you listening?
Will not
I've tried ?{1} and ?{1}[^?]+ but they both don't work. Can anyone point me in the right direction?
Why not do:
(\?)
and count the number of matches.
Or even more simply, count number of ? in string using tr///
my $c = $string1 =~ tr/?//;
You could do something like
my $cnt = () = $str =~ m/\Q$pat/g;
if ($cnt == 1) {
# matched
}
else {
# failed
}
$pat is the pattern (character in this case) you want to match, such as '?'.
If you're looking for a particular character only, you can use the transliteration operator, tr///:
my $count = $string =~ tr/?/?/;
if( $count == 1 ) {
...
}
With the transliteration operator, I can leave off the replacement side and any characters not lined up with a replacement character will use the previous replacement character. If there isn't a previous replacement character, it makes no replacement. I just leave out the second part of the tr///:
my $count = $string =~ tr/?//;
if( $count == 1 ) {
...
}
This won't work for patterns though. This is strictly for character-to-character replacements. For a pattern, you do the same thing with Lee Duhem's answer
You can use this regex:
^[^?]*\?[^?]*$
Online Demo

Perl - Regex to extract only the comma-separated strings

I have a question I am hoping someone could help with...
I have a variable that contains the content from a webpage (scraped using WWW::Mechanize).
The variable contains data such as these:
$var = "ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig"
$var = "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf"
$var = "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew"
The only bits I am interested in from the above examples are:
#array = ("cat_dog","horse","rabbit","chicken-pig")
#array = ("elephant","MOUSE_RAT","spider","lion-tiger")
#array = ("ANTELOPE-GIRAFFE","frOG","fish","crab","kangaROO-KOALA")
The problem I am having:
I am trying to extract only the comma-separated strings from the variables and then store these in an array for use later on.
But what is the best way to make sure that I get the strings at the start (ie cat_dog) and end (ie chicken-pig) of the comma-separated list of animals as they are not prefixed/suffixed with a comma.
Also, as the variables will contain webpage content, it is inevitable that there may also be instances where a commas is immediately succeeded by a space and then another word, as that is the correct method of using commas in paragraphs and sentences...
For example:
Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.
^ ^
| |
note the spaces here and here
I am not interested in any cases where the comma is followed by a space (as shown above).
I am only interested in cases where the comma DOES NOT have a space after it (ie cat_dog,horse,rabbit,chicken-pig)
I have a tried a number of ways of doing this but cannot work out the best way to go about constructing the regular expression.
How about
[^,\s]+(,[^,\s]+)+
which will match one or more characters that are not a space or comma [^,\s]+ followed by a comma and one or more characters that are not a space or comma, one or more times.
Further to comments
To match more than one sequence add the g modifier for global matching.
The following splits each match $& on a , and pushes the results to #matches.
my $str = "sdfds cat_dog,horse,rabbit,chicken-pig then some more pig,duck,goose";
my #matches;
while ($str =~ /[^,\s]+(,[^,\s]+)+/g) {
push(#matches, split(/,/, $&));
}
print join("\n",#matches),"\n";
Though you can probably construct a single regex, a combination of regexs, splits, grep and map looks decently
my #array = map { split /,/ } grep { !/^,/ && !/,$/ && /,/ } split
Going from right to left:
Split the line on spaces (split)
Leave only elements having no comma at the either end but having one inside (grep)
Split each such element into parts (map and split)
That way you can easily change the parts e.g. to eliminate two consecutive commas add && !/,,/ inside grep.
I hope this is clear and suits your needs:
#!/usr/bin/perl
use warnings;
use strict;
my #strs = ("ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig",
"fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf",
"dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew",
"Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.",
"Another sentence, although having commas, should not confuse the regex with this: a,b,c,d");
my $regex = qr/
\s #From your examples, it seems as if every
#comma separated list is preceded by a space.
(
(?:
[^,\s]+ #Now, not a comma or a space for the
#terms of the list
, #followed by a comma
)+
[^,\s]+ #followed by one last term of the list
)
/x;
my #matches = map {
$_ =~ /$regex/;
if ($1) {
my $comma_sep_list = $1;
[split ',', $comma_sep_list];
}
else {
[]
}
} #strs;
$var =~ tr/ //s;
while ($var =~ /(?<!, )\b[^, ]+(?=,\S)|(?<=,)[^, ]+(?=,)|(?<=\S,)[^, ]+\b(?! ,)/g) {
push (#arr, $&);
}
the regular expression matches three cases :
(?<!, )\b[^, ]+(?=,\S) : matches cat_dog
(?<=,)[^, ]+(?=,) : matches horse & rabbit
(?<=\S,)[^, ]+\b(?! ,) : matches chicken-pig