How can I determine if a value is in a Perl array? - regex

I'm using this small snippet to determine whether or not a URL is currently being stored in an array:
if( $self->{_local} eq "true" && ! grep {m|^$new_href?$|} #m_href_array ) {
push( #m_href_array, $new_href );
push( #href_array, $new_href );
}
It seems to work but then my code throws an error saying:
Sequence (?$...) not implemented in regex; marked by <-- HERE in m/^javascript:SearchGo(?$ <-- HERE / at C:/Perl/site/lib/ACTC.pm line 152, <> line 1.
Can anyone explain why this is happening?

When searching for a string in an array, you can just use eq, rather than a regular expression:
grep { $_ eq $new_href } #m_href_array
However, if you really do need to use a regular expression (for example you are searching for a string matching a substring in the array, you should always quote the string, so that embedded special characters in your string do not have undesired effects:
grep { /\Q$substr\Esomething_else/ } #array
Moreover, if all you care about is whether the value is there, somewhere, you can short-circuit as soon as you've found a match:
use List::Util 'first';
if (first { $_ eq $new_href } #m_href_array) { ... }
or
use List::MoreUtils 'any';
if (any { $_ eq $new_href } #m_href_array) { ... }
If you're going to be doing a lot of searches, or your array is really long, you can make the process faster still by transforming the array into a hash, so you have O(1) lookups:
my %values_index;
#values_index{#array} = ();
if (exists $values_index{$element}) { ... }

You don't need regexp here. Just use eq:
grep { $_ eq $new_href } #m_href_array
Also it's a good idea to use hash instead of array for faster checking:
my %allready_used_url;
if ( $self->{_local} eq "true" && ! exists $allready_used_url{ $new_href } ) {
$allready_used_url{ $new_href } = 1; ## add url to hash
push( #m_href_array, $new_href );
push( #href_array, $new_href );
}

What do you mean by the ? in $new_href?? Assuming there's a string in $new_href, do you expect the last letter of the string to be optional? That's not how the RE parser reads it.

Looks like the value of $new_herf is javascript:SearchGo( which when substituted in the regex check looks like:
^javascript:SearchGo(?$
which is a broken regex as there is no matching ) for (

You're using the URL as the pattern, and it's not a valid pattern. That's not so bad because there are much better ways to do this. The smart match makes it almost trivial:
use 5.010;
if( $new_href ~~ #urls ) { ... }

Related

Searching Perl array with regex and return single capturing group only

I have a Perl script in which I perform web service calls in a loop. The server returns a multivalued HTTP header that I need to parse after each call with information that I will need to make the next call (if it doesn't return the header, I want to exit the loop).
I only care about one of the values in the header, and I need to get the information out of it with a regular expression. Let's say the header is like this, and I only care about the "foo" value:
X-Header: test-abc12345; blah=foo
X-Header: test-fgasjhgakg; blah=bar
I can get the header values like this: #values = $response->header( 'X-Header' );. But how do I quickly check if
There is a foo value, and
Parse and save the foo value for the next iteration?
Ideally, I'd like to do something like this:
my $value = 'default';
do {
# (do HTTP request; use $value)
#values = $response->header( 'X-Header' );
} while( $value = first { /(?:test-)([^;]+)(?:; blah=foo)/ } #values );
But grep, first (from List::Util), etc. all return the entire match and not just the single capturing group I want. I want to avoid cluttering up my code by looping over the array and matching/parsing inside the loop body.
Is what I want possible? What would be the most compact way to write it? So far, all I can come up with is using lookarounds and \K to discard the stuff I don't care about, but this isn't super readable and makes the regex engine perform a lot of unnecessary steps.
So it seems that you want to catch the first element with a certain pattern, but acquire only the pattern. And you want it done nicely. Indeed, first and grep only pass the element itself.
However, List::MoreUtils::first_result does support processing of its match
use List::MoreUtils 0.406 qw(first_result);
my #w = qw(a bit c dIT); # get first "it" case-insensitive
my $res = first_result { ( /(it)/i )[0] } #w;
say $res // 'undef'; #--> it
That ( ... )[0] is needed to put the regex in the list context so that it returns the actual capture. Another way would be firstres { my ($r) = /(it)/i; $r }. Pick your choice
For the data in the question
use warnings;
use strict;
use feature 'say';
use List::MoreUtils 0.406 qw(firstres);
my #data = (
'X-Header: test-abc12345; blah=foo',
'X-Header: test-fgasjhgakg; blah=bar'
);
if (my $r = firstres { ( /test-([^;]+);\s+blah=foo/ )[0] } #data) {
say $r
}
Prints abc12345, clarified in a comment to be the sought result.
Module versions prior to 0.406 (of 2015-03-03) didn't have firstres (alias first_result)
first { ... } #values returns one the values (or undef).
You could use either of these:
my ($value) = map { /...(...).../ } #values;
my $value = ( map { /...(...).../ } #values ) ? $1 : undef;
my $value = ( map { /...(...).../ } #values )[0];
Using first, it would look like the following, which is rather silly:
my $value = first { 1 } map { /...(...).../ } #values;
However, assuming the capture can't be an empty string or the string 0, List::MoreUtils's first_result could be used to avoid the unnecessary matches:
my $value = first_result { /...(...).../ ? $1 : undef } #values;
my $value = first_result { ( /...(...).../ )[0] } #values;
If the returned value can be false (e.g. an empty string or a 0) you can use something like
my $value = first_result { /...(...).../ ? \$1 : undef } #values;
$value = $$value if $value;
The first_result approach isn't necessarily faster in practice.
Following code snippet is looking for foo stored in a variable $find, the found values is stored in variable $found.
my $find = 'foo';
my $found;
while( $response->header( 'X-Header' ) ) {
if( /X-Header: .*?blah=($find)/ ) {
$found = $1;
last;
}
}
say $found if $found;
Sample demo code
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my $find = 'foo';
my $found;
my #header = <DATA>;
chomp(#header);
for ( #header ) {
$found = $1 if /X-Header: .*?blah=($find)/;
last if $found;
}
say Dumper(\#header);
say "Found: $found" if $found;
__DATA__
X-Header: test-abc12345; blah=foo
X-Header: test-fgasjhgakg; blah=bar
Output
$VAR1 = [
'X-Header: test-abc12345; blah=foo',
'X-Header: test-fgasjhgakg; blah=bar'
];
Found: foo

How can I match these function calls, and extract the nmae of the function and the first argument?

I am trying to parse an array of elements. Those who match a pattern like the following:
Jim("jjanson", Customer.SALES);
I want to create a hash table like Jim => "jjanson"
How can I do this?
I can not match the lines using:
if($line =~ /\s*[A-Za-z]*"(.*),Customer.*\s*/)
You're not matching either the '(' after the name, nor the ' ' after the comma, before "Customer.".
I can get 'jjanson"' using this expression:
/\s*[A-Za-z]\(*"(.*), Customer.*\s*/
But I assume you don't want jjanson", so we need to modify it like so. (I tend to include the negative character class when I'm looking for simply-delimited stuff. So, in this case I'll make the expression "[^"]*"
/\s*[A-Za-z]\(*"([^"]+)", Customer.*\s*/
Also, I try not to depend upon whitespace, presence or number, I'm going to replace the space with \s*. That you didn't notice that you skipped the whitespace is a good illustration of the need to say "ignore a bunch of whitespace".
/\s*[A-Za-z]\(*"([^"]+)",\s*Customer.*\s*/
Now it's only looking for the sequence ',' + 'Customer' in the significant characters. Functionally, the same, if more flexible.
But since you only do one capture, I can't see what you'd map to what. So I'll do my own mapping:
my %records;
while ( my $line = $source->()) { # simply feed for a source of lines.
my ( $first, $user, $tag )
= $line = m/\s*(\p{Alpha}+)\s*\(\s*"([^"]+)",\s*Customer\.(\S+?)\)\/
;
$records{ $user }
= { first => $first
, username => $user
, tag => $tag
};
}
This is much more than you would tend to need in a one-off, quick solution. But I like to store as much of my input as seems significant.
Note that Jim("jjanson", Customer.SALES); matches the syntax of a function call with two arguments. You can thus abuse string eval:
#!/usr/bin/env perl
use strict;
use warnings;
use YAML::XS;
my $info = extract_first_arg(q{ Jim("jjanson", Customer.SALES);} );
print Dump $info;
sub extract_first_arg {
my $call = shift;
my ($name) = ($call =~ m{ \A \s* (\w+) }x);
unless ($name) {
warn "Failed to find function name in '$call'";
return;
}
my $username = eval sprintf(q{
package My::DangerZone;
no strict;
local *{ %s } = sub { $_[0] };
%s
}, $name, $call);
return { $name => $username };
}
Output:
---
Jim: jjanson
Or, you can abuse autoloading:
our $AUTOLOAD;
print Dump eval 'no strict;' . q{ Jim("jjanson", Customer.SALES); };
sub AUTOLOAD {
my ($fn) = ($AUTOLOAD =~ /::(\w+)\z/);
return { $fn => $_[0] };
}
I would not necessarily recommend using these methods, especially on input that is not in your control, and in a situation where this script has access to sensitive facilities.
On the other hand, I have, in the right circumstances, utilized this kind of thing in transforming one given set of information into something that can be used elsewhere.
Try this:
$line = 'Jim("jjanson", Customer.SALES)';
my %hashStore = (); #Jim("jjanson"
if($line=~m/^\s*([^\(\)]*)\(\"([^\"]*)\"/g) { $hashStore{$1} = $2; }
use Data::Dumper;
print Dumper \%hashStore;
Output:
$VAR1 = {
'Jim' => 'jjanson'
};

Regular expressions, matching operator using a string variable in Perl

I am using a regex but am getting some odd, unexpected "matches". "Names" are sent to a subroutine to be compared to an array called #ASlist, which contains multiple rows. The first element of each row is also a name, followed by 0 to several synonyms. The goal is to match the incoming "name" to any row in #ASlist that has a matching cell.
Sample input, from which $names is derived for the comparison against #ASlist:
13 1 13 chr7 7 70606019 74345818 Otud7a Klf13 E030018B13Rik Trpm1 Mir211 Mtmr10 Fan1 Mphosph10 Mcee Apba2 Fam189a1 Ndnl2 Tjp1 Tarsl2 Tm2d3 1810008I18Rik Pcsk6 Snrpa1 H47 Chsy1 Lrrk1 Aldh1a3 Asb7 Lins Lass3 Adamts17
Sample lines from #ASlist:
HSPA5 BIP FLJ26106 GRP78 MIF2
NDUFA5 B13 CI-13KD-B DKFZp781K1356 FLJ12147 NUFM UQOR13
ACAN AGC1 AGCAN CSPG1 CSPGCP MSK16 SEDK
The code:
my ($name) = #_; ## this comes in from another loop elsewhere in code I did not include
chomp $name;
my #collectmatches = (); ## container to collect matches
foreach my $ASline ( #ASlist ){
my #synonyms = split("\t", $ASline );
for ( my $i = 0; $i < scalar #synonyms; $i++ ){
chomp $synonyms[ $i ];
#print "COMPARE $name TO $synonyms[ $i ]\n";
if ( $name =~m/$synonyms[$i]/ ){
print "\tname $name from block matches\n\t$synonyms[0]\n\tvia $synonyms[$i] from AS list\n";
push ( #collectmatches, $synonyms[0], $synonyms[$i] );
}
else {
# print "$name does not match $synonyms[$i]\n";
}
}
}
The script is working but also reports weird matches. Such as, when $name is "E030018B13Rik" it matches "NDUFA5" when it occurs in #ASlist. These two should not be matched up.
If I change the regex from ~m/$synonyms[$i]/ to ~m/^$synonyms[$i]$/, the "weird" matches go away, BUT the script misses the vast majority of matches.
The NDUFA5 record contains B13 as a pattern, which will match E030018<B13>Rik.
If you want to be more literal, then add boundary conditions to your regular expression /\b...\b/. Also should probably escape regular expression special characters using quotemeta.
if ( $name =~ m/\b\Q$synonyms[$i]\E\b/ ) {
Or if you want to test straight equality, then just use eq
if ( $name eq $synonyms[$i] ) {
Another, more Perlish way to test for string equality is to use a hash.
You don't show any real test data, but this short Perl program builds a hash from your array #ASlist of lines of match strings. After that, most of the work is done.
The subsequent for loop tests just E030018B13Rik to see if it is one of the keys of the new %ASlist and prints an appropriate message
use strict;
use warnings;
my #ASlist = (
'HSPA5 BIP FLJ26106 GRP78 MIF2',
'NDUFA5 B13 CI-13KD-B DKFZp781K1356 FLJ12147 NUFM UQOR13',
'ACAN AGC1 AGCAN CSPG1 CSPGCP MSK16 SEDK',
);
my %ASlist = map { $_ => 1 } map /\S+/g, #ASlist;
for (qw/ E030018B13Rik /) {
printf "%s %s\n", $_, $ASlist{$_} ? 'matches' : 'doesn\'t match';
}
output
E030018B13Rik doesn't match
Since you only need to compare two strings, you can simply use eq:
if ( $name eq $synonyms[$i] ){
You are using B13 as the regular expression. As none of the characters has a special meaning, any string containing the substring B13 matches the expression.
E030018B13Rik
^^^
If you want the expression to match the whole string, use anchors:
if ($name =~m/^$synonyms[$i]$/) {
Or, use index or eq to detect substrings (or identical strings, respectively), as your input doesn't seem to use any features of regular expressions.

How do I use Perl's smart matching to match many patterns at once?

I was trying to follow some examples to use smart matching in the following piece of code, but failed (nothing was filtered out). How can I use smart matching here to match against multiple regexes at once?
my $regexes_to_filter_a = ("tmp", "temp", "del")
my #organism_dirs = (); # this will hold final list of dirs to processs
my #subdirs = File::Find::Rule->directory->maxdepth(1)->in($root_dir);
foreach my $subdir (#subdirs) {
my $filter = 0;
# IMPROVE: can do smart matching here
foreach my $regex ( #{$regexes_to_filter_a} ) {
if ( basename($subdir) =~ $regex ) {
$filter = 1; # filter out this dir
last;
}
}
unless ($filter) {
push #organism_dirs, $subdir;
}
}
You don't need smart matching here. The ~~ with a single regex on the right hand side and a string on the left hand side might as well be a =~, just like you have it. What are you trying to do?
For your match, you have two ways to go. If you want to use a string as a pattern, you need to use the match operator:
basename($subdir) =~ m/$regex/
If you want to not use the match operator, as you have it now, you need a regex object:
my $regexes_to_filter_a = (qr/tmp/, qr/temp/, qr/del/);
I guess you could match all the regexes at once. Note that if you are going to set maxdepth to 1, you don't really need File::Find::Rule. If you aren't going to walk a directory structure, don't use a module designed to walk a directory structure:
my $regexes_to_filter_a = (qr/tmp/, qr/temp/, qr/del/);
my #organism_dirs = ();
foreach my $subdir ( glob( '*' ) ) {
next unless -d $subdir;
unless (basename($subdir) ~~ #regexes_to_filter_a) {
push #organism_dirs, $subdir;
}
}
I think all of that is too much work though. If you want to exclude known, static directory names (so, not patterns), just use a hash:
my %ignore = map { $_, 1 } qw( tmp temp del );
my #organism_dirs =
grep { ! exists $ignore{ basename($_) } }
glob( "$rootdir/*" );
If you really want to use the smart match:
my %ignore = map { $_, 1 } qw( tmp temp del );
my #organism_dirs =
grep { basename($_) ~~ %ignore }
glob( "$rootdir/*" );
Here is a quick untested change to your example:
my #regexes_to_filter_a = (qr/^tmp$/, qr/^temp/, qr/del/);
my #organism_dirs = (); # this will hold final list of dirs to processs
my #subdirs = File::Find::Rule->directory->maxdepth(1)->in($root_dir);
foreach my $subdir (#subdirs) {
unless (basename($subdir) ~~ #regexes_to_filter_a) {
push #organism_dirs, $subdir;
}
}
The key changes are:
i) should be either #array = (...list...); or $array_ref = [...list...];
my #regexes_to_filter_a = ("tmp", "temp", "del");
ii) and change to using smart match. Below checks that basename($subdir) is in (~~) the #regexes_to_filter_a array. So no need to loop through the array and do individual regex checks.
unless (basename($subdir) ~~ #regexes_to_filter_a) { ... }
/I3az/

Perl regex replacement string special variable

I'm aware of the match, prematch, and postmatch predefined variables. I'm wondering if there is something similar for the evaluated replacement part of the s/// operator.
This would be particularly useful in dynamic expressions so they don't have to be evaluated a 2nd time.
For example, I currently have %regexs which is a hash of various search and replace strings.
Here's a snippet:
while (<>) {
foreach my $key (keys %regexs) {
while (s/$regexs{$key}{'search'}/$regexs{$key}{'replace'}/ee) {
# Here I want to do something with just the replaced part
# without reevaluating.
}
}
print;
}
Is there a convenient way to do it? Perl seems to have so many convenient shortcuts, and it seems like a waste to have to evaluate twice (which appears to be the alternative).
EDIT: I just wanted to give an example: $regexs{$key}{'replace'} might be the string '"$2$1"' thus swapping the positions of some text in the string $regexs{$key}{'search'} which might be '(foo)(bar)' - thus resulting in "barfoo". The second evaluation that I'm trying to avoid is the output of $regexs{$key}{'replace'}.
Instead of using string eval (which I assume is what's going on with s///ee), you could define code references to do the work. Those code references can then return the value of the replacement text. For example:
use strict;
use warnings;
my %regex = (
digits => sub {
my $r;
return unless $_[0] =~ s/(\d)(\d)_/$r = $2.$1/e;
return $r;
},
);
while (<DATA>){
for my $k (keys %regex){
while ( my $replacement_text = $regex{$k}->($_) ){
print $replacement_text, "\n";
}
}
print;
}
__END__
12_ab_78_gh_
34_cd_78_yz_
I'm pretty sure there isn't any direct way to do what you're asking, but that doesn't mean it's impossible. How about this?
{
my $capture;
sub capture {
$capture = $_[0] if #_;
$capture;
}
}
while (s<$regexes{$key}{search}>
<"capture('" . $regexes{$key}{replace}) . "')">eeg) {
my $replacement = capture();
#...
}
Well, except to do it really properly you'd have to shoehorn a little more code in there to make the value in the hash safe inside a singlequotish string (backslash singlequotes and backslashes).
If you do the second eval manually you can store the result yourself.
my $store;
s{$search}{ $store = eval $replace }e;
why not assign to local vars before:
my $replace = $regexs{$key}{'replace'};
now your evaluating once.