Perl variable in regex for finding expression in hash - regex

I have a big hash with a lot of elements.
%my_hash = ();
# filling of %my_hash automaticly
$my_variable;
# set the value of $my_variable within a loop
Now I want to find the value of $my_variablewithin %my_hash. I tried it with
if(grep {/$my_variable/} keys %my_hash){
$my_new_variable = #here should be the element of %my_hash which makes the statement true
}
how to do that?
Edit: The problem is not the whole $my_variable will be find at %my_hash, e.g.
$my_variable = astring
$modules_by_path{"this_is_a_longer_astring"} = (something)
now I want to find this...

If you're looking only for one particular key from %my_hash,
if (my ($my_new_variable) = grep /\Q$my_variable/, keys %my_hash) {
..
}
or
if (my #keys = grep /\Q$my_variable/, keys %my_hash) { .. }
if there are more keys which match specified regex. (use \Q prefix if $my_variable is not regex but literal string to be matched).

You can use grep, but you need to put it in scalar context to get the result you want. You also need to escape the contents of $my_variable if there's any chance that it contains any regex metacharacters.
This uses \Q to escape the non-alphanumeric characters, and leaves all the hash keys that match in #matching_keys. It's up to you to decide what to do if there's more than one match!
my #matching_keys = grep /\Q$my_variable/, keys %my_hash;
I suspect that there's a better way to do this. It's spoiling the whole point of hashes to search through them like that, and I think a better data design would help. But I can't say any more unless you describe your data and your application.

if you want to match every key of your hash, you have to iterate through them in a loop as well. this is how i would do it, don't know if it is the most elegant way though:
#!/usr/bin/env perl
use strict;
use warnings;
my %hash = (
foo => 1,
bar => 1,
baz => 1,
);
my $variable = "bar";
my $new_variable;
for my $key (keys %hash){
if ($key =~ /$variable/){
$new_variable = $hash{$key};
}
}
print $new_variable, "\n";
also, always try to write stuff like that with use strict; it will spare you of many classic mistakes.

Related

perl string catenation and substitution in a single line?

I need to modify a perl variable containing a file path; it needs to begin and end with a forward slash (/) and have all instances of multiple forward slashes reduced to a single slash.
(This is because an existing process does not enforce a consistent configuration syntax, so there are hundreds of config files scattered everywhere that may or may not have slashes in the right places in file names and path names.)
Something like this:
foreach ( ($config->{'backup_path'},
$config->{'work_path'},
$config->{'output_path'}
) ) {
$_ = "/" . $_ . "/";
$_ =~ s/\/{2,}/\//g;
}
but this does not look optimal or particularly readable to me; I'd rather have a more elegant expression (if it ends up using an unusual regex I'll use a comment to make it clearer.)
Input & output examples
home/datamonster//c2counts becomes /home/datamonster/c2counts/
home/////teledyne/tmp/ becomes /home/teledyne/tmp/
and /var/backup/DOC/all_instruments/ will pass through unchanged
Well, just rewriting what you got:
my #vars = qw ( backup_path work_path output_path );
for ( #{$config}{#vars} ) {
s,^/*,/,; #prefix
s,/*$,/,; #suffix
s,/+,/,g; #double slashes anywhere else.
}
I'd be cautious - optimising for magic regex is not an advantage in every situation, because they become quite quickly unreadable.
The above uses the hash slice mechanism to select values out of a hash (reference in this case), and the fact that s/// implicitly operates on $_ anyway. And modifies the original var when it does.
But it's also useful to know, if you're operating on patterns containing / it's helpful to switch delimiters, because that way you don't get the "leaning toothpicks" effect.
s/\/{2,}/\//g can be written as:
s,/+,/,g
or
s|/{2,}|/|g
if you want to keep the numeric quantifier, as + is inherently 1 or more which works the same here, because it collapses a double into a single anyway, but it technically matches / (and replaces it with /) where the original pattern doesn't. But you wouldn't want to use the , if you have that in your pattern, for the same reason.
However I think this does the trick;
s,(?:^/*|\b\/*$|/+),/,g for #{$config}{qw ( backup_path work_path output_path )};
This matches an alternation grouping, replacing either:
start of line, zero or more /
word boundary, zero or more / end of line
one or more slashes anywhere else.
with a single /.
uses the hash slice mechanism as above, but without the intermediate 'vars'.
(For some reason the second grouping doesn't work correctly without the word boundary \b zero width anchor - I think this is a backtracking issue, but I'm not entirely sure)
For bonus points - you could probably select #vars using grep if your source data structure is appropriate:
my #vars = grep { /_path$/ } keys %$config;
#etc. Or inline with:
s,(?:^/*|\b\/*$|/+),/,g for #{$config}{grep { /_path$/ } keys %$config };
Edit: Or as Borodin notes:
s|(?:/|\A|\z)/*|/|
Giving us:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $config = {
backup_path => "/fish/",
work_path => "narf//zoit",
output_path => "/wibble",
test_path => 'home/datamonster//c2counts',
another_path => "/home/teledyne/tmp/",
again_path => 'home/////teledyne/tmp/',
this_path => '/var/backup/DOC/all_instruments/',
};
s,(?:/|\A|\b\z)/*,/,g for #{$config}{grep { /_path$/ } keys %$config };
print Dumper $config;
Results:
$VAR1 = {
'output_path' => '/wibble/',
'this_path' => '/var/backup/DOC/all_instruments/',
'backup_path' => '/fish/',
'work_path' => '/narf/zoit/',
'test_path' => '/home/datamonster/c2counts/',
'another_path' => '/home/teledyne/tmp/',
'again_path' => '/home/teledyne/tmp/'
};
you could do it like this, but I wouldn't call it more readable:
foreach ( ($config->{'backup_path'},
$config->{'work_path'},
$config->{'output_path'}
) ) {
( $_ = "/$_/" ) =~ s/\/{2,}/\//g;
}
This question already got many fantastic answers.
From the view of non-perl-expert (me), some are hard to read / understand. ;)
So, I would probably use this:
my #vars = qw ( backup_path work_path output_path );
for my $var (#vars) {
my $value = '/' . $config->{$var} . '/';
$value =~ s|//+|/|g;
$config->{$var} = $value;
}
For me, this is will be readable after a year too. :)

How to assign class based on regexp match (sorting in perl)

I am reading from file. Based on value in one column, I want to assign my own class/tag to it.
These regexps:
'LTR*','MLT*','MST*' ...
belong to the class HERV.
'Charlie*','Looper*' ...
belong to the class DNA
Right now I have two arrays, one with regexps and one with respective classes:
my #array = map { qr{$_} } ('Alu*', 'HERV*', 'Charlie*' ...
my #classes = ('Alu', 'HERV', 'DNA', 'LINE' ...
So that I know that if my line matches Charlie*, it belongs to the class DNA.
To sum it up, for every line of the file I am looping the whole array and looking for match:
for my $i (0 .. $#array) {
if ($type =~ m/$array[$i]/) {
my $class=$classes[$i];
}
}
Of course, this is not too clever. It would be much better to say: "this group of regexps belongs to this class" which suggests use of hash.
However, I consider it quite inconvenient to loop all lines, than all keys of hashmap and then all values of certain keys and, when there is a match, use the key as the resulting class/tag. Is this good solution or not?
Thank you very much.
You can do something like this:
my %re = (
HERV=>qr/LTR|MLT|MST/,
DNA=> qr/Charlie|Looper/
);
my $class;
for (keys %re) {
$class = $_, last if ($type =~ $re{$_});
}
This will save you some regex compilation and one loop.
The CPAN module Text::Prefix::XS appears to do what you want: determine which if any of a list of prefixes match a given text. I have not used the module, but from what I can tell you would do something like:
my %prefix2class = ( LTR => 'HERV',
MLV => 'HERV',
...
Charlie => 'DNA' );
my $search = prefix_search_create( keys %prefix2class );
# ... now, for a given $type, no need to loop ...
my $pfx = prefix_search($search, $type);
my $class = $prefix2class{$pfx};
(Note: Your regexes look to me like shell-style/fnmatch-style patterns dubiously compiled as regexes, and from this I infer that you actually want simple prefix matching. Otherwise, the regex /Charlie*/, for example, would match Charli, Charlieeee, fooCharliebar, and so on — that seems unlikely to be representative of your "value in one column".)

evaluate pattern stored in variable perl regexp

I am trying to find out if basket has apple [simplified version of a big problem]
$check_fruit = "\$fruit =~ \/has\/apple\/";
$fruit="basket/has/mango/";
if ($check_fruit) {
print "apple found\n";
}
check_fruit variable is holding the statement of evaluating the regexp.
However it check_fruit variable always becomes true and shows apple found :(
Can somebody help me here If I am missing something.
Goal to accomplish:
Okay so let me explain:
I have a file with a pattern clause defined on eachline similar to:
Line1: $fruit_origin=~/europe\\/finland/ && $fruit_taste=~/sweet/
Line2: similar stuff that can contain ~10 pattern checks seprated by && or || with metacharacters too
2.I have another a list of fruit attributes from a perl hash containing many such fruits
3 I want to categorize each fruit to see how many fruits fall into category defined by each line of the file seprately.
Sort of fruit count /profile per line Is there an easier way to accomplish this ? Thanks a lot
if ($check_fruit) returns true because $check_fruit is defined, not empty and not zero. If you want to evaluate its content, use eval. But a subroutine would serve better:
sub check_fruit {
my $fruit = shift;
return $fruit =~ m(has/apple);
}
if (check_fruit($fruit)) {
print "Apple found\n";
}
Why is there a need to store the statement in a variable? If you're sure the value isn't set by a user, then you can do
if (eval $check_fruit) {
but this isn't safe if the user can set anything in that expression.
Put the pattern (and only the pattern) into the variable, use the variable inside the regular expression matching delimiters m/.../. If you don't know the pattern in advance then use quotemeta for escaping any meta characters.
It should look like this:
my $check_fruit = '/has/apple/'; # here no quotemeta is needed
my $fruit = 'basket/has/mango/';
if ($fruit =~ m/$check_fruit/) {
# do stuff!
}
$check_fruit is nothing but a variable holding string data. If you want to execute the code it contains, you have to use eval.
There were also some other errors in your code related to string quoting/escaping. This fixes that as well:
use strict;
use warnings;
my $check_fruit = '$apple =~ m|/has/mango|';
my $apple="basket/has/mango/";
if (eval $check_fruit) {
print "apple found\n";
}
However, this is not usually a good design. At the very least, it makes for confusing code. It is also a huge security hole if $check_fruit is coming from the user. You can put a regex into a variable, which is preferable:
Edit: note that a regex that comes from user input can be a security problem as well, but it is more limited in scope.
my $check_fruit = qr|/has/mango|;
my $apple="basket/has/mango/";
if ($apple =~ /$check_fruit/) {
print "apple found\n";
}
There are other things you can do to make your Perl code more dynamic, as well. The best approach would depend on what you are trying to accomplish.

Trying to simplify a Regex

I'm spending my weekend analyzing Campaign Finance Contribution records. Fun!
One of the annoying things I've noticed is that entity names are entered differently:
For example, i see stuff like this: 'llc', 'llc.', 'l l c', 'l.l.c', 'l. l. c.', 'llc,', etc.
I'm trying to catch all these variants.
So it would be something like:
"l([,\.\ ]*)l([,\.\ ]*)c([,\.\ ]*)"
Which isn't so bad... except there are about 40 entity suffixes that I can think of.
The best thing I can think of is programmatically building up this pattern , based on my list of suffixes.
I'm wondering if there's a better way to handle this within a single regex that is human readable/writable.
You could just strip out excess crap. Using Perl:
my $suffix = "l. lc.."; # the worst case imaginable!
$suffix =~ s/[.\s]//g;
# no matter what variation $suffix was, it's now just "llc"
Obviously this may maul your input if you use it on the full company name, but getting too in-depth with how to do that would require knowing what language we're working with. A possible regex solution is to copy the company name and strip out a few common words and any words with more than (about) 4 characters:
my $suffix = $full_name;
$suffix =~ s/\w{4,}//g; # strip words of more than 4 characters
$suffix =~ s/(a|the|an|of)//ig; # strip a few common cases
# now we can mangle $suffix all we want
# and be relatively sure of what we're doing
It's not perfect, but it should be fairly effective, and more readable than using a single "monster regex" to try to match all of them. As a rule, don't use a monster regex to match all cases, use a series of specialized regexes to narrow many cases down to a few. It will be easier to understand.
Regexes (other than relatively simple ones) and readability rarely go hand-in-hand. Don't misunderstand me, I love them for the simplicity they usually bring, but they're not fit for all purposes.
If you want readability, just create an array of possible values and iterate through them, checking your field against them to see if there's a match.
Unless you're doing gene sequencing, the speed difference shouldn't matter. And it will be a lot easier to add a new one when you discover it. Adding an element to an array is substantially easier than reverse-engineering a regex.
The first two "l" parts can be simplified by [the first "l" part here]{2}.
You can squish periods and whitespace first, before matching: for instance, in perl:
while (<>) {
$Sq = $_;
$Sq =~ s/[.\s]//g; # squish away . and " " in the temporary save version
$Sq = lc($Sq);
/^llc$/ and $_ = 'L.L.C.'; # try to match, if so save the canonical version
/^ibm/ and $_ = 'IBM'; # a different match
print $_;
}
Don't use regexes, instead build up a map of all discovered (so far) entries and their 'canonical' (favourite) versions.
Also build a tool to discover possible new variants of postfixes by identifying common prefixes to a certain number of characters and printing them on the screen so you can add new rules.
In Perl you can build up regular expressions inside your program using strings. Here's some example code:
#!/usr/bin/perl
use strict;
use warnings;
my #strings = (
"l.l.c",
"llc",
"LLC",
"lLc",
"l,l,c",
"L . L C ",
"l W c"
);
my #seps = ('.',',','\s');
my $sep_regex = '[' . join('', #seps) . ']*';
my $regex_def = join '', (
'[lL]',
$sep_regex,
'[lL]',
$sep_regex,
'[cC]'
);
print "definition: $regex_def\n";
foreach my $str (#strings) {
if ( $str =~ /$regex_def/ ) {
print "$str matches\n";
} else {
print "$str doesn't match\n";
}
}
This regular expression could also be simplified by using case-insensitive matching (which means $match =~ /$regex/i ). If you run this a few times on the strings that you define, you can easily see cases that don't validate according to your regular expression. Building up your regular expression this way can be useful in only defining your separator symbols once, and I think that people are likely to use the same separators for a wide variety of abbreviations (like IRS, I.R.S, irs, etc).
You also might think about looking into approximate string matching algorithms, which are popular in a large number of areas. The idea behind these is that you define a scoring system for comparing strings, and then you can measure how similar input strings are to your canonical string, so that you can recognize that "LLC" and "lLc" are very similar strings.
Alternatively, as other people have suggested you could write an input sanitizer that removes unwanted characters like whitespace, commas, and periods. In the context of the program above, you could do this:
my $sep_regex = '[' . join('', #seps) . ']*';
foreach my $str (#strings) {
my $copy = $str;
$copy =~ s/$sep_regex//g;
$copy = lc $copy;
print "$str -> $copy\n";
}
If you have control of how the data is entered originally, you could use such a sanitizer to validate input from the users and other programs, which will make your analysis much easier.

How do I assign many values to a particular Perl variable?

I am writing a script in Perl which searches for a motif(substring) in protein sequence(string). The motif sequence to be searched (or substring) is hhhDDDssEExD, where:
h is any hydrophobic amino acid
s is any small amino acid
x is any amino acid
h,s,x can have more than one value separately
Can more than one value be assigned to one variable? If yes, how should I do that? I want to assign a list of multiple values to a variable.
It seems like you want some kind of pattern matching. This can be done with strings using regular expressions.
You can use character classes in your regular expression. The classes you mentioned would be:
h -> [VLIM]
s -> [AG]
x -> [A-IK-NP-TV-Z]
The last one means "A to I, K to N, P to T, V to Z".
The regular expression for your example would be:
/[VLIM]{3}D{3}[AG]{2}E{2}[A-IK-NP-TV-Z]D/
I am no great expert in perl, so there is quite possibly a quicker way to this, but it seems like the match operator "//" in list context is what you need. When you assign the result of a match operation to a list, the match operator takes on list context and returns a list with each of the parenthesis delimited sub-expressions. If you specify global matches with the "g" flag, it will return a list of all the matches of each sub-expression. Example:
# print a list of each match for "x" in "xxx"
#aList = ("xxx" =~ /(x)/g);
print(join(".", #aList));
Will print out
x.x.x
I'm assuming you have a regular expression for each of those 5 types h, D, s, E, and x. You didn't say whether each of these parts is a single character or multiple, so I'm going to assume they can be multiple characters. If so, your solution might be something like this:
$h = ""; # Insert regex to match "h"
$D = ""; # Insert regex to match "D"
$s = ""; # Insert regex to match "s"
$E = ""; # Insert regex to match "E"
$x = ""; # Insert regex to match "x"
$sequenceRE = "($h){3}($D){3}($s){2}($E){2}($x)($D)"
if ($line =~ /$sequenceRE/) {
$hPart = $1;
$sPart = $3;
$xPart = $5;
#hValues = ($hPart =~ /($h)/g);
#sValues = ($sPart =~ /($s)/g);
#xValues = ($xPart =~ /($x)/g);
}
I'm sure there is something I've missed, and there are some subtleties of perl that I have overlooked, but this should get you most of the way there. For more information, read up on perl's match operator, and regular expressions.
I could be way off, but it sounds like you want an object with a built in method to output as a string.
If you start with a string, like the one you mentioned, you could pass the string to the class as a new object, use regular expressions like everyone has already suggested to parse out the chunks that you would then assign as variables to that object. Finally, you could have it output a string based on the variables of that object, for instance:
$string = "COHOCOHOCOHOCOHOCOHOC";
$sugar = new Organic($string);
Class Organic {
$chem;
function __construct($chem) {
$hydro_find = "OHO";
$carb_find = "C";
$this-> hydro = preg_find ($hydro_find, $chem);
$this -> carb = preg_find ($carb_find, $chem);
function __TO_STRING() {
return $this->carb."="$this->hydro;
}
}
echo $sugar;
Okay, that kind of fell apart in the end, and it was pseudo-php, not perl. But if I understand your question correctly, you are looking for a way to get all of the info from the string but keep it tied to that string. That would be objects and classes.
You probably want an array (or arrayref) or a pattern (qr//).
Or maybe Quantum::Superpositions.