Compile dynamic regex in perl - regex

Is it at all possible to dynamically generate a regular expression using values from an array in Perl?
Lets assume I have an array of keywords that I want to match on. How can I build the regex from the values in that array?
The following doesn't seem to work
### Generate regex dynamically
my #regx_array = ('apples','oranges','bananas');
my $dynanic_regx = qr/join("|",#regx_array)/;
As I'm looking for the following regex:
(?^i:apples|oranges|bananas);
But instead I end up with
(?^i:join("|",ARRAY(0x34c5924)));
Any help would be greatly appreciated.

You have a couple of things wrong. First, you're building your array incorrectly.
my #regx_array = ['apples','oranges','bananas'];
You use parentheses to create a list.
my #regx_array = ('apples','oranges','bananas');
Then do this:
my $list = join( '|', #regx_array );
my $dynamic_regx = qr/$list/i;

my #regx_array = ('apples','oranges','bananas');
my ($dynamic_regx) = map qr/$_/i, join "|", map quotemeta, #regx_array;

Even easier than the other two:
my #regx_array = qw(apples oranges bananas);
local $" = '|';
my $regex = qr/(#regx_array)/i;
$" is also known as the $LIST_SEPARATOR. And about that value:
When an array or an array slice is interpolated into a double-quoted string ... its elements are separated by this value. Default is a space.

Related

Perl variable in regex for finding expression in hash

I have a big hash with a lot of elements.
%my_hash = ();
# filling of %my_hash automaticly
$my_variable;
# set the value of $my_variable within a loop
Now I want to find the value of $my_variablewithin %my_hash. I tried it with
if(grep {/$my_variable/} keys %my_hash){
$my_new_variable = #here should be the element of %my_hash which makes the statement true
}
how to do that?
Edit: The problem is not the whole $my_variable will be find at %my_hash, e.g.
$my_variable = astring
$modules_by_path{"this_is_a_longer_astring"} = (something)
now I want to find this...
If you're looking only for one particular key from %my_hash,
if (my ($my_new_variable) = grep /\Q$my_variable/, keys %my_hash) {
..
}
or
if (my #keys = grep /\Q$my_variable/, keys %my_hash) { .. }
if there are more keys which match specified regex. (use \Q prefix if $my_variable is not regex but literal string to be matched).
You can use grep, but you need to put it in scalar context to get the result you want. You also need to escape the contents of $my_variable if there's any chance that it contains any regex metacharacters.
This uses \Q to escape the non-alphanumeric characters, and leaves all the hash keys that match in #matching_keys. It's up to you to decide what to do if there's more than one match!
my #matching_keys = grep /\Q$my_variable/, keys %my_hash;
I suspect that there's a better way to do this. It's spoiling the whole point of hashes to search through them like that, and I think a better data design would help. But I can't say any more unless you describe your data and your application.
if you want to match every key of your hash, you have to iterate through them in a loop as well. this is how i would do it, don't know if it is the most elegant way though:
#!/usr/bin/env perl
use strict;
use warnings;
my %hash = (
foo => 1,
bar => 1,
baz => 1,
);
my $variable = "bar";
my $new_variable;
for my $key (keys %hash){
if ($key =~ /$variable/){
$new_variable = $hash{$key};
}
}
print $new_variable, "\n";
also, always try to write stuff like that with use strict; it will spare you of many classic mistakes.

How to replace without overwriting

I've got a list of words I'm using for writing a game:
words[0[0] = 'INCREDIBLE'
words[0[1] = 'SUPERB'
words[0[2] = 'SUBLIME'
words[0[3] = 'PHENOMENAL'
words[0[4] = 'BLITZKRIEG'
words[1[0] = 'EXCELLENT'
words[1[1] = 'BOFFO'
words[1[2] = 'SMASH'
words[1[3] = 'SUPREME'
words[1[4] = 'OUTSTANDING'
I want to make this into a 2d array by replacing the second '[' with ','
Obviously I can do this manually in no time at all. Nevertheless it's something I'd very much like to learn how to do with regex and notepad++. How would I identify the second '[' and then replace it without changing the adjoining numbers?
Currently I use \d+[\d+ to find it.
Just use this:
Find what: (\[\d+)\[
Replace with: $1,
If all or most [ are are in the same column, you can also use Alt to select the whole column via mouse and just enter ' to replace it in the whole marked range.
Try to replace (^words\[\d+)\[ by $1,
I used lookbehind (?<=\d)[
Lookahead and lookbehind are two very neat (and seemingly necessary) regex features.

Regex to select semicolons that are not enclosed in double quotes

I have string like
a;b;"aaa;;;bccc";deef
I want to split string based on delimiter ; only if ; is not inside double quotes. So after the split, it will be
a
b
"aaa;;;bccc"
deef
I tried using look-behind, but I'm not able to find a correct regular expression for splitting.
Regular expressions are probably not the right tool for this. If possible you should use a CSV library, specify ; as the delimiter and " as the quote character, this should give you the exact fields you are looking for.
That being said here is one approach that works by ensuring that there are an even number of quotation marks between the ; we are considering the split at and the end of the string.
;(?=(([^"]*"){2})*[^"]*$)
Example: http://www.rubular.com/r/RyLQyR8F19
This will break down if you can have escaped quotation marks within a string, for example a;"foo\"bar";c.
Here is a much cleaner example using Python's csv module:
import csv, StringIO
reader = csv.reader(StringIO.StringIO('a;b;"aaa;;;bccc";deef'),
delimiter=';', quotechar='"')
for row in reader:
print '\n'.join(row)
Regular expression will only get messier and break on even minor changes. You are better off using a csv parser with any scripting language. Perl built in module (so you don't need to download from CPAN if there are any restrictions) called Text::ParseWords allows you to specify the delimiter so that you are not limited to ,. Here is a sample snippet:
#!/usr/local/bin/perl
use strict;
use warnings;
use Text::ParseWords;
my $string = 'a;b;"aaa;;;bccc";deef';
my #ary = parse_line(q{;}, 0, $string);
print "$_\n" for #ary;
Output
a
b
aaa;;;bccc
deef
This is kind of ugly, but if you don't have \" inside your quoted strings (meaning you don't have strings that look like this ("foo bar \"badoo\" goo") you can split on the " first and then assume that all your even numbered array elements are, in fact, strings (and split the odd numbered elements into their component parts on the ; token).
If you *do have \" in your strings, then you'll want to first convert those into some other temporary token that you'll convert back later after you've performed your operation.
Here's a fiddle...
http://jsfiddle.net/VW9an/
var str = 'abc;def;ghi"some other dogs say \\"bow; wow; wow\\". yes they do!"and another; and a fifth'
var strCp = str.replace(/\\"/g,"--##--");
var parts = strCp.split(/"/);
var allPieces = new Array();
for(var i in parts){
if(i % 2 == 0){
var innerParts = parts[i].split(/\;/)
for(var j in innerParts)
allPieces.push(innerParts[j])
}
else{
allPieces.push('"' + parts[i] +'"')
}
}
for(var a in allPieces){
allPieces[a] = allPieces[a].replace(/--##--/g,'\\"');
}
console.log(allPieces)
Match All instead of Splitting
Answering long after the battle because no one used the way that seems the simplest to me.
Once you understand that Match All and Split are Two Sides of the Same Coin, you can use this simple regex:
"[^"]*"|[^";]+
See the matches in the Regex Demo.
The left side of the alternation | matches full quoted strings
The right side matches any chars that are neither ; nor "

How to assign class based on regexp match (sorting in perl)

I am reading from file. Based on value in one column, I want to assign my own class/tag to it.
These regexps:
'LTR*','MLT*','MST*' ...
belong to the class HERV.
'Charlie*','Looper*' ...
belong to the class DNA
Right now I have two arrays, one with regexps and one with respective classes:
my #array = map { qr{$_} } ('Alu*', 'HERV*', 'Charlie*' ...
my #classes = ('Alu', 'HERV', 'DNA', 'LINE' ...
So that I know that if my line matches Charlie*, it belongs to the class DNA.
To sum it up, for every line of the file I am looping the whole array and looking for match:
for my $i (0 .. $#array) {
if ($type =~ m/$array[$i]/) {
my $class=$classes[$i];
}
}
Of course, this is not too clever. It would be much better to say: "this group of regexps belongs to this class" which suggests use of hash.
However, I consider it quite inconvenient to loop all lines, than all keys of hashmap and then all values of certain keys and, when there is a match, use the key as the resulting class/tag. Is this good solution or not?
Thank you very much.
You can do something like this:
my %re = (
HERV=>qr/LTR|MLT|MST/,
DNA=> qr/Charlie|Looper/
);
my $class;
for (keys %re) {
$class = $_, last if ($type =~ $re{$_});
}
This will save you some regex compilation and one loop.
The CPAN module Text::Prefix::XS appears to do what you want: determine which if any of a list of prefixes match a given text. I have not used the module, but from what I can tell you would do something like:
my %prefix2class = ( LTR => 'HERV',
MLV => 'HERV',
...
Charlie => 'DNA' );
my $search = prefix_search_create( keys %prefix2class );
# ... now, for a given $type, no need to loop ...
my $pfx = prefix_search($search, $type);
my $class = $prefix2class{$pfx};
(Note: Your regexes look to me like shell-style/fnmatch-style patterns dubiously compiled as regexes, and from this I infer that you actually want simple prefix matching. Otherwise, the regex /Charlie*/, for example, would match Charli, Charlieeee, fooCharliebar, and so on — that seems unlikely to be representative of your "value in one column".)

How do I assign many values to a particular Perl variable?

I am writing a script in Perl which searches for a motif(substring) in protein sequence(string). The motif sequence to be searched (or substring) is hhhDDDssEExD, where:
h is any hydrophobic amino acid
s is any small amino acid
x is any amino acid
h,s,x can have more than one value separately
Can more than one value be assigned to one variable? If yes, how should I do that? I want to assign a list of multiple values to a variable.
It seems like you want some kind of pattern matching. This can be done with strings using regular expressions.
You can use character classes in your regular expression. The classes you mentioned would be:
h -> [VLIM]
s -> [AG]
x -> [A-IK-NP-TV-Z]
The last one means "A to I, K to N, P to T, V to Z".
The regular expression for your example would be:
/[VLIM]{3}D{3}[AG]{2}E{2}[A-IK-NP-TV-Z]D/
I am no great expert in perl, so there is quite possibly a quicker way to this, but it seems like the match operator "//" in list context is what you need. When you assign the result of a match operation to a list, the match operator takes on list context and returns a list with each of the parenthesis delimited sub-expressions. If you specify global matches with the "g" flag, it will return a list of all the matches of each sub-expression. Example:
# print a list of each match for "x" in "xxx"
#aList = ("xxx" =~ /(x)/g);
print(join(".", #aList));
Will print out
x.x.x
I'm assuming you have a regular expression for each of those 5 types h, D, s, E, and x. You didn't say whether each of these parts is a single character or multiple, so I'm going to assume they can be multiple characters. If so, your solution might be something like this:
$h = ""; # Insert regex to match "h"
$D = ""; # Insert regex to match "D"
$s = ""; # Insert regex to match "s"
$E = ""; # Insert regex to match "E"
$x = ""; # Insert regex to match "x"
$sequenceRE = "($h){3}($D){3}($s){2}($E){2}($x)($D)"
if ($line =~ /$sequenceRE/) {
$hPart = $1;
$sPart = $3;
$xPart = $5;
#hValues = ($hPart =~ /($h)/g);
#sValues = ($sPart =~ /($s)/g);
#xValues = ($xPart =~ /($x)/g);
}
I'm sure there is something I've missed, and there are some subtleties of perl that I have overlooked, but this should get you most of the way there. For more information, read up on perl's match operator, and regular expressions.
I could be way off, but it sounds like you want an object with a built in method to output as a string.
If you start with a string, like the one you mentioned, you could pass the string to the class as a new object, use regular expressions like everyone has already suggested to parse out the chunks that you would then assign as variables to that object. Finally, you could have it output a string based on the variables of that object, for instance:
$string = "COHOCOHOCOHOCOHOCOHOC";
$sugar = new Organic($string);
Class Organic {
$chem;
function __construct($chem) {
$hydro_find = "OHO";
$carb_find = "C";
$this-> hydro = preg_find ($hydro_find, $chem);
$this -> carb = preg_find ($carb_find, $chem);
function __TO_STRING() {
return $this->carb."="$this->hydro;
}
}
echo $sugar;
Okay, that kind of fell apart in the end, and it was pseudo-php, not perl. But if I understand your question correctly, you are looking for a way to get all of the info from the string but keep it tied to that string. That would be objects and classes.
You probably want an array (or arrayref) or a pattern (qr//).
Or maybe Quantum::Superpositions.