To match a variable containing metacharacters using regular expressions in perl - regex

$match="";
for($i=0;$i<=$#wor;$i++)
{
$match=$match.$letter[$wor[$i]];
}
print $match;
open ABC,"<words.txt";
while(<ABC>)
{
if($_ =~ /^$match$/ )
{
print "$_";
print "\n";
}
}
In the following code, I am not able to match the line of the file i.e. $_ with the variable $match (which contains the actual metacharacters which are to be matched )?
And hence no output is produced
What changes are needed?

You need to remove the ^ and $ anchors from your regexp which match the beginning and end of a string.
With them, the regexp will only match lines which only contain the meta-characters.
You probably also want to wrap $match in [ .. ] characters, to indicate that it's a range of characters, and not a word.
For example, if you wanted to exclude any line containing _ or % your $match would need to contain [_$]
EDIT if, per the comments, you only want to match if the meta characters are found at either end, use:
if (/^${match}/ || /${match}$/) {
...
}

Related

Perl regular expression (starts with ATG and ends with TAG, TAA, or TGA)

I need a regular expression in perl that will match with ATG at the start, and ends with either TAG, TAA, or TGA. This is the code I have so far.
my $sequence = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAACGAA';
while($sequence =~ ____) {
print $1;
}
Since you're dealing with codons here, you probably forgot to mention that the nuclotides in between must be a multiple of 3.
Code:
my $sequence = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAACGAA';
# |--------------1-------------|
# |---2---|
while($sequence =~ /ATG(?:[ACTG]{3})*?T(?:A[AG]|GA)/g)
{
print $&."\n";
}
Output:
ATGGTTTCTCCCATCTCTCCATCGGCATAA
ATGATCTAA
Description:
ATG - Matches "ATG" literally
(?:[ACTG]{3})*? - is a non capturing group, repeated 0 o more times, as few as possible (lazy quantifier, the extra ?), matching:
[ACTG]{3} - 3 characters/nucleotides (either "A", "C", "T" or "G")
T(?:A[AG]|GA) - matches "TAA", "TAG", or "TGA". Also, as Borodin commented, this can be written as (?:TAG|TAA|TGA) if you prefer to improve readability.
But if you also need to match overlapping sequences, you should use a lookahead to prevent the match from consuming the characters.
Code:
# modified to include overlapping sequences
my $sequence = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATATGGAATGATCTAACGAA';
# |--------------1-------------|
# |---2---|
# |---3---|
while($sequence =~ /ATG(?=((?:[ACTG]{3})*?T(?:A[AG]|GA)))/g)
{
print $&.$1."\n";
}
Output:
ATGGTTTCTCCCATCTCTCCATCGGCATAA
ATGGAATGA
ATGATCTAA
And finally, this is a more efficent version of the last expression, using the Unrolling the Loop technique, that will yield better results when you're dealing with large sequences.
Code:
my $sequence = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATATGGAATGATCTAACGAA';
# modified to include overlapping sequences
while($sequence =~ /ATG(?=((?:[ACG][ACTG]{2})*(?:T(?:A[CT]|G[CTG]|[CT][ACTG])(?:[ACG][ACTG]{2})*)*T(?:A[AG]|GA)))/g)
{
print $&.$1."\n";
}
Output:
ATGGTTTCTCCCATCTCTCCATCGGCATAA
ATGGAATGA
ATGATCTAA
Not sure if this is what you are looking for, but this finds one match so it is non-exhaustive
my $sequence = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAACGAA';
if($sequence =~ /(ATG.*?(:?TAG|TAA|TGA))/) {
print $1;
}
OUTPUT
ATGGTTTCTCCCATCTCTCCATCGGCATAA
Try this:
#!/usr/bin/perl
my $sequence ='AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAACGAA';
while($sequence =~m/ATG.*?(TGA|TAG|TAA)/g)
{
print $&,"\n";
}

perl Regex replace for specific string length

I am using Perl to do some prototyping.
I need an expression to replace e by [ee] if the string is exactly 2 chars and finishes by "e".
le -> l [ee]
me -> m [ee]
elle -> elle : no change
I cannot test the length of the string, I need one expression to do the whole job.
I tried:
`s/(?=^.{0,2}\z).*e\z%/[ee]/g` but this is replacing the whole string
`s/^[c|d|j|l|m|n|s|t]e$/[ee]/g` same result (I listed the possible letters that could precede my "e")
`^(?<=[c|d|j|l|m|n|s|t])e$/[ee]/g` but I have no match, not sure I can use ^ on a positive look behind
EDIT
Guys you're amazing, hours of search on the web and here I get answers minutes after I posted.
I tried all your solutions and they are working perfectly directly in my script, i.e. this one:
my $test2="le";
$test2=~ s/^(\S)e$/\1\[ee\]/g;
print "test2:".$test2."\n";
-> test2:l[ee]
But I am loading these regex from a text file (using Perl for proto, the idea is to reuse it with any language implementing regex):
In the text file I store for example (I used % to split the line between match and replace):
^(\S)e$% \1\[ee\]
and then I parse and apply all regex like that:
my $test="le";
while (my $row = <$fh>) {
chomp $row;
if( $row =~ /%/){
my #reg = split /%/, $row;
#if no replacement, put empty string
if($#reg == 0){
push(#reg,"");
}
print "reg found, reg:".$reg[0].", replace:".$reg[1]."\n";
push #regs, [ #reg ];
}
}
print "orgine:".$test."\n";
for my $i (0 .. $#regs){
my $p=$regs[$i][0];
my $r=$regs[$i][1];
$test=~ s/$p/$r/g;
}
print "final:".$test."\n";
This technique is working well with my other regex, but not yet when I have a $1 or \1 in the replace... here is what I am obtaining:
final:\1\ee\
PS: you answered to initial question, should I open another post ?
Something like s/(?i)^([a-z])e$/$1[ee]/
Why aren't you using a capture group to do the replacement?
`s/^([c|d|j|l|m|n|s|t])e$/\1 [ee]/g`
If those are the characters you need and if it is indeed one word to a line with no whitespace before it or after it, then this will work.
Here's another option depending on what you are looking for. It will match a two character string consisting of one a-z character followed by one 'e' on its own line with possible whitespace before or after. It will replace this will the single a-z character followed by ' [ee]'
`s/^\s*([a-z])e\s*$/\1 [ee]/`
^(\S)e$
Try this.Replace by $1 [ee].See demo.
https://regex101.com/r/hR7tH4/28
I'd do something like this
$word =~ s/^(\w{1})(e)$/$1$2e/;
You can use following regex which match 2 character and then you can replace it with $1\[$2$2\]:
^([a-zA-Z])([a-zA-Z])$
Demo :
$my_string =~ s/^([a-zA-Z])([a-zA-Z])$/$1[$2$2]/;
See demo https://regex101.com/r/iD9oN4/1

Matching numbers for substitution in Perl

I have this little script:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The expected output would be
5.txt
12.txt
1.txt
But instead, I get
R3_05.txt
T3_12.txt
1.txt
The last one is fine, but I cannot fathom why the regex gives me the string start for $1 on this case.
Try this pattern
foreach (#list) {
s/^.*?_?(?|0(\d)|(\d{2})).*\.txt$/$1.txt/;
print $_ . "\n";
}
Explanations:
I use here the branch reset feature (i.e. (?|...()...|...()...)) that allows to put several capturing groups in a single reference ( $1 here ). So, you avoid using a second replacement to trim a zero from the left of the capture.
To remove all from the begining before the number, I use :
.*? # all characters zero or more times
# ( ? -> make the * quantifier lazy to match as less as possible)
_? # an optional underscore
Note that you can ensure that you have only 2 digits adding a lookahead to check if there is not a digit that follows:
s/^.*?_?(?|0(\d)|(\d{2}))(?!\d).*\.txt$/$1.txt/;
(?!\d) means not followed by a digit.
The problem here is that your substitution regex does not cover the whole string, so only part of the string is substituted. But you are using a rather complex solution for a simple problem.
It seems that what you want is to read two digits from the string, and then add .txt to the end of it. So why not just do that?
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
for (#list) {
if (/(\d{2})/) {
$_ = "$1.txt";
}
}
To overcome the leading zero effect, you can force a conversion to a number by adding zero to it:
$_ = 0+$1 . ".txt";
I would modify your regular expression. Try using this code:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/.*(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The problem is that the first part in your s/// matches, what you think it does, but that the second part isn't replacing what you think it should. s/// will only replace what was previously matched. Thus to replace something like T3_ you will have to match that too.
s/.*(\d{2}).*\.txt$/$1.txt/;

Perl - Regex to extract only the comma-separated strings

I have a question I am hoping someone could help with...
I have a variable that contains the content from a webpage (scraped using WWW::Mechanize).
The variable contains data such as these:
$var = "ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig"
$var = "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf"
$var = "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew"
The only bits I am interested in from the above examples are:
#array = ("cat_dog","horse","rabbit","chicken-pig")
#array = ("elephant","MOUSE_RAT","spider","lion-tiger")
#array = ("ANTELOPE-GIRAFFE","frOG","fish","crab","kangaROO-KOALA")
The problem I am having:
I am trying to extract only the comma-separated strings from the variables and then store these in an array for use later on.
But what is the best way to make sure that I get the strings at the start (ie cat_dog) and end (ie chicken-pig) of the comma-separated list of animals as they are not prefixed/suffixed with a comma.
Also, as the variables will contain webpage content, it is inevitable that there may also be instances where a commas is immediately succeeded by a space and then another word, as that is the correct method of using commas in paragraphs and sentences...
For example:
Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.
^ ^
| |
note the spaces here and here
I am not interested in any cases where the comma is followed by a space (as shown above).
I am only interested in cases where the comma DOES NOT have a space after it (ie cat_dog,horse,rabbit,chicken-pig)
I have a tried a number of ways of doing this but cannot work out the best way to go about constructing the regular expression.
How about
[^,\s]+(,[^,\s]+)+
which will match one or more characters that are not a space or comma [^,\s]+ followed by a comma and one or more characters that are not a space or comma, one or more times.
Further to comments
To match more than one sequence add the g modifier for global matching.
The following splits each match $& on a , and pushes the results to #matches.
my $str = "sdfds cat_dog,horse,rabbit,chicken-pig then some more pig,duck,goose";
my #matches;
while ($str =~ /[^,\s]+(,[^,\s]+)+/g) {
push(#matches, split(/,/, $&));
}
print join("\n",#matches),"\n";
Though you can probably construct a single regex, a combination of regexs, splits, grep and map looks decently
my #array = map { split /,/ } grep { !/^,/ && !/,$/ && /,/ } split
Going from right to left:
Split the line on spaces (split)
Leave only elements having no comma at the either end but having one inside (grep)
Split each such element into parts (map and split)
That way you can easily change the parts e.g. to eliminate two consecutive commas add && !/,,/ inside grep.
I hope this is clear and suits your needs:
#!/usr/bin/perl
use warnings;
use strict;
my #strs = ("ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig",
"fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf",
"dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew",
"Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.",
"Another sentence, although having commas, should not confuse the regex with this: a,b,c,d");
my $regex = qr/
\s #From your examples, it seems as if every
#comma separated list is preceded by a space.
(
(?:
[^,\s]+ #Now, not a comma or a space for the
#terms of the list
, #followed by a comma
)+
[^,\s]+ #followed by one last term of the list
)
/x;
my #matches = map {
$_ =~ /$regex/;
if ($1) {
my $comma_sep_list = $1;
[split ',', $comma_sep_list];
}
else {
[]
}
} #strs;
$var =~ tr/ //s;
while ($var =~ /(?<!, )\b[^, ]+(?=,\S)|(?<=,)[^, ]+(?=,)|(?<=\S,)[^, ]+\b(?! ,)/g) {
push (#arr, $&);
}
the regular expression matches three cases :
(?<!, )\b[^, ]+(?=,\S) : matches cat_dog
(?<=,)[^, ]+(?=,) : matches horse & rabbit
(?<=\S,)[^, ]+\b(?! ,) : matches chicken-pig

after matching pattern how to add dash after string in perl.regex

i have this type of data:
please help me out i am new to regular expressions,and please explain each step while answering.thanks..
7210315_AX1A_1X50_LI_MOTORTRAEGER_VORN_AUSSEN
7210316_W1A_1X50_RE_MOTORTRAEGER_VORN_AUSSEN
7210243_U1A_1X50_LI_MOTORTRAEGER_VORN_INNEN
7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD
i want to extract only this data from above lines:
7210315_AX1A_MOTORTRAEGER_VORN_AUSSEN
7210316_W1A_MOTORTRAEGER_VORN_AUSSEN
7210243_U1A_MOTORTRAEGER_VORN_INNEN
7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD
then if AX1A contains two consecutive alphabets after underscore ,it should be written as AX_ , and if contains single digit and single alphabet then they become as -1_ and -A_ so after applying this pattern it will become: AX_-1_-A_ and all other data should be remain same.
similarly in next line "W1A" so firstly it contains single alphabet "W" which should be converted to -W_ now next character is a single digit so it should also be converted as same pattern -1_ similarly last one is also treated same.so it become -W_-1_-A_
we are only interested in applying regex to the part after digits followed by underscore.
_AX1A_
_W1A_
_U1A_
_AV21NA_
output should be:
7210315_AX_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210316_-W_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210243_-U_-1_-A_MOTORTRAEGER_VORN_INNEN
7210330_AV_21_NA_ABSTUETZUNG_STUETZTRAEGER_RAD
use strict;
use warnings;
my $match
= qr/
( \d+ # group of digits
_ # followed by an underscore
) # end group
( \p{Alpha}+ ) # group of alphas
( \d+ ) # group of digits
( \p{Alpha}* ) # group of alphas
( \w+ ) # group of word characters
/x
;
while ( my $record = <$input> ) { # record of input
# match and capture
if ( my ( $pre, $pre_alpha, $num, $post_alpha, $post ) = $record =~ m/$match/ ) {
say $pre
# if the alpha has length 1, add a dash before it
. ( length $pre_alpha == 1 ? '-' : '' )
# then the alpha
. $pre_alpha
# then the underscore
. '_'
# test if the length of the number is 1 and the length of the
# trailing alpha string is 1
. ( length( $num ) == 1 && length( $post_alpha ) == 1
# if true, apply a dash before each
? "-$num\_-$post_alpha"
# otherwise treat as AV21NA in example.
: "$num\_$post_alpha"
)
. $post
;
}
}
I don't know all the ins and outs of what you need stripped, but I'll extrapolate and let you clarify if this doesn't do quite what you need.
For the first step, extracting the 1X50_RE_ and 1X50_LI, you could search for those strings and replace them with nothing.
Next, to split your second letter/number code into your small chunks, you can use a pair of matches, using a look-ahead on each. However, since you only want to mess with that second code chunk, I'd split the overall line up first, work on the second chunk, and then join the pieces back together again.
while (<$input>) {
# Replace the 1X50_RE/LI_ bits with nothing (i.e., delete them)
s/1X50_(RE|LI)_//;
my #pieces = split /_/; # split the line into pieces at each underscore
# Just working with the second chunk. /g, means do it for all matches found
$pieces[1] =~ s/([A-Z])(?=[0-9])/$1_-/g; # Convert AX1 -> AX_-1
$pieces[1] =~ s/([0-9])(?=[A-Z])/$1_-/g; # Convert 1A -> 1-_A
# Join the pieces back together again
$_ = join '_', #pieces;
print;
}
The $_ is the variable many Perl operations work on if you don't specify. The <$input> reads the next line of the file handle named $input into $_. The s///, split, and print functions work on $_ when not given. The =~ operator is the way you tell Perl to use $pieces[1] (or whichever variable you are working on) instead of $_ for regular expression operations. (For split or print, you'd pass the variables as the argument instead, so split /_/ is the same as split /_/, $_ and print is the same as print $_.)
Oh, and to explain the regular expressions a bit:
s/1X50_(RE|LI)_//;
This is matching anything containing 1X50_RE or 1X50_LI (the (|) is a list of alternatives) and replacing them with nothing (the empty // at the end).
Looking at one of the other lines:
s/([A-Z])(?=[0-9])/$1_-/g;
The plain parentheses (...) around [A-Z] cause $1 to be set to whatever letter is matched inside (in this case a letter, A-Z). The (?=...) parenthesis cause a zero-width positive look-ahead assertion. That means the regular expression only matches if the very next thing in the string matches the expression (a digit, 0-9), but that part of the match is not included as part of the string that is replaced.
The /$1_-/ causes the matched part of the string, the [A-Z], to be replaced with the value captured by the parentheses, (...), but before the look-head, [0-9], with the addition of the _- you require.
#!/usr/bin/perl -w
use strict;
while (<>) {
next if /^\s*$/;
chomp;
## Remove those parts of the line we do not want
## You do not specify what, if anything, is constant about
## the parts you do not want. One of the following cases should
## serve.
## i) Remove the string _1X50_ and the next characters between
## two underscores:
s/_1X50_.+?_/_/;
## ii) keep the first 2 and last 3 sections of each line.
## Uncomment this line and comment the previous one to use this:
#s/^(.+?_.+?)_.+_(.+_.+_.+)$/$1_$2/;
## The line now contains only those regions we are
## interested in. Split on '_' to collect an array of the
## different parts (#a):
my #a=split(/_/);
## $a[1] is the second string, eg AX1A,W1A etc.
## We search for one or more letters, followed by one or more digits
## followed by one or more letters. The 'i' operand makes the match
## case Insensitive and the 'g' operand makes the search global, allowing
## us to capture the matches in the #matches array.
my #matches=($a[1]=~/^([a-z]*)(\d*)([a-z]*)/ig);
## So, for each of the matched strings, if the length of the match
## is less than 2, add a '-' to the beginning of the string:
foreach my $match (#matches) {
if (length($match)<2) {
$match="-" . $match;
}
}
## Now replace the original $a[1] with each string in
## #matches, connected by '_':
$a[1]=join("_", #matches);
## Finally, build the string $kk by joining each element
## of the line (#a) by a '_', and print:
my $kk=join("_", #a);
print "$kk\n";
}
Are you sure like this:
while (<DATA>) {
s/1X50_(LI|RE)_//;
s/(\d+)_([A-Z])(\d)([A-Z])/$1_-$2_-$3_-$4/;
s/(\d+)_([A-Z]{2})(\d)([A-Z])/$1_$2_-$3_-$4/;
s/(\d+)_([A-Z]{1,2})(\d+)([A-Z]+)/$1_$2_$3_$4/;
print;
}
__DATA__
7210315_AX1A_1X50_LI_MOTORTRAEGER_VORN_AUSSEN
7210316_W1A_1X50_RE_MOTORTRAEGER_VORN_AUSSEN
7210243_U1A_1X50_LI_MOTORTRAEGER_VORN_INNEN
7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD
output:
7210315_AX_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210316_-W_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210243_-U_-1_-A_MOTORTRAEGER_VORN_INNEN
7210330_AV_21_NA_ABSTUETZUNG_STUETZTRAEGER_RAD
zostay's suggestion of splitting the line may make things easier if you are a regex beginner. However, avoiding the split is optimal from a performance perspective. Here is how to do it without splitting:
open IN_FILE, "filename" or die "Whoops! Can't open file.";
while (<IN_FILE>)
{
s/^\d{7}_\K([A-Z]{1,2})(\d{1,2})([A-Z]{1,2})/-${1}-${2}-${3}/
or print "line didn't match: $line\n";
s/1X50_(LI|RE)_//;
}
Breaking down the first pattern:
s/// is the search-and-replace operator.
^ match the beginning of the line
\d{7}_ match seven digits, followed by an underscore
\K look-behind operator. This means that whatever came before won't be part of the string that is replaced. () each set of parentheses specifies a chunk of the match that will be captured. These will be put into the match variables $1, $2, etc. in order. [A-Z]{1,2} this means match between one and two capital letters. You can probably figure out what the other two sections in parentheses mean. -${1}-${2}-${3} Replace what matched with the first three match variables, preceded by dashes. The only reason for the curly braces is to make clear what the variable name is.