parsing of string using regex in perl - regex

DESCR: "10GE SR"
i need match this above part which is part of my rest of the string. Im using regex in perl.
i tried
if ($line =~ /DESCR: \"([a-zA-Z0-9)\"/) {
print "$1\n";
}
but im not able to understand how to consider spaces inside my string. these spaces can occur any where within the quotes. can someone help me out.

$str = 'DESCR: "10GE SR"';
if ($str =~ /DESCR: \"([a-zA-Z0-9\s]+)\"/) {
print "$1\n";
}

Take a look, this pattern can match double quoted in string:
if ($line =~ /DESCR: \"((?:[^\\"]|\\.)*)\"/) {
print "$1\n";
}

It may be simpler:
if ( $line =~ /DESCR: "([^"]+)"/ ) {
print "$1\n";
}

Related

Perl - string matching issue

I have a problem I cannot understand. I have this string:
gene_id "siRNA_Z27kG1_20543"transcript_id "siRNA_Z27kG1_20543_X_1";tss_id "TSS124620"
And I want to change the gene_id. So, I have the following code:
if ($line =~ /;transcript_id "([A-Za-z0-9:\-._]*)(_[oxOX][_.][0-9]*)";/) {
$num = $2;
$line =~ s/gene_id "([A-Za-z0-9:\-._]*)";/gene_id "$1$num";/g;
print $new $line."\n";
}
The aim of my code is to change siRNA_Z27kG1_20543 for siRNA_Z27kG1_20543_X_1. However, my code does not produce that output. Why? I can't understand that.
My regex needs to be as it is because I match other strings (this time with success).
#!/usr/bin/perl
use strict;
use warnings;
my $string = q{gene_id "siRNA_Z27kG1_20543"transcript_id "siRNA_Z27kG1_20543_X_1";tss_id "TSS124620"};
if($string =~ m|transcript_id "([A-Za-z0-9:\-._]*)(_[oxOX][_.][0-9]*)"|){
my $replace_with = qq{gene_id "$1$2"};
$string =~ s/gene_id (\"\w+\")/$replace_with/g;
}
print "$string";
Output: gene_id "siRNA_Z27kG1_20543_X_1"transcript_id "siRNA_Z27kG1_20543_X_1";tss_id "TSS124620"
Demo
Remove the semicolon at the start of the pattern as it is not present in the string :-
if ($line =~ /transcript_id "([A-Za-z0-9:\-._]*)(_[oxOX][_.][0-9]*)";/) {
$num = $2;
$line =~ s/gene_id "([A-Za-z0-9:\-._]*)";/gene_id "$1$num";/g;
print $new $line."\n";
}

Is there a way to combine a substitution regex and match test in one line?

The answer is probably obvious, but I'm wondering if there's a shorter way to write this:
if ($line =~ m/^REF: /){
$line =~ s/^REF: //;
# do something else
}
s/// returns the number of substitutions made. An equivalent of your code would be:
if ($line =~ s/^REF: //) {
# do something else
}
You mean?
if ($line =~ s/^REF: //){
print $line."\n";
}
else {
print "Line not touched\n";
}

How to replace a variable with another variable in PERL?

I am trying to replace all words from a text except some that I have in an array. Here's my code:
my $text = "This is a text!And that's some-more text,text!";
while ($text =~ m/([\w']+)/g) {
next if $1 ~~ #ignore_words;
my $search = $1;
my $replace = uc $search;
$text =~ s/$search/$replace/e;
}
However, the program doesn't work. Basically I am trying to make all words uppercase but skip the ones in #ignore_words. I know it's a problem with the variables being used in the regular expression, but I can't figure the problem out.
#!/usr/bin/perl
my $text = "This is a text!And that's some-more text,text!";
my #ignorearr=qw(is some);
my %h1=map{$_ => 1}#ignorearr;
$text=~s/([\w']+)/($h1{$1})?$1:uc($1)/ge;
print $text;
On running this,
THIS is A TEXT!AND THAT'S some-MORE TEXT,TEXT!
You can figure the problem out of your code if instead of applying an expression to the same control variable of a while loop, just let s/../../eg do it globally for you:
my $text = "This is a text!And that's some-more text,text!";
my #ignore_words = qw{ is more };
$text =~ s/([\w']+)/$1 ~~ #ignore_words ? $1 : uc($1)/eg;
print $text;
And on running:
THIS is A TEXT!AND THAT'S SOME-more TEXT,TEXT!

How can I know which portion of a Perl regex is matched by a string?

I want to search the lines of a file to see if any of them match one of a set of regexs.
something like this:
my #regs = (qr/a/, qr/b/, qr/c/);
foreach my $line (<ARGV>) {
foreach my $reg (#regs) {
if ($line =~ /$reg/) {
printf("matched %s\n", $reg);
}
}
}
but this can be slow.
it seems like the regex compiler could help. Is there an optimization like this:
my $master_reg = join("|", #regs); # this is wrong syntax. what's the right way?
foreach my $line (<ARGV>) {
$line =~ /$master_reg/;
my $matched = special_function();
printf("matched the %sth reg: %s\n", $matched, $regs[$matched]
}
}
where 'special_function' is the special sauce telling me which portion of the regex was matched.
Use capturing parentheses. Basic idea looks like this:
my #matches = $foo =~ /(one)|(two)|(three)/;
defined $matches[0]
and print "Matched 'one'\n";
defined $matches[1]
and print "Matched 'two'\n";
defined $matches[2]
and print "Matched 'three'\n";
Add capturing groups:
"pear" =~ /(a)|(b)|(c)/;
if (defined $1) {
print "Matched a\n";
} elsif (defined $2) {
print "Matched b\n";
} elsif (defined $3) {
print "Matched c\n";
} else {
print "No match\n";
}
Obviously in this simple example you could have used /(a|b|c)/ just as well and just printed $1, but when 'a', 'b', and 'c' can be arbitrarily complex expressions this is a win.
If you're building up the regex programmatically you might find it painful to have to use the numbered variables, so instead of breaking strictness, look in the #- or #+ arrays instead, which contain offsets for each match position. $-[0] is always set as long as the pattern matched at all, but higher $-[$n] will only contain defined values if the nth capturing group matched.

perl regex help matching varying number values

Looking for some help with this perl regex.
I need to extract (3) items from this filename: abc101.name.aue-abc_p002.20110124.csv
abc101.name.aue-abc_p002.20110124.csv
where item (3) in this example 002, can also be a max of 4 digits, 0002
Here's my non working regex:
while (my $line=<>) {
chomp $line;
if ($line =~ m/abc(d{3}).name.(w{3})_p([0-9]).[0-9].csv/) {
print $1;
print $2;
print $3;
}
}
while (my $line=<>) {
chomp $line;
if ($line =~ /^abc(\d{3})\.name\.(\w{3})-abc_p(\d{1,4})\.\d+.csv$/) {
print $1;
print $2;
print $3;
}
}
You are missing a few plus signs (or { } quantifiers) and escaping the dots:
abc(d{3})\.name\.(w{3})_p([0-9]{3,4})\.[0-9]+\.csv/
Untested: /^abc(\d{3})\.name\.([a-z]{3})-abc_p(\d{1,4})\.\d+\.csv$/