perl regex help matching varying number values - regex

Looking for some help with this perl regex.
I need to extract (3) items from this filename: abc101.name.aue-abc_p002.20110124.csv
abc101.name.aue-abc_p002.20110124.csv
where item (3) in this example 002, can also be a max of 4 digits, 0002
Here's my non working regex:
while (my $line=<>) {
chomp $line;
if ($line =~ m/abc(d{3}).name.(w{3})_p([0-9]).[0-9].csv/) {
print $1;
print $2;
print $3;
}
}

while (my $line=<>) {
chomp $line;
if ($line =~ /^abc(\d{3})\.name\.(\w{3})-abc_p(\d{1,4})\.\d+.csv$/) {
print $1;
print $2;
print $3;
}
}

You are missing a few plus signs (or { } quantifiers) and escaping the dots:
abc(d{3})\.name\.(w{3})_p([0-9]{3,4})\.[0-9]+\.csv/

Untested: /^abc(\d{3})\.name\.([a-z]{3})-abc_p(\d{1,4})\.\d+\.csv$/

Related

Counting number of pattern matches in Perl

I am VERY new to perl, and to programming in general.
I have been searching for the past couple of days on how to count the number of pattern matches; I have had a hard time understanding others solutions and applying them to the code I have already written.
Basically, I have a sequence and I need to find all the patterns that match [TC]C[CT]GGAAGC
I believe I have that part down. but I am stuck on counting the number of occurrences of each pattern match. Does anyone know how to edit the code I already have to do this? Any advice is welcomed. Thanks!
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
# open fasta file for reading
unless( open( FASTA, "<", '/scratch/Drosophila/dmel-all-chromosome- r6.02.fasta' )) {
die "Can't open dmel-all-chromosome-r6.02.fasta for reading:", $!;
}
#split the fasta record
local $/ = ">";
#scan through fasta file
while (<FASTA>) {
chomp;
if ( $_ =~ /^(.*?)$(.*)$/ms) {
my $header = $1;
my $seq = $2;
$seq =~ s/\R//g; # \R removes line breaks
while ( $seq =~ /([TC]C[CT]GGAAGC)/g) {
print $1, "\n";
}
}
}
Update, I have added in
my #matches = $seq =~ /([TC]C[CT]GGAAGC)/g;
print scalar #matches;
In the code below. However, it seems to be outputting 0 in front of each pattern match, instead of outputting the total sum of all pattern matches.
while (<FASTA>) {
chomp;
if ( $_ =~ /^(.*?)$(.*)$/ms) {
my $header = $1;
my $seq = $2;
$seq =~ s/\R//g; # \R removes line breaks
while ( $seq =~ /([TC]C[CT]GGAAGC)/g) {
print $1, "\n";
my #matches = $seq =~ /([TC]C[CT]GGAAGC)/g;
print scalar #matches;
}
}
}
Edit: I need the output to list ever pattern match found. I also need it to find the total number of matches found. For example:
CCTGGAAGC
TCTGGAAGC
TCCGGAAGC
3 matches found
counting the number of occurrences of each pattern match
my #matches = $string =~ /pattern/g
#matches array will contain all the matched parts. You can then do below to get the count.
print scalar #matches
Or you could directly write
my $matches = () = $string =~ /pattern/
I would suggest you to use the former as you might need to check "what was matched" in future (perhaps for debugging?).
Example 1:
use strict;
use warnings;
my $string = 'John Doe John Done';
my $matches = () = $string =~ /John/g;
print $matches; #prints 2
Example 2:
use strict;
use warnings;
my $string = 'John Doe John Done';
my #matches = $string =~ /John/g;
print "#matches"; #prints John John
print scalar #matches; #prints 2
Edit:
while ( my #matches = $seq =~ /([TC]C[CT]GGAAGC)/g) {
print $1, "\n";
print "Count of matches:". scalar #matches;
}
As you have written the code, you have to count the matches yourself:
local $/ = ">";
my $count = 0;
#scan through fasta file
while (<FASTA>) {
chomp;
if ( $_ =~ /^(.*?)$(.*)$/ms) {
my $header = $1;
my $seq = $2;
$seq =~ s/\R//g; # \R removes line breaks
while ( $seq =~ /([TC]C[CT]GGAAGC)/g) {
print $1, "\n";
$count = $count +1;
}
}
}
print "Fount $count matches\n";
should do the job.
HTH Georg
my #count = ($seq =~ /([TC]C[CT]GGAAGC)/g);
print scalar #count ;

Perl - string matching issue

I have a problem I cannot understand. I have this string:
gene_id "siRNA_Z27kG1_20543"transcript_id "siRNA_Z27kG1_20543_X_1";tss_id "TSS124620"
And I want to change the gene_id. So, I have the following code:
if ($line =~ /;transcript_id "([A-Za-z0-9:\-._]*)(_[oxOX][_.][0-9]*)";/) {
$num = $2;
$line =~ s/gene_id "([A-Za-z0-9:\-._]*)";/gene_id "$1$num";/g;
print $new $line."\n";
}
The aim of my code is to change siRNA_Z27kG1_20543 for siRNA_Z27kG1_20543_X_1. However, my code does not produce that output. Why? I can't understand that.
My regex needs to be as it is because I match other strings (this time with success).
#!/usr/bin/perl
use strict;
use warnings;
my $string = q{gene_id "siRNA_Z27kG1_20543"transcript_id "siRNA_Z27kG1_20543_X_1";tss_id "TSS124620"};
if($string =~ m|transcript_id "([A-Za-z0-9:\-._]*)(_[oxOX][_.][0-9]*)"|){
my $replace_with = qq{gene_id "$1$2"};
$string =~ s/gene_id (\"\w+\")/$replace_with/g;
}
print "$string";
Output: gene_id "siRNA_Z27kG1_20543_X_1"transcript_id "siRNA_Z27kG1_20543_X_1";tss_id "TSS124620"
Demo
Remove the semicolon at the start of the pattern as it is not present in the string :-
if ($line =~ /transcript_id "([A-Za-z0-9:\-._]*)(_[oxOX][_.][0-9]*)";/) {
$num = $2;
$line =~ s/gene_id "([A-Za-z0-9:\-._]*)";/gene_id "$1$num";/g;
print $new $line."\n";
}

Match different variant of a word using regex Perl

I am splitting sentences at individual space characters, and then matching these terms against keys of hashes. I am getting matches only if the terms are 100% similar, and I am struggling to find a perfect regex that could match several occurrences of the same word. Eg. Let us consider I have a term 'antagon' now it perfectly matches with the term 'antagon' but fails to match with antagonists, antagonistic or pre-antagonistic, hydro-antagonist etc. Also I need a regex to match occurrences of words like MCF-7 with MCF7 or MC-F7 silencing the effect of special characters and so on.
This is the code that I have till now; thr commented part is where I am struggling.
(Note: Terms in the hash are stemmed to root form of a word).
use warnings;
use strict;
use Drug;
use Stop;
open IN, "sample.txt" or die "cannot find sample";
open OUT, ">sample1.txt" or die "cannot find sample";
while (<IN>) {
chomp $_;
my $flag = 0;
my $line = lc $_;
my #full = ();
if ( $line =~ /<Sentence.*>(.*)<\/Sentence>/i ) {
my $string = $1;
chomp $string;
$string =~ s/,/ , /g;
$string =~ s/\./ \. /g;
$string =~ s/;/ ; /g;
$string =~ s/\(/ ( /g;
$string =~ s/\)/ )/g;
$string =~ s/\:/ : /g;
$string =~ s/\::/ :: )/g;
my #array = split / /, $string;
foreach my $word (#array) {
chomp $word;
if ( $word =~ /\,|\;|\.|\(|\)/g ) {
push( #full, $word );
}
if ( $Stop_words{$word} ) {
push( #full, $word );
}
if ( $Values{$word} ) {
my $term = "<Drug>$word<\/Drug>";
push( #full, $term );
}
else {
push( #full, $word );
}
# if($word=~/.*\Q$Values{$word}\E/i)#Changed this
# {
# $term="<Drug>$word</$Drug>";
# print $term,"\n";
# push(#full,$term);
# }
}
}
my $mod_str = join( " ", #full );
print OUT $mod_str, "\n";
}
I need a regex to match occurances of words like MCF-7 with MCF7 or
MC-F7
The most straightforward approach is just to strip out the hyphenss i.e.
my $ignore_these = "[-_']"
$word =~ s{$ignore_these}{}g;
I am not sure what is stored in your Value hash, so its hard to tell what you expect to happen
if($word=~/.*\Q$Values{$word}\E/i)
However, the kind of thing I imagin you want is (simplified your code somewhat)
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use 5.10.0;
use Data::Dumper;
while (<>) {
chomp $_;
my $flag = 0;
my $line = lc $_;
my #full = ();
if ( $line =~ /<Sentence.*>(.*)<\/Sentence>/i ) {
my $string = $1;
chomp $string;
$string =~ s/([,\.;\(\)\:])/ $1 /g; # squished these together
$string =~ s/\:\:/ :: )/g; # typo in original
my #array = split /\s+/, $string; # split on one /or more/ spaces
foreach my $word (#array) {
chomp $word;
my $term=$word;
my $word_chars = "[\\w\\-_']";
my $word_part = "antagon";
if ($word =~ m{$word_chars*?$word_part$word_chars+}) {
$term="<Drug>$word</Drug>";
}
push(#full,$term); # push
}
}
my $mod_str = join( " ", #full );
say "<Sentence>$mod_str</Sentence>";
}
This gives me the following output, which is my best guess at what you expect:
$ cat tmp.txt
<Sentence>This in antagonizing the antagonist's antagonism pre-antagonistically.</Sentence>
$ cat tmp.txt | perl x.pl
<Sentence>this in <Drug>antagonizing</Drug> the <Drug>antagonist's</Drug> <Drug>antagonism</Drug> <Drug>pre-antagonistically</Drug> .</Sentence>
$
perl -ne '$things{$1}++while s/([^ ;.,!?]*?antagon[^ ;.,!?]++)//;END{print "$_\n" for sort keys %things}' FILENAME
If the file contains the following:
he was an antagonist
antagonize is a verb
why are you antagonizing her?
this is an alpha-antagonist
This will return:
alpha-antagonist
antagonist
antagonize
antagonizing
Below is the a regular (not one-liner) version:
#!/usr/bin/perl
use warnings;
use strict;
open my $in, "<", "sample.txt" or die "could not open sample.txt for reading!";
open my $out, ">", "sample1.txt" or die "could not open sample1.txt for writing!";
my %things;
while (<$in>){
$things{$1}++ while s/([^ ;.,!?]*?antagon[^ ;.,!?]++)//
}
print $out "$_\n" for sort keys %things;
You may want to take another look at your assumptions on your approach. What it sounds like to me is that you are looking for words which are within a certain distance of a list of words. Take a look at the Levenshtein distance formula to see if this is something you want. Be aware, however, that computing this might take exponential time.

parsing of string using regex in perl

DESCR: "10GE SR"
i need match this above part which is part of my rest of the string. Im using regex in perl.
i tried
if ($line =~ /DESCR: \"([a-zA-Z0-9)\"/) {
print "$1\n";
}
but im not able to understand how to consider spaces inside my string. these spaces can occur any where within the quotes. can someone help me out.
$str = 'DESCR: "10GE SR"';
if ($str =~ /DESCR: \"([a-zA-Z0-9\s]+)\"/) {
print "$1\n";
}
Take a look, this pattern can match double quoted in string:
if ($line =~ /DESCR: \"((?:[^\\"]|\\.)*)\"/) {
print "$1\n";
}
It may be simpler:
if ( $line =~ /DESCR: "([^"]+)"/ ) {
print "$1\n";
}

How do I use perl regex to extract the digit value from '[1]'?

My code...
$option = "[1]";
if ($option =~ m/^\[\d\]$/) {print "Activated!"; $str=$1;}
I need a way to drop off the square brackets from $option. $str = $1 does not work for some reason. Please advise.
To get $1 to work you need to capture the value inside the brackets using parentheses, i.e:
if ($option =~ m/^\[(\d)\]$/) {print "Activated!"; $str=$1;}
if ($option =~ m/^\[(\d)\]$/) { print "Activated!"; $str=$1; }
Or
if (my ($str) = $option =~ m/^\[(\d)\]$/) { print "Activated!" }
Or
if (my ($str) = $option =~ /(\d)/) { print "Activated!" }
..and a bunch of others. You forgot to capture your match with ()'s.
EDIT:
if ($option =~ /(?<=^\[)\d(?=\]$)/p && (my $str = ${^MATCH})) { print "Activated!" }
Or
my $str;
if ($option =~ /^\[(\d)(?{$str = $^N})\]$/) { print "Activated!" }
Or
if ($option =~ /^\[(\d)\]$/ && ($str = $+)) { print "Activated!" }
For ${^MATCH}, $^N, and $+, perlvar.
I love these questions : )