Splitting regex matches by newlines in perl - regex

I am trying to glob files from a directory and print out regexp matches,
Trying to match
<110>
*everything here*
<120>
My matches would be
SCHALLY, ANDREW V.
CAI, REN ZHI
ZARANDI, MARTA
However when i try to split this by newline and join using "|", I am not getting the desired output that is
Applicant : SCHALLY, ANDREW V. | CAI, REN ZHI | ZARANDI, MARTA
My current output is only
| ZARANDI, MARTA
Can someone see any obvious mistakes?
#!/usr/bin/perl
use warnings;
use strict;
use IO::Handle;
open (my $fh, '>', '../logfile.txt') || die "can't open logfile.txt";
open (STDERR, ">>&=", $fh) || die "can't redirect STDERR";
$fh->autoflush(1);
my $input_path = "../input/";
my $output_path = "../output/";
my $whole_file;
opendir INPUTDIR, $input_path or die "Cannot find dir $input_path : $!";
my #input_files = readdir INPUTDIR;
closedir INPUTDIR;
foreach my $input_file (#input_files)
{
$whole_file = &getfile($input_path.$input_file);
if ($whole_file){
$whole_file =~ /[<][1][1][0][>](.*)[<][1][2][0][>]/s ;
if ($1){
my $applicant_string = "Applicant : $1";
my $op = join( "|", split("\n", $applicant_string) );
print $op;
}
}
}
close $fh;
sub getfile {
my $filename = shift;
open F, "< $filename " or die "Could not open $filename : $!" ;
local $/ = undef;
my $contents = <F>;
close F;
return $contents;
}
EDIT 1
I Ran Code on a single file
#!/usr/bin/perl
use warnings;
use strict;
use IO::Handle;
my $input_file = "01.txt-WO13_090919_PD_20130620";
my $input_path = "../input/";
my $whole_file = &getfile($input_path.$input_file);
if ($whole_file =~ /[<][1][1][0][>](.*)[<][1][2][0][>]/s ) {
print $1;
my #split_string = split("\n", $1);
my $new_string = join("|", #split_string) ;
print "$new_string \n";
}
sub getfile {
my $filename = shift;
open F, "< $filename " or die "Could not open $filename : $!" ;
local $/ = undef;
my $contents = <F>;
close F;
return $contents;
}
Output
Chen, Guokai
Thomson, James
Hou, Zhonggang
Hou, Zhonggang

Replace
$whole_file =~ /[<][1][1][0][>](.*)[<][1][2][0][>]/s ;
if ($1) {
with
if ($whole_file =~ /[<][1][1][0][>](.+)[<][1][2][0][>]/s) {
The problem with your original code is that $1 is unchanged (i.e. retained from the previous file) if the regexp doesn't match.
If that doesn't solve the problem, then double check and make sure that you have the correct value if $applicant_string. Your join + split line looks correct.

I run your code and get
|SCHALLY, ANDREW V. |CAI, REN ZHI| ZARANDI, MARTA
Which is pretty close. all you need to do is trim whitespace before you join. So replace this
my #split_string = split("\n", $1);
my $new_string = join("|", #split_string) ;
With this:
my #split_string = split("\n", $1);
my #names;
foreach my $name ( #split_string ) {
$name =~ s/^\s*(.*)\s*$/$1/;
next if $name =~ /^$/;
push #names, $name;
}
my $new_string = join("|", #names);

#pts is correct, the regex capture variables do not get reset to UNDEF
upon a negative match, looks like they retain their last value.
So his solution should work for you. Use the if ( $whole_file =~ // ) {} form.
Beyond that, you could clean up the operation a little by doing something like this
use strict;
use warnings;
$/ = undef;
my $whole_file = <DATA>;
if ( $whole_file =~ /[<][1][1][0][>](.*)[<][1][2][0][>]/s )
{
my $applicant_string = $1;
$applicant_string =~ s/^\s+|\s+$//g;
my $op = "Applicant : " . join( " | ", split( /\s*\r?\n\s*/, $applicant_string) );
print $op;
}
__DATA__
<110>
SCHALLY, ANDREW V.
CAI, REN ZHI
ZARANDI, MARTA
<120>
Output:
Applicant : SCHALLY, ANDREW V. | CAI, REN ZHI | ZARANDI, MARTA

Related

Perl : match multiple lines

With the use of perl regex, if two consecutive lines match than count the number of lines.
I want the number of lines until matches the pattern
D001
0000
open ($file, "$file") || die;
my #lines_f = $file;
my $total_size = $#lines_f +1;
foreach my $line (#lines_f)
{
if ($line =~ /D001/) {
$FSIZE = $k + 1;
} else {
$k++;}
}
Instead of just D001, I also want to check if the next line is 0000. If so $FSIZE is the $file size.
The $file would look something like this
00001
00002
.
.
.
D0001
00000
00000
Here is an example. This sets $FSIZE to undef if it cannot find the marker lines:
use strict;
use warnings;
my $fn = 'test.txt';
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
chomp (my #lines = <$fh>);
close $fh;
my $FSIZE = undef;
for my $i (0..$#lines) {
if ($lines[$i] =~ /D0001/) {
if ( $i < $#lines ) {
if ( $lines[$i+1] =~ /00000/ ) {
$FSIZE = $i + 1;
last;
}
}
}
}

Extract the matched pattern (Numbers +Name) in perl

I have used below mentioned pattern to Search and extract the string form big string.
example input string like
loadStringCombo('1',10,1,10,MaxCallApprComboBxId,quatstyle='width:50px;'quat)
Expected Output
(10,1,10,MaxCallApprComboBxId,)
But by this way i am getting only combobox1 as output.
while ( my $st = $str =~ /[0-9]+[\,][0-9]+[\,][0-9]+[\,][0-9a-zA-Z]+[\,]/g ) {
my $str3 = "combobox" . $st;
push #arry1, $str3 . "\n";
print #arry1, "\n";
open FILE, ">test.txt" or die $!;
print FILE #arry1, "\n";
}
Please guide me to extract the value 10,1,10,MaxCallApprComboBxId,.
Replace this line:
while ( my $st = $str =~ /[0-9]+[\,][0-9]+[\,][0-9]+[\,][0-9a-zA-Z]+[\,]/g ) {
by:
while ( my ($st) = $str =~ /(\d+,\d+,\d+,[0-9a-zA-Z]+,)/g ) {
whole loop:
while ($str =~ /(\d+,\d+,\d+,[0-9a-zA-Z]+,)/g ) {
push #arry1, "combobox$1";
}
use Data::Dumper;
print Dumper\#arry1;
open my $FILE, '>', 'test.txt' or die $!;
print $FILE "#arry1";

match using regex in perl

HI I am trying to exract some data from a text file in perl. My file looks like this
Name:John
FirstName:Smith
Name:Alice
FirstName:Meyers
....
I want my string to look like John Smith and Alice Meyers
I tried something like this but I'm stuck and I don't know how to continue
while (<INPUT>) {
if (/^[Name]/) {
$match =~ /(:)(.*?)(\n) /
$string = $string.$2;
}
if (/^[FirstName]/) {
$match =~ /(:)(.*?)(\n)/
$string = $string.$2;
}
}
What I try to do is that when I match Name or FirstName to copy to content between : and \n but I get confused which is $1 and $2
This will put you first and last names in a hash:
use strict;
use warnings;
use Data::Dumper;
open my $in, '<', 'in.txt';
my (%data, $names, $firstname);
while(<$in>){
chomp;
($names) = /Name:(.*)/ if /^Name/;
($firstname) = /FirstName:(.*)/ if /^FirstName/;
$data{$names} = $firstname;
}
print Dumper \%data;
Through perl one-liner,
$ perl -0777 -pe 's/(?m).*?Name:([^\n]*)\nFirstName:([^\n]*).*/\1 \2/g' file
John Smith
Alice Meyers
while (<INPUT>) {
/^([A-Za-z])+\:\s*(.*)$/;
if ($1 eq 'Name') {
$surname = $2;
} elsif ($1 eq 'FirstName') {
$completeName = $2 . " " . $surname;
} else {
/* Error */
}
}
You might want to add some error handling, e.g. make sure that a Name is always followed by a FirstName and so on.
$1 $2 $3 .. $N , it's the capture result of () inside regex.
If you do something like that , you cant avoid using $1 like variables.
my ($matched1,$matched2) = $text =~ /(.*):(.*)/
my $names = [];
my $name = '';
while(my $row = <>){
$row =~ /:(.*)/;
$name = $name.' '.$1;
push(#$names,$name) if $name =~ / /;
$name = '' if $name =~ / /;
}
`while(<>){
}
`
open (FH,'abc.txt');
my(%hash,#array);
map{$_=~s/.*?://g;chomp($_);push(#array,$_)} <FH>;
%hash=#array;
print Dumper \%hash;

How to extract a part of string in Perl?

I have an input file with data like below:
X-X-D-X-X-A
X-D-X-A-X
D-X-X-X-X-A-X-X
I need the result to be only giving me
D-X-X-A
D-X-A
D-X-X-X-X-A
Please help me!!
You can try,
open my $fh, "<", "file" or die $!;
while (my $line = <$fh>) {
$line =~ s/^[-X]+ | [-X]+(?=\s*$)//xg;
print $line;
}
close $fh;
or from cmd line,
perl -pe 's/^[-X]+ | [-X]+(?=\s*$)//xg' file
Tested code:
my #a = qw[
X-X-D-X-X-A
X-D-X-A-X
D-X-X-X-X-A-X-X
];
for my $a (#a) {
if ($a =~ /(D[-X]+A)/) {
print $1,"\n";
}
}
perl -lne 'print $1 if(/[^D]*(D.*A).*/)'
test
inside a script while reading a file:
while (my $line = <$fh>) {
$line =~ m/[^D]*(D.*A).*/g;
print $1;
}
Regex:
[^D]*(D.*A).*
[^D]* line may not start with D
(D.*A) Capture(round braces) the staring from D till last A in the
line. captured part will be stored in $1
.* left over string in the line.

Perl: How to extract sequences based on gene number and nucleotide length?

I have 2 files, as follows:
file1.txt:
0 117nt, >gene_73|GeneMark.hm... *
0 237nt, >gene_3097|GeneMark.... *
0 237nt, >gene_579|GeneMark.h... *
0 237nt, >gene_988|GeneMark.h... *
0 189nt, >gene_97|GeneMark.hm... *
0 183nt, >gene_97|GeneMark.hm... *
file2.fasta:
>gene_735|GeneMark.hmm|237_nt|+|798985|799221
TTGTGGTTCGTGCCGCGCGACGCGTTGCGTCTGCAAACGCCCGACGAAGACATCGCGACCTATCTGTTCAACAAGCATGTGATTCGGCATCGGTTCTGTCCGACCTGCGGGATTCATCCGTTCGCGGAAGGCACGGACCCGAAGGGCAACGCGATGGCGGCCGTCAATCTTCGCTGCGTCGACGGCGTCGATCTCGACGCGTTGAGCGTCCGCCATTTCGACGGGCGCGCGCTCTGA
>gene_579|GeneMark.hmm|237_nt|+|667187|667423
ATGTACCACGGCGCCGAATTTGCCGCTGCCAAGGGCATGCGCTGGCTGCGAGATGCCGCCAACGGCTCTGCCTTCATCGCACCGGGCAGTCCGTGGCAAAACGGTTTCGTCGAGCGTTTCAACGGCAAGCTGCATGACGAATTGCTGAACCGGGAATGGTTCCGCGGCCGTGCCGAGACCAAGATGCTCATCGAACGCTCCGGCTACGGTCCGTCGAGTCTGACCGGATTCCGATGA
>gene_1876|GeneMark.hmm|234_nt|-|2168498|2168731
ATGCTGTTCTTTTCGCGCGCGGGCGTGTCGCGTGCGGCCGGCGGCCAATCATGCGGCGAGTCGTTTTGTCGCGGCTCGCGGCGCTTGCCGACGTTGGAATCGCGCGCGCCGATGCGCGGATCGGGGCGGCAACGTTTGCGTATGAGGAATGATGCGTTTGCGCATCGGGAATGGGCGCCTCGCCCCGGTTTCGCCGCGATTCCGCCCGACTCGAGGCAGTCGTTTTTCCGCTAA
>gene_3097|GeneMark.hmm|237_nt|-|3467022|3467258
GTGTCGAACGAACGTCGCGGCGAACGGCCGCTGCGGGCATCGCCGCAGGACGTCACACGGCGAACGTCGCGCGCGATCCTCGGCGGCCGCGAACGTGGGCCGTCCCGTGGCACGTTCGGCTCGCTCGGCATGGCGAACGACCGCCGCATCGCGCATCGCCGTCGCGCGGCCTCCAAAAAAACGGCGGTCAGCGACCGCCGGCTTTGGCCGAAACCGATGCGTCGTACGAATCAGTGA
>gene_988|GeneMark.hmm|237_nt|+|1121027|1121263
ATGACCTTGTCAGGCAACATCAAGGACGGCGACTGGACGGTCGAGGTGACGACATCGCCGGTGCAGGGCGGTTACGTGTGCGACATCGAGGTGATGCACGGCGCGCCGGGCGGCGCGTTCCGGCACGCGTTCCGGCACGGCGGCACTTATCCGGCCGAGCGCGACGCGATGATCGAGGGGCTGCGCGCGGGCATGACCTGGATCGAGCTGAAGATGTCGAAAGCATTCAATCTGTAA
>gene_97|GeneMark.hmm|105_nt|+|90122|90226
GTGACGCGTTTCGCGACGCGCGTCGATGGGGCGGGCGCGAAACCCGTTCGCCGCGATGCGGCGGACGGGGTATGGCCGAGCGCCGTCCGTCGCGGCGAGAGTTGA
>gene_97|GeneMark.hmm|183_nt|-|107002|107184
ATGGAGGCAATCGTGATCGAGCAAGTGATACTGGGCGTCTTTCTCGTACTGCCGCTTCTCATCGTCGCGGTGCTGTACTCCGACGAACTCTGGCAAGAACACCGCCTGCAGCATCCGCGCGACGAGCACACGCCACATATCGACTGGCGTCATCCGTGGCGGATCCTGCGGCGAGGGCACTAA
>gene_97|GeneMark.hmm|189_nt|-|98624|98812
GTGAAATACACGAGCGACCATTACGCGGGCGTCAAATTTGGCGCGCTGTACGGGTTCTCGAACGCGGCGAACTTCGCCGACAACCGCGCTCGCCGGCGCATGCGCGGCGTTCGCATACGCGATCGGCAAAAGCGGCGTGATGTGCGGTTGCCTGCCGCGCTCGCGCTATGCGCGGCACGCCATCGATGA
>gene_97|GeneMark.hmm|234_nt|+|105494|105727
ATGAAGATTCAAATCGCCATTGTTTATTTTGTCGCCCGTCACGCAAACGAGCAGGCGCGAAGCGGATCGGCGCGCATTGGCGAAGAGCCGGCGCGCATCGGCATCGCGCTCGCGCGACACATGCGCGCCGCGCGCGGCCGGTCGACGCCGGATTCGCCTGTCGATCGATCCGGTGCGCCCCGAGCCGATGAGCGGTACGCTTCGGCGCGCGCGCGACACGCGCGACACGCGTGA
>gene_979|GeneMark.hmm|225_nt|-|1115442|1115666
TTGATCGACGCGCGGGGCCGGCCGGGCCGCGGGGTATCGAAGGCGATCGACGCGCAACACGAATCGCCGCCGCGCGCCGAAACCTCGCTATGCGCGTCGCGCGCACGCGCGGCCGGCGGCGCACGCGCGGGTGTGCGCGGGCCGGCGGCGCGGCCGCTCGCACTGCGCGACCGCTCGCGCGCACGCCTTCCTCGGCACGCGCCGGGAATCCCGGCCCTTCAATGA
The output that I expect is:
>gene_579|GeneMark.hmm|237_nt|+|667187|667423
ATGTACCACGGCGCCGAATTTGCCGCTGCCAAGGGCATGCGCTGGCTGCGAGATGCCGCCAACGGCTCTGCCTTCATCGCACCGGGCAGTCCGTGGCAAAACGGTTTCGTCGAGCGTTTCAACGGCAAGCTGCATGACGAATTGCTGAACCGGGAATGGTTCCGCGGCCGTGCCGAGACCAAGATGCTCATCGAACGCTCCGGCTACGGTCCGTCGAGTCTGACCGGATTCCGATGA
>gene_3097|GeneMark.hmm|237_nt|-|3467022|3467258
GTGTCGAACGAACGTCGCGGCGAACGGCCGCTGCGGGCATCGCCGCAGGACGTCACACGGCGAACGTCGCGCGCGATCCTCGGCGGCCGCGAACGTGGGCCGTCCCGTGGCACGTTCGGCTCGCTCGGCATGGCGAACGACCGCCGCATCGCGCATCGCCGTCGCGCGGCCTCCAAAAAAACGGCGGTCAGCGACCGCCGGCTTTGGCCGAAACCGATGCGTCGTACGAATCAGTGA
>gene_988|GeneMark.hmm|237_nt|+|1121027|1121263
ATGACCTTGTCAGGCAACATCAAGGACGGCGACTGGACGGTCGAGGTGACGACATCGCCGGTGCAGGGCGGTTACGTGTGCGACATCGAGGTGATGCACGGCGCGCCGGGCGGCGCGTTCCGGCACGCGTTCCGGCACGGCGGCACTTATCCGGCCGAGCGCGACGCGATGATCGAGGGGCTGCGCGCGGGCATGACCTGGATCGAGCTGAAGATGTCGAAAGCATTCAATCTGTAA
>gene_97|GeneMark.hmm|183_nt|-|107002|107184
ATGGAGGCAATCGTGATCGAGCAAGTGATACTGGGCGTCTTTCTCGTACTGCCGCTTCTCATCGTCGCGGTGCTGTACTCCGACGAACTCTGGCAAGAACACCGCCTGCAGCATCCGCGCGACGAGCACACGCCACATATCGACTGGCGTCATCCGTGGCGGATCCTGCGGCGAGGGCACTAA
>gene_97|GeneMark.hmm|189_nt|-|98624|98812
GTGAAATACACGAGCGACCATTACGCGGGCGTCAAATTTGGCGCGCTGTACGGGTTCTCGAACGCGGCGAACTTCGCCGACAACCGCGCTCGCCGGCGCATGCGCGGCGTTCGCATACGCGATCGGCAAAAGCGGCGTGATGTGCGGTTGCCTGCCGCGCTCGCGCTATGCGCGGCACGCCATCGATGA
There are 4 sequences with gene number 97, but all in different length. I want the sequence with the correct gene length only which listed in file1.txt to output in the output.fasta file. What I've done so far is as follows (but failed and have some errors):
#!/usr/bin/perl
use strict;
use warnings;
my #genes;
open my $list, '<file1.txt';
while (my $line = <$list>) {
push (#genes, $1) if $line =~/\>(.*?)\|/gs;
}
my $tag1 = "0\t";
my $tag2 = "nt";
while (my $line = <$list>) {
if ($line =~ /$tag1(.*?)$tag2/) {
my $match1 = $1;
}
}
my $input;
{
local $/ = undef;
open my $fasta, '<file2.fasta';
my $tag3 = "GeneMark.hmm";
my $tag4 = "_nt";
while (my $input = <$fasta>) {
if ($input =~ /$tag3(.*?)$tag4/) {
my $match2 = $1; }}
close $fasta;
}
my #lines = split(/>/,$input);
foreach my $l (#lines) {
if ($l =~ /(.+?)\|/) {
my $real_name = $1;
if ($real_name ~~ #genes) {
if ($match2 = $match1) {
open (OUTFILE, '>>output.fasta');
print OUTFILE ">$l"; }
}
}
}
Can anyone give me some guide to correct the code? Or is there any better way to do this? Any help will be very much appreciated! Thanks! :)
Here's an option that uses Bio::SeqIO:
use strict;
use warnings;
use Bio::SeqIO;
my %hash;
open my $fh, '<', $ARGV[0] or die $!;
while (<$fh>) {
push #{ $hash{$2} }, $1 if /\s+(\d+)nt,.+?>(gene_\d+)\|/;
}
close $fh;
my $in = Bio::SeqIO->new( -file => $ARGV[1], -format => 'Fasta' );
my $out = Bio::SeqIO->new( -fh => \*STDOUT, -format => 'Fasta' );
while ( my $seq = $in->next_seq() ) {
$out->write_seq($seq)
if $seq->id =~ /(gene_\d+)\|.+?\|(\d+)_nt\|/ and grep /$2/, #{ $hash{$1} };
}
Usage: perl script.pl file1.txt file2.fasta [>outFile.fasta]
The second, optional parameter directs output to a file.
Output from your data:
>gene_579|GeneMark.hmm|237_nt|+|667187|667423
ATGTACCACGGCGCCGAATTTGCCGCTGCCAAGGGCATGCGCTGGCTGCGAGATGCCGCC
AACGGCTCTGCCTTCATCGCACCGGGCAGTCCGTGGCAAAACGGTTTCGTCGAGCGTTTC
AACGGCAAGCTGCATGACGAATTGCTGAACCGGGAATGGTTCCGCGGCCGTGCCGAGACC
AAGATGCTCATCGAACGCTCCGGCTACGGTCCGTCGAGTCTGACCGGATTCCGATGA
>gene_3097|GeneMark.hmm|237_nt|-|3467022|3467258
GTGTCGAACGAACGTCGCGGCGAACGGCCGCTGCGGGCATCGCCGCAGGACGTCACACGG
CGAACGTCGCGCGCGATCCTCGGCGGCCGCGAACGTGGGCCGTCCCGTGGCACGTTCGGC
TCGCTCGGCATGGCGAACGACCGCCGCATCGCGCATCGCCGTCGCGCGGCCTCCAAAAAA
ACGGCGGTCAGCGACCGCCGGCTTTGGCCGAAACCGATGCGTCGTACGAATCAGTGA
>gene_988|GeneMark.hmm|237_nt|+|1121027|1121263
ATGACCTTGTCAGGCAACATCAAGGACGGCGACTGGACGGTCGAGGTGACGACATCGCCG
GTGCAGGGCGGTTACGTGTGCGACATCGAGGTGATGCACGGCGCGCCGGGCGGCGCGTTC
CGGCACGCGTTCCGGCACGGCGGCACTTATCCGGCCGAGCGCGACGCGATGATCGAGGGG
CTGCGCGCGGGCATGACCTGGATCGAGCTGAAGATGTCGAAAGCATTCAATCTGTAA
>gene_97|GeneMark.hmm|183_nt|-|107002|107184
ATGGAGGCAATCGTGATCGAGCAAGTGATACTGGGCGTCTTTCTCGTACTGCCGCTTCTC
ATCGTCGCGGTGCTGTACTCCGACGAACTCTGGCAAGAACACCGCCTGCAGCATCCGCGC
GACGAGCACACGCCACATATCGACTGGCGTCATCCGTGGCGGATCCTGCGGCGAGGGCAC
TAA
>gene_97|GeneMark.hmm|189_nt|-|98624|98812
GTGAAATACACGAGCGACCATTACGCGGGCGTCAAATTTGGCGCGCTGTACGGGTTCTCG
AACGCGGCGAACTTCGCCGACAACCGCGCTCGCCGGCGCATGCGCGGCGTTCGCATACGC
GATCGGCAAAAGCGGCGTGATGTGCGGTTGCCTGCCGCGCTCGCGCTATGCGCGGCACGC
CATCGATGA
Bio::SeqIO lives to parse fasta (and other such) files, so the above leverages this capability. After creating a hash of arrays (HoA) from file1.txt, the fasta file is processed, and only matching fasta records are printed.
Hope this helps!