Extract the matched pattern (Numbers +Name) in perl

Extract the matched pattern (Numbers +Name) in perl - regex

I have used below mentioned pattern to Search and extract the string form big string.
example input string like
loadStringCombo('1',10,1,10,MaxCallApprComboBxId,quatstyle='width:50px;'quat)
Expected Output
(10,1,10,MaxCallApprComboBxId,)
But by this way i am getting only combobox1 as output.
while ( my $st = $str =~ /[0-9]+[\,][0-9]+[\,][0-9]+[\,][0-9a-zA-Z]+[\,]/g ) {
my $str3 = "combobox" . $st;
push #arry1, $str3 . "\n";
print #arry1, "\n";
open FILE, ">test.txt" or die $!;
print FILE #arry1, "\n";
}
Please guide me to extract the value 10,1,10,MaxCallApprComboBxId,.

Replace this line:
while ( my $st = $str =~ /[0-9]+[\,][0-9]+[\,][0-9]+[\,][0-9a-zA-Z]+[\,]/g ) {
by:
while ( my ($st) = $str =~ /(\d+,\d+,\d+,[0-9a-zA-Z]+,)/g ) {
whole loop:
while ($str =~ /(\d+,\d+,\d+,[0-9a-zA-Z]+,)/g ) {
push #arry1, "combobox$1";
}
use Data::Dumper;
print Dumper\#arry1;
open my $FILE, '>', 'test.txt' or die $!;
print $FILE "#arry1";

Related

perl to check if sequence contains at least 3 unique bases and if not delete

I have a fasta file. I need to remove sequences containing “N” or did not contain at least 3 unique bases.
The code so far is below. Also how would I remove the sequence ID line as go along for sequences I delete.
#!/usr/bin/perl
use strict;
use warnings;
open FILE, '<', $ARGV[0] or die qq{Failed to open "$ARGV[1]" for input: $!\n};
open match_fh, ">$ARGV[0]_trimmed.fasta"
or die qq{Failed to open for output: $!\n};
while ( my $line = <FILE> ) {
chomp($line);
if ( $line =~ m/^>/ ) {
print match_fh "$line\n";
my #data = split( /\|/, $line );
my $nextline = <FILE>;
if ( $nextline !~ /N+/g ) {
if ( $nextline =~ /[ATGC]{3}/g ) {
}
print match_fh "$nextline";
}
}
}
close FILE;
close match_fh;
INPUT
>seq1
ATGCGGGATGATCCGAACGTTTAATCTCGTATGCCGTCTTCTATCTCNNN
>seq2
GATGAGCTTGACTCTAGTCCATCTCGTATGCCGTCTTCTGCTATCTCGTA
>seq3
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTC
>seq4
TGGTACTGTAAGCATGAGAGTAATCTCGTATGCCGTCTTCTGCTTGAAAA
OUTPUT
>seq2
GATGAGCTTGACTCTAGTCCATCTCGTATGCCGTCTTCTGCTATCTCGTA
>seq4
TGGTACTGTAAGCATGAGAGTAATCTCGTATGCCGTCTTCTGCTTGAAAA

while(my $head = <FILE>) {
next if($head !~ /^>/);
$_=<FILE>;
if(!/N+/ && /A/+/T/+/G/+/C/ >= 3) {
print match_fh $head, $_;
}
}

How to get the next line of file if pattern match in file using Perl script

I am trying to find pattern Pattern String , once it found , I need to get the next line of pattern, which contains page number, I need extract the page number 2 in below sample text file Page: 2 of 5. Here is my try:
my $filename="sample.txt";
$i=1;
open(FILE, "<$filename") or die "File couldn't be matched $filename\n";
#array = <FILE>;
foreach $line(#array){
chomp($line);
if ($array[$i]=~/(\s+)Pattern String(\s+)/) {
if ($array[$i]=~/(\s+)Page:(\s+)(.*) of (.*)/) {
$page = $3;
}
}
Here is my sample text file :
Pattern String
MCN: 349450A0 NCP Account ID: 999 600-0089 Page: 2 of 5
=============================================================================
Customer Name: PCS HEALTH SYSTEMS
Customer Number: 349450A0

What about this? Is that what you want? After a match and if next line is not empty then show the line. Let me know if worked for you.
# Perl:
my $filename="sample.txt";
my $match = undef;
my $line = "";
open(my $fh, "<", $filename) or die "Failed to open file: $!";
foreach (<$fh>) {
$line = $_;
if ( $line =~ /.*Pattern\sString.*/ ) {
$match = 1;
next;
}
if (($match == "1") && ($line !~ /^$/)){
print $line;
$match = undef;
}
}

I think this will solve the problem (I'm assuming that the sample files will always have the same format). I hope this will help you, please let me know if it worked.
my $filename="sample.txt";
my $count = 0;
my $tgline = 0;
open(my $fh, "<", $filename) or die "Failed to open file: $!";
my #lines = <$fh>;
foreach (#lines) {
if ( $_ =~ /.*Pattern\sString.*/ ) {
$tgline = $count + 2;
if ( $lines[$tgline] =~ /.*Page\:\s(\d+)\sof\s(\d+)$/ ) {
print "Current page: " . $1 . "\n";
print "Total page #: " . $2 . "\n";
}
}
$count+=1;
}

I don't know why are you matching Pattern String, if your target is achieveing 2 from Page: 2 of 5 from your input file. This is a way to get this:
use warnings;
use strict;
my $filename = "sample.txt";
open my $fh, "<","$filename" or die "Couldn't open $filename: $!";
while (my $line = <$fh>)
{
if($line =~ m/.*Page:\s(\d+)\sof\s(\d+)$/)
{
print "$1\n";
}
}
sample.txt:
Pattern String
MCN: 349450A0 NCP Account ID: 999 600-0089 Page: 2 of 5
=============================================================================
Customer Name: PCS HEALTH SYSTEMS
Customer Number: 349450A0
Output:
2

Splitting regex matches by newlines in perl

I am trying to glob files from a directory and print out regexp matches,
Trying to match
<110>
*everything here*
<120>
My matches would be
SCHALLY, ANDREW V.
CAI, REN ZHI
ZARANDI, MARTA
However when i try to split this by newline and join using "|", I am not getting the desired output that is
Applicant : SCHALLY, ANDREW V. | CAI, REN ZHI | ZARANDI, MARTA
My current output is only
| ZARANDI, MARTA
Can someone see any obvious mistakes?
#!/usr/bin/perl
use warnings;
use strict;
use IO::Handle;
open (my $fh, '>', '../logfile.txt') || die "can't open logfile.txt";
open (STDERR, ">>&=", $fh) || die "can't redirect STDERR";
$fh->autoflush(1);
my $input_path = "../input/";
my $output_path = "../output/";
my $whole_file;
opendir INPUTDIR, $input_path or die "Cannot find dir $input_path : $!";
my #input_files = readdir INPUTDIR;
closedir INPUTDIR;
foreach my $input_file (#input_files)
{
$whole_file = &getfile($input_path.$input_file);
if ($whole_file){
$whole_file =~ /[<][1][1][0][>](.*)[<][1][2][0][>]/s ;
if ($1){
my $applicant_string = "Applicant : $1";
my $op = join( "|", split("\n", $applicant_string) );
print $op;
}
}
}
close $fh;
sub getfile {
my $filename = shift;
open F, "< $filename " or die "Could not open $filename : $!" ;
local $/ = undef;
my $contents = <F>;
close F;
return $contents;
}
EDIT 1
I Ran Code on a single file
#!/usr/bin/perl
use warnings;
use strict;
use IO::Handle;
my $input_file = "01.txt-WO13_090919_PD_20130620";
my $input_path = "../input/";
my $whole_file = &getfile($input_path.$input_file);
if ($whole_file =~ /[<][1][1][0][>](.*)[<][1][2][0][>]/s ) {
print $1;
my #split_string = split("\n", $1);
my $new_string = join("|", #split_string) ;
print "$new_string \n";
}
sub getfile {
my $filename = shift;
open F, "< $filename " or die "Could not open $filename : $!" ;
local $/ = undef;
my $contents = <F>;
close F;
return $contents;
}
Output
Chen, Guokai
Thomson, James
Hou, Zhonggang
Hou, Zhonggang

Replace
$whole_file =~ /[<][1][1][0][>](.*)[<][1][2][0][>]/s ;
if ($1) {
with
if ($whole_file =~ /[<][1][1][0][>](.+)[<][1][2][0][>]/s) {
The problem with your original code is that $1 is unchanged (i.e. retained from the previous file) if the regexp doesn't match.
If that doesn't solve the problem, then double check and make sure that you have the correct value if $applicant_string. Your join + split line looks correct.

I run your code and get
|SCHALLY, ANDREW V. |CAI, REN ZHI| ZARANDI, MARTA
Which is pretty close. all you need to do is trim whitespace before you join. So replace this
my #split_string = split("\n", $1);
my $new_string = join("|", #split_string) ;
With this:
my #split_string = split("\n", $1);
my #names;
foreach my $name ( #split_string ) {
$name =~ s/^\s*(.*)\s*$/$1/;
next if $name =~ /^$/;
push #names, $name;
}
my $new_string = join("|", #names);

#pts is correct, the regex capture variables do not get reset to UNDEF
upon a negative match, looks like they retain their last value.
So his solution should work for you. Use the if ( $whole_file =~ // ) {} form.
Beyond that, you could clean up the operation a little by doing something like this
use strict;
use warnings;
$/ = undef;
my $whole_file = <DATA>;
if ( $whole_file =~ /[<][1][1][0][>](.*)[<][1][2][0][>]/s )
{
my $applicant_string = $1;
$applicant_string =~ s/^\s+|\s+$//g;
my $op = "Applicant : " . join( " | ", split( /\s*\r?\n\s*/, $applicant_string) );
print $op;
}
__DATA__
<110>
SCHALLY, ANDREW V.
CAI, REN ZHI
ZARANDI, MARTA
<120>
Output:
Applicant : SCHALLY, ANDREW V. | CAI, REN ZHI | ZARANDI, MARTA

Match different variant of a word using regex Perl

I am splitting sentences at individual space characters, and then matching these terms against keys of hashes. I am getting matches only if the terms are 100% similar, and I am struggling to find a perfect regex that could match several occurrences of the same word. Eg. Let us consider I have a term 'antagon' now it perfectly matches with the term 'antagon' but fails to match with antagonists, antagonistic or pre-antagonistic, hydro-antagonist etc. Also I need a regex to match occurrences of words like MCF-7 with MCF7 or MC-F7 silencing the effect of special characters and so on.
This is the code that I have till now; thr commented part is where I am struggling.
(Note: Terms in the hash are stemmed to root form of a word).
use warnings;
use strict;
use Drug;
use Stop;
open IN, "sample.txt" or die "cannot find sample";
open OUT, ">sample1.txt" or die "cannot find sample";
while (<IN>) {
chomp $_;
my $flag = 0;
my $line = lc $_;
my #full = ();
if ( $line =~ /<Sentence.*>(.*)<\/Sentence>/i ) {
my $string = $1;
chomp $string;
$string =~ s/,/ , /g;
$string =~ s/\./ \. /g;
$string =~ s/;/ ; /g;
$string =~ s/\(/ ( /g;
$string =~ s/\)/ )/g;
$string =~ s/\:/ : /g;
$string =~ s/\::/ :: )/g;
my #array = split / /, $string;
foreach my $word (#array) {
chomp $word;
if ( $word =~ /\,|\;|\.|\(|\)/g ) {
push( #full, $word );
}
if ( $Stop_words{$word} ) {
push( #full, $word );
}
if ( $Values{$word} ) {
my $term = "<Drug>$word<\/Drug>";
push( #full, $term );
}
else {
push( #full, $word );
}
# if($word=~/.*\Q$Values{$word}\E/i)#Changed this
# {
# $term="<Drug>$word</$Drug>";
# print $term,"\n";
# push(#full,$term);
# }
}
}
my $mod_str = join( " ", #full );
print OUT $mod_str, "\n";
}

I need a regex to match occurances of words like MCF-7 with MCF7 or
MC-F7
The most straightforward approach is just to strip out the hyphenss i.e.
my $ignore_these = "[-_']"
$word =~ s{$ignore_these}{}g;
I am not sure what is stored in your Value hash, so its hard to tell what you expect to happen
if($word=~/.*\Q$Values{$word}\E/i)
However, the kind of thing I imagin you want is (simplified your code somewhat)
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use 5.10.0;
use Data::Dumper;
while (<>) {
chomp $_;
my $flag = 0;
my $line = lc $_;
my #full = ();
if ( $line =~ /<Sentence.*>(.*)<\/Sentence>/i ) {
my $string = $1;
chomp $string;
$string =~ s/([,\.;\(\)\:])/ $1 /g; # squished these together
$string =~ s/\:\:/ :: )/g; # typo in original
my #array = split /\s+/, $string; # split on one /or more/ spaces
foreach my $word (#array) {
chomp $word;
my $term=$word;
my $word_chars = "[\\w\\-_']";
my $word_part = "antagon";
if ($word =~ m{$word_chars*?$word_part$word_chars+}) {
$term="<Drug>$word</Drug>";
}
push(#full,$term); # push
}
}
my $mod_str = join( " ", #full );
say "<Sentence>$mod_str</Sentence>";
}
This gives me the following output, which is my best guess at what you expect:
$ cat tmp.txt
<Sentence>This in antagonizing the antagonist's antagonism pre-antagonistically.</Sentence>
$ cat tmp.txt | perl x.pl
<Sentence>this in <Drug>antagonizing</Drug> the <Drug>antagonist's</Drug> <Drug>antagonism</Drug> <Drug>pre-antagonistically</Drug> .</Sentence>
$

perl -ne '$things{$1}++while s/([^ ;.,!?]*?antagon[^ ;.,!?]++)//;END{print "$_\n" for sort keys %things}' FILENAME
If the file contains the following:
he was an antagonist
antagonize is a verb
why are you antagonizing her?
this is an alpha-antagonist
This will return:
alpha-antagonist
antagonist
antagonize
antagonizing
Below is the a regular (not one-liner) version:
#!/usr/bin/perl
use warnings;
use strict;
open my $in, "<", "sample.txt" or die "could not open sample.txt for reading!";
open my $out, ">", "sample1.txt" or die "could not open sample1.txt for writing!";
my %things;
while (<$in>){
$things{$1}++ while s/([^ ;.,!?]*?antagon[^ ;.,!?]++)//
}
print $out "$_\n" for sort keys %things;

You may want to take another look at your assumptions on your approach. What it sounds like to me is that you are looking for words which are within a certain distance of a list of words. Take a look at the Levenshtein distance formula to see if this is something you want. Be aware, however, that computing this might take exponential time.

How can I extract abbreviations from a file using Perl?

I need to extract certain Abbreviations from a file such as ABS,TVS,and PERL. Any abbreviations which are in uppercase letters. I'd preferably like to do this with a regular expression. Any help is appreciated.

It would have been nice to hear what part you were particularly having trouble with.
my %abbr;
open my $inputfh, '<', 'filename'
or die "open error: $!\n";
while ( my $line = readline($inputfh) ) {
while ( $line =~ /\b([A-Z]{2,})\b/g ) {
$abbr{$1}++;
}
}
for my $abbr ( sort keys %abbr ) {
print "Found $abbr $abbr{$abbr} time(s)\n";
}

Reading text to be searched from standard input and writing
all abbreviations found to standard output, separated by spaces:
my $text;
# Slurp all text
{ local $/ = undef; $text = <>; }
# Extract all sequences of 2 or more uppercase characters
my #abbrevs = $text =~ /\b([[:upper:]]{2,})\b/g;
# Output separated by spaces
print join(" ", #abbrevs), "\n";
Note the use of the POSIX character class [:upper:], which will match
all uppercase characters, not just English ones (A-Z).

Untested:
my %abbr;
open (my $input, "<", "filename")
|| die "open: $!";
for ( < $input > ) {
while (s/([A-Z][A-Z]+)//) {
$abbr{$1}++;
}
}
Modified it to look for at least two consecutive capital letters.

#!/usr/bin/perl
use strict;
use warnings;
my %abbrs = ();
while(<>){
my #words = split ' ', $_;
foreach my $word(#words){
$word =~ /([A-Z]{2,})/ && $abbrs{$1}++;
}
}
# %abbrs now contains all abreviations

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract the matched pattern (Numbers +Name) in perl - regex

Related

perl to check if sequence contains at least 3 unique bases and if not delete

How to get the next line of file if pattern match in file using Perl script

Splitting regex matches by newlines in perl

Match different variant of a word using regex Perl

How can I extract abbreviations from a file using Perl?

Categories

Resources