Perl Print if regex match A and not regex match B

Perl Print if regex match A and not regex match B - regex

I want to print a line of a file if it contains string A and it doesnt contain string B (also splitting into newlines at each colon). What is the proper syntax? Here is what I tried ( I want it print lines containing "bash" but to not print lines containing numbers):
my $file = passwdtest;
open(FH, "$file");
foreach (<FH>) {
print join("\n", split(/:/, "$_")) if ($_ =~ /bash/ and $_ != /\d+/);
};
close FH;

$_ != /\d+/
is short for
$_ != ($_ =~ /\d+/)
Instead of != you need !~
if ($_ =~ /bash/ and $_ !~ /\d+/);

Isn't that just:
if (/bash/ && ! /\d+/)

Related

Counting number of pattern matches in Perl

I am VERY new to perl, and to programming in general.
I have been searching for the past couple of days on how to count the number of pattern matches; I have had a hard time understanding others solutions and applying them to the code I have already written.
Basically, I have a sequence and I need to find all the patterns that match [TC]C[CT]GGAAGC
I believe I have that part down. but I am stuck on counting the number of occurrences of each pattern match. Does anyone know how to edit the code I already have to do this? Any advice is welcomed. Thanks!
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
# open fasta file for reading
unless( open( FASTA, "<", '/scratch/Drosophila/dmel-all-chromosome- r6.02.fasta' )) {
die "Can't open dmel-all-chromosome-r6.02.fasta for reading:", $!;
}
#split the fasta record
local $/ = ">";
#scan through fasta file
while (<FASTA>) {
chomp;
if ( $_ =~ /^(.*?)$(.*)$/ms) {
my $header = $1;
my $seq = $2;
$seq =~ s/\R//g; # \R removes line breaks
while ( $seq =~ /([TC]C[CT]GGAAGC)/g) {
print $1, "\n";
}
}
}
Update, I have added in
my #matches = $seq =~ /([TC]C[CT]GGAAGC)/g;
print scalar #matches;
In the code below. However, it seems to be outputting 0 in front of each pattern match, instead of outputting the total sum of all pattern matches.
while (<FASTA>) {
chomp;
if ( $_ =~ /^(.*?)$(.*)$/ms) {
my $header = $1;
my $seq = $2;
$seq =~ s/\R//g; # \R removes line breaks
while ( $seq =~ /([TC]C[CT]GGAAGC)/g) {
print $1, "\n";
my #matches = $seq =~ /([TC]C[CT]GGAAGC)/g;
print scalar #matches;
}
}
}
Edit: I need the output to list ever pattern match found. I also need it to find the total number of matches found. For example:
CCTGGAAGC
TCTGGAAGC
TCCGGAAGC
3 matches found

counting the number of occurrences of each pattern match
my #matches = $string =~ /pattern/g
#matches array will contain all the matched parts. You can then do below to get the count.
print scalar #matches
Or you could directly write
my $matches = () = $string =~ /pattern/
I would suggest you to use the former as you might need to check "what was matched" in future (perhaps for debugging?).
Example 1:
use strict;
use warnings;
my $string = 'John Doe John Done';
my $matches = () = $string =~ /John/g;
print $matches; #prints 2
Example 2:
use strict;
use warnings;
my $string = 'John Doe John Done';
my #matches = $string =~ /John/g;
print "#matches"; #prints John John
print scalar #matches; #prints 2
Edit:
while ( my #matches = $seq =~ /([TC]C[CT]GGAAGC)/g) {
print $1, "\n";
print "Count of matches:". scalar #matches;
}

As you have written the code, you have to count the matches yourself:
local $/ = ">";
my $count = 0;
#scan through fasta file
while (<FASTA>) {
chomp;
if ( $_ =~ /^(.*?)$(.*)$/ms) {
my $header = $1;
my $seq = $2;
$seq =~ s/\R//g; # \R removes line breaks
while ( $seq =~ /([TC]C[CT]GGAAGC)/g) {
print $1, "\n";
$count = $count +1;
}
}
}
print "Fount $count matches\n";
should do the job.
HTH Georg

my #count = ($seq =~ /([TC]C[CT]GGAAGC)/g);
print scalar #count ;

Remove matching words from the string using Perl

I want to remove the words Z or ZN and LVT from the strings present in my file but I couldn't get it. Can someone check my code.
Input
abchsfk/jshflka/ZN (cellLVT)
asjkfsa/sfklfkshfsf/Z (mobLVT)
asjhfdjkfd/sjfdskjfhdk/hsakfshf/Z (celLVT)
asjhdjs/jhskjds/ZN (abcLVT)
shdsjk/jhskd/ZN (xyzLVT)
Output
abchsfk/jshflka cell
asjkfsa/sfklfkshfsf mob
asjhfdjkfd/sjfdskjfhdk/hsakfshf cel
asjhdjs/jhskjds abc
shdsjk/jhskd xyz
CODE:
if ($line =~ /LVT/ && ($line =~ /ZN/ || $line =~ /Z/) )
#### matches the words LVT and ( Z or ZN)
{
my #names = split / /, $line; ##### splits the line
$names[2] =~ s/\/Z|/ZN//g; #### remove Z or ZN
$names[3] =~ s/\(|LVT\)//g ; #### remove LVT & braces
print OUT " $names[2] $names[3] \n"; #### print
}

The problem is the order of matching: s/\/Z|\/ZN//g (the second backslash is missing in your code!). You should match the longer string first, otherwise Z will match and N won't be deleted.
There's even easier way, though: Just use \/ZN?:
#!/usr/bin/perl
use warnings;
use strict;
while (my $line = <DATA>) {
if ($line =~ /LVT/ && $line =~ /ZN?/) {
my #names = split ' ', $line;
$names[0] =~ s/\/ZN?//g;
$names[1] =~ s/\(|LVT\)//g;
print "$names[0] $names[1]\n";
}
}
__DATA__
abchsfk/jshflka/ZN (cellLVT)
asjkfsa/sfklfkshfsf/Z (mobLVT)
asjhfdjkfd/sjfdskjfhdk/hsakfshf/Z (celLVT)
asjhdjs/jhskjds/ZN (abcLVT)
shdsjk/jhskd/ZN (xyzLVT)

How to extract a part of string in Perl?

I have an input file with data like below:
X-X-D-X-X-A
X-D-X-A-X
D-X-X-X-X-A-X-X
I need the result to be only giving me
D-X-X-A
D-X-A
D-X-X-X-X-A
Please help me!!

You can try,
open my $fh, "<", "file" or die $!;
while (my $line = <$fh>) {
$line =~ s/^[-X]+ | [-X]+(?=\s*$)//xg;
print $line;
}
close $fh;
or from cmd line,
perl -pe 's/^[-X]+ | [-X]+(?=\s*$)//xg' file

Tested code:
my #a = qw[
X-X-D-X-X-A
X-D-X-A-X
D-X-X-X-X-A-X-X
];
for my $a (#a) {
if ($a =~ /(D[-X]+A)/) {
print $1,"\n";
}
}

perl -lne 'print $1 if(/[^D]*(D.*A).*/)'
test
inside a script while reading a file:
while (my $line = <$fh>) {
$line =~ m/[^D]*(D.*A).*/g;
print $1;
}
Regex:
[^D]*(D.*A).*
[^D]* line may not start with D
(D.*A) Capture(round braces) the staring from D till last A in the
line. captured part will be stored in $1
.* left over string in the line.

Perl - regexp not being replaced

I have an array containing words that I want to remove from each line of a file. The code I am using is as follows:
my $INFILE;
my $OUTFILE;
my $STOPLIST;
open($INFILE, '<', $ARGV[0]);
open($STOPLIST, '<', "stop.txt");
open($OUTFILE, '>', $ARGV[1]);
my #stoplist = <$STOPLIST>;
my $line;
my $stopword;
while (<$INFILE>) {
$line = $_;
$line =~ s/\[[0-9]*\] //g;
$line =~ s/i\/.*\/; //g;
foreach (#stoplist) {
$stopword = $_;
$line =~ s/${stopword}//g;
}
print $OUTFILE lc($line);
}
However, the words in the stoplist still appear in the text in the output file, which would indicate that the $line =~ s/${stopword}//g; line wasn't doing it's job as I expected.
How can I make sure that all words in the stop list that appear in the input text are replaced with 0 characters in the output?

You need to remove newlines from your stoplist using chomp:
my #stoplist = <$STOPLIST>;
chomp #stoplist;

Match different variant of a word using regex Perl

I am splitting sentences at individual space characters, and then matching these terms against keys of hashes. I am getting matches only if the terms are 100% similar, and I am struggling to find a perfect regex that could match several occurrences of the same word. Eg. Let us consider I have a term 'antagon' now it perfectly matches with the term 'antagon' but fails to match with antagonists, antagonistic or pre-antagonistic, hydro-antagonist etc. Also I need a regex to match occurrences of words like MCF-7 with MCF7 or MC-F7 silencing the effect of special characters and so on.
This is the code that I have till now; thr commented part is where I am struggling.
(Note: Terms in the hash are stemmed to root form of a word).
use warnings;
use strict;
use Drug;
use Stop;
open IN, "sample.txt" or die "cannot find sample";
open OUT, ">sample1.txt" or die "cannot find sample";
while (<IN>) {
chomp $_;
my $flag = 0;
my $line = lc $_;
my #full = ();
if ( $line =~ /<Sentence.*>(.*)<\/Sentence>/i ) {
my $string = $1;
chomp $string;
$string =~ s/,/ , /g;
$string =~ s/\./ \. /g;
$string =~ s/;/ ; /g;
$string =~ s/\(/ ( /g;
$string =~ s/\)/ )/g;
$string =~ s/\:/ : /g;
$string =~ s/\::/ :: )/g;
my #array = split / /, $string;
foreach my $word (#array) {
chomp $word;
if ( $word =~ /\,|\;|\.|\(|\)/g ) {
push( #full, $word );
}
if ( $Stop_words{$word} ) {
push( #full, $word );
}
if ( $Values{$word} ) {
my $term = "<Drug>$word<\/Drug>";
push( #full, $term );
}
else {
push( #full, $word );
}
# if($word=~/.*\Q$Values{$word}\E/i)#Changed this
# {
# $term="<Drug>$word</$Drug>";
# print $term,"\n";
# push(#full,$term);
# }
}
}
my $mod_str = join( " ", #full );
print OUT $mod_str, "\n";
}

I need a regex to match occurances of words like MCF-7 with MCF7 or
MC-F7
The most straightforward approach is just to strip out the hyphenss i.e.
my $ignore_these = "[-_']"
$word =~ s{$ignore_these}{}g;
I am not sure what is stored in your Value hash, so its hard to tell what you expect to happen
if($word=~/.*\Q$Values{$word}\E/i)
However, the kind of thing I imagin you want is (simplified your code somewhat)
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use 5.10.0;
use Data::Dumper;
while (<>) {
chomp $_;
my $flag = 0;
my $line = lc $_;
my #full = ();
if ( $line =~ /<Sentence.*>(.*)<\/Sentence>/i ) {
my $string = $1;
chomp $string;
$string =~ s/([,\.;\(\)\:])/ $1 /g; # squished these together
$string =~ s/\:\:/ :: )/g; # typo in original
my #array = split /\s+/, $string; # split on one /or more/ spaces
foreach my $word (#array) {
chomp $word;
my $term=$word;
my $word_chars = "[\\w\\-_']";
my $word_part = "antagon";
if ($word =~ m{$word_chars*?$word_part$word_chars+}) {
$term="<Drug>$word</Drug>";
}
push(#full,$term); # push
}
}
my $mod_str = join( " ", #full );
say "<Sentence>$mod_str</Sentence>";
}
This gives me the following output, which is my best guess at what you expect:
$ cat tmp.txt
<Sentence>This in antagonizing the antagonist's antagonism pre-antagonistically.</Sentence>
$ cat tmp.txt | perl x.pl
<Sentence>this in <Drug>antagonizing</Drug> the <Drug>antagonist's</Drug> <Drug>antagonism</Drug> <Drug>pre-antagonistically</Drug> .</Sentence>
$

perl -ne '$things{$1}++while s/([^ ;.,!?]*?antagon[^ ;.,!?]++)//;END{print "$_\n" for sort keys %things}' FILENAME
If the file contains the following:
he was an antagonist
antagonize is a verb
why are you antagonizing her?
this is an alpha-antagonist
This will return:
alpha-antagonist
antagonist
antagonize
antagonizing
Below is the a regular (not one-liner) version:
#!/usr/bin/perl
use warnings;
use strict;
open my $in, "<", "sample.txt" or die "could not open sample.txt for reading!";
open my $out, ">", "sample1.txt" or die "could not open sample1.txt for writing!";
my %things;
while (<$in>){
$things{$1}++ while s/([^ ;.,!?]*?antagon[^ ;.,!?]++)//
}
print $out "$_\n" for sort keys %things;

You may want to take another look at your assumptions on your approach. What it sounds like to me is that you are looking for words which are within a certain distance of a list of words. Take a look at the Levenshtein distance formula to see if this is something you want. Be aware, however, that computing this might take exponential time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perl Print if regex match A and not regex match B - regex

$_ != /\d+/ is short for $_ != ($_ =~ /\d+/) Instead of != you need !~ if ($_ =~ /bash/ and $_ !~ /\d+/);

Isn't that just: if (/bash/ && ! /\d+/)

Related

Counting number of pattern matches in Perl

Remove matching words from the string using Perl

How to extract a part of string in Perl?

Perl - regexp not being replaced

Match different variant of a word using regex Perl

Categories

Resources