I have an array containing words that I want to remove from each line of a file. The code I am using is as follows:
my $INFILE;
my $OUTFILE;
my $STOPLIST;
open($INFILE, '<', $ARGV[0]);
open($STOPLIST, '<', "stop.txt");
open($OUTFILE, '>', $ARGV[1]);
my #stoplist = <$STOPLIST>;
my $line;
my $stopword;
while (<$INFILE>) {
$line = $_;
$line =~ s/\[[0-9]*\] //g;
$line =~ s/i\/.*\/; //g;
foreach (#stoplist) {
$stopword = $_;
$line =~ s/${stopword}//g;
}
print $OUTFILE lc($line);
}
However, the words in the stoplist still appear in the text in the output file, which would indicate that the $line =~ s/${stopword}//g; line wasn't doing it's job as I expected.
How can I make sure that all words in the stop list that appear in the input text are replaced with 0 characters in the output?
You need to remove newlines from your stoplist using chomp:
my #stoplist = <$STOPLIST>;
chomp #stoplist;
Related
I am opening files in a directory that contain two lines of sequences in each file. The top sequence is longer than the bottom, but includes the bottom sequence. I would like to extend the bottom sequence by the two flanking letters in each direction once it is found in the top sequence. I am trying this by a doing a regex match, but am getting a uninitialized error for the $newsequence variable.
Here is what a typical file looks like:
>CCCCNNNNNCCCC
NNNNN
I would like to print to one file all the sequences in the following format:
>CCCCNNNNNCCCC
CCNNNNNCC
Here is my code so far:
use strict;
use warnings;
my ($directory) = #ARGV
my #array = glob "$directory/*";
my $header;
my $sequence;
my $newsequence;
open(OUT, ">", "/path/to/out.txt") or die $!;
foreach my $file (#array){
open (my $fh, $file) or die $!;
while (my $line = <$fh>){
chomp $line;
if ($line =~ /^>/) {
$header = $line;
} elsif ($line =~ /^[CN]/) {
$sequence = $line;
}
my ($newsequence) = $header =~ /(([CN]{2})($sequence)([CN]{2}))/;
}
print OUT $header, "\n", $newsequence, "\n";
}
How can I improve my regex assignment to $newsequence to get adequate output? Thanks.
This line is wrong:
my ($newsequence) = $header =~ /(([CN]{2})($sequence)([CN]{2}))/;
The my keyword is creating a new variable $newsequence local to the while loop, not assigning the variable in the main script. So when you try to write $newsequence after the loop is done, the variable is still uninitialized.
Either put the print statement inside the while loop, or remove the my keyword in this assignment.
Also, you should put that assignment statement inside the elseif block. Otherwise, you'll try to use $sequence before you've assigned it. So the whole thing should look like:
foreach my $file (#array){
open (my $fh, $file) or die $!;
while (my $line = <$fh>){
chomp $line;
if ($line =~ /^>/) {
$header = $line;
} elsif ($line =~ /^[CN]/) {
$sequence = $line;
($newsequence) = $header =~ /(([CN]{2})($sequence)([CN]{2}))/;
print OUT $header, "\n", $newsequence, "\n";
}
}
}
If your conditions are accurate (each file contains only 2 lines, and the sequence is always found in the header), then you can make your code a lot simpler, including the regex:
for my $file (#array) {
open (my $fh, $file) or die $!;
chomp ((my $header, my $sequence) = <$fh>);
$header =~ /(..)$sequence(..)/;
print OUT "$header\n$1$sequence$2";
}
I have a file eg.txt with contents of this sort :
....text...
....text...
COMP1 = ../../path1/path2/path3
COMP2 = ../../path4/path5/path6
and so on, for a large number of application names (the "COMP"s). I need to get the path -- the stuff including and after the second slash -- for a user-specified application.
This is the code I've been trying :
use strict;
use warnings;
my $line = "";
my $app = "";
print "Enter the app";
$app = <STDIN>;
print $app;
open my $fh, '<', "eg.txt" or die "Cannot open $!";
while (<$fh>) {
$line = <$fh>;
if ( $line && $line =~ /($app)( = )(..\/)(..)(.*)/ ) {
print $5;
}
}
This prints the name of the user-input application, and does nothing else. Any help would be greatly appreciated!
There are two main problems with your program
The $app variable contains a newline at the end from the enter key you pressed when you typed it in. That will prevent the pattern from matching so you need to use chomp to remove it. The same applies to lines read from your file
The <$fh> in your while statement reads a line from your file into the default variable $_, and then $line = <$fh> reads another, so you are ignoring alternate lines from the file
Here is a version of your program that I think should work although I am unable to test it at present. I have dropped your $line variable altogether and hope that doesn't confuse you. $_ is the default variable for the pattern match so it isn't mentioned explicitly anywhere
use strict;
use warnings;
print "Enter the app: ";
my $app = <STDIN>;
chomp $app;
open my $fh, '<', 'eg.txt' or die "Cannot open: $!";
while ( <$fh> ) {
if ( /$app\s*=\s*(.+)/ ) {
my $path = $1;
$path =~ s/.*\.\.//;
print $path, "\n";
}
}
The input did not matched in regex because newlines were coming along with them, so better use chomp to trim them. In while loop you are displacing two times the file handle, I don't know why. So after corrections this should work:
use strict;
use warnings;
my $line = "";
my $app = "";
print "Enter the app";
chomp($app = <STDIN>);
print "$app: ";
open my $fh, '<', "eg.txt" or die "Cannot open $!";
while($line = <$fh>)
{
chomp $line;
if($line && $line =~ /($app)( = )(..\/)(..)(.*)/)
{
print "$5 \n";
}
}
close($fh);
Try this code:
use strict;
use warnings;
my $line = "";
my $app = "";
print "Enter the app";
$app = <STDIN>;
print $app;
open my $fh, '<', "eg.txt" or die "Cannot open $!";
my #line = <$fh>;
my #fetch = map { /COMP\d+\s\=\s(\..\/\..\/.*)/g } #line ;
$, = "\n";
print #fetch;
and then please send your response.
You are accessing <$fh> twice in your loop. This will have the effect of interpreting only every other line. You might want to change the top of the loop to something like this:
while (defined(my $line = <$fh>)) {
and remove the my $line ... at the top of the program.
Also, you might want to consider chomping your input line so that you don't have to think about the trailing newline character:
while (defined(my $line = <$fh>)) {
chomp $line;
Your regular expression is also a bit dicey. You probably want to bind it to the beginning and end of the search space and escape the literal dots. You may also want $app to be interpreted as a string rather than a regexp, which can be done by wrapping it with \Q...\E. Also unless your file format specifies single spaces around the equals, I'd be tempted to make those flexible to zero or more occurrences. Also, if you aren't going to use the earlier captures, I would say don't do them, so:
if ($line && $line =~ /^\Q$app\E *= *\.\.\/\.\.(.*)$/)
{
print $1;
(Some may say you should use \A and \z rather than ^ and $. That choice is left as an exercise to the reader.)
I have a word (MODEL 1) in my file 20 times interspersed by lines of text. I want to replace it with the frequency number of occurrence e.g. MODEL 1 and then when it occurs again then MODEL 2 and then MODEL 3 and so on so forth.
However my loop gets stuck at the first round and not looping till it has replaced all of words.
Can any one tell me what I have been missing out. Any help would be much appreciated.
The code is listed below:
#!/usr/bin/perl -w
my $file = 'test.text';
open (my $fh, $file);
while (my $row = <$fh>) {
chomp $row;
if (($row) =~ /^MODEL 1/){
$i = 1;
$row =~ s/^MODEL 1/MODEL $i/g;
$i++;
}
print "$row\n";
}
You need to move your counter variable outside of the loop.
As a simplification, use s///e to match and replace in a single step:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
my $file = 'test.text';
open my $fh, '<', $file;
my $counter = 0;
while (<$fh>) {
chomp;
s/^MODEL \K1/++$counter/e;
print "$_\n";
}
Move $i = 1 initialization to above while loop
$i = 1;
while (my $row = <$fh>) { chomp $row;
...
}
You're resetting it back to for every line, so there won't be any change at all.
You can increment a counter in the replacement pattern itself with the e modifier:
my $i=1;
while (my $row = <$fh>) {
chomp $row;
$row =~ s/MODEL \K1/$i++/ge
print "$row\n";
}
I have an input file with data like below:
X-X-D-X-X-A
X-D-X-A-X
D-X-X-X-X-A-X-X
I need the result to be only giving me
D-X-X-A
D-X-A
D-X-X-X-X-A
Please help me!!
You can try,
open my $fh, "<", "file" or die $!;
while (my $line = <$fh>) {
$line =~ s/^[-X]+ | [-X]+(?=\s*$)//xg;
print $line;
}
close $fh;
or from cmd line,
perl -pe 's/^[-X]+ | [-X]+(?=\s*$)//xg' file
Tested code:
my #a = qw[
X-X-D-X-X-A
X-D-X-A-X
D-X-X-X-X-A-X-X
];
for my $a (#a) {
if ($a =~ /(D[-X]+A)/) {
print $1,"\n";
}
}
perl -lne 'print $1 if(/[^D]*(D.*A).*/)'
test
inside a script while reading a file:
while (my $line = <$fh>) {
$line =~ m/[^D]*(D.*A).*/g;
print $1;
}
Regex:
[^D]*(D.*A).*
[^D]* line may not start with D
(D.*A) Capture(round braces) the staring from D till last A in the
line. captured part will be stored in $1
.* left over string in the line.
I am splitting sentences at individual space characters, and then matching these terms against keys of hashes. I am getting matches only if the terms are 100% similar, and I am struggling to find a perfect regex that could match several occurrences of the same word. Eg. Let us consider I have a term 'antagon' now it perfectly matches with the term 'antagon' but fails to match with antagonists, antagonistic or pre-antagonistic, hydro-antagonist etc. Also I need a regex to match occurrences of words like MCF-7 with MCF7 or MC-F7 silencing the effect of special characters and so on.
This is the code that I have till now; thr commented part is where I am struggling.
(Note: Terms in the hash are stemmed to root form of a word).
use warnings;
use strict;
use Drug;
use Stop;
open IN, "sample.txt" or die "cannot find sample";
open OUT, ">sample1.txt" or die "cannot find sample";
while (<IN>) {
chomp $_;
my $flag = 0;
my $line = lc $_;
my #full = ();
if ( $line =~ /<Sentence.*>(.*)<\/Sentence>/i ) {
my $string = $1;
chomp $string;
$string =~ s/,/ , /g;
$string =~ s/\./ \. /g;
$string =~ s/;/ ; /g;
$string =~ s/\(/ ( /g;
$string =~ s/\)/ )/g;
$string =~ s/\:/ : /g;
$string =~ s/\::/ :: )/g;
my #array = split / /, $string;
foreach my $word (#array) {
chomp $word;
if ( $word =~ /\,|\;|\.|\(|\)/g ) {
push( #full, $word );
}
if ( $Stop_words{$word} ) {
push( #full, $word );
}
if ( $Values{$word} ) {
my $term = "<Drug>$word<\/Drug>";
push( #full, $term );
}
else {
push( #full, $word );
}
# if($word=~/.*\Q$Values{$word}\E/i)#Changed this
# {
# $term="<Drug>$word</$Drug>";
# print $term,"\n";
# push(#full,$term);
# }
}
}
my $mod_str = join( " ", #full );
print OUT $mod_str, "\n";
}
I need a regex to match occurances of words like MCF-7 with MCF7 or
MC-F7
The most straightforward approach is just to strip out the hyphenss i.e.
my $ignore_these = "[-_']"
$word =~ s{$ignore_these}{}g;
I am not sure what is stored in your Value hash, so its hard to tell what you expect to happen
if($word=~/.*\Q$Values{$word}\E/i)
However, the kind of thing I imagin you want is (simplified your code somewhat)
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use 5.10.0;
use Data::Dumper;
while (<>) {
chomp $_;
my $flag = 0;
my $line = lc $_;
my #full = ();
if ( $line =~ /<Sentence.*>(.*)<\/Sentence>/i ) {
my $string = $1;
chomp $string;
$string =~ s/([,\.;\(\)\:])/ $1 /g; # squished these together
$string =~ s/\:\:/ :: )/g; # typo in original
my #array = split /\s+/, $string; # split on one /or more/ spaces
foreach my $word (#array) {
chomp $word;
my $term=$word;
my $word_chars = "[\\w\\-_']";
my $word_part = "antagon";
if ($word =~ m{$word_chars*?$word_part$word_chars+}) {
$term="<Drug>$word</Drug>";
}
push(#full,$term); # push
}
}
my $mod_str = join( " ", #full );
say "<Sentence>$mod_str</Sentence>";
}
This gives me the following output, which is my best guess at what you expect:
$ cat tmp.txt
<Sentence>This in antagonizing the antagonist's antagonism pre-antagonistically.</Sentence>
$ cat tmp.txt | perl x.pl
<Sentence>this in <Drug>antagonizing</Drug> the <Drug>antagonist's</Drug> <Drug>antagonism</Drug> <Drug>pre-antagonistically</Drug> .</Sentence>
$
perl -ne '$things{$1}++while s/([^ ;.,!?]*?antagon[^ ;.,!?]++)//;END{print "$_\n" for sort keys %things}' FILENAME
If the file contains the following:
he was an antagonist
antagonize is a verb
why are you antagonizing her?
this is an alpha-antagonist
This will return:
alpha-antagonist
antagonist
antagonize
antagonizing
Below is the a regular (not one-liner) version:
#!/usr/bin/perl
use warnings;
use strict;
open my $in, "<", "sample.txt" or die "could not open sample.txt for reading!";
open my $out, ">", "sample1.txt" or die "could not open sample1.txt for writing!";
my %things;
while (<$in>){
$things{$1}++ while s/([^ ;.,!?]*?antagon[^ ;.,!?]++)//
}
print $out "$_\n" for sort keys %things;
You may want to take another look at your assumptions on your approach. What it sounds like to me is that you are looking for words which are within a certain distance of a list of words. Take a look at the Levenshtein distance formula to see if this is something you want. Be aware, however, that computing this might take exponential time.