Search for a pattern in a file - regex

I have a file eg.txt with contents of this sort :
....text...
....text...
COMP1 = ../../path1/path2/path3
COMP2 = ../../path4/path5/path6
and so on, for a large number of application names (the "COMP"s). I need to get the path -- the stuff including and after the second slash -- for a user-specified application.
This is the code I've been trying :
use strict;
use warnings;
my $line = "";
my $app = "";
print "Enter the app";
$app = <STDIN>;
print $app;
open my $fh, '<', "eg.txt" or die "Cannot open $!";
while (<$fh>) {
$line = <$fh>;
if ( $line && $line =~ /($app)( = )(..\/)(..)(.*)/ ) {
print $5;
}
}
This prints the name of the user-input application, and does nothing else. Any help would be greatly appreciated!

There are two main problems with your program
The $app variable contains a newline at the end from the enter key you pressed when you typed it in. That will prevent the pattern from matching so you need to use chomp to remove it. The same applies to lines read from your file
The <$fh> in your while statement reads a line from your file into the default variable $_, and then $line = <$fh> reads another, so you are ignoring alternate lines from the file
Here is a version of your program that I think should work although I am unable to test it at present. I have dropped your $line variable altogether and hope that doesn't confuse you. $_ is the default variable for the pattern match so it isn't mentioned explicitly anywhere
use strict;
use warnings;
print "Enter the app: ";
my $app = <STDIN>;
chomp $app;
open my $fh, '<', 'eg.txt' or die "Cannot open: $!";
while ( <$fh> ) {
if ( /$app\s*=\s*(.+)/ ) {
my $path = $1;
$path =~ s/.*\.\.//;
print $path, "\n";
}
}

The input did not matched in regex because newlines were coming along with them, so better use chomp to trim them. In while loop you are displacing two times the file handle, I don't know why. So after corrections this should work:
use strict;
use warnings;
my $line = "";
my $app = "";
print "Enter the app";
chomp($app = <STDIN>);
print "$app: ";
open my $fh, '<', "eg.txt" or die "Cannot open $!";
while($line = <$fh>)
{
chomp $line;
if($line && $line =~ /($app)( = )(..\/)(..)(.*)/)
{
print "$5 \n";
}
}
close($fh);

Try this code:
use strict;
use warnings;
my $line = "";
my $app = "";
print "Enter the app";
$app = <STDIN>;
print $app;
open my $fh, '<', "eg.txt" or die "Cannot open $!";
my #line = <$fh>;
my #fetch = map { /COMP\d+\s\=\s(\..\/\..\/.*)/g } #line ;
$, = "\n";
print #fetch;
and then please send your response.

You are accessing <$fh> twice in your loop. This will have the effect of interpreting only every other line. You might want to change the top of the loop to something like this:
while (defined(my $line = <$fh>)) {
and remove the my $line ... at the top of the program.
Also, you might want to consider chomping your input line so that you don't have to think about the trailing newline character:
while (defined(my $line = <$fh>)) {
chomp $line;
Your regular expression is also a bit dicey. You probably want to bind it to the beginning and end of the search space and escape the literal dots. You may also want $app to be interpreted as a string rather than a regexp, which can be done by wrapping it with \Q...\E. Also unless your file format specifies single spaces around the equals, I'd be tempted to make those flexible to zero or more occurrences. Also, if you aren't going to use the earlier captures, I would say don't do them, so:
if ($line && $line =~ /^\Q$app\E *= *\.\.\/\.\.(.*)$/)
{
print $1;
(Some may say you should use \A and \z rather than ^ and $. That choice is left as an exercise to the reader.)

Related

Assigning regex search to variable: Uninitialized variable error

I am opening files in a directory that contain two lines of sequences in each file. The top sequence is longer than the bottom, but includes the bottom sequence. I would like to extend the bottom sequence by the two flanking letters in each direction once it is found in the top sequence. I am trying this by a doing a regex match, but am getting a uninitialized error for the $newsequence variable.
Here is what a typical file looks like:
>CCCCNNNNNCCCC
NNNNN
I would like to print to one file all the sequences in the following format:
>CCCCNNNNNCCCC
CCNNNNNCC
Here is my code so far:
use strict;
use warnings;
my ($directory) = #ARGV
my #array = glob "$directory/*";
my $header;
my $sequence;
my $newsequence;
open(OUT, ">", "/path/to/out.txt") or die $!;
foreach my $file (#array){
open (my $fh, $file) or die $!;
while (my $line = <$fh>){
chomp $line;
if ($line =~ /^>/) {
$header = $line;
} elsif ($line =~ /^[CN]/) {
$sequence = $line;
}
my ($newsequence) = $header =~ /(([CN]{2})($sequence)([CN]{2}))/;
}
print OUT $header, "\n", $newsequence, "\n";
}
How can I improve my regex assignment to $newsequence to get adequate output? Thanks.
This line is wrong:
my ($newsequence) = $header =~ /(([CN]{2})($sequence)([CN]{2}))/;
The my keyword is creating a new variable $newsequence local to the while loop, not assigning the variable in the main script. So when you try to write $newsequence after the loop is done, the variable is still uninitialized.
Either put the print statement inside the while loop, or remove the my keyword in this assignment.
Also, you should put that assignment statement inside the elseif block. Otherwise, you'll try to use $sequence before you've assigned it. So the whole thing should look like:
foreach my $file (#array){
open (my $fh, $file) or die $!;
while (my $line = <$fh>){
chomp $line;
if ($line =~ /^>/) {
$header = $line;
} elsif ($line =~ /^[CN]/) {
$sequence = $line;
($newsequence) = $header =~ /(([CN]{2})($sequence)([CN]{2}))/;
print OUT $header, "\n", $newsequence, "\n";
}
}
}
If your conditions are accurate (each file contains only 2 lines, and the sequence is always found in the header), then you can make your code a lot simpler, including the regex:
for my $file (#array) {
open (my $fh, $file) or die $!;
chomp ((my $header, my $sequence) = <$fh>);
$header =~ /(..)$sequence(..)/;
print OUT "$header\n$1$sequence$2";
}

Find and get the location of a list of words in a text

I'm sure this is simple but I just can't figure out what to do...
I have a text file with a bunch of words in it (let's call it "wordlist") organized in a single column. Then I have a big text file (let's call it "essay"). What I want to do is to look in the "essay" file for the words in my "wordlist".
The trick is that I want to know the position of the matched word in the "essay" (meaning, match found after X characters).
I'm actually able to do it when I look for a single word (so wordlist containing just 1 word) but I can't get it to work when working with a list of words...
Any advice ?
thanks a lot
Ok so I just realized it would just tell me "no match found" anyway...Here is the code
use strict;
use warnings;
open (my $wordlist, "<", "/wordlist.txt")
or die "cannot open < wordlist.txt $!";
open (my $essay, "<", "/essay.txt")
or die "cannot open < essay.txt $!";
while (<$essay>) { print "match found\n" if ($essay =~ m/$wordlist/) ; }
{ print "no match found\n" if ($essay !~ m/$wordlist/) ; }
Help please...?
perl index function basically matches substring which does not ensure the match of a full string. A regular expression based match is more useful here imho.
Explanation:
Read whole text of essay in a string. => $essay
For each word from wordlist.txt => $_
-- Keep matching $_ within $essay with proper regex. The one used here is b$_\b
-- For each match, collect the value of #-[0]
\b: is the word boundary character here which ensures that it only matches with complete words not substrings.
#-: is a special variable that contains the start position of the last regex match.
Here is a sample code:
use strict;
use warnings;
use 5.010;
my $wordlist_file = 'wordlist.txt';
open my $wordlist_fh, '<', $wordlist_file or die "Failed to open '$wordlist_file': $!";
my %pos;
my $essay_file = 'essay.txt';
my $essay = do {
local $/ = undef;
open my $fh, "<", $essay_file
or die "could not open $essay_file: $!";
<$fh>;
};
while (<$wordlist_fh>) {
chomp;
$pos{$_} = [] unless $pos{$_};
while($essay =~ m/\b$_\b/g){
push #{$pos{$_}}, #-;
}
}
use Data::Dumper;
print Dumper(\%pos);
the wordlist file and essay files are similar as mentioned by ThisSuitIsBlackNot.
wordlist.txt
I
Perl
hacker
essay.txt
I want to be just another Perl hacker when I grow up
I want to be just another Perl hacker when I grow up
The %pos hash now contains all the positions of your each word. I just showed them through dumper
$VAR1 = {
'hacker' => [
'31',
'84'
],
'Perl' => [
'26',
'79'
],
'I' => [
'0',
'43',
'53',
'96'
]
};
Note that the counts are including the newline characters at the end of each line.
Maybe you can use index() function.
Here is the link: Using the Perl index() function
This is my sample. The performance may be not too well. Hope it helps~:)
open (my $wordlist, "<", "files/wordlist.txt")
or die "cannot open < wordlist.txt $!";
open (my $essay, "<", "files/essay.txt")
or die "cannot open < essay.txt $!";
my $words = {};
while (<$wordlist>) {
chomp($_);
$words->{$_} = 1;
}
my $row_count = 0;
while (<$essay>) {
$row_count++;
chomp($_);
foreach my $word (keys %{$words}) {
my $offset = 0;
my $r = index($_, $word, $offset);
while ($r != -1) {
print "Found [$word] in line $row_count at $r\n";
$offset = $r + 1;
$r = index($_, $word, $offset);
}
}
}
In your code, $essay and $wordlist are both filehandles. When you say
print "match found\n" if ($essay =~ m/$wordlist/);
You're trying to match the stringification of one filehandle to the stringification of another filehandle. When a filehandle is stringified, it looks something like this:
GLOB(0x9a26c38)
So your code actually does something like:
print "match found\n" if ('GLOB(0x9a26c38)' =~ m/GLOB(0x94bbc38)/);
This is not what you want. You need to read the contents of your files and compare those, not the filehandles themselves.
Essay words each on their own line
The following code assumes that your "essay" consists of one word per line. We read the contents of the essay file into a hash of arrays, with the lines as keys and an array of positions as values. We use an array in case the same word appears multiple times in the file. The position of the first word is zero. We then loop through the word list file, printing the word and the first matching position, if there is one.
use strict;
use warnings;
use 5.010;
my $essay_file = 'files/essay.txt';
open my $essay_fh, '<', $essay_file or die "Failed to open '$essay_file': $!";
my $pos = 0;
my %essay;
while (<$essay_fh>) {
chomp;
push #{ $essay{$_} }, $pos;
$pos += length $_;
}
my $wordlist_file = 'files/wordlist.txt';
open my $wordlist_fh, '<', $wordlist_file or die "Failed to open '$wordlist_file': $!";
while (<$wordlist_fh>) {
chomp;
say "$_: $essay{$_}[0]" if exists $essay{$_};
}
essay.txt
I
want
to
be
just
another
Perl
hacker
when
I
grow
up
wordlist.txt
I
Perl
hacker
Output
I: 0
Perl: 20
hacker: 24
Note that I'm ignoring newline characters when computing the position values. You can adjust this as necessary.
Essay words more than one per line
If your essay file can have more than one word per line, we can use a regex to check for matches:
use strict;
use warnings;
use 5.010;
# Slurp entire essay file into a variable
my $essay = do {
local $/;
my $essay_file = 'files/essay.txt';
open my $essay_fh, '<', $essay_file or die "Failed to open '$essay_file': $!";
<$essay_fh>;
};
my $wordlist_file = 'files/wordlist.txt';
open my $wordlist_fh, '<', $wordlist_file or die "Failed to open '$wordlist_file': $!";
while (<$wordlist_fh>) {
chomp;
say "$_: ", pos($essay) - length($_) if $essay =~ /\b$_\b/g;
}
essay.txt
I want to be just another Perl hacker when I grow up
wordlist.txt
I
Perl
hacker
hack
Output
I: 0
Perl: 26
hacker: 31
Note that the results are a little bit different from our other program, because now there are spaces between words. Also note that there is no output for the word hack, since we're only checking for whole word matches.

Recurrent replacement using Perl

I have a word (MODEL 1) in my file 20 times interspersed by lines of text. I want to replace it with the frequency number of occurrence e.g. MODEL 1 and then when it occurs again then MODEL 2 and then MODEL 3 and so on so forth.
However my loop gets stuck at the first round and not looping till it has replaced all of words.
Can any one tell me what I have been missing out. Any help would be much appreciated.
The code is listed below:
#!/usr/bin/perl -w
my $file = 'test.text';
open (my $fh, $file);
while (my $row = <$fh>) {
chomp $row;
if (($row) =~ /^MODEL 1/){
$i = 1;
$row =~ s/^MODEL 1/MODEL $i/g;
$i++;
}
print "$row\n";
}
You need to move your counter variable outside of the loop.
As a simplification, use s///e to match and replace in a single step:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
my $file = 'test.text';
open my $fh, '<', $file;
my $counter = 0;
while (<$fh>) {
chomp;
s/^MODEL \K1/++$counter/e;
print "$_\n";
}
Move $i = 1 initialization to above while loop
$i = 1;
while (my $row = <$fh>) { chomp $row;
...
}
You're resetting it back to for every line, so there won't be any change at all.
You can increment a counter in the replacement pattern itself with the e modifier:
my $i=1;
while (my $row = <$fh>) {
chomp $row;
$row =~ s/MODEL \K1/$i++/ge
print "$row\n";
}

Perl extract between start and end

I am aiming to extract a string from start to an end word, (dIonly is start and should be the end workset [including these parenthesis]; furthermore I would like to print the output into a file named report.
I have had problems with lookbehind, as the variable length was not implemented.
Now I reversed the string, to do lookahead. However, something is still not working.
I need to start from dIonly which means I have to reverse the string to circumvent the problem described above, as there are many workset(( in the whole string, which means I can't start from there...
Thank you! I edited the script now. What I need to do is reverse the string. I did that by splitting the string with a space as delimiter into a list, then reversed it, and put it into a string again. Just to split it into a list again at the delimiter 'solution' as my output will have several strings of which I want to extract dIonly to workset (this only works once the string is reversed as otherwise I would encouter worksets that I do not want and extract a different string, as dIonly is a distinct part of the pattern of the solution from which I can work forward to the second workset (which itself is the first workset with 2 parenthesis). Then I want to print it to a new output file. Any suggestions welcome!
This is a sample of the data:
... denotes that it continues after
..... maxRiskC(cA, 3)) c workset((RiskCA(cA, 3), RiskCB(cB, 2), maxRiskC(cA, 3))) c RiskCA(cA, 3) c RiskCB(cB, 2)) ***********
equation (built-in equation for symbol <=) 6 <= 40 ---> true
Solution 4 (state 31) states: 40 rewrites: 8421 in 5357394502ms cpu
(1464ms real) (0) rewrites/second) G:Game --> workset(empty) c playA
c dIonly c
.....
#!/usr/bin/perl
# perl -d ./perl_debugger.pl
use strict;
use Data::Dumper qw(Dumper);
use File::Slurp;
my #a_linesorig;
my #a_out;
my #a_str;
my $line;
my $reversedline;
my #a_linesrev;
my #reversedarray;
my $reversedline;
my $str;
open(my $fh, "<", "data.txt")
or die "cannot open < data.txt: $!";
my $line = read_file('data.txt');
#a_linesorig = split(' ', $line);
#a_linesrev = reverse(#a_linesorig);
$reversedline = join(' ', #a_linesrev); # joins the reversed list to a single string again
#reversedarray = split( /solution/, $reversedline ); # should split huge string into a list from one solution to next
foreach $str (#reversedarray) {
if ($str =~ /\bdIonly:\b(.*?)\bworkset\b/g);
print Dumper \$str;
print (#a_out, "$str");
}
close $fh
or die "can't close file: $!";
open(my $fh, ">", "output.txt")
or die "cannot open > output.txt: $!";
foreach $str (#a_out)
{
print ($fh "$str\n");
}
close $fh
or die "can't close file: $!";
Take off the reverse, it will reverse letters also and not individual words, for that scalar.
You can try it with a greedy match since you are only interested in the last workset:
while (my $line = <$input>) {
chomp $line;
if ($line =~ /.*workset(.*dIonly)/) {
# do something with results
say $fh "'$1'";
}
}
And if you need to reverse before writing to the file, you can do:
while (my $line = <$input>) {
chomp $line;
if ($line =~ /.*workset(.*dIonly)/) {
say $fh join " ",reverse (split / /,$1);
}
}

How to match exactly two empty lines

I have a question about regular expressions. I have a file and I need to parse it in such a way that I could distinguish some specific blocks of text in it. These blocks of text are separated by two empty lines (there are blocks which are separated by 3 or 1 empty lines but I need exactly 2). So I have a piece of code and this is \s*$^\s*$/ regular expression I think should match, but it does not.
What is wrong?
$filename="yu";
open($in,$filename);
open(OUT,">>out.text");
while($str=<$in>)
{
unless($str = /^\s*$^\s*$/){
print "yes";
print OUT $str;
}
}
close($in);
close(OUT);
Cheers,
Yuliya
By default, Perl reads files a line at a time, so you won't see multiple new lines. The following code selects text terminated by a double new line.
local $/ = "\n\n" ;
while (<> ) {
print "-- found $_" ;
}
New Answer
After having problems excluding >2 empty lines, and a good nights sleep here is a better method that doesn't even need to slurp.
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'yu';
my #blocks; #each element will be an arrayref, one per block
#that referenced array will hold lines in that block
open(my $fh, '<', $file);
my $empty = 0;
my $block_num = 0;
while (my $line = <$fh>) {
chomp($line);
if ($line =~ /^\s*$/) {
$empty++;
} elsif ($empty == 2) { #not blank and exactly 2 previous blanks
$block_num++; # move on to next block
$empty = 0;
} else {
$empty = 0;
}
push #{ $blocks[$block_num] }, $line;
}
#write out each block to a new file
my $file_num = 1;
foreach my $block (#blocks) {
open(my $out, '>', $file_num++ . ".txt");
print $out join("\n", #$block);
}
In fact rather than store and write later, you could simply write to one file per block as you go:
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'yu';
open(my $fh, '<', $file);
my $empty = 0;
my $block_num = 1;
open(OUT, '>', $block_num . '.txt');
while (my $line = <$fh>) {
chomp($line);
if ($line =~ /^\s*$/) {
$empty++;
} elsif ($empty == 2) { #not blank and exactly 2 previous blanks
close(OUT); #just learned this line isn't necessary, perldoc -f close
open(OUT, '>', ++$block_num . '.txt');
$empty = 0;
} else {
$empty = 0;
}
print OUT "$line\n";
}
close(OUT);
use 5.012;
open my $fh,'<','1.txt';
#slurping file
local $/;
my $content = <$fh>;
close $fh;
for my $block ( split /(?<!\n)\n\n\n(?!\n)/,$content ) {
say 'found:';
say $block;
}
Deprecated in favor of new answer
justintime's answer works by telling perl that you want to call the end of a line "\n\n", which is clever and will work well. One exception is that this must match exactly. By the regex you are using it makes it seem like there might be whitespace on the "empty" lines, in which case this will not work. Also his method will split even on more than 2 linebreaks, which was not allowed in the OP.
For completeness, to do it the way you were asking, you need to slurp the whole file into a variable (if the file is not so large as to use all your memory, probably fine in most cases).
I would then probably say to use the split function to split the block of text into an array of chunks. Your code would then look something like:
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'yu';
my $text;
open(my $fh, '<', $file);
{
local $/; enables slurp mode inside this block
$text = <$fh>;
}
close($fh);
my #blocks = split(
/
(?<!\n)\n #check to make sure there isn't another \n behind this one
\s*\n #first whitespace only line
\s*\n #second "
(?!\n) #check to make sure there isn't another \n after this one
/x, # x flag allows comments and whitespace in regex
$text
);
You can then do operations on the array. If I understand your comment to justintime's answer, you want to write each block out to a different file. That would look something like
my $file_num = 1;
foreach my $block (#blocks) {
open(my $out, '>', $file_num++ . ".txt");
print $out $block;
}
Notice that since you open $out lexically (with my) when it reaches the end of the foreach block, the $out variable dies (i.e. "goes out of scope"). When this happens to a lexical filehandle, the file is automatically closed. And you can do a similar thing to that with justintime's method as well:
local $/ = "\n\n" ;
my $file_num = 1;
while (<>) {
open(my $out, '>', $file_num++ . ".txt");
print $out $block;
}