Printing first instance of match in each line of file (Perl)

Printing first instance of match in each line of file (Perl) - regex

I have the following in an executable .pl file:
#!/usr/bin/env perl
$file = 'TfbG_peaks.txt';
open(INFO, $file) or die("Could not open file.");
foreach $line (<INFO>) {
if ($line =~ m/[^_]*(?=_)/){
#print $line; #this prints lines, which means there are matches
print $1; #but this prints nothing
}
}
Based on my reading at http://goo.gl/YlEN7 and http://goo.gl/VlwKe, print $1; should print the first match in each line, but it doesn't. Help!

No, $1 should print the string saved by so-called capture groups (created by the bracketing construct - ( ... )). For example:
if ($line =~ m/([^_]*)(?=_)/){
print $1;
# now this will print something,
# unless string begins from an underscore
# (which still matches the pattern, as * is read as 'zero or more instances')
# are you sure you don't need `+` here?
}
The pattern in your original code didn't have any capture groups, that's why $1 was empty (undef, to be precise) there. And (?=...) didn't count, as these were used to add a look-ahead subexpression.

$1 prints what the first capture ((...)) in the pattern captured.
Maybe you were thinking of
print $& if $line =~ /[^_]*(?=_)/; # BAD
or
print ${^MATCH} if $line =~ /[^_]*(?=_)/p; # 5.10+
But the following would be simpler (and work before 5.10):
print $1 if $line =~ /([^_]*)_/;
Note: You'll get a performance boost when the pattern doesn't match if you add a leading ^ or (?:^|_) (whichever is appropriate).
print $1 if $line =~ /^([^_]*)_/;

Related

splitting the string with regex - print $2 twice

I have the code:
#!/bin/env perl
use warnings;
$line = 'GFHFHDSH4567 FHGFF 687686';
print $line, "\n";
$line =~ s/ +/ /g;
$line =~ s/^(.+)(?=\s+(\d+)$)/1:$1\t2:$2/g;
print $line, "\n";
I get:
GFHFHDSH4567 FHGFF 687686
1:GFHFHDSH4567 FHGFF 2:687686 687686
Why it printed twice?

Your regex includes a look-ahead assertion. This doesn't form part of the final matched string, so only the first part of $line, 'GFHFHDSH4567 ', is replaced with the replacement string '1:GFHFHDSH4567 FHGFF 2:687686' The original '687686' is left unchanged.

Why is Perl not printing all regex matches in a multi-line regex?

I have this text (shortened version of my original text):
mytext.txt
BAHJSBUBGUCYHAGSBUCAGSUCBASBCYHUBXZCZPZHCUIHAUISHCIUJXZJCBZYAUSGHDYUAGWEBWHBHJASBHJASCXZBUYTRTRTRJFUARGAFGOOPWWKBBCAAAABBXHABSDAUSBCZAAAAAAAAACGAFAXHJBJHXZCXZCCZCXZUCAGSUCBASBCYHUBXZCZPZHCUIHAUISHCIUJXZJCBZYAUSGHDYUAGWEBWHBHJASBHJASCXZBUYHABSDAUSZXHJBRRRRRRJFUABGAFGLLPKWAACAAAABBZJHXZXHJBJHXZXHJBJHXJBJHXZCXZCCZCXZUCAGSAJIJICXZIJUAUUISUSJUSSJSJSJAJCXZXCZTTTTTRJFUABGAFGLOPKWABCAAAABBU
My code is the following, which intends to print all of the matches and then save them into a file as well. But I am not getting any matches while I except there to be at least 10 in my original file.
open(text, "<mytext.txt");
push (#matches,$&) while(<text> =~ m{
([TR]{6}
JFUA
[ABR]{1}
GAFG
( [LOP]{2,3} )
[KW]{2,5}
(??{ $2 =~ tr/LOP/ABC/r })
AAAABB[UXZ]{1})
/g
}x);
print "#matches\n";
my $filename = 'results_matches.txt';
open(my $fh, '>', $filename) or die "Could not open file '$filename' $!";
print $fh "#matches\n";
close $fh;
print "done\n";
I have also tried the following code and this also does not work:
my #matches = <text> =~ m{
([TR]{6}
JFUA
[ABR]{1}
GAFG
( [LOP]{2,3} )
[KW]{2,5}
(??{ $2 =~ tr/LOP/ABC/r })
AAAABB[UXZ]{1})
/g
}x;
print "#matches\n";
I have the following code which successfully prints out only one (the first) result. But it fails to print all of the matches.
if (<text> =~ m{
([TR]{6}
JFUA
[ABR]{1}
GAFG
( [LOP]{2,3} )
[KW]{2,5}
(??{ $2 =~ tr/LOP/ABC/r })
AAAABB[UXZ]{1})
}x) {print "$1\n";}
I have followed the answers in this topic but have not been able to get any of them to work: How can I find all matches to a regular expression in Perl?

By using while <text>, you are reading a new file from the file handle on each iteration of the loop. You need to loops, one iterating over the lines, and the inner loop to iterate over the matches.
while (my $line = <text>) {
push #matches, $1 while $line
=~ m{
([TR]{6}
JFUA
[ABR]
GAFG
( [LOP]{2,3} )
[KW]{2,5}
(??{ $2 =~ tr/LOP/ABC/r })
AAAABB[UXZ])
}xg;
}
I also removed {1} as it's useless, used $1 instead of $& because $& imposes a performance penatly on all the matching you do in a program; and removed the /g and added the g to the right place (i.e. next to }x).
When testing, I copied'n'pasted the input from here, i.e. I have all the characters in one line. If your input is different, please use the code formatting for it, not quotation.

Find multiline blocks

I'm trying to find occurrences of BLOB_SMUGHO, from the file test.out from the bottom of the file. If found, return a chunk of data which I'm interested in between the string "2014.10"
I'm getting Use of uninitialized value $cc in pattern match (m//) at
Whats is wrong with this script?
#!/usr/bin/perl
use strict;
use warnings;
use POSIX qw(strftime);
use File::ReadBackwards;
my $find = "BLOB_SMUGHO";
my $chnkdelim = "\n[" . strftime "%Y.%m", localtime;
my $fh = File::ReadBackwards->new('test.out', $chnkdelim, 0) or die "err-file: $!\n";
while ( defined(my $line = $fh->readline) ) {
if(my $cc =~ /$find/){
print $cc;
}
}
close($fh);
In case if this helps, here is a sample content of test.out
2014.10.31 lots and
lots of
gibbrish
2014.10.31 which I'm not
interested
in. It
also
2014.10.31 spans
across thousands of
lines and somewhere in the middle there will be
2014.10.31
this precious word BLOB_SMUGHO and
2014.10.31 certain other
2014.10.31 words
2014.10.31
this precious word BLOB_SMUGHO and
2014.10.31
this precious word BLOB_SMUGHO and
which
I
will
be
interested
in.
And I'm expecting to capture all the multiple occurrences of the chunk of the text from bottom of the file.
2014.10.31
this precious word BLOB_SMUGHO and

First, you have written your match incorrectly due to misunderstanding the =~ operator:
if(my $cc =~ /$find/){ # incorrect, like saying if(undef matches something)
If you want to match what is in $line against the pattern between /.../ then do:
if($line =~ /$find/) {
The match operator expects a value on left side as well as right side. you were using it like an assignment operator.
If you need to capture the match(es) into a variable or list, then add it to the left of an equal sign:
if(my ($cc) = $line =~ /$find/) { <-- wrap $cc in () for list context
By the way, I think you are better off writing:
if($line =~ /$find/) {
print $line;
or if you want to print what you matched only
print $0;
Since you aren't capturing a substring, it doesnt really matter here.
Now, as to how to match everything between two patterns, the task is easier if you don't match line by line, but match across newlines using the /s modifier.
In Perl, you can set the record separator to undef and use slurp mode.
local $/ = undef;
my $s = <>; # read all lines into $s
Now to scan $s for patterns
while($s =~ /(START.*?STOP)/gsm) { print "$1\n"; } # print the pattern inclusive of START and STOP
Or to capture between START and STOP
while($s =~ /START(.*?)STOP/gsm) { print "$1\n"; } # print the pattern between of START and STOP
So in your case the start pattern is 2014.10.31 and stop is BLOB_SMUGHO
while($s =~ /(2014\.10\.31.*?BLOB_SMUGHO)/gsm) {
print "$1\n";
}
NOTE: Regex modifiers in Perl come after the last / so if you see I use /gsm for multiline, match newline, and global matching (get multiple matches in a loop by remembering the last location).

Perl: Empty $1 regex value when matching?

Readers,
I have the following regex problem:
code
#!/usr/bin/perl -w
use 5.010;
use warnings;
my $filename = 'input.txt';
open my $FILE, "<", $filename or die $!;
while (my $row = <$FILE>)
{ # take one input line at a time
chomp $row;
if ($row =~ /\b\w*a\b/)
{
print "Matched: |$`<$&>$'|\n"; # the special match vars
print "\$1 contains '$1' \n";
}
else
{
#print "No match: |$row|\n";
}
}
input.txt
I like wilma.
this line does not match
output
Matched: |I like <wilma>|
Use of uninitialized value $1 in concatenation (.) or string at ./derp.pl line 14, <$FILE> line 22.
$1 contains ''
I am totally confused. If it is matching and I am checking things in a conditional. Why am I getting an empty result for $1? This isn't supposed to be happening. What am I doing wrong? How can I get 'wilma' to be in $1?
I looked here but this didn't help because I am getting a "match".

You don't have any parentheses in your regex. No parentheses, no $1.
I'm guessing you want the "word" value that ends in -a, so that would be /\b(\w*a)\b/.
Alternatively, since your whole regex only matches the bit you want, you can just use $& instead of $1, like you did in your debug output.
Another example:
my $row = 'I like wilma.';
$row =~ /\b(\w+)\b\s*\b(\w+)\b\s*(\w+)\b/;
print join "\n", "\$&='$&'", "\$1='$1'", "\$2='$2'", "\$3='$3'\n";
The above code produces this output:
$&='I like wilma'
$1='I'
$2='like'
$3='wilma'

Perl Extracting Text

I have been working on this for so long!
I'd appreciate your help...
What my doc will look like:
<text>
<text> command <+>= "stuff_i_need" <text>
<text>
<text> command <+>= stuff <text>
<text>
<text> command <+>= -stuff <text>
<text>
Anything with tangle brackets around it is optional
stuff could be anything (apple, orange, banana) but it is what I need to extract
the command is fixed
My code so far:
#!/usr/bin/env perl
use warnings;
use strict;
use Text::Diff;
# File Handlers
open(my $ofh, '>in.txt');
open(my $ifh, '<out.txt');
while (<$ifh>)
{
# Read in a line
my $line = $_;
chomp $line;
# Extract stuff
my $extraction = $line;
if ($line =~ /command \+= /i) {
$extraction =~ s/.*"(.*)".*/$1/;
# Write to file
print $ofh "$extraction\n";
}
}

Based on the example input:
if ($line =~ /command\d*\s*\+?=\s*["-]?(\w+)"?/i) {
$extraction = $1;
print "$extraction\n";
}

A few things:
For extraction, don't use substitution (i.e., use m// and not s///). If you use a match, the parenthetical groups inside the match will be returned as a list (and assigned to $1, $2, $3, etc. if you prefer).
The =~ binds the variable you want to match. So you want $extraction to actually be $line.
Your .* match is too greedy and will prevent the match from succeeding the way you want. What I mean by "greedy" is that .* will match the trailing " in your lines. It will consume the rest of the input on the line and then try match that " and fail because you've reached the end of the line.
You want to specify what the word could be. For example, if it's letters, then match [a-zA-Z]
my ($extraction) = $line =~ /command \+= "([a-zA-Z]*)"/;
If it's a number, you want [0-9]:
my ($extraction) = $line =~ /command \+= "([0-9]*)"/;
If it could be anything except ", use [^"], which means "anything but "":
my ($extraction) = $line =~ /command \+= "([^"]*)"/;
It usually helps to try to match against just what you are looking for rather than the blanket .*.

The following regular expression would help you:
m{
(?<= = ) # Find an `=`
\s* # Match 0 or more whitespaces
(?: # Do not capture
[ " \- ] # Match either a `"` or a `-`
)? # Match once or never
( # Capture
[^ " \s ]+ # Match anything but a `"` or a whitespace
)
}x;

The following one-liner will extract a word (a sequence of characters without spaces) that follows an equal sign prefixed by an optional plus sign, surrounded by optional quotes. It will read from in.txt and write to out.txt.
perl -lne 'push #a, $1 if /command\s*\+?=\s*("?\S+"?)/ }{
print for #a' in.txt > out.txt
The full code - if you prefer script form - is:
BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
chomp $_;
push #a, $1 if /command\s*\+?=\s*("?\S+"?)/;
}
{
print $_ foreach (#a);
}
Courtesy of the Deparse function of the O module.

A light solution.
#!/usr/bin/env perl
use warnings;
use strict;
open my $ifh, '<','in.txt';
open my $ofh, '>', 'out.txt';
while (<$ifh>)
{
if (/
\s command\s\+?=\s
(?:-|("))? # The word can be preceded by an optional - or "
(\w+)
(?(1)\1)\s+ # If the word is preceded by a " it must be end
# with a "
/x)
{
print $ofh $2."\n";
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Printing first instance of match in each line of file (Perl) - regex

Related

splitting the string with regex - print $2 twice

Why is Perl not printing all regex matches in a multi-line regex?

Find multiline blocks

Perl: Empty $1 regex value when matching?

Perl Extracting Text

Categories

Resources