Can't grab value from subexpressions - regex

I'm reading null-separated blocks of data from a socket; upon finding a null, the block read so far is handed for processing, and the buffer is truncated to whatever was left after the null (usually nothing).
do {
$in .= <$sock>
} while(!in =~ /^(.*?)\x00(.*)/) ;
print "A:[ $in, $block ]\n";
$block = $1;
$in = $2;
print "B:[ $in, $block ]";
Result:
A:[ {"hn":"ITtestKA","v":{"m":4,"u":4}}
, ]
B:[ , ]
Why can't I pick the data from the subexpressions $1, $2?

Another idea is to change the value $/ which determines how much will <$sock> read from $sock. So for example, if you do
local $/="\0"
then for that scope, <$sock> will read until the end of the next (null-separated) block.
Edit: If the input consists of null-separated blocks, it does not make sense to use <$sock> with $/ having its default value, because then you would be reading the input line by line. So I think there are 2 approaches:
Set $/ to \0 and use <$sock> to read the next block.
Read a fixed size chunk and then search for a null byte in it to extract the next block. In this scenario, you can either use <$sock> with $/ set to a reference of the chunk size (e.g. local $/=\4096), or use read/sysread. I would also use index/substr to search for the null and extract the chunk in that case.

There's a negated form of the binding operator:
while($in !~ /^(.*?)\x00(.*)/)

in and $in are different things. Do you use strict? If not, you're matching "in" against the regex and it never matches.
#! /usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my #socket = ( "abcd\x00efgh",
"ijkl",
"mnop\x00qrst",
"uvwx\x00",
);
my ($in, $block);
while (#socket) {
do {
$in .= shift #socket;
} while ($in !~ /^(.*?)\x00(.*)/) ;
say "A:[ $in, $block ]";
($block, $in) = ($1, $2);
say "B:[ $in, $block ]";
}

Related

The perfect regex to extract a particular string in perl

I have a text file abc.txt that looks like this:
dQdC(sA1B2C3,sC5) = A lot of stuff
a = b = c
Baseball
dQdC(sC2V3X1,sD5) = A lot of stuff again
Now I want create two arrays in perl, one of which will contain A1B2C3 and C2V3X1, the other array will contain C5 and D5. I don't care about the other intermediate lines. To achieve this goal, I am trying this perl script:
for (my $in=0;$in<=$#lines;$in++){
if ($lines[$in]=~/dQdC\(s([A-Z0-9]+?),s([A-Z0-9]+?)\)/) {
print "1111"; #this line is just to check if it is at all going inside the loop
#A = $1;
#B = $2;
}
However, it is not even going inside the loop. So I guess I did something wrong with the regex. Will someone please tell me what I am doing wrong here?
my (#a, #b);
while ($file =~ /^dQdC\(s(\w+),s(\w+)\)/mg) {
push #a, $1;
push #b, $2;
}
or
my (#a, #b);
while (<$fh>) {
if (/^dQdC\(s(\w+),s(\w+)\)/) {
push #a, $1;
push #b, $2;
}
}
Working with parallel arrays isn't nice.
Alternative 1: Hash
my %hash = $file =~ /^dQdC\(s(\w+),s(\w+)\)/mg;
or
my %hash;
while (<$fh>) {
if (/^dQdC\(s(\w+),s(\w+)\)/) {
$hash{$1} = $2;
}
}
Alternative 2: AoA
use List::Util qw( pairs ); # 1.29+
my #pairs = pairs( $file =~ /^dQdC\(s(\w+),s(\w+)\)/mg );
or
my #pairs;
while (<$fh>) {
if (/^dQdC\(s(\w+),s(\w+)\)/) {
push #pairs, [ $1, $2 ];
}
}
If the format of your target lines is always as shown
use warnings;
use strict;
my $file = ...
my (#ary_1, #ary_2);
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>)
{
my ($v1, $v2) = /dQdC\(s([^,]+),s([^\)]+)/ or next;
push #ary_1, $v1;
push #ary_2, $v2;
}
which captures between ( and a , and then between a , and ). The first pattern might as well be s(.*?), as there is no benefit of the negated character class since the following , still need be matched (but I left it with [^...] for consistency with the other one).
Comments
In general better process a file line-by-line, unless there are specific reasons to read it first
C-style loop is rarely needed. To iterate over array index use for my $i (0..$#ary)
Please use warnings; and use strict; always
Try this:
(?<=\(s)([A-Z0-9]+)(?=,)
It matches substrings that come between (s and , using lookbehind and lookahead.
Similarily, use (?<=,s)([A-Z0-9]+)(?=\)) to capture the substrings between ,s and ).
Putting them together, you can create two capturing groups, each containing the different kind of substrings: (A1B2C3, C2V3X1), (C5, D5)

Save result from flip-flop in variable?

I have about 1kB of text from STDIN
my $f = join("", <STDIN>);
and I would like to get the content between open1 and close1, so /open1/../close1/ comes to mind.
I have only seen it been used in one liners and in scripts in while-loops and $_.
Question
How can I get the result from /open1/../close1/ in my script when everything is in $f?
Capturing all matches with a single regular expression
If you want to capture all the lines between open1 and start1 markers (excluding the markers), it is easily done with a single regular expression:
my $f = join("", <STDIN>);
my #matches = ( $f =~ m/\bopen1\b(.*?)\bclose1\b/gs );
for my $m (#matches) {
print "$m";
}
where
s modifier treats the string as a single line;
g modifier captures all the matches;
(.*?) matches a group of any characters using the lazy quantifier
Using the range operator
The range operator (so-called flip-flop) is not very convenient for this task if you want to avoid capturing the markers, because an expression like /open1/ .. /close1/ returns true for the lines matching the patterns.
The expression /^open1$/ .. /^close1$/ returns false until /^open1$/ is true. The left regular expression stops being evaluated once it matches the line, and keeps returning true until /^close1$/ becomes true. When the right expression matches, the cycle is restarted. Thus, the open1 and close1 markers are included into $matches.
It is even less convenient, if the input is stored in a variable, because you will need to read the contents of the variable line by line, e.g.:
my $matches = "";
my #lines = split /\n/, $f;
foreach my $line (#lines) {
if ($line =~ m/^open1$/ .. $line =~ m/^close1$/) {
$matches .= "$line\n";
}
}
Note, it is possible to use arbitrary Perl expressions as operands of the range operator. I wouldn't recommend this code, as it is not very efficient, and not very readable. At the same time it is easy to adapt the first example to the case where the open1 and close1 markers are included into the set of matches, e.g.:
my #matches = ( $f =~ m/\bopen1\b(.*?)\bclose1\b/gs );
for my $m (#matches) {
print "open1${m}close1\n";
}
You can rewrite how $f is generated so that it takes advantage of the flip-flop inside a while loop:
my ( $f, $matched );
while ( <> ) {
$f .= $_;
$matched .= $_ if /open1/ .. /close1/;
}
Another way is to create a new inputs stream out of the contents of $f.
open my $fh, '<', \$f;
while (<$fh>) {
if (/open1/ .. /close1/) {
...
}
}
You can also employ split. To get what is between the first pair of open1 and close1
my $open_to_close = (split /open1|close1/, $f)[1];
The delimiter can be either open1 or close1, so returned is a list of three elements: before open1, between them, and after close1. We take the second element.
If there are more open1/close1 pairs take all odd-indexed elements.
Either get the array as well
my #parts = split /open1|close1/, $f;
my #all_open_to_close = #parts[ grep { $_ & 1 } 0..$#parts ];
or get it directly from the list
my #all_open_to_close =
grep { CORE::state $i; ++$i % 2 == 0 } split /open1|close1/, $f;
The state is a feature
from v5.10. If you already use that you don't need CORE:: prefix.

In array listitem how to store values with comma separation and end with dot using perl

use strict;
use warnings;
my $tmp = join "\n", <DATA>;
my #biblables = ();
List items will be fetched and storing into #biblables in a while loop
while($tmp=~m/\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/g)
{
push(#biblables, "\\cite{$2}, ");
}
print #biblables;
While printing this we are getting the output like as:
\cite{BuI2001},\cite{BuI2002},\cite{BuI2003},\cite{BuI2004},\cite{BuI2005},\cite{BuI2006},
However we need the output like this
\cite{BuI2001},\cite{BuI2002},\cite{BuI2003},\cite{BuI2004},\cite{BuI2005},\cite{BuI2006}.
Hence we can use post regex to insert dot at the end of the listitem in array
while($tmp=~m/\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/g)
{
my $post = $';
if($post!~m/\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/)
{ push(#biblables, "\\cite{$2}."); }
else { push(#biblables, "\\cite{$2}, "); }
}
print #biblables;
Could you please advise me if there is short way to get this output
#
__DATA__
\bibitem[{BuI (2001)}]{BuI2001}
\bibitem[{BuII (2002)}]{BuI2002}
\bibitem[{BuIII (2003)}]{BuI2003}
\bibitem[{BuIV (2004)}]{BuI2004}
\bibitem[{BuV (2005)}]{BuI2005}
\bibitem[{BuVI (2006)}]{BuI2006}
You can add the comma and period after the fact:
while($tmp=~m/\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/g){
push(#biblables, "\\cite{$2}");
}
print join ', ', #biblables;
print ".\n";
If you read from a filehandle you can use eof to determine that you are on the last line, at which point you replace the comma by the dot in the last element. This allows you to build the array completely in the loop, as required.
use warnings;
use strict;
open my $fh, '<', 'bibitems.txt';
my #biblabels;
while (<$fh>) {
push #biblabels, "\\cite{$2}," if /\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/;
$biblabels[-1] =~ tr/,/./ if eof;
}
print "$_ " for #biblabels;
print "\n";
This prints your desired output.
The oef returns true if the next read will return end-of-file. This means that you've just read the last line, which got put on the array if it matched. This function is rarely needed but here it seems to find a fitting purpose. Note that eof and eof() behave a little differently. Please see the eof page.
If the other capture in the regex is meant to be used change the above to if (...) { ... }. Note that what is in {} is in Latex called citation keys, while the (optional) labels are things inside []. I'd go with the array name of #citkeys for clarity.
If you're determine to add the comma's and dots to the elements when
matching in the regex while loop, it can be done like this.
Since you don't know the total matches yet, just keep a reference to
the most recently pushed element.
Then append the , or . as needed.
Code
use strict;
use warnings;
$/ = undef;
my $tmp = <DATA>;
my #biblables = ();
my $ref = undef;
while( $tmp =~ /\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/g )
{
$$ref .= ", " if defined $ref;
$ref = \$biblables[ push(#biblables,"\\cite{$2}") ];
}
$$ref .= "." if defined $ref;
print #biblables;
__DATA__
\bibitem[{BuI (2001)}]{BuI2001}
\bibitem[{BuII (2002)}]{BuI2002}
\bibitem[{BuIII (2003)}]{BuI2003}
\bibitem[{BuIV (2004)}]{BuI2004}
\bibitem[{BuV (2005)}]{BuI2005}
\bibitem[{BuVI (2006)}]{BuI2006}
Output
\cite{BuI2001}, \cite{BuI2002}, \cite{BuI2003}, \cite{BuI2004}, \cite{BuI2005}, \cite{BuI2006}.

Values from IF statement regex match (Perl)

I'm currently extracting values from a table within a file via REGEX line matches against the table rows.
foreach my $line (split("\n", $file)) {
if ($line =~ /^(\S+)\s*(\S+)\s*(\S+)$/) {
my ($val1, $val2, $val3) = ($1, $2, $3);
# $val's used here
}
}
I purposely assign vals for clarity in the code. Some of my table rows contain 10+ vals (aka columns) - is there a more efficient method of assigning the vals instead of doing ... = ($1, $2, ..., $n)?
A match in list context yields a list of the capture groups. If it fails, it returns an empty list, which is false. You can therefore
if( my ( $val1, $val2, $val3 ) = $line =~ m/^(\S+)\s*(\S+)\s*(\S+)$/ ) {
...
}
However, a number of red flags are apparent in this code. That regexp capture looks very similar to a split:
if( my ( $val2, $val2, $val3 ) = split ' ', $line ) {
...
}
Secondly, why split $file by linefeeds; if you are reading the contents of a file, far nicer is to actually read a single line at once:
while( my $line = <$fh> ) {
...
}
I assume that this is not your actual code, because if so, it will not work:
foreach my $line (split("\n", $file)) {
if ($line =~ /^(\S+)\s*(\S+)\s*(\S+)$/) {
my ($val1, $val2, $val3) = ($1, $2, $3);
}
# all the $valX variables are now out of scope
}
You should also be aware that \s* will also match the empty string, and may cause subtle errors. For example:
"a bug" =~ /^(\S+)\s*(\S+)\s*(\S+)$/;
# the captures are now: $1 = "a"; $2 = "bu"; $3 = "g"
Even despite the fact that \S+ is greedy, the anchors ^ ... $ will force the regex to fit, hence allowing the empty strings to split the words.
If your intention is to capture all the words that are separated by whitespace, using split is your best option, as others have already mentioned.
open my $fh, "<", "file.txt" or die $!;
my #stored;
while (<$fh>) {
my #vals = split;
push(#stored, \#vals) if #vals; # ignore empty values
}
This will store any captured values into a two-dimensional array. Using the file handle directly and reading line-by-line is the preferred method, unless for some reason you actually need to have the entire file in memory.
Looks like you are just using a table with a space delimiter.You can use the split function:
#valuearray = split(" ", $line)
And then address the elements as:
#valuearray[0] ,#valuearray[1] etc..

How to match exactly two empty lines

I have a question about regular expressions. I have a file and I need to parse it in such a way that I could distinguish some specific blocks of text in it. These blocks of text are separated by two empty lines (there are blocks which are separated by 3 or 1 empty lines but I need exactly 2). So I have a piece of code and this is \s*$^\s*$/ regular expression I think should match, but it does not.
What is wrong?
$filename="yu";
open($in,$filename);
open(OUT,">>out.text");
while($str=<$in>)
{
unless($str = /^\s*$^\s*$/){
print "yes";
print OUT $str;
}
}
close($in);
close(OUT);
Cheers,
Yuliya
By default, Perl reads files a line at a time, so you won't see multiple new lines. The following code selects text terminated by a double new line.
local $/ = "\n\n" ;
while (<> ) {
print "-- found $_" ;
}
New Answer
After having problems excluding >2 empty lines, and a good nights sleep here is a better method that doesn't even need to slurp.
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'yu';
my #blocks; #each element will be an arrayref, one per block
#that referenced array will hold lines in that block
open(my $fh, '<', $file);
my $empty = 0;
my $block_num = 0;
while (my $line = <$fh>) {
chomp($line);
if ($line =~ /^\s*$/) {
$empty++;
} elsif ($empty == 2) { #not blank and exactly 2 previous blanks
$block_num++; # move on to next block
$empty = 0;
} else {
$empty = 0;
}
push #{ $blocks[$block_num] }, $line;
}
#write out each block to a new file
my $file_num = 1;
foreach my $block (#blocks) {
open(my $out, '>', $file_num++ . ".txt");
print $out join("\n", #$block);
}
In fact rather than store and write later, you could simply write to one file per block as you go:
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'yu';
open(my $fh, '<', $file);
my $empty = 0;
my $block_num = 1;
open(OUT, '>', $block_num . '.txt');
while (my $line = <$fh>) {
chomp($line);
if ($line =~ /^\s*$/) {
$empty++;
} elsif ($empty == 2) { #not blank and exactly 2 previous blanks
close(OUT); #just learned this line isn't necessary, perldoc -f close
open(OUT, '>', ++$block_num . '.txt');
$empty = 0;
} else {
$empty = 0;
}
print OUT "$line\n";
}
close(OUT);
use 5.012;
open my $fh,'<','1.txt';
#slurping file
local $/;
my $content = <$fh>;
close $fh;
for my $block ( split /(?<!\n)\n\n\n(?!\n)/,$content ) {
say 'found:';
say $block;
}
Deprecated in favor of new answer
justintime's answer works by telling perl that you want to call the end of a line "\n\n", which is clever and will work well. One exception is that this must match exactly. By the regex you are using it makes it seem like there might be whitespace on the "empty" lines, in which case this will not work. Also his method will split even on more than 2 linebreaks, which was not allowed in the OP.
For completeness, to do it the way you were asking, you need to slurp the whole file into a variable (if the file is not so large as to use all your memory, probably fine in most cases).
I would then probably say to use the split function to split the block of text into an array of chunks. Your code would then look something like:
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'yu';
my $text;
open(my $fh, '<', $file);
{
local $/; enables slurp mode inside this block
$text = <$fh>;
}
close($fh);
my #blocks = split(
/
(?<!\n)\n #check to make sure there isn't another \n behind this one
\s*\n #first whitespace only line
\s*\n #second "
(?!\n) #check to make sure there isn't another \n after this one
/x, # x flag allows comments and whitespace in regex
$text
);
You can then do operations on the array. If I understand your comment to justintime's answer, you want to write each block out to a different file. That would look something like
my $file_num = 1;
foreach my $block (#blocks) {
open(my $out, '>', $file_num++ . ".txt");
print $out $block;
}
Notice that since you open $out lexically (with my) when it reaches the end of the foreach block, the $out variable dies (i.e. "goes out of scope"). When this happens to a lexical filehandle, the file is automatically closed. And you can do a similar thing to that with justintime's method as well:
local $/ = "\n\n" ;
my $file_num = 1;
while (<>) {
open(my $out, '>', $file_num++ . ".txt");
print $out $block;
}