Perl - How to remove mulitline numbers with a Regex - regex

I have a data file with the following.
Some random text here
1
2
3
13
Show:
120
items per page
I want to remove the numbers, "Show:" and the number below.
So the result becomes
Some random text here
items per page
I have the following code:
my $Showing = "((\\d{1,}\\n))*Show:\\n\\d{1,}\\n";
$FileContents =~ s/$Showing//ig;
which results in the following:
Some random text here
1
2
3
items per page
It only removes one number above "Show:", I have tried a number of variations of the $Showing variable. How can I get this to work.
I have another data file with the following:
Showing 1 - 46 of 46 products
20
50
per page
With the code, this code works.
my $Showing = 'Showing.*\n((\\d{1,}\\n)*)';
$FileContents =~ s/$Showing//ig;
The difference is the numbers are below "Showing", whereas for the one that does not work the numbers are above.

The attempted regex appears OK, even though I'd avoid the double quotes (and thus the need to then escape things!). Better yet, use qr operator to first build the regex pattern
my $re = qr/(?:[0-9]+\s*\n\s*)+Show:\s*\n\s*[0-9]+\s*\n/;
Then
$text =~ s/$re//;
results in the wanted two lines. The whole file is in the string $text.
I've sprinkled that pattern with possible spaces everywhere, but then since \s mostly includes all manner of new lines you can probably leave only the \s+
my $re = qr/(?:[0-9]+\s+)+Show:\s+[0-9]+\s+/;
(I left explicit \n's in the first pattern to avoid confusion.)
It is possible that something's "wrong" with new lines in your file, like having a carriage return and linefeed pair (instead of just a newline character). So if this isn't working try to tweak the \n in the pattern.
Options are to use [\n\r]+ (either or both of linefeed and carriage return), or \R (Unicode newline), or \v (vertical space). Or \s+, equivalent to [\h\v]. See the perlrecharclass link above.

I would solve this by just doing multiple regexes. For example
#!/usr/bin/env perl
use strict;
use warnings;
use v5.32;
while (my $line = <>) {
next if $line =~ m/\A\d+\s*\z/xms;
next if $line =~ m/\AShow:\s*\z/xms;
print $line;
}
In Shell it works like
$ ./remover.pl data.txt
Some random text here
items per page

Related

Telling regex search to only start searching at a certain index

Normally, a regex search will start searching for matches from the beginning of the string I provide. In this particular case, I'm working with a very large string (up to several megabytes), and I'd like to run successive regex searches on that string, but beginning at specific indices.
Now, I'm aware that I could use the substr function to simply throw away the part at the beginning I want to exclude from the search, but I'm afraid this is not very efficient, since I'll be doing it several thousand times.
The specific purpose I want to use this for is to jump from word to word in a very large text, skipping whitespace (regardless of whether it's simple space, tabs, newlines, etc). I know that I could just use the split function to split the text into words by passing \s+ as the delimiter, but that would make things for more complicated for me later on, as there a various other possible word delimiters such as quotes (ok, I'm using the term 'word' a bit generously here), so it would be easier for me if I could just hop from word to word using successive regex searches on the same string, always specifying the next index at which to start looking as I go. Is this doable in Perl?
So you want to match against the words of a body of text.
(The examples find words that contain i.)
You think having the starting positions of the words would help, but it isn't useful. The following illustrates what it might look like to obtain the positions and use them:
my #positions;
while ($text =~ /\w+/g) {
push #positions, $-[0];
}
my #matches;
for my $pos (#positions) {
pos($text) = $pos;
push #matches $1 if $text =~ /\G(\w*i\w*)/g;
}
If would far simpler not to use the starting positions at all. Aside from being far simpler, we also remove the need for two different regex patterns to agree as to what constitute a word. The result is the following:
my #matches;
while ($text =~ /\b(\w*i\w*)/g) {
push #matches $1;
}
or
my #matches = $text =~ /\b(\w*i\w*)/g;
A far better idea, however, is to extra the words themselves in advance. This approach allows for simpler patterns and more advanced definitions of "word"[1].
my #matches;
while ($text =~ /(\w+)/g) {
my $word = $1;
push #matches, $word if $word =~ /i/;
}
or
my #matches = grep { /i/ } $text =~ /\w+/g;
For example, a proper tokenizer could be used.
In the absence of more information, I can only suggest the pos function
When doing a global regex search, the engine saves the position where the previous match ended so that it knows where to start searching for the next iteration. The pos function gives access to that value and allows it to be set explicitly, so that a subsequent m//g will start looking at the specified position instead of at the start of the string
This program gives an example. The string is searched for the first non-space character after each of a list of offsets, and displays the character found, if any
Note that the global match must be done in scalar context, which is applied by if here, so that only the next match will be reported. Otherwise the global search will just run on to the end of the file and leave information about only the very last match
use strict;
use warnings 'all';
use feature 'say';
my $str = 'a b c d e f g h i j k l m n';
# 0123456789012345678901234567890123456789
# 1 2 3
for ( 4, 31, 16, 22 ) {
pos($str) = $_;
say $1 if $str =~ /(\S)/g;
}
output
c
l
g
i

Perl regex wierd behavior : works with smaller strings fails with longer repetitions of the smaller ones

here is a REGEX in perl that I use to identify strings that match this pattern : include any number of occurrences of any character but single quote ' or backslash , allow only escaped occurrences of ' or , respectively : \' and \ and finally it has to end with a (non-escaped) single quote '
foo.pl
#!/usr/bin/perl
my $line;
my $matchString;
Main();
sub Main() {
foreach $line( <STDIN> ) {
$line =~ m/(^(([^\\\']*?(\\\')*?(\\\\)*?)*?\'))/g;
$matchString = $1;
print "matchString:$matchString\n"
}
}
It seems to work fine for strings like :
./foo.pl
asasas'
sdsdsdsdsdsd'
\\\'sdsdsdsdsd\\\'sdsdsdsd\\'
\'sddsd\\sdsdsds\\\\\\sdsdsdsd\\\\\\'
matchString:asasas'
matchString:sdsdsdsdsdsd'
matchString:\\\'sdsdsdsdsd\\\'sdsdsdsd\\'
matchString:\'sddsd\\sdsdsds\\\\\\sdsdsdsd\\\\\\'
Then I create a file with the following recurring pattern :
AAAAAAAAAAAAAAAAAAAAAAAAAAAAA\\BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\'CCCCCCCCCCCCCCCCCCCCCC\\sdsdsd\\\\\' ZZZZ\'GGGGGG
By creating a string by repeating this pattern one or more times and adding a single quote ' at the end should match the reg exp. I created a file called zz3 with 16 repetitions of the above pattern. I created then a file called ZZ6 with 18 repetitions of zz3 and another one called ZZ7 with the contents of ZZ6 + one additional instance of zz3, hence 19 repetitions of zz3.
By adding a single quote at the end of zz3 it results in a match. By adding a single quote at the end of ZZ6 it also results in a match as expected.
Now here is the tough part, by adding a single quote at the end of ZZ7 does not result in a match!
here is a link to the 3 files :
https://drive.google.com/file/d/0BzIKyGguqkWvOWdKaElGRjhGdjg/view?usp=sharing
The perl version I am using is v5.16.3 on FreeBSD bit i tried with various versions on either FreeBSD or linux with identical results. It seems to me that either perl has a problem with the size from 34274 bytes (ZZ6) to 36178 bytes (ZZ7), or I am missing something badly.
Your regular expression leads to catastrophic backtracking because you have nested quantifiers.
If you change it to
(^(([^\\\']*+(\\')*+(\\\\)*+)*?'))
(using possessive quantifiers to avoid backtracking), it should work.
I just would like to note that the whole problem appeared in an effort to re-engineer an old in-house program to parse escaped PostgreSQL bytea values.
Following this discussion it is clear that perl cannot match any repetition of non dot (.) patterns for more than 32766(=32K-2) times.
The solution is to masquerade the \\ and \' sequences with some chars that are certain to not appear in the input, such as Device Ctrl1 (\x11) and Device Ctrl2 (\x12), (presented as ^Q, ^R in vi respectively) :
$dataField =~ s/\\\\/\x11/g;
$dataField =~ s/\\\'/\x12/g;
then try to match non greedily any input till the first single quote.
$dataField =~ m/(^.*?\')/s;
$matchString = $1;
and finally substitute the above Ctrl chars back to their initial values
$matchString =~ s/\x11/\\\\/g;
$matchString =~ s/\x12/\\\'/g;
This is very fast. Another solution would be to parse till the first single quote and count the number of \'s. If it is even then we have found our last non escaped single quote in the text so we have found our desired match, otherwise the single quote is an escape one and thus considered part of the text, so we keep this value and iterate to the next single quote and repeat the same logic, by concatenating the value to the previous value. This tends to be very slow for big files with many intermediate escaped single quotes.
Perl regex's seem to be much faster than Perl code.

Matching the end of line $ in perl; print showing different behavior with chomp

I am reading a file and matching a regex for lines with a hex number at the start followed by few dot separated hex values followed by optional array name which may contain an option index. For eg:
010c10 00000000.00000000.0000a000.02300000 myFooArray[0]
while (my $rdLine = <RDHANDLE>) {
chomp $rdLine;
if ($rdLine =~ m/^([0-9a-z]+)[ \t]+([0-9.a-z]+)[ \t]*([A-Za-z_0-9]*)\[*[0-9]*\]*$/) {
...
My source file containing these hex strings is also script generated. This match works fine for some files but other files produced thru the exact same script (ie no extra spaces, formats etc) do not match when the last $ is present on the match condition.
If I modify the condition to not have the end $, lines match as expected.
Another curious thing is for debugging this, I added a print statement like this:
if ($rdLine =~ m/^([0-9a-z]+)[ \t]+/) {
print "Hey first part matched for $rdLine \n";
}
if ($rdLine =~ m/^([0-9a-z]+)[ \t]+([0-9.a-z]+)/) {
print "Hey second part matched for $rdLine \n";
}
The output on the terminal for the following input eats the first character :
010000 00000000 foo
"ey first part matched for 010000 00000000 foo
ey second part matched for 010000 00000000 foo"
If I remove the chomp, it prints the Hey correctly instead of just ey.
Any clues appreciated!
"other files produced thru the exact same script (ie no extra spaces, formats etc) do not match when the last $ is present on the match condition"
Although you deny it, I am certain that your file contains a single space character directly before the end of the line. You should check by using Data::Dump to display the true contents of each file record. Like this
use Data::Dump;
dd \$read_line;
It is probably best to use
$read_line =~ s/\s+\z//;
in place of chomp. That will remove all spaces and tabs, as well as line endings like carriage-return and linefeed from the end of each line.
"If I remove the chomp, it prints the Hey correctly instead of just ey."
It looks like you are working on a Linux machine, processing a file that was generated on a Windows platform. Windows uses the two characters CR LF as a record separator, whereas Linux uses just LF, so a chomp removes just the trailing LF, leaving CR to cause the start of the string to be overwritten.
If it wasn't for your secondary problem of having trailing whitespace, tThe best solution here would be to replace chomp $read_line with $read_line =~ s/\R\z//. The \R character class matches the Unicode idea of a line break sequence, and was introduced in version 10 of Perl 5. However, the aforementioned s/\s+\z// will deal with your line endings as well, and should be all that you need.
Borodin is right, \r\n is the culprit.
I used a less elegant solution, but it works:
$rdLine =~ s/\r//g;
followed by:
chomp $rdLine;

Use Perl to check if a string has only English characters

I have a file with submissions like this
%TRYYVJT128F93506D3<SEP>SOYKCDV12AB0185D99<SEP>Rainie Yang<SEP>Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
%TRYYVHU128F933CCB3<SEP>SOCCHZY12AB0185CE6<SEP>Tepr<SEP>Achète-moi
I am stripping everything but the song name by using this regex.
$line =~ s/.*>|([([\/\_\-:"``+=*].*)|(feat.*)|[?¿!¡\.;&\$#%#\\|]//g;
I want to make sure that the only strings printed are ones that contain only English characters, so in this case it would the first song title Ai Wo Quing shut up and not the next one because of the è.
I have tried this
if ( $line =~ m/[^a-zA-z0-9_]*$/ ) {
print $line;
}
else {
print "Non-english\n";
I thought this would match just the English characters, but it always prints Non-english. I feel this is me being rusty with regex, but I cannot find my answer.
Following from the comments, your problem would appear to be:
$line =~ m/[^a-zA-z0-9_]*$/
Specifically - the ^ is inside the brackets, which means that it's not acting as an 'anchor'. It's actually a negation operator
See: http://perldoc.perl.org/perlrecharclass.html#Negation
It is also possible to instead list the characters you do not want to match. You can do so by using a caret (^) as the first character in the character class. For instance, [^a-z] matches any character that is not a lowercase ASCII letter, which therefore includes more than a million Unicode code points. The class is said to be "negated" or "inverted".
But the important part is - that without the 'start of line' anchor, your regular expression is zero-or-more instances (of whatever), so will match pretty much anything - because it can freely ignore the line content.
(Borodin's answer covers some of the other options for this sort of pattern match, so I shan't reproduce).
It's not clear exactly what you need, so here are a couple of observations that speak to what you have written.
It is probably best if you use split to divide each line of data on <SEP>, which I presume is a separator. Your question asks for the fourth such field, like this
use strict;
use warnings;
use 5.010;
while ( <DATA> ) {
chomp;
my #fields = split /<SEP>/;
say $fields[3];
}
__DATA__
%TRYYVJT128F93506D3<SEP>SOYKCDV12AB0185D99<SEP>Rainie Yang<SEP>Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
%TRYYVHU128F933CCB3<SEP>SOCCHZY12AB0185CE6<SEP>Tepr<SEP>Achète-moi
output
Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
Achète-moi
Also, the word character class \w matches exactly [a-zA-z0-9_] (and \W matches the complement) so you can rewrite your if statement like this
if ( $line =~ /\W/ ) {
print "Non-English\n";
}
else {
print $line;
}

Perl: why does this web scraper regex work inconsistently?

I have run into another problem in relation to a site I am trying to scrape.
Basically I have stripped most of what I don't want from the page content and thanks to some help given here have managed to isolate the dates I wanted. Most of it seems to be working fine, despite some initial problems matching a non-breaking space. However, I am now having difficulty with the final regex, which is intended to split each line of data into fields. Each line represents the price of a share price index. The fields on each line are:
A name of arbitrary length made from characters from the latin alphabet and sometimes a comma or ampersand, no numerics.
A number with two digits after the decimal point (the absolute value of the index).
A number with two digits after the decimal point (the change in the value).
A number with two digits after the decimal point followed by a percent sign (the percentage change in value).
Here is an example string, before splitting:
"Fishery, Agriculture & Forestry243.45-1.91-0.78% Mining360.74-4.15-1.14% Construction465.36-1.01-0.22% Foods783.2511.281.46% Textiles & Apparels412.070.540.13% Pulp & Paper333.31-0.29-0.09% Chemicals729.406.010.83% "
The regex I am using to split this line is this:
$mystr =~ s/\n(.*?)(\d{1,4}\.\d{2})(\-?\d{1,3}\.\d{2})(.*?%)\n/\n$1 == $2 == $3 == $4\n/ig;
It works sometimes but not other times and I cannot work out why this should be. (The doubled equal signs in the example output below are used to make the field split more easily visible.)
Fishery, Agriculture & Forestry == 243.45 == -1.91 == -0.78%
Mining360.74-4.15-1.14%
Construction == 465.36 == -1.01 == -0.22%
Foods783.2511.281.46%
I thought the minus sign was an issue for those indices that saw a negative change in the price of the index, but sometimes it works despite the minus sign.
Q. Why is the final regex shown below failing to split the fields consistently?
Example code follows.
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::Tree;
my $url_full = "http://www.tse.or.jp/english/market/STATISTICS/e06_past.html";
my $content = get($url_full);
# get dates:
(my #dates) = $content =~ /(?<=dateFormat\(')\d{4}\/\d{2}\/\d{2}(?='\))/g;
foreach my $date (#dates) { # convert to yyyy-mm-dd
$date =~ s/\//-/ig;
}
my $tree = HTML::Tree->new();
$tree->parse($content);
my $mystr = $tree->as_text;
$mystr =~ s/\xA0//gi; # remove non-breaking spaces
# remove first chunk of text:
$mystr =~
s/^(TSE.*?)IndustryIndexChange ?/IndustryIndexChange\n$dates[0]\n\n/gi;
$mystr =~ s/IndustryIndexChange ?/IndustryIndexChange/ig;
$mystr =~ s/IndustryIndexChange/Industry Index Change\n/ig;
$mystr =~ s/% /%\n/gi; # percent symbol is market for end of line
# indicate breaks between days:
$mystr =~ s/Stock.*?IndustryIndexChange/\nDAY DELIMITER\n/gi;
$mystr =~ s/Exemption from Liability.*$//g; # remove boilerplate at bottom
# and here's the problem regex...
# try to split it:
$mystr =~
s/\n(.*?)(\d{1,4}\.\d{2})(\-?\d{1,3}\.\d{2})(.*?%)\n/\n$1 == $2 == $3 == $4\n/ig;
print $mystr;
It appears to be doing every other one.
My guess is that your records have a single \n between them, but your pattern starts and ends with a \n. So the final \n on the first match consumes the \n that the second match needed to find the second record. The net result is that it picks up every other record.
You might be better off wrapping your pattern in ^ and $ (instead of \n and \n), and using the m flag on the s///.
The problem is that you have \n both at the start and at the end of the regex.
Consider something like this:
$s = 'abababa';
$s =~ s/aba/axa/g;
that will set $s to axabaxa, not axaxaxa, because there are only two non-overlapping occurrences of aba.
My interpretation (pseudocode) -
one = [a-zA-Z,& ]+
two = \d{1,4}.\d\d
three = <<two>>
four = <<two>>%
regex = (<<one>>)(<<two>>)(<<three>>)(<<four>>)
= ([a-zA-Z,& ]+)(\d{1,4}.\d\d)(\d{1,4}.\d\d)(\d{1,4}.\d\d%)
However, you are already presented with 'structured' data in the form of HTML. Why not take advantage of this?
HTML parsing in perl references MOJO
for DOM based parsing in perl, and unless there are serious performance reasons,
I'd highly recommend such an approach.