Perl Regex Problem! - regex

I am reading a string from a file:
2343,0,1,0 ... 500 times ...3
Above is an example of $_ when it is read from a file. It is any number, followed by 500 comma separated 0's/1's then the number 3.
while(<FILE>){
my $string = $_;
chomp($string);
my $a = chop($string);
my $found;
if($string=~m/^[0-9]*\,((0,|1,){$i})/){
$found = $&.$a;
print OTH $found,"\n";
}
}
I am using chop to get the number 3 from the end of the string. Then matching the first number followed by $i occurences of 0, or 1. The problem I'm having is that chop is not working on the string for some reason. In the if statement when I try to concat the match and the chopped number all I get returned is the contents of $&.
I have also tried using my $a = substr $a,-1,1; to get the number 3 and this also hasn't worked.
The thing that's odd is that this code works in Eclipse on Windows, and when I put it onto a Linux server it won't work. Can anyone spot the silly mistake I'm making?

As a rule, I tend always to allow for unseen whitespace in my data. I find that it makes my code more robust expecting that somebody didn't see an extra space at the end of a line or string (as in writing to a log). So I think this would solve your problem:
my ( $a ) = $string =~ /(\S)\s*$/;
Of course, since you know you are looking for a number, it's better to be more precise:
my ( $a ) = $string =~ /(\d+)\s*$/;

Take care of the end of line char… I can not test here but I assume you just chop a newline. Try first to trim your string then chop it. See for example http://www.somacon.com/p114.php

Instead of trying to do it that way, why not use a regexp to pull out everything you need in one go?
my $x = "4123,0,1,0,1,4";
$x =~ /^[0-9]+,((?:0,|1,){4})([0-9]+)/;
print "$1\n$2\n";
Produces:
0,1,0,1,
4
Which is pretty much what you're looking for. Both sets of needed answers are in the match variables.
Note that I included ?: in the front of the 0,1, matching so that it didn't end up in the output match variables.

I'm really not sure what you are trying to achieve here but I've tried the code on Win32 and Solaris and it works. Are you sure $i is the correct number? Might be easier to use * or ?
use strict;
use warnings;
while(<DATA>){
my $string = $_;
chomp($string);
my $a = chop($string);
print "$string\n";
my $found;
if($string=~m/^[0-9]*\,((0,|1,)*)/){
$found = $&.$a;
print $found,"\n";
}
}
__DATA__
2343,0,1,0,0,1,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,0,3

I don't see much reason to use a regex in this case, just use split.
use strict;
use warnings;
use autodie; # open will now die on failure
my %data;
{
# limit the scope of $fh
open my $fh, '<', 'test.data';
while(<$fh>){
chomp;
s(\s+){}g; # remove all spaces
my($number,#bin) = split ',', $_;
# uncomment if you want to throw away the 3
# pop #bin if $bin[-1] == 3;
$data{$number} = \#bin;
}
close $fh;
}
If all you want is the 3
while(<$fh>){
# the .* forces it to look for the last set of numbers
my($last_number) = /.*([0-9]+)/;
}

Related

How to capture multiple words using regex on this particular text?

I'm trying to extract the best paying job titles from this sample text:
Data Scientist
#1 in Best Paying Jobs
5,100 Projected Jobs $250,000 Median Salary 0.5% Unemployment Rate
Programmer
#2 in Best Paying Jobs
4,000 Projected Jobs $240,000 Median Salary 1.0% Unemployment Rate
SAP Module Consultant
#3 in Best Paying Jobs
3,000 Projected Jobs $220,000 Median Salary 0.2% Unemployment Rate
by using the following regex and Perl code.
use File::Glob;
local $/ = undef;
my $file = #ARGV[0];
open INPUT, "<", $file
or die "Couldn't open file $!\n";
my $content = <INPUT>;
my $regex = "^\w+(\w+)*$\n\n#(\d+)";
my #arr_found = ($content =~ m/^\w+(\w+)*$\n\n#(\d+)/g);
close (INPUT);
Q1: The regex finds only the one-word titles*. How to make it find the multiple word titles and how to forward (i.e. how to properly capture) those found titles into the Perl array?
Q2: I defined the regex into a Perl variable and tried to use that variable for the regex operation like:
my #arr_found = ($content =~ m/"$regex"/g);
but it gave error. How to make it?
* When I apply the regex ^\w+(\w+)*$\n\n#(\d+) on Sublime Text 2, it finds only the one word titles.
Why not process line-by-line, simple and easy
use warnings;
use strict;
use feature 'say';
my $file = shift || die "Usage: $0 file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my (#jobs, $prev_line);
while (my $line = <$fh>) {
chomp $line;
next if not $line =~ /\S/;
if ($line =~ /^\s*#[0-9]/) {
push #jobs, $prev_line;
}
$prev_line = $line;
}
say for #jobs;
This relies on the requirement that the #N line is the first non-empty line after the jobs title.
It prints
Data Scientist
Programmer
SAP Module Consultant
The question doesn't say whether rankings are wanted as well but there is a hint in the regex that they may be. Then, assuming that the ordering in the file is "correct" you can iterate over the array indices and print elements (titles) with their indices (rank).
Or, to be certain, capture them in the regex, /^\s*#([0-9]+)/. Then you can directly print both the title and its rank, or perhaps store them in a hash with key-value pairs rank => title.
As for the regex, there are a few needed corrections. To compose a regex ahead of matching, what is a great idea, you want the qr operator. To work with multi-line strings you need the /m modifier. (See perlretut.) The regex itself needs fixing. For example
my $regex = qr/^(.+)?(?:\n\s*)+\n\s*#\s*[0-9]/m;
my #titles = $content =~ /$regex/g
what captures a line followed by at least one empty line and then #N on another line.
If the ranking of titles is needed as well then capture it, too, and store in a hash
my $regex = qr/^(.+)?(?:\n\s*)+\n\s*#\s*([0-9]+)/m;
my %jobs = reverse $content =~ /$regex/g;
or maybe better not push it with reverse-ing the list of matches but iterate through pairs instead
my %jobs;
while ($content =~ /$regex/g) {
$jobs{$2} = $1;
}
since with this we can check our "catch" at each iteration, do other processing, etc. Then you can sort the keys to print in order
say "#$_ $jobs{$_}" for sort { $a <=> $b } keys %jobs;
and just in general pick jobs by their rank as needed.
I think that it's fair to say that the regex here is much more complex than the first program.
Answers for your questions:
you are capturing the second word only and you do not allow for space in between them. That's why it won't match e.g. Data Scientist
use the qr// operator to compile regexes with dynamic content. The error stems from the $ in the middle of the regex which Perl regex compiler assumes you got wrong, because $ should only come at the end of a regex.
The following code should achieve what you want. Note the two-step approach:
Find matching text
beginning of a line (^)
one-or-more words separated by white space (\w+(?:\s+\w+)*, no need to capture match)
2 line ends (\n\n)
# followed by a number (\d+)
apply regex multiple times (/g) and treat strings as multiple lines (/m, i.e. ^ will match any beginning of a line in the input text)
Split match at line ends (\n) and extract the 1st and the 3rd field
as we know $match will contain three lines, this approach is much easier than writing another regex.
#!/usr/bin/perl
use strict;
use warnings;
use feature qw(say);
use File::Slurper qw(read_text);
my $input = read_text($ARGV[0])
or die "slurp: $!\n";
my $regex = qr/^(\w+(?:\s+\w+)*\n\n#\d+)/m;
foreach my $match ($input =~ /$regex/g) {
#say $match;
my($title, undef, $rank) = split("\n", $match);
$rank =~ s/^#//;
say "MATCH '${title}' '${rank}'";
}
exit 0;
Test run over the example text you provided in your question.
$ perl dummy.pl dummy.txt
MATCH 'Data Scientist' '1'
MATCH 'Programmer' '2'
MATCH 'SAP Module Consultant' '3'
UNICODE UPDATE: as suggested by #Jan's answer the code can be improved like this:
my $regex = qr/^(\w+(?:\s+\w+)*\R\R#\d+)/m;
...
my($title, undef, $rank) = split(/\R/, $match);
That is probably the more generic approach, as UTF-8 is the default for File::Slurper::read_text() anyway...
You were not taking whitespaces (as in Data Scientist) into account:
^\w+.*$\R+#(\d+)
See a demo on regex101.com.
\R is equal to (?>\r\n|\n|\r|\f|\x0b|\x85) (matches Unicode newlines sequences).

Capitalize all letters after a period and a space

I am trying to capitalize all occurrences of small letters after a period and a space using Perl. This is an example of an input:
...so, that's our art. the 4 of us can now have a dialog. we can have a conversation. we can speak to...
This is the output I'd like to see:
...so, that's our art. The 4 of us can now have a dialog. We can have a conversation. We can speak to...
I have tried multiple regexes without much success--for instance:
$currentLine =~ s/\.\s([a-z])/\. \u$1/g;
or
$currentLine =~ s/([\.!?]\s*)(\w)/$1\U$2/g;
But I don't get the intended result. Help please!
UPDATE
To provide context, as somebody pointed out, the problem may lie elsewhere. The regexes are used in the context of this little script which does a few things besides the step that originated this post. I run it on long SRT files obtained from video closed captions. Thanks again for your help.
#! perl
use strict;
use warnings;
my $filename = $ARGV[0];
open(INPUT_FILE, $filename)
or die "Couldn't open $filename for reading!";
while (<INPUT_FILE>) {
my $currentLine = $_;
# Remove empty lines and lines that start with digits
if ($currentLine =~ /^[\s+|\d+]/){
next;
}
# Remove all carriage returns
$currentLine =~ s/\R$/ /;
# Convert all letters to lower case
$currentLine =~ s/([A-Z])/\l$1/g;
# Capitalize after period <= STEP THAT DOES NOT WORK
$currentLine =~ s/\.\s([a-z])/\. \u$1/g;
print $currentLine;
}
close(INPUT_FILE);
Try this
Use look behind, and capture the pattern and use \U for the change the beginning of the string to uppercase
$str ="...so, that's our art. the 4 of us can now have a dialog. we can have a conversation. we can speak to...";
$str =~ s/(?<=\w\.\s)(\w)/\U$1/g;
print $str
Or else try to \K for keep the word by the substitution.
$str =~ s/\w\.\s\K(\w)/\U$1/g;
One problem is the code:
if ($currentLine =~ /^[\s+|\d+]/){
next;
}
Contrary to the comment, this ignores lines that start with a space, a digit, a plus or a pipe symbol. This is probably sending you down the wrong track. You likely meant to write:
next if /^(\s+$|\d)/;
This skips a line if the whole line is spaces, or if the first character is a digit.
You could simplify your loop, and generalize it, with:
#!/usr/bin/env perl
use strict;
use warnings;
while (<>) {
# Remove empty lines and lines that start with digits. sometimes
next if /^(\s+$|\d)/;
# Remove all carriage returns. forever
s/\R$//;
# Convert all letters to lower case. always
s/([A-Z])/\l$1/g;
# Capitalize after period <=... STEP THAT DOES NOT WORK
s/\.\s([a-z])/\. \u$1/g;
print "$_\n";
}
When run on itself, the output is:
#!/usr/bin/env perl
use strict;
use warnings;
while (<>) {
# remove empty lines and lines that start with digits. Sometimes
next if /^(\s+$|\d)/;
# remove all carriage returns. Forever
s/\r$//;
# convert all letters to lower case. Always
s/([a-z])/\l$1/g;
# capitalize after period <=... Step that does not work
s/\.\s([a-z])/\. \u$1/g;
print "$_\n";
}
Note that for the converted script to work, you'd need to use /gi as the modifier (instead of /g) on the substitute operations. There is plenty of room for improvement in this code, still.
One basic way of testing what's going on is to print everything at each step.
while (<INPUT_FILE>) {
print "## $_";
my $currentLine = $_;
# Remove empty lines and lines that start with digits
if ($currentLine =~ /^[\s+|\d+]/){
print "#SKIP# $currentLine";
next;
}
# Remove all carriage returns
$currentLine =~ s/\R$/ /;
print "#EOL# $currentLine##\n";
# Convert all letters to lower case
$currentLine =~ s/([A-Z])/\l$1/g;
print "#LC# $currentLine##\n";
# Capitalize after period <= STEP THAT DOES NOT WORK
$currentLine =~ s/\.\s([a-z])/\. \u$1/g;
print "#CAPS# $currentLine##\n";
print $currentLine; # Needs a newline!
}
This would have told you what was going on, and going wrong. Note that replacing the generic EOL (\R) with a blank means that the output doesn't end with a newline. That's a bad idea too — and it's why the outputs I generate end with a newline; either the one read from the file, or adding one after that's been removed.
Also, you should avoid ALL_CAPS file handles and use lexical ones — when you need an explicit file handle at all.
open my $fh, '<', $filename
or die "Couldn't open $filename for reading!";
Good work on including the file name in the error message (though adding $! to report the system error message would be a good idea too).
# (char)(char)(char) (char)(char)(char) Uppercase the 3rd
$str =~ s/(\.)(\s)(\w)/$1$2\U$3/g;
print $str
...so, that's our art. the 4 of us can now have a dialog. we can have a conversation. we can speak to...
...so, that's our art. The 4 of us can now have a dialog. We can have a conversation. We can speak to...

Perl Regex Match Text String and Extract Following Number

I have a giant text data file (~100MB) that is a concatenation of a bunch of data files with various header information then some columns of data. Here's the problem. I want to extract a particular number from the header info before each of these data sets and then append that to another column in the data (and write out that data to a different file).
The header info that I want is of the format ex: BGA 1
Where what I want for that extra data column is the # after word BGA. It will be a number between 1 and maybe 20000. I can write the regex to pull the word BGA, but I don't seem to be able to figure out how to just get the digit after it.
To add EXTRA fun, that text "BGA 1" is repeated in each data section TWICE.
Here's what I have so far, which actually doesn't work... I want it to at least print "BGA" everytime it encounters the word BGA, but it prints nothing.... Any help would be appreciated.
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'alldata.txt';
open my $info, $file or die "Could not open $file: $!";
$_="";
while(my $line = <$info>){
if ($line eq "/BGA/"){
print <>,"\n";
}
}
close $file;
if ($line =~ /BGA\s(\d+)/){
#your code
print "BGA number $1 \n";
#your code
}
And $1 variable will have the number you want
If there is more than one BGA per line, you'll need to allow the regex to match more than once per line:
while (my $line = <$info>) {
while ( $line =~ /BGA\s(\d+)/g ) {
print "$1\n";
}
}
This should print out all the BGA numbers as a single column. Without any further information it's hard to answer this any better.
First, a 100 MB file is not giant. Don't be so defeatist. You could even slurp it into memory:
Let's look at the few critical places in your code:
while(my $line = <$info>) {
if ($line eq "/BGA/") {
Your condition $line eq "/BGA/" tests if the line literally consists of the string "/BGA/". But, that can never be true for the line with at least have the input record separator, i.e. the contents of $/ at the end because you did not chomp it. In any case, what you want is to match lines that contain "BGA" anywhere and the proper Perl syntax to do that is
if ($line =~ /BGA/) {
Now, once you fix that, you are going to run into a problem with the following statement:
print <>,"\n";
What you really want is print $line;. The diamond operator, <>, in list context is going to try to slurp from STDIN or any files specified as arguments on the command line. Not a good idea.
Others have pointed out how to match the string "BGA" followed by a digit. For better answers, you are going to need to show examples of input and expected output.

How do I do optional matching in a regular expression using Perl?

I want to extract the size value from a string. The string can be be formatted in one of two ways:
Data-Size: (2000 bytes)
or
file Data-Size: (2082 bytes)
If the string is present in a file, it will appear only once.
So far I have:
#!/usr/bin/perl
use strict;
use warnings;
open FILE, "</tmp/test";
my $input = do { local $/; <FILE> };
my ($length) = $input =~ /(file)?\s*Data-Size: \((\d+) bytes\)/m;
$length or die "could not get data length\n";
print "length: $length\n";
The problem seems to be with making the word file optional. I thought I could do this with:
(file)?
But this seems to be stopping matches when the word file is not present. Also when the word file is there it sets $length to the string "file". I think this is because the parenthesis around file also mean extraction.
So how do I match either of the two strings and extract the size value?
You want the second capture in $length. To do that, you could use
my (undef, $length) = $input =~ /(file)?\s*Data-Size: \((\d+) bytes\)/;
or
my $length = ( $input =~ /(file)?\s*Data-Size: \((\d+) bytes\)/ )[1];
But a much better approach would be to avoid capturing something you're not interested in capturing.
my ($length) = $input =~ /(?:file)?\s*Data-Size: \((\d+) bytes\)/;
Of course, you'd get the same result from
my ($length) = $input =~ /Data-Size: \((\d+) bytes\)/;
By the way, I removed the needless /m. /m changes the meaning of ^ and $, yet neither are present in the pattern.
Just my 2 cents, you can make optional matching other way:
/(file|)\s*Data-Size: ((\d+) bytes)/

How do I use Perl to intersperse characters between consecutive matches with a regex substitution?

The following lines of comma-separated values contains several consecutive empty fields:
$rawData =
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
I want to replace these empty fields with 'N/A' values, which is why I decided to do it via a regex substitution.
I tried this first of all:
$rawdata =~ s/,([,\n])/,N\/A/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
which returned
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,,N/A,\n
Not what I wanted. The problem occurs when more than two consecutive commas occur. The regex gobbles up two commas at a time, so it starts at the third comma rather than the second when it rescans the string.
I thought this could be something to do with lookahead vs. lookback assertions, so I tried the following regex out:
$rawdata =~ s/(?<=,)([,\n])|,([,\n])$/,N\/A$1/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
which resulted in:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,,N/A,,N/A\n
That didn't work either. It just shifted the comma-pairings by one.
I know that washing this string through the same regex twice will do it, but that seems crude. Surely, there must be a way to get a single regex substitution to do the job. Any suggestions?
The final string should look like this:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,N/A,N/A,N/A,N/A\n
EDIT: Note that you could open a filehandle to the data string and let readline deal with line endings:
#!/usr/bin/perl
use strict; use warnings;
use autodie;
my $str = <<EO_DATA;
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,
EO_DATA
open my $str_h, '<', \$str;
while(my $row = <$str_h>) {
chomp $row;
print join(',',
map { length $_ ? $_ : 'N/A'} split /,/, $row, -1
), "\n";
}
Output:
E:\Home> t.pl
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A
You can also use:
pos $str -= 1 while $str =~ s{,(,|\n)}{,N/A$1}g;
Explanation: When s/// finds a ,, and replaces it with ,N/A, it has already moved to the character after the last comma. So, it will miss some consecutive commas if you only use
$str =~ s{,(,|\n)}{,N/A$1}g;
Therefore, I used a loop to move pos $str back by a character after each successful substitution.
Now, as #ysth shows:
$str =~ s!,(?=[,\n])!,N/A!g;
would make the while unnecessary.
I couldn't quite make out what you were trying to do in your lookbehind example, but I suspect you are suffering from a precedence error there, and that everything after the lookbehind should be enclosed in a (?: ... ) so the | doesn't avoid doing the lookbehind.
Starting from scratch, what you are trying to do sounds pretty simple: place N/A after a comma if it is followed by another comma or a newline:
s!,(?=[,\n])!,N/A!g;
Example:
my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
use Data::Dumper;
$Data::Dumper::Useqq = $Data::Dumper::Terse = 1;
print Dumper($rawData);
$rawData =~ s!,(?=[,\n])!,N/A!g;
print Dumper($rawData);
Output:
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A\n"
You could search for
(?<=,)(?=,|$)
and replace that with N/A.
This regex matches the (empty) space between two commas or between a comma and end of line.
The quick and dirty hack version:
my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
while ($rawData =~ s/,,/,N\/A,/g) {};
print $rawData;
Not the fastest code, but the shortest. It should loop through at max twice.
Not a regex, but not too complicated either:
$string = join ",", map{$_ eq "" ? "N/A" : $_} split (/,/, $string,-1);
The ,-1 is needed at the end to force split to include any empty fields at the end of the string.