I am trying to capitalize all occurrences of small letters after a period and a space using Perl. This is an example of an input:
...so, that's our art. the 4 of us can now have a dialog. we can have a conversation. we can speak to...
This is the output I'd like to see:
...so, that's our art. The 4 of us can now have a dialog. We can have a conversation. We can speak to...
I have tried multiple regexes without much success--for instance:
$currentLine =~ s/\.\s([a-z])/\. \u$1/g;
or
$currentLine =~ s/([\.!?]\s*)(\w)/$1\U$2/g;
But I don't get the intended result. Help please!
UPDATE
To provide context, as somebody pointed out, the problem may lie elsewhere. The regexes are used in the context of this little script which does a few things besides the step that originated this post. I run it on long SRT files obtained from video closed captions. Thanks again for your help.
#! perl
use strict;
use warnings;
my $filename = $ARGV[0];
open(INPUT_FILE, $filename)
or die "Couldn't open $filename for reading!";
while (<INPUT_FILE>) {
my $currentLine = $_;
# Remove empty lines and lines that start with digits
if ($currentLine =~ /^[\s+|\d+]/){
next;
}
# Remove all carriage returns
$currentLine =~ s/\R$/ /;
# Convert all letters to lower case
$currentLine =~ s/([A-Z])/\l$1/g;
# Capitalize after period <= STEP THAT DOES NOT WORK
$currentLine =~ s/\.\s([a-z])/\. \u$1/g;
print $currentLine;
}
close(INPUT_FILE);
Try this
Use look behind, and capture the pattern and use \U for the change the beginning of the string to uppercase
$str ="...so, that's our art. the 4 of us can now have a dialog. we can have a conversation. we can speak to...";
$str =~ s/(?<=\w\.\s)(\w)/\U$1/g;
print $str
Or else try to \K for keep the word by the substitution.
$str =~ s/\w\.\s\K(\w)/\U$1/g;
One problem is the code:
if ($currentLine =~ /^[\s+|\d+]/){
next;
}
Contrary to the comment, this ignores lines that start with a space, a digit, a plus or a pipe symbol. This is probably sending you down the wrong track. You likely meant to write:
next if /^(\s+$|\d)/;
This skips a line if the whole line is spaces, or if the first character is a digit.
You could simplify your loop, and generalize it, with:
#!/usr/bin/env perl
use strict;
use warnings;
while (<>) {
# Remove empty lines and lines that start with digits. sometimes
next if /^(\s+$|\d)/;
# Remove all carriage returns. forever
s/\R$//;
# Convert all letters to lower case. always
s/([A-Z])/\l$1/g;
# Capitalize after period <=... STEP THAT DOES NOT WORK
s/\.\s([a-z])/\. \u$1/g;
print "$_\n";
}
When run on itself, the output is:
#!/usr/bin/env perl
use strict;
use warnings;
while (<>) {
# remove empty lines and lines that start with digits. Sometimes
next if /^(\s+$|\d)/;
# remove all carriage returns. Forever
s/\r$//;
# convert all letters to lower case. Always
s/([a-z])/\l$1/g;
# capitalize after period <=... Step that does not work
s/\.\s([a-z])/\. \u$1/g;
print "$_\n";
}
Note that for the converted script to work, you'd need to use /gi as the modifier (instead of /g) on the substitute operations. There is plenty of room for improvement in this code, still.
One basic way of testing what's going on is to print everything at each step.
while (<INPUT_FILE>) {
print "## $_";
my $currentLine = $_;
# Remove empty lines and lines that start with digits
if ($currentLine =~ /^[\s+|\d+]/){
print "#SKIP# $currentLine";
next;
}
# Remove all carriage returns
$currentLine =~ s/\R$/ /;
print "#EOL# $currentLine##\n";
# Convert all letters to lower case
$currentLine =~ s/([A-Z])/\l$1/g;
print "#LC# $currentLine##\n";
# Capitalize after period <= STEP THAT DOES NOT WORK
$currentLine =~ s/\.\s([a-z])/\. \u$1/g;
print "#CAPS# $currentLine##\n";
print $currentLine; # Needs a newline!
}
This would have told you what was going on, and going wrong. Note that replacing the generic EOL (\R) with a blank means that the output doesn't end with a newline. That's a bad idea too — and it's why the outputs I generate end with a newline; either the one read from the file, or adding one after that's been removed.
Also, you should avoid ALL_CAPS file handles and use lexical ones — when you need an explicit file handle at all.
open my $fh, '<', $filename
or die "Couldn't open $filename for reading!";
Good work on including the file name in the error message (though adding $! to report the system error message would be a good idea too).
# (char)(char)(char) (char)(char)(char) Uppercase the 3rd
$str =~ s/(\.)(\s)(\w)/$1$2\U$3/g;
print $str
...so, that's our art. the 4 of us can now have a dialog. we can have a conversation. we can speak to...
...so, that's our art. The 4 of us can now have a dialog. We can have a conversation. We can speak to...
Related
I'm trying to extract the best paying job titles from this sample text:
Data Scientist
#1 in Best Paying Jobs
5,100 Projected Jobs $250,000 Median Salary 0.5% Unemployment Rate
Programmer
#2 in Best Paying Jobs
4,000 Projected Jobs $240,000 Median Salary 1.0% Unemployment Rate
SAP Module Consultant
#3 in Best Paying Jobs
3,000 Projected Jobs $220,000 Median Salary 0.2% Unemployment Rate
by using the following regex and Perl code.
use File::Glob;
local $/ = undef;
my $file = #ARGV[0];
open INPUT, "<", $file
or die "Couldn't open file $!\n";
my $content = <INPUT>;
my $regex = "^\w+(\w+)*$\n\n#(\d+)";
my #arr_found = ($content =~ m/^\w+(\w+)*$\n\n#(\d+)/g);
close (INPUT);
Q1: The regex finds only the one-word titles*. How to make it find the multiple word titles and how to forward (i.e. how to properly capture) those found titles into the Perl array?
Q2: I defined the regex into a Perl variable and tried to use that variable for the regex operation like:
my #arr_found = ($content =~ m/"$regex"/g);
but it gave error. How to make it?
* When I apply the regex ^\w+(\w+)*$\n\n#(\d+) on Sublime Text 2, it finds only the one word titles.
Why not process line-by-line, simple and easy
use warnings;
use strict;
use feature 'say';
my $file = shift || die "Usage: $0 file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my (#jobs, $prev_line);
while (my $line = <$fh>) {
chomp $line;
next if not $line =~ /\S/;
if ($line =~ /^\s*#[0-9]/) {
push #jobs, $prev_line;
}
$prev_line = $line;
}
say for #jobs;
This relies on the requirement that the #N line is the first non-empty line after the jobs title.
It prints
Data Scientist
Programmer
SAP Module Consultant
The question doesn't say whether rankings are wanted as well but there is a hint in the regex that they may be. Then, assuming that the ordering in the file is "correct" you can iterate over the array indices and print elements (titles) with their indices (rank).
Or, to be certain, capture them in the regex, /^\s*#([0-9]+)/. Then you can directly print both the title and its rank, or perhaps store them in a hash with key-value pairs rank => title.
As for the regex, there are a few needed corrections. To compose a regex ahead of matching, what is a great idea, you want the qr operator. To work with multi-line strings you need the /m modifier. (See perlretut.) The regex itself needs fixing. For example
my $regex = qr/^(.+)?(?:\n\s*)+\n\s*#\s*[0-9]/m;
my #titles = $content =~ /$regex/g
what captures a line followed by at least one empty line and then #N on another line.
If the ranking of titles is needed as well then capture it, too, and store in a hash
my $regex = qr/^(.+)?(?:\n\s*)+\n\s*#\s*([0-9]+)/m;
my %jobs = reverse $content =~ /$regex/g;
or maybe better not push it with reverse-ing the list of matches but iterate through pairs instead
my %jobs;
while ($content =~ /$regex/g) {
$jobs{$2} = $1;
}
since with this we can check our "catch" at each iteration, do other processing, etc. Then you can sort the keys to print in order
say "#$_ $jobs{$_}" for sort { $a <=> $b } keys %jobs;
and just in general pick jobs by their rank as needed.
I think that it's fair to say that the regex here is much more complex than the first program.
Answers for your questions:
you are capturing the second word only and you do not allow for space in between them. That's why it won't match e.g. Data Scientist
use the qr// operator to compile regexes with dynamic content. The error stems from the $ in the middle of the regex which Perl regex compiler assumes you got wrong, because $ should only come at the end of a regex.
The following code should achieve what you want. Note the two-step approach:
Find matching text
beginning of a line (^)
one-or-more words separated by white space (\w+(?:\s+\w+)*, no need to capture match)
2 line ends (\n\n)
# followed by a number (\d+)
apply regex multiple times (/g) and treat strings as multiple lines (/m, i.e. ^ will match any beginning of a line in the input text)
Split match at line ends (\n) and extract the 1st and the 3rd field
as we know $match will contain three lines, this approach is much easier than writing another regex.
#!/usr/bin/perl
use strict;
use warnings;
use feature qw(say);
use File::Slurper qw(read_text);
my $input = read_text($ARGV[0])
or die "slurp: $!\n";
my $regex = qr/^(\w+(?:\s+\w+)*\n\n#\d+)/m;
foreach my $match ($input =~ /$regex/g) {
#say $match;
my($title, undef, $rank) = split("\n", $match);
$rank =~ s/^#//;
say "MATCH '${title}' '${rank}'";
}
exit 0;
Test run over the example text you provided in your question.
$ perl dummy.pl dummy.txt
MATCH 'Data Scientist' '1'
MATCH 'Programmer' '2'
MATCH 'SAP Module Consultant' '3'
UNICODE UPDATE: as suggested by #Jan's answer the code can be improved like this:
my $regex = qr/^(\w+(?:\s+\w+)*\R\R#\d+)/m;
...
my($title, undef, $rank) = split(/\R/, $match);
That is probably the more generic approach, as UTF-8 is the default for File::Slurper::read_text() anyway...
You were not taking whitespaces (as in Data Scientist) into account:
^\w+.*$\R+#(\d+)
See a demo on regex101.com.
\R is equal to (?>\r\n|\n|\r|\f|\x0b|\x85) (matches Unicode newlines sequences).
I am using a perl script to remove all stopwords in a text. The stop words are stored one by line. I am using Mac OSX command line and perl is installed correctly.
This script is not working properly and has a boundary problem.
#!/usr/bin/env perl -w
# usage: script.pl words text >newfile
use English;
# poor man's argument handler
open(WORDS, shift #ARGV) || die "failed to open words file: $!";
open(REPLACE, shift #ARGV) || die "failed to open replacement file: $!";
my #words;
# get all words into an array
while ($_=<WORDS>) {
chop; # strip eol
push #words, split; # break up words on line
}
# (optional)
# sort by length (makes sure smaller words don't trump bigger ones); ie, "then" vs "the"
#words=sort { length($b) <=> length($a) } #words;
# slurp text file into one variable.
undef $RS;
$text = <REPLACE>;
# now for each word, do a global search-and-replace; make sure only words are replaced; remove possible following space.
foreach $word (#words) {
$text =~ s/\b\Q$word\E\s?//sg;
}
# output "fixed" text
print $text;
sample.txt
$ cat sample.txt
how about i decide to look at it afterwards what
across do you think is it a good idea to go out and about i
think id rather go up and above
stopwords.txt
I
a
about
an
are
as
at
be
by
com
for
from
how
in
is
it
..
Output:
$ ./remove.pl stopwords.txt sample.txt
i decide look fterwards cross do you think good idea go out d i
think id rather go up d bove
As you can see, it replaces afterwards using a as fterwards. Think its a regex problem. Please can somebody help me to patch this quickly? Thanks for all the help :J
Use word-boundary on both sides of your $word. Currently, you are only checking for it at the beginning.
You won't need the \s? condition with the \b in place:
$text =~ s/\b\Q$word\E\b//sg;
Your regex is not strict enough.
$text =~ s/\b\Q$word\E\s?//sg;
When $word is a, the command is effectively s/\ba\s?//sg. This means, remove all occurrences of a new word starting with a followed by zero or more whitespace. In afterwards, this will successfully match the first a.
You can make the match more stricter by ending word with another \b. Like
$text =~ s/\b\Q$word\E\b\s?//sg;
I'm trying to find occurrences of BLOB_SMUGHO, from the file test.out from the bottom of the file. If found, return a chunk of data which I'm interested in between the string "2014.10"
I'm getting Use of uninitialized value $cc in pattern match (m//) at
Whats is wrong with this script?
#!/usr/bin/perl
use strict;
use warnings;
use POSIX qw(strftime);
use File::ReadBackwards;
my $find = "BLOB_SMUGHO";
my $chnkdelim = "\n[" . strftime "%Y.%m", localtime;
my $fh = File::ReadBackwards->new('test.out', $chnkdelim, 0) or die "err-file: $!\n";
while ( defined(my $line = $fh->readline) ) {
if(my $cc =~ /$find/){
print $cc;
}
}
close($fh);
In case if this helps, here is a sample content of test.out
2014.10.31 lots and
lots of
gibbrish
2014.10.31 which I'm not
interested
in. It
also
2014.10.31 spans
across thousands of
lines and somewhere in the middle there will be
2014.10.31
this precious word BLOB_SMUGHO and
2014.10.31 certain other
2014.10.31 words
2014.10.31
this precious word BLOB_SMUGHO and
2014.10.31
this precious word BLOB_SMUGHO and
which
I
will
be
interested
in.
And I'm expecting to capture all the multiple occurrences of the chunk of the text from bottom of the file.
2014.10.31
this precious word BLOB_SMUGHO and
First, you have written your match incorrectly due to misunderstanding the =~ operator:
if(my $cc =~ /$find/){ # incorrect, like saying if(undef matches something)
If you want to match what is in $line against the pattern between /.../ then do:
if($line =~ /$find/) {
The match operator expects a value on left side as well as right side. you were using it like an assignment operator.
If you need to capture the match(es) into a variable or list, then add it to the left of an equal sign:
if(my ($cc) = $line =~ /$find/) { <-- wrap $cc in () for list context
By the way, I think you are better off writing:
if($line =~ /$find/) {
print $line;
or if you want to print what you matched only
print $0;
Since you aren't capturing a substring, it doesnt really matter here.
Now, as to how to match everything between two patterns, the task is easier if you don't match line by line, but match across newlines using the /s modifier.
In Perl, you can set the record separator to undef and use slurp mode.
local $/ = undef;
my $s = <>; # read all lines into $s
Now to scan $s for patterns
while($s =~ /(START.*?STOP)/gsm) { print "$1\n"; } # print the pattern inclusive of START and STOP
Or to capture between START and STOP
while($s =~ /START(.*?)STOP/gsm) { print "$1\n"; } # print the pattern between of START and STOP
So in your case the start pattern is 2014.10.31 and stop is BLOB_SMUGHO
while($s =~ /(2014\.10\.31.*?BLOB_SMUGHO)/gsm) {
print "$1\n";
}
NOTE: Regex modifiers in Perl come after the last / so if you see I use /gsm for multiline, match newline, and global matching (get multiple matches in a loop by remembering the last location).
I have a string that is read from a text file, but in Ubuntu Linux, and I try to delete its newline character from the end.
I used all the ways. But for s/\n|\r/-/ (I look whether it finds any replaces any new line string) it replaces the string, but it still goes to the next line when I print it. Moreover, when I used chomp or chop, the string is completely deleted. I could not find any other solution. How can I fix this problem?
use strict;
use warnings;
use v5.12;
use utf8;
use encoding "utf-8";
open(MYINPUTFILE, "<:encoding(UTF-8)", "file.txt");
my #strings;
my #fileNames;
my #erroredFileNames;
my $delimiter;
my $extensions;
my $id;
my $surname;
my $name;
while (<MYINPUTFILE>)
{
my ($line) = $_;
my ($line2) = $_;
if ($line !~ /^(((\X|[^\W_ ])+)(.docx)(\n|\r))/g) {
#chop($line2);
$line2 =~ s/^\n+//;
print $line2 . " WRONG FORMAT!\n";
}
else {
#print "INSERTED:".$13."\n";
my($id) = $13;
my($name) = $2;
print $name . "\t" . $id . "\n";
unshift(#fileNames, $line2);
unshift(#strings, $line2 =~ /[^\W_]+/g);
}
}
close(MYINPUTFILE);
The correct way to remove Unicode linebreak graphemes, including CRLF pairs, is using the \R regex metacharacter, introduced in v5.10.
The use encoding pragma is strongly deprecated. You should either use the use open pragma, or use an encoding in the mode argument on 3-arg open, or use binmode.
use v5.10; # minimal Perl version for \R support
use utf8; # source is in UTF-8
use warnings qw(FATAL utf8); # encoding errors raise exceptions
use open qw(:utf8 :std); # default open mode, `backticks`, and std{in,out,err} are in UTF-8
while (<>) {
s/\R\z//;
...
}
You are probably experiencing a line ending from a Windows file causing issues. For example, a string such as "foo bar\n", would actually be "foo bar\r\n". When using chomp on Ubuntu, you would be removing whatever is contained in the variable $/, which would be "\n". So, what remains is "foo bar\r".
This is a subtle, but very common error. For example, if you print "foo bar\r" and add a newline, you would not notice the error:
my $var = "foo bar\r\n";
chomp $var;
print "$var\n"; # Remove and put back newline
But when you concatenate the string with another string, you overwrite the first string, because \r moves the output handle to the beginning of the string. For example:
print "$var: WRONG\n";
It would effectively be "foo bar\r: WRONG\n", but the text after \r would cause the following text to wrap back on top of the first part:
foo bar\r # \r resets position
: WRONG\n # Second line prints and overwrites
This is more obvious when the first line is longer than the second. For example, try the following:
perl -we 'print "foo bar\rbaz\n"'
And you will get the output:
baz bar
The solution is to remove the bad line endings. You can do this with the dos2unix command, or directly in Perl with:
$line =~ s/[\r\n]+$//;
Also, be aware that your other code is somewhat horrific. What do you for example think that $13 contains? That'd be the string captured by the 13th parenthesis in your previous regular expression. I'm fairly sure that value will always be undefined, because you do not have 13 parentheses.
You declare two sets of $id and $name. One outside the loop and one at the top. This is very poor practice, IMO. Only declare variables within the scope they need, and never just bunch all your declarations at the top of your script, unless you explicitly want them to be global to the file.
Why use $line and $line2 when they have the same value? Just use $line.
And seriously, what is up with this:
if ($line !~ /^(((\X|[^\W_ ])+)(.docx)(\n|\r))/g) {
That looks like an attempt to obfuscate, no offence. Three nested negations and a bunch of unnecessary parentheses?
First off, since it is an if-else, just swap it around and reverse the regular expression. Second, [^\W_] a double negation is rather confusing. Why not just use [A-Za-z0-9]? You can split this up to make it easier to parse:
if ($line =~ /^(.+)(\.docx)\s*$/) {
my $pre = $1;
my $ext = $2;
You can wipe the linebreaks with something like this:
$line =~ s/[\n\r]//g;
When you do that though, you'll need to change the regex in your if statement to not look for them. I also don't think you want a /g in your if. You really shouldn't have a $line2 either.
I also wouldn't do this type of thing:
print $line2." WRONG FORMAT!\n";
You can do
print "$line2 WRONG FORMAT!\n";
... instead. Also, print accepts a list, so instead of concatenating your strings, you can just use commas.
You can do something like:
=~ tr/\n//
But really chomp should work:
while (<filehandle>){
chomp;
...
}
Also s/\n|\r// only replaces the first occurrence of \r or \n. If you wanted to replace all occurrences you would want the global modifier at the end s/\r|\n//g.
Note: if you're including \r for windows it usually ends its line as \r\n so you would want to replace both (e.g. s/(?:\r\n|\n)//), of course the statement above (s/\r|\n//g) with the global modifier would take care of that anyways.
$variable = join('',split(/\n/,$variable))
I am reading a string from a file:
2343,0,1,0 ... 500 times ...3
Above is an example of $_ when it is read from a file. It is any number, followed by 500 comma separated 0's/1's then the number 3.
while(<FILE>){
my $string = $_;
chomp($string);
my $a = chop($string);
my $found;
if($string=~m/^[0-9]*\,((0,|1,){$i})/){
$found = $&.$a;
print OTH $found,"\n";
}
}
I am using chop to get the number 3 from the end of the string. Then matching the first number followed by $i occurences of 0, or 1. The problem I'm having is that chop is not working on the string for some reason. In the if statement when I try to concat the match and the chopped number all I get returned is the contents of $&.
I have also tried using my $a = substr $a,-1,1; to get the number 3 and this also hasn't worked.
The thing that's odd is that this code works in Eclipse on Windows, and when I put it onto a Linux server it won't work. Can anyone spot the silly mistake I'm making?
As a rule, I tend always to allow for unseen whitespace in my data. I find that it makes my code more robust expecting that somebody didn't see an extra space at the end of a line or string (as in writing to a log). So I think this would solve your problem:
my ( $a ) = $string =~ /(\S)\s*$/;
Of course, since you know you are looking for a number, it's better to be more precise:
my ( $a ) = $string =~ /(\d+)\s*$/;
Take care of the end of line char… I can not test here but I assume you just chop a newline. Try first to trim your string then chop it. See for example http://www.somacon.com/p114.php
Instead of trying to do it that way, why not use a regexp to pull out everything you need in one go?
my $x = "4123,0,1,0,1,4";
$x =~ /^[0-9]+,((?:0,|1,){4})([0-9]+)/;
print "$1\n$2\n";
Produces:
0,1,0,1,
4
Which is pretty much what you're looking for. Both sets of needed answers are in the match variables.
Note that I included ?: in the front of the 0,1, matching so that it didn't end up in the output match variables.
I'm really not sure what you are trying to achieve here but I've tried the code on Win32 and Solaris and it works. Are you sure $i is the correct number? Might be easier to use * or ?
use strict;
use warnings;
while(<DATA>){
my $string = $_;
chomp($string);
my $a = chop($string);
print "$string\n";
my $found;
if($string=~m/^[0-9]*\,((0,|1,)*)/){
$found = $&.$a;
print $found,"\n";
}
}
__DATA__
2343,0,1,0,0,1,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,0,3
I don't see much reason to use a regex in this case, just use split.
use strict;
use warnings;
use autodie; # open will now die on failure
my %data;
{
# limit the scope of $fh
open my $fh, '<', 'test.data';
while(<$fh>){
chomp;
s(\s+){}g; # remove all spaces
my($number,#bin) = split ',', $_;
# uncomment if you want to throw away the 3
# pop #bin if $bin[-1] == 3;
$data{$number} = \#bin;
}
close $fh;
}
If all you want is the 3
while(<$fh>){
# the .* forces it to look for the last set of numbers
my($last_number) = /.*([0-9]+)/;
}