How do I use a Perl regExp to count sentences? - regex

I've struggled with regExp in Perl for some reason from the start and have a quick script i wrote here to count sentences in some text being inputted that won't work. I just get the number 1 back at the end and I know in the file specified there is several so the count should be higher. I can't see the issue...
#!C:\strawberry\perl\bin\perl.exe
#strict
#diagnostics
#warnings
$count = 0;
$file = "c:/programs/lorem.txt";
open(IN, "<$file") || die "Sorry, the file failed to open: $!";
while($line = <IN>)
{
if($line =~ m/^[A-Z]/)
{
$count++;
}
}
close(IN);
print("Sentances count was: ($count)");
The file lorem.txt is here......
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu, consequat vitae, eleifend ac, enim. Aliquam lorem ante, dapibus in, viverra quis, feugiat a, tellus. Phasellus viverra nulla ut metus varius laoreet. Quisque rutrum. Aenean imperdiet. Etiam ultricies nisi vel augue. Curabitur ullamcorper ultricies nisi. Nam eget dui. Etiam rhoncus. Maecenas tempus, tellus eget condimentum rhoncus, sem quam semper libero, sit amet adipiscing sem neque sed ipsum. Nam quam nunc, blandit vel, luctus pulvinar, hendrerit id, lorem. Maecenas nec odio et ante tincidunt tempus. Donec vitae sapien ut libero venenatis faucibus. Nullam quis ante. Etiam sit amet orci eget eros faucibus tincidunt. Duis leo. Sed fringilla mauris sit amet nibh. Donec sodales sagittis magna. Sed consequat, leo eget bibendum sodales, augue velit cursus nunc,

I don't know what's in your lorem.txt, but the code that you've given is not counting sentences. It's counting lines, and furthermore it's counting lines that begin with a capital letter.
This regex:
/^[A-Z]/
will only match at the beginning of a line, and only if the first character on that line is capitalized. So if you have a line that looks like it. And then we went... it will not be matched.
If you want to match all capital letters, just remove the ^ from the beginning of the regex.

This does not answer your specific question about regexp, but you could consider using a CPAN module: Text::Sentence. You can look at its source code to see how it defines a sentence.
use warnings;
use strict;
use Data::Dumper;
use Text::Sentence qw(split_sentences);
my $text = <<EOF;
One sentence. Here is another.
And yet another.
EOF
my #sentences = split_sentences($text);
print Dumper(\#sentences);
__END__
$VAR1 = [
'One sentence.',
'Here is another.',
'And yet another.'
];
A google search also turned up: Lingua::EN::Sentence

You are currently counting all lines that begin with a capital letter. Perhaps you intend to count all words that start with a capital letter? If so, try:
m/\W[A-Z]/
(Although this is not a robust count of sentences)
On another note, there is no need to do the file manipulation explicitly. perl does a really good job of that for you. Try this:
$ARGV[ 0 ] = "c:/programs/lorem.txt" unless #ARGV;
while( $line = <> ) {
...
If you do insist on doing an explicit open/close, it is considered bad practice to use raw filehandles. In other words, instead of "open IN...", do "open my $fh, '<', $file_name;"

Related

Remove page numbers at bottom from PDF copy text

I copied some paragraph from a PDF file and want to remove page numbers at the bottom:
Duis diam dolor, iaculis a efficitur vitae, feugiat sed diam. Phasellus porta dolor non mauris
12
imperdiet ante. Etiam volutpat rhoncus massa, ut laoreet elit suscipit sed.
13
Integer quis ultrices turpis. Nunc molestie euismod aliquet.
14
If a line end with dot, its should be merged with next paragraph. I don't know how to combine these two paragraphs.
So final result should like:
Duis diam dolor, iaculis a efficitur vitae, feugiat sed diam. Phasellus porta dolor non mauris imperdiet ante. Etiam volutpat rhoncus massa, ut laoreet elit suscipit sed.
Integer quis ultrices turpis. Nunc molestie euismod aliquet.
I tried
[^.]\n([0-9]+)
on Linux bash, but no luck.
This might work for you (GNU sed):
sed -E ':a;N;s/\n[0-9]+$//;ta;s/([^.])\n/\1 /;ba' file
Append lines, removing any lines with page numbers and any newlines if the previous line does not end with a period.
This will remove any page numbers, join lines inside a paragraph and leave previous paragraphs intact.
This will produce the output you posted from the sample input you provided:
$ awk '!/^[0-9]+$/{buf=buf $0} /\.$/{print buf; buf=""}' file
Duis diam dolor, iaculis a efficitur vitae, feugiat sed diam. Phasellus porta dolor non maurisimperdiet ante. Etiam volutpat rhoncus massa, ut laoreet elit suscipit sed.
Integer quis ultrices turpis. Nunc molestie euismod aliquet.
It'll behave the same way using any awk in any shell on every UNIX box.

Match a string pattern in line and replace line with pattern in Notepad++

I want to search for a string pattern in a line and if found replace the whole line with the matched string pattern.
My string pattern starts with 2 alpha characters and followed with either 5 or 6 numeric characters. Ex. HR12345 or HR123456
Here is sample of how the lines with the pattern looks like.
Class cum accumsan. In. Pellentesque nec magna interdum fusce metus, massa aliquam HR032145
Amet commodo arcu, felis orci Per. Facilisis blandit rhoncus hac porttitor ut duis eu HR32145
Mattis quis magna, suspendisse HR32146 aucibus vel, fames Nonummy molestie penatibus ad.
Nascetur mattis ad egestas et nec HR032111 Penatibus posuere. Posuere.
Inceptos consectetuer neque nullam HR032114. rutrum Eleifend.
Netus tortor conubia parturient sapien interdum adipiscing sociis luctus integer HR032113
HR032112 Mattis erat a ante. Rutrum. Mattis risus fames. Euismod sapien morbi habitasse.
Platea sapien vitae Risus. Erat dictum elit dapibus convallis.
Facilisis ut dis morbi integer fusce dolor Et class Primis iaculis.
Aptent per risus phasellus HR032188
After search replace it should look like
HR032145
HR32145
HR32146
HR032111
HR032114
HR032113
HR032112
Platea sapien vitae Risus. Erat dictum elit dapibus convallis.
Facilisis ut dis morbi integer fusce dolor Et class Primis iaculis.
HR032188
Try the following simple find and replace:
Find:
^.*(HR\d+).*$
Replace:
$1
This replacement will only happen with lines containing HR followed by one or more digits. Hence, the lines which do not have this pattern will not even match, and no replacement will take place there.

how to replace dots inside quote in sentence with regex

lets say there is something like this
Lorem ipsum dolor sit amet, consectetur adipiscing elit. "Vestibulum interdum dolor nec sapien blandit a suscipit arcu fermentum. Nullam lacinia ipsum vitae enim consequat iaculis quis in augue. Phasellus fermentum congue blandit. Donec laoreet, ipsum et vestibulum vulputate, risus augue commodo nisi, vel hendrerit sem justo sed mauris." Phasellus ut nunc neque, id varius nunc. In enim lectus, blandit et dictum at, molestie in nunc. Vivamus eu ligula sed augue pretium tincidunt sit amet ac nisl. "Morbi eu elit diam, sed tristique nunc."
to be something like this
Lorem ipsum dolor sit amet, consectetur adipiscing elit. "Vestibulum interdum dolor nec sapien blandit a suscipit arcu fermentum[dot] Nullam lacinia ipsum vitae enim consequat iaculis quis in augue[dot] Phasellus fermentum congue blandit[dot] Donec laoreet, ipsum et vestibulum vulputate, risus augue commodo nisi, vel hendrerit sem justo sed mauris[dot]" Phasellus ut nunc neque, id varius nunc. In enim lectus, blandit et dictum at, molestie in nunc. Vivamus eu ligula sed augue pretium tincidunt sit amet ac nisl. "Morbi eu elit diam, sed tristique nunc[dot]"
i somehow found a regex to select all the "{sentence}" with "(.)+?" or use them like
regex('"(.)+?"','[sentence]')
but can we do something like replace the dots inside a group?. so i can get the output like above example?
I'm not sure regexps are able to suit your needs on their own.
You should implement an algorithm that replaces nested dots until the string doesn't contain nested dots anymore.
For example in PHP:
$string = 'He asked "Please." while she answered "No. Or maybe yes."';
var_dump($string);
while(preg_match('/"[^"]*\.[^"]*"/', $string)) {
$string = preg_replace('/("[^"]*)\.([^"]*")/', '$1[dot]$2', $string);
}
var_dump($string);
which prints:
string 'He asked "Please." while she answered "No. Or maybe yes."' (length=57)
string 'He asked "Please[dot]" while she answered "No[dot] Or maybe yes[dot]"' (length=69)
This is what I would do.
echo
preg_replace_callback('~(?<!\\\)"(.+?)((?<!\\\)")~',
/*
Pattern:
--------
(?<!\\\)" a double quote not preceded by a backward (escaping) slash
(.+?) anything (with min 1 char.) between condition above and below
((?<!\\\)") a double quote not preceded by a backward (escaping) slash
*/
// for anything that matches the above pattern
// the following function is called
create_function('$m',
'return preg_replace("~\.~","[dot]",$m[0]);'),
// which replaces each dot with [dot] and returns the match
$str);
EDIT: Added explanations in comments.
try this:
(\"[^\.]*)\.([^\"]*) to \1[dot]\2
works well in my editor, but sometimes $ is used instead of \ in replacement (e.g. in php)
With Javascript I would just do a basic replace:
str = str.replace(/".+?"/g,function(m) {
return m.replace(/\./g,'[dot]');
});

Regex to match any character including new lines

Is there a regex to match "all characters including newlines"?
For example, in the regex below, there is no output from $2 because (.+?) doesn't include new lines when matching.
$string = "START Curabitur mollis, dolor ut rutrum consequat, arcu nisl ultrices diam, adipiscing aliquam ipsum metus id velit. Aenean vestibulum gravida felis, quis bibendum nisl euismod ut.
Nunc at orci sed quam pharetra congue. Nulla a justo vitae diam eleifend dictum. Maecenas egestas ipsum elementum dui sollicitudin tempus. Donec bibendum cursus nisi, vitae convallis ante ornare a. Curabitur libero lorem, semper sit amet cursus at, cursus id purus. Cras varius metus eu diam vulputate vel elementum mauris tempor.
Morbi tristique interdum libero, eu pulvinar elit fringilla vel. Curabitur fringilla bibendum urna, ullamcorper placerat quam fermentum id. Nunc aliquam, nunc sit amet bibendum lacinia, magna massa auctor enim, nec dictum sapien eros in arcu.
Pellentesque viverra ullamcorper lectus, a facilisis ipsum tempus et. Nulla mi enim, interdum at imperdiet eget, bibendum nec END";
$string =~ /(START)(.+?)(END)/;
print $2;
If you don't want add the /s regex modifier (perhaps you still want . to retain its original meaning elsewhere in the regex), you may also use a character class. One possibility:
[\S\s]
a character which is not a space or is a space. In other words, any character.
You can also change modifiers locally in a small part of the regex, like so:
(?s:.)
Add the s modifier to your regex to cause . to match newlines:
$string =~ /(START)(.+?)(END)/s;
Yeap, you just need to make . match newline :
$string =~ /(START)(.+?)(END)/s;
You want to use "multiline".
$string =~ /(START)(.+?)(END)/m;

Continuing regular expression match after successful match

For the code below, what do I need to change (in my regex?) so it prints out all the instances of the regex that appear in that match in $string?
So right now the output is just the first instance that matches the regex, but I want to print out the following 2 instances as well. I thought the /g at the end of the regex would do this.
#!/usr/bin/perl
$start = '<!--Start here-->';
$end = '<!--End now-->';
$string = "Lorem ipsum dolor sit amet, consectetur <!--Start here-->adipiscing elit. Maecenas gravida dictum erat et sollicitudin. Class aptent taciti sociosqu ad litora torquent per <!--End now-->conubia nostra, per inceptos himenaeos. <!--Start here-->Mauris ac elementum enim. <!--End now-->Etiam hendrerit accumsan sodales. Morbi mi tortor, adipiscing in interdum eu, volutpat quis neque. Aenean tincidunt ornare risus, id faucibus augue dictum ut. Nullam aliquet metus vel nibh ullamcorper ornare. Vestibulum a sapien augue. Praesent tellus nulla, congue non vestibulum eget, venenatis eu tellus. <!--Start here-->Donec varius porttitor blandit.<!--End now--> a sapien augue ipsum dolor.";
$string =~ m/(($start)(.+?)($end))/g;
print $1;
You do it in a loop:
print $1 while ($string =~ m/(($start)(.+?)($end))/g);
Just change the way you print the results :)
while ($subject =~ m/(($start)(.+?)($end))/g) {
print $1, "\n";
}
Try
my #matches = $string =~ m/(($start)(.+?)($end))/g;
To capture them all to array #matches
That's probably not the best way for capturing it, unless you really want the starts, the ends, the full phrase including the start and end, the contents, etc. The regex operator returns match results in list context, and you can use that to do:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my $start = '<!--Start here-->';
my $end = '<!--End now-->';
my $string = "Lorem ipsum dolor sit amet, consectetur <!--Start here-->adipiscing elit. Maecenas gravida dictum erat et sollicitudin. Class aptent taciti sociosqu ad litora torquent per <!--End now-->conubia nostra, per inceptos himenaeos. <!--Start here-->Mauris ac elementum enim. <!--End now-->Etiam hendrerit accumsan sodales. Morbi mi tortor, adipiscing in interdum eu, volutpat quis neque. Aenean tincidunt ornare risus, id faucibus augue dictum ut. Nullam aliquet metus vel nibh ullamcorper ornare. Vestibulum a sapien augue. Praesent tellus nulla, congue non vestibulum eget, venenatis eu tellus. <!--Start here-->Donec varius porttitor blandit.<!--End now--> a sapien augue ipsum dolor.";
my #matches = ($string =~ m/$start(.+?)$end/g);
say for #matches;
Outputs:
adipiscing elit. Maecenas gravida dictum erat et sollicitudin. Class aptent taciti sociosqu ad litora torquent per
Mauris ac elementum enim.
Donec varius porttitor blandit.
i.e. no messing around with $1 $2 $3 etc.