How to specify number of paragraphs in perl - regex

I Want to find number of occurrences of new line in perl using regex.
How to define number of occurrences of newline in perl.
For example
I have text containing
dog
cat
other 23 newlines
puppy
Kitten
I am able to do regex using notepad Find "(dog)((?:.*[\r\n]+){25})(\w.*)" and replace with "\1 = \3 \2"
EDIT
Important thing is that, How to find what is on paragraph 25 from dog.
More simpler way.
How to shorten this find string
(dog)(.*[\r\n]+.*[\r\n]+.*[\r\n]+.*[\r\n]+.*[\r\n]+.*[\r\n]+.*[\r\n]+.*[\r\n]+.*[\r\n]+.*[\r\n]+.*[\r\n]+.*[\r\n]+.*[\r\n]+.*[\r\n]+.*[\r\n]+.*[\r\n]+.*[\r\n]+.*[\r\n]+.*[\r\n]+.*[\r\n]+.*[\r\n]+.*[\r\n]+.*[\r\n]+)(.*)
what is the alternative to specific numbers of new lines in Perl?
By mistake posted on superuser, Now Moved here.

You can skip the lines when reading the input:
while (<>) {
if (/dog/) {
<> for 1 .. 24;
print scalar <>;
}
}
Or, if the whole string is the input, you can use a non-capturing group:
my ($puppy) = $string =~ /dog(?:.*\n){25}(.*)/;
print $puppy;
In a regex, a dot doesn't match a newline (unless the /s modifier is used).

You can use the tr/// operator (see perlop) to count the number of occurrences of a single character.
#!/usr/bin/perl
use warnings;
use strict;
my $string = 'dog
cat
other 23 newlines
puppy
Kitten';
print $string =~ tr/\n//, "\n";
Output:
4

This was the answer I wanted.
perl -i.bak -pe "BEGIN{undef $/;} s/(dog)(.*[\r\n]+){23}(.*)/$1 = $3/smg" 1.rtf

Related

using the command line and regex to determine words that start sentences

I have the text:
This is a test. This is only a test! If there were an emergency, then Information would be provided for you.
I want to be able to determine which words start sentences. What I have now is:
$ cat <FILE> | perl -pe 's/[\s.?!]/\n/g;'
This just gets rid of punctuation and replaces them with newlines, giving me:
This
is
a
test
This
is
only
a
test
If
there
were
an
emergency,
then
Information
would
be
provided
for
you
From here I could somehow extract the words that have either nothing above them (start of file) or a blank space, but I am unsure of exactly how to do this.
If you have a Perl of at least version 5.22.1 (or 5.22.0 and this case is not affected by the bug described here), then you can use the sentence boundaries in your regular expression.
use feature 'say';
foreach my $sentence (m/\b{sb}(\w+)/g) {
say $sentence;
}
Or, as a one-liner:
perl -nE 'say for /\b{sb}(\w+)/g'
If called with your example text, the output is:
This
This
If
It uses \b{sb}, which is the sentence boundary. You can read a tutorial at brian d foy's blog about it. The \b{} is called a unicode boundary and is described in perlrebackslash.
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
local $/;
my #words = <DATA> =~ m/(?:^|[\.!]+)\s+(\w+)/g;
print Dumper \#words;
__DATA__
This is a test. This is only a test! If there were an emergency, then Information would be provided for you.
So as a command line:
perl -ne 'print join "\n", m/(?:^|[\.!])\s+(\w+)/g;' somefile
You can use this gnu grep command to extract first after each period or ! or ?:
grep -oP '(?:^|[.?!])\s*\K[A-Z][a-z]+' file
This
This
If
Though I must caution you may get false results for cases like Mr. Smith.
Regex Breakup:
(?:^|[.?!]) - match start or DOT or ! or ?
\s* - match 0 or more whitespaces
\K - match reset to forget matched data
[A-Z][a-z]+ - match a word startign with upper case letter

regular expression that matches any word that starts with pre and ends in al

The following regular expression gives me proper results when tried in Notepad++ editor but when tried with the below perl program I get wrong results. Right answer and explanation please.
The link to file I used for testing my pattern is as follows:
(http://sainikhil.me/stackoverflow/dictionaryWords.txt)
Regular expression: ^Pre(.*)al(\s*)$
Perl program:
use strict;
use warnings;
sub print_matches {
my $pattern = "^Pre(.*)al(\s*)\$";
my $file = shift;
open my $fp, $file;
while(my $line = <$fp>) {
if($line =~ m/$pattern/) {
print $line;
}
}
}
print_matches #ARGV;
A few thoughts:
You should not escape the dollar sign
The capturing group around the whitespaces is useless
Same for the capturing group around the dot .
which leads to:
^Pre.*al\s*$
If you don't want words like precious final to match (because of the middle whitespace, change regex to:
^Pre\S*al\s*$
Included in your code:
while(my $line = <$fp>) {
if($line =~ /^Pre\S*al\s*$/m) {
print $line;
}
}
You're getting messed up by assigning the pattern to a variable before using it as a regex and putting it in a double-quoted string when you do so.
This is why you need to escape the $, because, in a double-quoted string, a bare $ indicates that you want to interpolate the value of a variable. (e.g., my $str = "foo$bar";)
The reason this is causing you a problem is because the backslash in \s is treated as escaping the s - which gives you just plain s:
$ perl -E 'say "^Pre(.*)al(\s*)\$";'
^Pre(.*)al(s*)$
As a result, when you go to execute the regex, it's looking for zero or more ses rather than zero or more whitespace characters.
The most direct fix for this would be to escape the backslash:
$ perl -E 'say "^Pre(.*)al(\\s*)\$";'
^Pre(.*)al(\s*)$
A better fix would be to use single quotes instead of double quotes and don't escape the $:
$ perl -E "say '^Pre(.*)al(\s*)$';"
^Pre(.*)al(\s*)$
The best fix would be to use the qr (quote regex) operator instead of single or double quotes, although that makes it a little less human-readable if you print it out later to verify the content of the regex (which I assume to be why you're putting it into a variable in the first place):
$ perl -E "say qr/^Pre(.*)al(\s*)$/;"
(?^u:^Pre(.*)al(\s*)$)
Or, of course, just don't put it into a variable at all and do your matching with
if($line =~ m/^Pre(.*)al(\s*)$/) ...
Try removing trailing newline character(s):
while(my $line = <$fp>) {
$line =~ s/[\r\n]+$//s;
And, to match only words that begin with Pre and end with al, try this regular expression:
/^Pre\w*al$/
(\w means any letter of a word, not just any character)
And, if you want to match both Pre and pre, do a case-insensitive match:
/^Pre\w*al$/i

perl regex to remove dashes

I have some files I am processing, and I would like to remove the dashes from the non date fields.
I came up with s/([^0-9]+)-([^0-9]+)/$1 $2/g but that only works if there is one dash only in the string, or I should say it will only remove one dash.
So lets say I have:
2014-05-01
this-and
this-and-that
this-and-that-and-that-too
2015-01-01
What regex would I use to produce
2014-05-01
this and
this and that
this and that and that too
2015-01-01
Don't do it with one regex. There is no requirement that a single regex must contain all of your code's logic.
Use one regex to see if it's a date, and then a second one to do your transformation. It will be much clearer to the reader (that's you, in the future) if you split it up into two.
#!/usr/bin/perl
use warnings;
use strict;
while ( my $str = <DATA>) {
chomp $str;
my $old = $str;
if ( $str !~ /^\d{4}-\d{2}-\d{2}$/ ) { # First regex to see if it's a date
$str =~ s/-/ /g; # Second regex to do the transformation
}
print "$old\n$str\n\n";
}
__DATA__
2014-05-01
this-and
this-and-that
this-and-that-and-that-too
2015-01-01
Running that gives you:
2014-05-01
2014-05-01
this-and
this and
this-and-that
this and that
this-and-that-and-that-too
this and that and that too
2015-01-01
2015-01-01
Using look around :
$ perl -pe 's/
(?<!\d) # a negative look-behind with a digit: \d
- # a dash, literal
(?!\d) # a negative look-ahead with a digit: \d
/ /gx' file
OUTPUT
2014-05-01
this and
this and that
this and that and that too
2015-01-01
Look around are some assertions to ensure that there's no digit (in this case) around -. A look around don't make any capture, it's really just there to test assertions. It's a good tool to have near you.
Check :
http://www.perlmonks.org/?node_id=518444
http://www.regular-expressions.info/lookaround.html
Lose the + - it's catching the string up until the last -, including any previous - characters:
s/([^0-9]|^)-+([^0-9]|$)/$1 $2/g;
Example: https://ideone.com/r2CI7v
As long as your program receives each field separately in the $_ variable, all you need is
tr/-/ / if /[^-\d]/
This should do it
$line =~ s/(\D)-/$1 /g;
As I explained in a comment, you really need to use Text::CSV to split each record into fields before you edit the data. That's because data that contain whitespace need to be enclosed in double quotes, so a field like this-and-that will start out without spaces, but needs them added when the hyphens are translated to spaces.
This program shows a simple example that uses your own data.
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({eol => $/});
while (my $row = $csv->getline(\*DATA)) {
for (#$row) {
tr/-/ / unless /^\d\d\d\d-\d\d-\d\d$/;
}
$csv->print (\*STDOUT, $row);
}
__DATA__
2014-05-01,this-and-that,this-and-that,this-and-that-and-that-too,2015-01-01
output
2014-05-01,"this and that","this and that","this and that and that too",2015-01-01

How do I return all characters that begin and end with certain characters in Perl (Or C++)?

note: I'm running Perl 5 on Linux
I'm currently doing a project where I have to input a few words and then return words that begin with "d" and end with "e". I'm not using a pre-done list, for example I input into the console Done, Dish, Dome, and Death. I want it to return Done and Dome, but not the other words. I hope to receive help how to do this in Perl, but C++ would help if Perl doesn't work out.
perl -ne ' print if /^d/i && /e$/i ' < words
Since you are using Linux, it may be simpler to use grep(1):
grep -i '^d.*e$' < words
That's almost trivial in Perl:
$ perl -nE 'say "ok" if /^d.*e$/i'
Done
ok
Dish
Dome
ok
Death
It reads from STDIN and says ok if the line matched. This is useful while debugging regular expressions. You just want to output matching lines, so you could simply replace say "ok" by say
$ perl -nlE 'say if /^d.*e$/i' words
while words is the filename of your words file. It magically reads its lines. Short explanation of that regular expression match:
^ # start of the line
d # the literal character 'd' (case-insensitive because of the i switch)
.* # everything allowed here
$ # end of the line
Not often I answer perl questions, but I think this does the trick.
my #words = ...;
#words = grep(/^d.*e$/i, #words);
grep uses a regular expression to filter the words.
How about:
#!/usr/bin/perl -Tw
use strict;
use warnings;
for my $word (#ARGV) {
if ( $word =~ m{\A d .* e \z}xmsi ) {
print "$word\n";
}
}

How do I ignore a regex match if a line has a special prefix?

I'm using this regex in Perl to match and replace the following expressions:
_HI2_
_HI_2
HI2_
_HI_2
if ($subject =~ m/_?HI2?_?|HI2?_?/) {
# Successful match
} else {
# Match attempt failed
}
I also want to do this though:
The text is: ABCDEMAFGHIJ
This is a sequence HI in there but must be ignored because if you look left you can see that this line starts with The text is:.
The text is: ABCDEHI2FGHI
As above, two sequence of HI here.
How can I build into this regex a match and ignore it because of a line prefix?
Why not just match twice?
If $subject does not match /^The text is:/, run the replace ..
Try this regex:
/^(?!The text is:).*(?:_?HI2?_?|HI2?_?)/
Or use two matches like:
if($subject !~ /^This text is:/i && $subject =~ /_?HI2?_?|HI2?_?/)
I just discovered this brilliant resource here and the section on Perl.
You can find there details of a (*SKIP)(*F) construct which will blow your mind; your described problem as a one-liner:
cat > test.txt <<EOF
_HI2_
_HI_2xxxHI_2
The text is: ABCDEMAFGHIJ
HI2_
The text is: ABCDEHI2FGHI
_HI_2
EOF
perl -ne '/^The text is:.*$(*SKIP)(*F)|.+/ && s/_?HI_?2?_?/HAPPY/; print' test.txt
# or
perl -ne 's/(^The text is:.*$)(*SKIP)(*F)|_?HI_?2?_?/HAPPY/g; print' test.txt
I have new found love and respect for Perl; Sed is my go-to, but now I know how to skip lines (read: leave unchanged) in Perl, I will hesitate less
Try telling it is the start of the line with "^", ignore whitespaces if that you think is needed(I always tend to do it). Also you could mark the end of the string with "$"
if ($subject =~ m/^\s*_?HI2?_?|HI2?_?/) {
# Successful match
} else {
# Match attempt failed
}
Not the most elegant method but easy to understand (TIMTOWTDI :)
#!/usr/bin/perl
use strict;
use warnings;
my #text = ("ABCDEHI2FGHI", "The text is: ABCDEHI2FGHI");
for (#text) {
my $new = my_replace($_); # do the replacement
print "$new\n"; # print result
}
sub my_replace {
my ($text) = #_;
return $text if ($text =~ m/The text is:/); # return if prefixed / no replacement
$text =~ s/(_?HI2?_?|HI2?_?)/__replacement__/g; # do replace (give a replacement string here)
return $text; # return result of replacement
}
Otherwise you can use a "negative lookbehind".
To try see regex101 or debuggex.
/(?<!^The text is.*)(_?HI2?_?|HI2?_?)/