Attach a newline to every sentences - regex

i was wondering how to turn a paragraph, into bullet sentences.
before:
sentence1. sentence2. sentence3. sentence4. sentence5. sentence6. sentence7.
after:
sentence1.
sentence2.
sentence3
sentence4.
sentence5.

Since all the other answers so far show how to do it various programming languages and you have tagged the question with Vim, here's how to do it in Vim:
:%s/\.\(\s\+\|$\)/.\r\r/g
I've used two carriage returns to match the output format you showed in the question. There are a number of alternative regular expression forms you could use:
" Using a look-behind
:%s/\.\#<=\( \|$\)/\r\r/g
" Using 'very magic' to reduce the number of backslashes
:%s/\v\.( |$)/.\r\r/g
" Slightly different formation: this will also break if there
" are no spaces after the full-stop (period).
:%s/\.\s*$\?/.\r\r/g
and probably many others.
A non-regexp way of doing it would be:
:let s = getline('.')
:let lineparts = split(s, '\.\#<=\s*')
:call append('.', lineparts)
:delete
See:
:help pattern.txt
:help change.txt
:help \#<=
:help :substitute
:help getline()
:help append()
:help split()
:help :d

You can use a regex
/\.( |$)/g
That will match the end of the sentence, then you can add newlines.
Or you can use some split function with . (dot space) and . (dot), then join with newlines.

Just replace all end of sentences /(?<=.) / with a period followed by two newline characters /.\n\n/. The syntax would of course depend on the language you are using.

Using Perl:
perl -e "$_ = <>; s/\.\s*/.\n/g; print"
Longer, somewhat more readable version:
my $input = 'foo. bar. baz.';
$input =~ s/
\. # A literal '.'
\s* # Followed by 0 or more space characters
/.\n/gx; # g for all occurences, x to allow comments and whitespace in regex
print $input;
Using Python:
import re
input = 'foo. bar. baz.'
print re.sub(r'\.\s*', '.\n', input)

An example using Ruby:
ruby-1.9.2 > a = "sentence1. sentence2. sentence3. and array.split(). the end."
=> "sentence1. sentence2. sentence3. and array.split(). the end."
ruby-1.9.2 > puts a.gsub(/\.(\s+|$)/, ".\n\n")
sentence1.
sentence2.
sentence3.
and array.split().
the end.
It goes like, for every . followed by (1 whitespace character or more, or followed by end of line), replace it with just . and two newline characters.

using awk
$ awk '{$1=$1}1' OFS="\n" file
sentence1.
sentence2.
sentence3.
sentence4.
sentence5.
sentence6.
sentence7

In PHP:
<?php
$input = "sentence. sentence. sentence.";
$output = preg_replace("/(.*?)\\.[\\s]+/", "$1\n", $input);
?&gt
Also, regular expressions are a blast, but not necessary for this problem. You can also try:
&lt?php
$input = "sentence. sentence. sentence.";
$arr = explode('.', $input);
foreach ($arr as $k => $v) $arr[$k] = trim($v);
$output = implode("\n", $arr);
?&gt

I figured out how to do this in RegExr
Search String is
(\-=?\s+)
--
Replace String is
\n\n
This is the generated information for the current regex
RegExp: /(\-=?\s+)/g
pattern: (\-=?\s+)
flags: g
capturing groups: 1
group 1: (\-=?\s+)
This will find every - in the sentence below and replace it with two newlines
Sentence 1- Sentence 2- Sentence 3- Sentence 4- Sentence 5-
The end result is
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5

I have a really simple naive solution using capturing regexs.
:%s/[.!?]/\1y\r\r/g
The main draw back is this won't handle ellipses or multiple punctuation.

Related

How to split text into "steps" using regex in perl?

I am trying to split texts into "steps"
Lets say my text is
my $steps = "1.Do this. 2.Then do that. 3.And then maybe that. 4.Complete!"
I'd like the output to be:
"1.Do this."
"2.Then do that."
"3.And then maybe that."
"4.Complete!"
I'm not really that good with regex so help would be great!
I've tried many combination like:
split /(\s\d.)/
But it splits the numbering away from text
I would indeed use split. But you need to exclude the digit from the match by using a lookahead.
my #steps = split /\s+(?=\d+\.)/, $steps;
All step-descriptions start with a number followed by a period and then have non-numbers, until the next number. So capture all such patterns
my #s = $steps =~ / [0-9]+\. [^0-9]+ /xg;
say for #s;
This works only if there are surely no numbers in the steps' description, like any approach relying on matching a number (even if followed by a period, for decimal numbers)†
If there may be numbers in there, we'd need to know more about the structure of the text.
Another delimiting pattern to consider is punctuation that ends a sentence (. and ! in these examples), if there are no such characters in steps' description and there are no multiple sentences
my #s = $steps =~ / [0-9]+\. .*? [.!] /xg;
Augment the list of patterns that end an item's description as needed, say with a ?, and/or ." sequence as punctuation often goes inside quotes.‡
If an item can have multiple sentences, or use end-of-sentence punctuation mid-sentence (as a part of a quotation perhaps) then tighten the condition for an item's end by combining footnotes -- end-of-sentence punctuation and followed by number+period
my #s = $steps =~ /[0-9]+\. .*? (?: \."|\!"|[.\!]) (?=\s+[0-9]+\. | \z)/xg;
If this isn't good enough either then we'd really need a more precise description of that text.
† An approach using a "numbers-period" pattern to delimit item's description, like
/ [0-9]+\. .*? (?=\s+[0-9]+\. | \z) /xg;
(or in a lookahead in split) fails with text like
1. Only $2.50   or   1. Version 2.4.1   ...
‡ To include text like 1. Do "this." and 2. Or "that!" we'd want
/ [0-9]+\. .*? (?: \." | !" | [.!?]) /xg;
Following sample code demonstrates power of regex to fill up %steps hash in one line of code.
Once the data obtained you can dice and slice it anyway your heart desires.
Inspect the sample for compliance with your problem.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my($str,%steps,$re);
$str = '1.Do this. 2.Then do that. 3.And then maybe that. 4.Complete!';
$re = qr/(\d+)\.(\D+)\./;
%steps = $str =~ /$re/g;
say Dumper(\%steps);
say "$_. $steps{$_}" for sort keys %steps;
Output
$VAR1 = {
'1' => 'Do this',
'2' => 'Then do that',
'3' => 'And then maybe that'
};
1. Do this
2. Then do that
3. And then maybe that

Extract first word after specific word

I'm having difficulty writing a Perl program to extract the word following a certain word.
For example:
Today i'm not going anywhere except to office.
I want the word after anywhere, so the output should be except.
I have tried this
my $words = "Today i'm not going anywhere except to office.";
my $w_after = ( $words =~ /anywhere (\S+)/ );
but it seems this is wrong.
Very close:
my ($w_after) = ($words =~ /anywhere\s+(\S+)/);
^ ^ ^^^
+--------+ |
Note 1 Note 2
Note 1: =~ returns a list of captured items, so the assignment target needs to be a list.
Note 2: allow one or more blanks after anywhere
In Perl v5.22 and later, you can use \b{wb} to get better results for natural language. The pattern could be
/anywhere\b{wb}.+?\b{wb}(.+?\b{wb})/
"wb" stands for word break, and it will account for words that have apostrophes in them, like "I'll", that plain \b doesn't.
.+?\b{wb}
matches the shortest non-empty sequence of characters that don't have a word break in them. The first one matches the span of spaces in your sentence; and the second one matches "except". It is enclosed in parentheses, so upon completion $1 contains "except".
\b{wb} is documented most fully in perlrebackslash
First, you have to write parentheses around left side expression of = operator to force array context for regexp evaluation. See m// and // in perlop documentation.[1] You can write
parentheses also around =~ binding operator to improve readability but it is not necessary because =~ has pretty high priority.
Use POSIX Character Classes word
my ($w_after) = ($words =~ / \b anywhere \W+ (\w+) \b /x);
Note I'm using x so whitespaces in regexp are ignored. Also use \b word boundary to anchor regexp correctly.
[1]: I write my ($w_after) just for convenience because you can write my ($a, $b, $c, #rest) as equivalent of (my $a, my $b, my $c, my #rest) but you can also control scope of your variables like (my $a, our $UGLY_GLOBAL, local $_, #_).
This Regex to be matched:
my ($expect) = ($words=~m/anywhere\s+([^\s]+)\s+/);
^\s+ the word between two spaces
Thanks.
If you want to also take into consideration the punctuation marks, like in:
my $words = "Today i'm not going anywhere; except to office.";
Then try this:
my ($w_after) = ($words =~ /anywhere[[:punct:]|\s]+(\S+)/);

Copy matched pattern of line at the end of it

I want to match a pattern from text and then append it at the end of line. In below case i want to match numbers and then paste it at the end of line. In case of matching two patterns want to have comma separated.
Basically i am looking how i can use the matching portion as variable.
I am looking to do it in Bash.
abc 123=
agdaf456ad
dfaf879:
abc123xyz12:
To
abc 123=123
agdaf456ad456
dfaf879:879
abc123xyz12:123,12
Something like
(\d+)(.*)$
And replace with
$1$2$1
Regex Demo
Example
$replace = preg_replace("/(\d+)(.*)$/", "$1$2$1", "abc 123=");
echo $replace;
=> abc 123=123
To get all sequences of digits in a given string, you can use a mere \d+ regex, and then just implode the obtained result array and append it to the input string:
$str = "abc123xyz12:";
preg_match_all('/\d+/', $str, $m);
$append = implode(",", $m[0]);
echo $str . $append;
See demo

perl Regex replace for specific string length

I am using Perl to do some prototyping.
I need an expression to replace e by [ee] if the string is exactly 2 chars and finishes by "e".
le -> l [ee]
me -> m [ee]
elle -> elle : no change
I cannot test the length of the string, I need one expression to do the whole job.
I tried:
`s/(?=^.{0,2}\z).*e\z%/[ee]/g` but this is replacing the whole string
`s/^[c|d|j|l|m|n|s|t]e$/[ee]/g` same result (I listed the possible letters that could precede my "e")
`^(?<=[c|d|j|l|m|n|s|t])e$/[ee]/g` but I have no match, not sure I can use ^ on a positive look behind
EDIT
Guys you're amazing, hours of search on the web and here I get answers minutes after I posted.
I tried all your solutions and they are working perfectly directly in my script, i.e. this one:
my $test2="le";
$test2=~ s/^(\S)e$/\1\[ee\]/g;
print "test2:".$test2."\n";
-> test2:l[ee]
But I am loading these regex from a text file (using Perl for proto, the idea is to reuse it with any language implementing regex):
In the text file I store for example (I used % to split the line between match and replace):
^(\S)e$% \1\[ee\]
and then I parse and apply all regex like that:
my $test="le";
while (my $row = <$fh>) {
chomp $row;
if( $row =~ /%/){
my #reg = split /%/, $row;
#if no replacement, put empty string
if($#reg == 0){
push(#reg,"");
}
print "reg found, reg:".$reg[0].", replace:".$reg[1]."\n";
push #regs, [ #reg ];
}
}
print "orgine:".$test."\n";
for my $i (0 .. $#regs){
my $p=$regs[$i][0];
my $r=$regs[$i][1];
$test=~ s/$p/$r/g;
}
print "final:".$test."\n";
This technique is working well with my other regex, but not yet when I have a $1 or \1 in the replace... here is what I am obtaining:
final:\1\ee\
PS: you answered to initial question, should I open another post ?
Something like s/(?i)^([a-z])e$/$1[ee]/
Why aren't you using a capture group to do the replacement?
`s/^([c|d|j|l|m|n|s|t])e$/\1 [ee]/g`
If those are the characters you need and if it is indeed one word to a line with no whitespace before it or after it, then this will work.
Here's another option depending on what you are looking for. It will match a two character string consisting of one a-z character followed by one 'e' on its own line with possible whitespace before or after. It will replace this will the single a-z character followed by ' [ee]'
`s/^\s*([a-z])e\s*$/\1 [ee]/`
^(\S)e$
Try this.Replace by $1 [ee].See demo.
https://regex101.com/r/hR7tH4/28
I'd do something like this
$word =~ s/^(\w{1})(e)$/$1$2e/;
You can use following regex which match 2 character and then you can replace it with $1\[$2$2\]:
^([a-zA-Z])([a-zA-Z])$
Demo :
$my_string =~ s/^([a-zA-Z])([a-zA-Z])$/$1[$2$2]/;
See demo https://regex101.com/r/iD9oN4/1

split one line regex in a multiline regexp in perl

I have trouble spliting my regex in multiple line. I want my regex to match the line given:
* Code "l;k""dfsakd;.*[])_lkaDald"
So I created this regex which work:
my $firstRegexpr = qr/^\s*\*\s*Code\s+\"(?<Code>((\")*[^\"]+)+)\"/x;
But now I want to split it in multiline like this(and want it to match the same thing!):
my $firstRegexpr = qr/^\s*\*\s*Code\s+\"
(?<Code>((\")*[^\"]+)+)\"/x;
I read about this, but I have trouble using it:
/
^\s*\*\s*Code\s+\"
(?<Code>((\")*[^\"]+)+)\"
/x
My last question is about removing inlining variable in perl regex:
my $firstRegexpr = qr/^\s*\*\s*Code\s+\"(?<Code>((\")*[^\"$]+)+)\"\$/x;
the character $] is matched as a variable in the regex, how to define it not as a variable?
Thanks a lot for your time and please provide explicit example.
What the x flag does is very simply say 'ignore whitespace'.
So you no longer match 'space' characters , and instead have to use \s or similar.
So you can write:
if ( m/
^
\d+\s+
fish:\w+\s+
$
/x ) {
print "Matched\n";
}
You can test regular expressions with various websites but one example is https://regex101.com/
So to take your example: https://regex101.com/r/eG5jY8/1
But how is yours not working?
This matches:
my $string = q{* Code "l;k""dfsakd;.*[])_lkaDald"};
my $firstRegexpr = qr/^\s*
\*
\s*
Code\s+
\"
(?<Code>((\")*[^\"]+)+)
\"
/x;
print "Compiled_Regex: $firstRegexpr\n";
print "Matched\n" if ( $string =~ m/$firstRegexpr/ );
And as for not having $] - there's two answers. Either: Use \ to escape it, or use \Q\E.