How to split text into "steps" using regex in perl? - regex

I am trying to split texts into "steps"
Lets say my text is
my $steps = "1.Do this. 2.Then do that. 3.And then maybe that. 4.Complete!"
I'd like the output to be:
"1.Do this."
"2.Then do that."
"3.And then maybe that."
"4.Complete!"
I'm not really that good with regex so help would be great!
I've tried many combination like:
split /(\s\d.)/
But it splits the numbering away from text

I would indeed use split. But you need to exclude the digit from the match by using a lookahead.
my #steps = split /\s+(?=\d+\.)/, $steps;

All step-descriptions start with a number followed by a period and then have non-numbers, until the next number. So capture all such patterns
my #s = $steps =~ / [0-9]+\. [^0-9]+ /xg;
say for #s;
This works only if there are surely no numbers in the steps' description, like any approach relying on matching a number (even if followed by a period, for decimal numbers)†
If there may be numbers in there, we'd need to know more about the structure of the text.
Another delimiting pattern to consider is punctuation that ends a sentence (. and ! in these examples), if there are no such characters in steps' description and there are no multiple sentences
my #s = $steps =~ / [0-9]+\. .*? [.!] /xg;
Augment the list of patterns that end an item's description as needed, say with a ?, and/or ." sequence as punctuation often goes inside quotes.‡
If an item can have multiple sentences, or use end-of-sentence punctuation mid-sentence (as a part of a quotation perhaps) then tighten the condition for an item's end by combining footnotes -- end-of-sentence punctuation and followed by number+period
my #s = $steps =~ /[0-9]+\. .*? (?: \."|\!"|[.\!]) (?=\s+[0-9]+\. | \z)/xg;
If this isn't good enough either then we'd really need a more precise description of that text.
† An approach using a "numbers-period" pattern to delimit item's description, like
/ [0-9]+\. .*? (?=\s+[0-9]+\. | \z) /xg;
(or in a lookahead in split) fails with text like
1. Only $2.50   or   1. Version 2.4.1   ...
‡ To include text like 1. Do "this." and 2. Or "that!" we'd want
/ [0-9]+\. .*? (?: \." | !" | [.!?]) /xg;

Following sample code demonstrates power of regex to fill up %steps hash in one line of code.
Once the data obtained you can dice and slice it anyway your heart desires.
Inspect the sample for compliance with your problem.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my($str,%steps,$re);
$str = '1.Do this. 2.Then do that. 3.And then maybe that. 4.Complete!';
$re = qr/(\d+)\.(\D+)\./;
%steps = $str =~ /$re/g;
say Dumper(\%steps);
say "$_. $steps{$_}" for sort keys %steps;
Output
$VAR1 = {
'1' => 'Do this',
'2' => 'Then do that',
'3' => 'And then maybe that'
};
1. Do this
2. Then do that
3. And then maybe that

Related

Regular expression puzzler

I have been doing regular expression for 25+ years but I don't understand why this regex is not a match (using Perl syntax):
"unify" =~ /[iny]{3}/
# as in
perl -e 'print "Match\n" if "unify" =~ /[iny]{3}/'
Can someone help solve that riddle?
The quantifier {3} in the pattern [iny]{3} means to match a character with that pattern (either i or n or y), and then another character with the same pattern, and then another. Three -- one after another. So your string unify doesn't have that, but can muster two at most, ni.
That's been explained in other answers already. What I'd like to add is an answer to a clarification in comments: how to check for these characters appearing 3 times in the string, scattered around at will. Apart from matching that whole substring, as shown already, we can use a lookahead:
(?=[iny].*[iny].*[iny])
This does not "consume" any characters but rather "looks" ahead for the pattern, not advancing the engine from its current position. As such it can be very useful as a subpattern, in combination with other patterns in a larger regex.
A Perl example, to copy-paste on the command line:
perl -wE'say "Match" if "unify" =~ /(?=[iny].*[iny].*[iny])/'
The drawback to this, as well as to consuming the whole such substring, is the literal spelling out of all three subpatterns; what when the number need be decided dynamically? Or when it's twelve? The pattern can be built at runtime of course. In Perl, one way
my $pattern = '(?=' . join('.*', ('[iny]')x3) . ')';
and then use that in the regex.
 
For the sake of performance, for long strings and many repetitions, make that .* non-greedy
(?=[iny].*?[iny].*?[iny])
(when forming the pattern dynamically join with .*?)
A simple benchmark for illustration (in Perl)
use warnings;
use strict;
use feature 'say';
use Getopt::Long;
use List::Util qw(shuffle);
use Benchmark qw( cmpthese );
# For how many seconds to run each option (-r N, default 3),
# how many times to repeat for the test string (-n N, default 2)
my ($runfor, $n) = (3, 2);
GetOptions('r=i' => \$runfor, 'n=i' => \$n);
my $str = 'aa'
. join('', map { (shuffle 'b'..'t')x$n, 'a' } 1..$n)
. 'a'x($n+1)
. 'zzz';
my $pat_greedy = '(?=' . join('.*', ('a')x$n) . ')';
my $pat_non_greedy = '(?=' . join('.*?', ('a')x$n) . ')';
#my $pat_greedy = join('.*', ('a')x$n); # test straight match,
#my $pat_non_greedy = join('.*?', ('a')x$n); # not lookahead
sub match_repeated {
my ($s, $pla) = #_;
return ( $s =~ /$pla(.*z)/ ) ? "match" : "no match";
}
cmpthese(-$runfor, {
greedy => sub { match_repeated($str, $pat_greedy) },
non_greedy => sub { match_repeated($str, $pat_non_greedy) },
});
(Shuffling of that string is probably unneeded but I feared optimizations intruding.)
When a string is made with the factor of 20 (program.pl -n 20) the output is
Rate greedy non_greedy
greedy 56.3/s -- -100%
non_greedy 90169/s 159926% --
So ... some 1600 times better non-greedy. That test string is 7646 characters long and the pattern to match has 20 subpatterns (a) with .* between them (in greedy case); so there's a lot going on there. With default 2, so for a short string and a simpler pattern, the difference is 10%.
Btw, to test for straight-up matches (not using lookahead) just move those comment signs around the pattern variables, and it's nearly twice as bad:
Rate greedy non_greedy
greedy 56.5/s -- -100%
non_greedy 171949/s 304117% --
The letters n, i, and y aren't all adjacent. There's an f in between them.
/[iny]{3}/ matches any string that contains a substring of three letters taken from the set {i, n, y}. The letters can be in any order; they can even be repeated.
Choosing three characters three times, with replacement, means there are 33 = 27 matching substrings:
iii, iin, iiy, ini, inn, iny, iyi, iyn, iyy
nii, nin, niy, nni, nnn, nny, nyi, nyn, nyy
yii, yin, yiy, yni, ynn, yny, yyi, yyn, yyy
To match non-adjacent letters you can use one of these:
[iny].*[iny].*[iny]
[iny](.*[iny]){2}
([iny].*){3}
(The last option will work fine on its own since your search is unanchored, but might not be suitable as part of a larger regex. The final .* could match more than you intend.)
That pattern looks for three consecutive occurrences of the letters i, n, or y. You do not have three consecutive occurrences.
Perhaps you meant to use [inf] or [ify]?
Looks like you are looking for 3 consecutive letters, so yours should not match
[iny]{3} //no match
[unf]{3} //no match
[nif]{3} //matches nif
[nify]{3} //matches nif
[ify]{3} //matches ify
[uni]{3} //matches uni
Hope that helps somewhat :)
The {3} atom means "exactly three consecutive matches of the preceding element." While all of the letters in your character class are present in the string, they are not consecutive as they are separated by other characters in your string.
It isn't the order of items in the character class that's at issue. It's the fact that you can't match any combination of the three letters in your character class where exactly three of them are directly adjacent to one another in your example string.

Extract first word after specific word

I'm having difficulty writing a Perl program to extract the word following a certain word.
For example:
Today i'm not going anywhere except to office.
I want the word after anywhere, so the output should be except.
I have tried this
my $words = "Today i'm not going anywhere except to office.";
my $w_after = ( $words =~ /anywhere (\S+)/ );
but it seems this is wrong.
Very close:
my ($w_after) = ($words =~ /anywhere\s+(\S+)/);
^ ^ ^^^
+--------+ |
Note 1 Note 2
Note 1: =~ returns a list of captured items, so the assignment target needs to be a list.
Note 2: allow one or more blanks after anywhere
In Perl v5.22 and later, you can use \b{wb} to get better results for natural language. The pattern could be
/anywhere\b{wb}.+?\b{wb}(.+?\b{wb})/
"wb" stands for word break, and it will account for words that have apostrophes in them, like "I'll", that plain \b doesn't.
.+?\b{wb}
matches the shortest non-empty sequence of characters that don't have a word break in them. The first one matches the span of spaces in your sentence; and the second one matches "except". It is enclosed in parentheses, so upon completion $1 contains "except".
\b{wb} is documented most fully in perlrebackslash
First, you have to write parentheses around left side expression of = operator to force array context for regexp evaluation. See m// and // in perlop documentation.[1] You can write
parentheses also around =~ binding operator to improve readability but it is not necessary because =~ has pretty high priority.
Use POSIX Character Classes word
my ($w_after) = ($words =~ / \b anywhere \W+ (\w+) \b /x);
Note I'm using x so whitespaces in regexp are ignored. Also use \b word boundary to anchor regexp correctly.
[1]: I write my ($w_after) just for convenience because you can write my ($a, $b, $c, #rest) as equivalent of (my $a, my $b, my $c, my #rest) but you can also control scope of your variables like (my $a, our $UGLY_GLOBAL, local $_, #_).
This Regex to be matched:
my ($expect) = ($words=~m/anywhere\s+([^\s]+)\s+/);
^\s+ the word between two spaces
Thanks.
If you want to also take into consideration the punctuation marks, like in:
my $words = "Today i'm not going anywhere; except to office.";
Then try this:
my ($w_after) = ($words =~ /anywhere[[:punct:]|\s]+(\S+)/);

How to use a REGEX pattern to remove a specific word "THE" only if at beginning of text string?

I have a text input field for titles of various things and to help minimize false negatives on search results(internal search is not the best), I need to have a REGEX pattern which looks at the first four characters of the input string and removes the word(and space after the word) _the _ if it is there at the beginning only.
For example if we are talking about the names of bands, and someone enters The Rolling Stones , what i need is for the entry to say only Rolling Stones
Can a regex be used to automatically strip these 4characters?
Applying the regex
^(?:\s*the\s*)?(.*)$
will match any string, and capture it in backreference no. 1, unless it starts with the (optionally surrounded by whitespace), in which case backref no. 1 will contain whatever follows.
You need to set the case-insensitive option in your regex engine for this to work.
You can use the ^ identifier to match a pattern at the beginning of a line, however for what you are using this for, it can be considered overkill.
A lot of languages support string manipulations, which is a more suitable choice. I can provide an example to demonstrate in Python,
>>> def func(n):
n = n[4:len(n)] if n[0:4] == "The " else n
return n
>>> func("The Rolling Stones")
'Rolling Stones'
>>> func("They Might Be Giants")
'They Might Be Giants'
As you don't clarify with language, here is a solution in Perl :
my $str = "The Rolling Stones";
$str =~ s/^the //i;
say $str; # Rolling Stones

How can I parse a phone number in Perl?

I am trying to grab any digits in front of a known line number of a phone, if they exist (in Perl). There will be no dashes, only digits.
For example, say I know the line number will always be 8675309. 8675309 may or may not have leading digits, if it does I want to capture them. There is not really a limit on the number of leading digits.
$input $digits $number
'8675309' '' '8675309'
'8008675309' '800' '8675309'
'18888675309' '1888' '8675309'
'18675309' '1' '8675309'
'86753091' not a match
/8675309$/ this will match how to capture the pre-digits in one regex?
Some regexes work better backwards than forwards. So sometimes it is useful to use sexeger, rather than regexes.
my $pn = '18008675309';
reverse($pn) =~ /^9035768(\d*)/;
my $got = reverse $1;
The regex is cleaner and avoids a lot of back tracking at the cost of some fummery with reversing the input and captured values.
The backtracking gain is smaller in this case than it would be if you had a general phone number extraction regex:
Regex: /^(\d*)\d{7}$/
Sexeger: /^\d{7}(\d*)/
There is a whole class of problems where this technique is useful. For more info see the sexeger post on Perlmonks.
my($digits,$number);
if ($input =~ /^(\d*)(8675309)$/) {
($digits,$number) = ($1,$2);
}
The * quantifier is greedy, but that means it matches as much as possible while still allowing a match. So initially, yes, \d* tries to gobble up all the digits in $number, but it reluctantly gives up character-by-character what it's matched until the whole pattern matches successfully.
Another approach is to chop off the tail:
(my $digits = $input) =~ s/8675309$//;
You could do the same without using a regular expression:
my $digits = $input;
substr($digits, -7) = "";
The above, at least with perl-5.10-1, could even be condensed to
substr(my $digits = $input, -7) = "";
The regex special variables $` and $& are another way of grabbing those pieces of information. They hold the contents of the data preceding the match and the match itself respectively.
if ( /8675309$/ )
{
printf( "%s,%s,%s\n", $_, $`, $& );
}
else
{
printf( "%s,Not a match\n", $_ );
}
There's a Perl package that deals with at least UK and US phone numbers.
It's called Number::Phone and the code is somewhere on the cpan.org site.
How about /(\d)?(8675309)/?
UPDATE:
whoops that should haev been /(\d*)(8675309)/
I might not understand the problem. Why is there a difference between the first and fourth examples:
'8675309' '' '8675309'
...
'8675309' '1' '8675309'
If all you want is to separate the last seven digits from everything else, you could have said it that way rather than provide confusing examples. A regex for that would be:
/(\d*)(\d{7,7})$/
If you weren't just providing a hypothetical number, and really are only looking for lines with '8675309' (seems strange), replace the '\d{7,7}' with '8675309'.

regex to match a maximum of 4 spaces

I have a regular expression to match a persons name.
So far I have ^([a-zA-Z\'\s]+)$ but id like to add a check to allow for a maximum of 4 spaces. How do I amend it to do this?
Edit: what i meant was 4 spaces anywhere in the string
Don't attempt to regex validate a name. People are allowed to call themselves what ever they like. This can include ANY character. Just because you live somewhere that only uses English doesn't mean that all the people who use your system will have English names. We have even had to make the name field in our system Unicode. It is the only Unicode type in the database.
If you care, we actually split the name at " " and store each name part as a separate record, but we have some very specific requirements that mean this is a good idea.
PS. My step mum has 5 spaces in her name.
^ # Start of string
(?!\S*(?:\s\S*){5}) # Negative look-ahead for five spaces.
([a-zA-Z\'\s]+)$ # Original regex
Or in one line:
^(?!(?:\S*\s){5})([a-zA-Z\'\s]+)$
If there are five or more spaces in the string, five will be matched by the negative lookahead, and the whole match will fail. If there are four or less, the original regex will be matched.
Screw the regex.
Using a regex here seems to be creating a problem for a solution instead of just solving a problem.
This task should be 'easy' for even a novice programmer, and the novel idea of regex has polluted our minds!.
1: Get Input
2: Trim White Space
3: If this makes sence, trim out any 'bad' characters.
4: Use the "split" utility provided by your language to break it into words
5: Return the first 5 Words.
ROCKET SCIENCE.
replies
what do you mean screw the regex? your obviously a VB programmer.
Regex is the most efficient way to work with strings. Learn them.
No. Php, toyed a bit with ruby, now going manically into perl.
There are some thing ( like this case ) where the regex based alternative is computationally and logically exponentially overly complex for the task.
I've parse entire php source files with regex, I'm not exactly a novice in their use.
But there are many cases, such as this, where you're employing a logging company to prune your rose bush.
I could do all steps 2 to 5 with regex of course, but they would be simple and atomic regex, with no weird backtracking syntax or potential for recursive searching.
The steps 1 to 5 I list above have a known scope, known range of input, and there's no ambiguity to how it functions. As to your regex, the fact you have to get contributions of others to write something so simple is proving the point.
I see somebody marked my post as offensive, I am somewhat unhappy I can't mark this fact as offensive to me. ;)
Proof Of Pudding:
sub getNames{
my #args = #_;
my $text = shift #args;
my $num = shift #args;
# Trim Whitespace from Head/End
$text =~ s/^\s*//;
$text =~ s/\s*$//;
# Trim Bad Characters (??)
$text =~ s/[^a-zA-Z\'\s]//g;
# Tokenise By Space
my #words = split( /\s+/, $text );
#return 0..n
return #words[ 0 .. $num - 1 ];
} ## end sub getNames
print join ",", getNames " Hello world this is a good test", 5;
>> Hello,world,this,is,a
If there is anything ambiguous to anybody how that works, I'll be glad to explain it to them. Noted that I'm still doing it with regexps. Other languages I would have used their native "trim" functions provided where possible.
Bollocks -->
I first tried this approach. This is your brain on regex. Kids, don't do regex.
This might be a good start
/([^\s]+
(\s[^\s]+
(\s[^\s]+
(\s[^\s]+
(\s[^\s]+|)
|)
|)
|)
)/
( Linebroken for clarity )
/([^\s]+(\s[^\s]+(\s[^\s]+(\s[^\s]+|)|)|))/
( Actual )
I've used [^\s]+ here instead of your A-Z combo for succintness, but the point is here the nested optional groups
ie:
(Hello( this( is( example))))
(Hello( this( is( example( two)))))
(Hello( this( is( better( example))))) three
(Hello( this( is()))))
(Hello( this()))
(Hello())
( Note: this, while being convoluted, has the benefit that it will match each name into its own group )
If you want readable code:
$word = '[^\s]+';
$regex = "/($word(\s$word(\s$word(\s$word(\s$word|)|)|)|)|)/";
( it anchors around the (capture|) mantra of "get this, or get nothing" )
#Sir Psycho : Be careful about your assumptions here. What about hyphenated names? Dotted names (e.g. Brian R. Bondy) and so on?
Here's the answer that you're most likely looking for:
^[a-zA-Z']+(\s[a-zA-Z']+){0,4}$
That says (in English): "From start to finish, match one or more letters, there can also be a space followed by another 'name' up to four times."
BTW: Why do you want them to have apostrophes anywhere in the name?
^([a-zA-Z']+\s){0,4}[a-zA-Z']+$
This assumes you want 4 spaces inside this string (i.e. you have trimmed it)
Edit: If you want 4 spaces anywhere I'd recommend not using regex - you'd be better off using a substr_count (or the equivalent in your language).
I also agree with pipTheGeek that there are so many different ways of writing names that you're probably best off trusting the user to get their name right (although I have found that a lot of people don't bother using capital letters on ecommerce checkouts).
Match multiple whitespace followed by two characters at the end of the line.
Related problem ----
From a string, remove trailing 2 characters preceded by multiple white spaces... For example, if the column contains this string -
" 'This is a long string with 2 chars at the end AB "
then, AB should be removed while retaining the sentence.
Solution ----
select 'This is a long string with 2 chars at the end AB' as "C1",
regexp_replace('This is a long string with 2 chars at the end AB',
'[[[:space:]][a-zA-Z][a-zA-Z]]*$') as "C2" from dual;
Output ----
C1
This is a long string with 2 chars at the end AB
C2
This is a long string with 2 chars at the end
Analysis ----
regular expression specifies - match and replace zero or more occurences (*) of a space ([:space:]) followed by combination of two characters ([a-zA-Z][a-zA-Z]) at the end of the line.
Hope this is useful.