Perl regex expression - regex

I have inherited a perl script that pulls data out of some files. The whole script works fine but recently some engineers have been putting in more than one number for a certain spot that usually took one number, so the output is not showing all of what is expected.
Sample input:
CRXXXX: "Then some text"
CRs XXXX, XXXX, XX, XXX
CRXXX "Some Text"
Currently this regex statement I have pulls out the number after the CR, but if given then second line of sample input it prints "s XXXX, XXXX, XX, XXX" instead of the wanted "XXXX XXXX XX XXX"
I am very new to perl and am struggling to figure out how to alter this regex to work on all of the inputs.
$temp_comment =~ s/\s[cC][rR][-\s:;]*([\d])/\n$1/mg;
Thanks in advance!
Brock

For sample data like:
my $temp_comment =
'CR1234: "Then some text"
CRs 2345, 3456, 45, 567
CR678 "Some Text"';
try:
$temp_comment =~ s/(,)|[^\d\n]+/$1?' ':''/semg;
or, if you want to stay close to the string templates:
$temp_comment =~ s/ ^ # multi-line mode, line start
\s* # leading blanks?
CR # CR tag
\D* # non-number stuff
( # start capture group
(?:\d+ [,\s]*)+ # find (number, comma, space) groups
) # end capture group
\D* # skip remaining non-number stuff
$ # multi-line mode, line end
/$1/mxg; # set multi-line mode + regex comments "x"
but you'd have to remove the commas in the number group in a subsequent step.
$temp_comment =~ tr/,//d; # remove commas in the whole string
or
$temp_comment =~ s/(?<=\d),(?=\s\d)//g; # remove commas between numbers '11, 22'
For "single step", you have to use the /e modifier:
$temp_comment =~ s{ ^ # line start
\s* # leading blanks?
CR # CR tag
\D* # non-number stuff
((?:\d+ [,\s]*)+) # single or group of numbers
\D* # non number stuff
$ # line end
}
{do{(local$_=$1)=~y/,//d;$_}}mxeg;
This will, on the data above, result in:
1234
2345 3456 45 567
678
But really, please use, if possible, the simpler two step approach. The latter regex might be a maintainance nightmare for your successors.

You may be better off doing this in two steps:
1) Create your regular expression
s/\s[cC][rR][-\s:;]*([\d\ ]+)/\n$1/mg (note the new way to capture all of the numbers, you're only capturing the first number above)
2) Then just strip out the commas in the string with find/replace.

my ($v) = /CR[s ]*((?:\d+[\s,]*)*)/ig;
$v =~ s/,//g;
print $v,"\n";

Perhaps the following will work for you:
use Modern::Perl;
say join ' ', (/(\d+)/g) for <DATA>;
__DATA__
CR1234: "Then some text"
CRs 1111, 2222, 33, 444
CR567 "Some Text"
Output:
1234
1111 2222 33 444
567
Hope this helps!

Related

perl regex to convert currency

i need some help in text cleaning/normalization process
i struck at a place where i need to convert a currency format
input: $100 million output: 100 million dollar
input: eur20 million output: 20 million euros
i'm using perl regex for the cleaning process, help will be appreciated if someone can help me in providing a regex to convert input to output
this is my code so far
s/([\$])([0-9\.])([million])/ $2 $3 dollars/g;
example number is $4.2 million
this is what i tried for converting dollars symbol into word "dollars" and shift it to end of phrase, but it is not providing the result as expected, it provide me ".2 million" as output
[...] in a regex introduces a character class, so [million] is the same as [nolim], and it matches one of those characters.
I'd create a translation table for the currencies in a hash. From the keys of the hash, you can build a regex that matches them, and use it in the replacement:
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
use feature qw{ say };
my %currency = ( '$' => 'dollar', # or dollars?
eur => 'euros',
'€' => 'euros',
);
my $regex = join '|', map quotemeta, keys %currency;
for my $input ('$100 million', 'eur20 million', '€13.2 thousand') {
( my $output = $input )
=~ s/($regex)([0-9.]+ (?:million|thousand))/$2 $currency{$1}/g;
say $output;
}
Your regex does not give the result you claim it does.
s/([\$])([0-9.])([million])/ $2 $3 dollars/g;
With the help of the /x modifier we can add whitespace (even newlines and comments) to the pattern to improve readability. Your pattern can then be re-written as
s/([\$]) # match a literal $ and capture that as $1
([0-9.]) # match ONE digit or a dot and capture as $2
([million]) # match ONE character of 'm', 'i', 'l', 'o', 'n'
# and capture as $3
/ $2 $3 dollars/gx;
There is no way $100 million matches this pattern and results in .2 million. Possible inputs would be
$3i, $.o or $9m. They would give 3 i dollars, . o dollars, and 9 m dollars.
What you are looking for is a pattern like this:
s/\$ # a literal '$'
([\d.]+) # one or more digits or dots, like e.g. '99.5',
# captured as $1
\s+ # one or more whitespace
(million) # the literal text 'million', captured as $2
/$1 $2 dollars/gx;
(or, as a one-liner: s/\$([\d.]+)\s+(million)/$1 $2 dollars/g;)
Note that $2 in this case always is million and you could also rewrite it as s/\$([\d.]+)\s+million/$1 million dollars/g; (omitting the () around million).

Regex for receipt items

I have a simple receipt which I typed out. I need to be able to read the items purchased on the receipt. The sample receipt is below.
Tim Hortons
Alwasy Fresh
1 Brek Wrap Combo /A ($0.76)
1 Bacon-wrap $3.79
1 Grilled $0.00
1 5 Pieces Bacon-wrap $0.00
1 Orange $1.40
1 Deposit $0.10
Subtotal: $55.84
GST: $0.29
Debit: $55.84
Take out
Thanks for stopping by!!
Tell us how we did
I came up with the following regex string to find the items.
\d(\s){1,10}(.)*\s{1,}\$\d\.[0-9]{2}
It works for the most part but there are a few incorrect lines like
4
GST: $0.29
Can someone come up with a better pattern. Below is a link to see it in action.
http://regexr.com/3cnk9
I see a number of problems with this original regex:
\d(\s){1,10}(.)*\s{1,}\$\d\.[0-9]{2}
First, parentheses both group and match, though when you quantify your match, only the last iteration is captured, so matching like (.)* will only store the last character; you wanted (.*) for that. Since it's greedy, that will be the character before the space preceding a dollar sign, which given your data will always be a space. Similarly, you're quantifying a group at the beginning with (\s){1,10}, which captures only the last whitespace character. In this case, you don't need the group since \s is a single space character, so you can simply use \s{1,10}.
Here is a piece-by-piece explanation of what that regular expression does.
Capturing solution
The following regex captures the quantity ($1), item description ($2), whether the price is parenthesized ($3), and the price ($4):
^\s*(\d+)\s+(.*\S)\s+(\(?)\$([0-9.]+)\)?\s*$
Explained and matched to your sample at regex101.
Separated out and commented (assumes the /x flag is supported):
/ # begin regex
^\s* # start of line, ignore leading spaces if present
(\d+) # $1 = quantity
\s+ # spacing as a delimiter
(.*\S) # $2 = item: contains anything, must end in a non-space char
\s+ # spacing as a delimiter
(\(?) # $3 = negation, an optional open parenthesis
\$ # dollar sign
([0-9.]+) # $4 = price
\)?\s*$ # trailing characters: optional end-paren and space(s)
/x # end regex, multi-line regex flag
with sample perl code executed from a command line:
perl -ne '
my ($quantity, $item, $neg, $price)
= /^\s*(\d+)\s+(.*\S)\s+(\(?)\$([0-9.]+)\)?\s*$/;
if ($item) {
if ($neg) { $price *= -1; }
print "<$quantity><$item><$price>\n"
}' RECEIPT_FILE
(If you want that as a perl script, wrap the code with while(<>) { } and you're done.)
This assigns the variables $quantity, $item, and $price to the itemized lines on your receipt. I am assuming that a parenthesized item is to be subtracted (but I can't verify that since the totals are nonsensical), so $neg notes the existence of a parenthesis so the $price can be negated.
I set the output to use angle brackets (< and >) to indicate what each variable stores.
The output of your given sample receipt would therefore be:
<1><Brek Wrap Combo /A><-0.76>
<1><Bacon-wrap><3.79>
<1><Grilled><0.00>
<1><5 Pieces Bacon-wrap><0.00>
<1><Orange><1.40>
<1><Deposit><0.10>
Prices only solution
You didn't say what you wanted to match. If you don't care about anything but the prices and there are no negative values, you don't need matchers if you have negative look-behind or \K:
grep -Po '^\s*[0-9].*\$\K[0-9.]+' RECEIPT_FILE
Grep's -P flag invokes libpcre (which may not be available if you're on an old or embedded system) and -o displays only the matching text. \K denotes the start of the match. Put the \$ after the \K if you want to capture it. (See also the regex101 description and matches.)
Output from that grep command:
0.76
3.79
0.00
0.00
1.40
0.10
Prices only – with awk
There aren't great ways to handle this regex with efficiency. If you're processing through a mountain of content, you'll feel the hurt. Here's a solution using awk that should be significantly faster. (The difference won't be noticeable with a small input.)
awk '$1 / 1 > 0 && $NF ~ /\$/ { gsub(/[()]/, "", $0); print $NF; }' RECEIPT_FILE
Commented version with explanation:
awk '
# if the quantity is indeed a number and the last field has a dollar sign
$1 / 1 > 0 && $NF ~ /\$/ {
gsub(/[()]/, "", $NF); # remove all parentheses from the last field
print $NF; # print the contents of the last field
}' RECEIPT_FILE
Prices only – with awk, supporting negative prices
awk '
# if the quantity is indeed a number and the last field has a dollar sign
$1 / 1 > 0 && $NF ~ /\$/ {
neg = 1;
if ( $NF ~ /\(/ ) { # the last field has an open parenthesis
gsub(/[()]/, "", $NF); # remove all parentheses from the last field
neg = -1;
}
print $NF * neg; # print the last field, negated if parenthesized
}' RECEIPT_FILE
Here's my attempt:
^(\d+)\s+(.*)\s+\(?(\$.+)\)?$
Stub. Remember to turn the multiline option on. Components:
^ - beginning of line
(\d+) - capture the quantity at the beginning of each line item
\s+ - one or more space
(.*) - capture the item description
\s+ - one or more space
\(? - optional open bracket `(` character
($.+) - capture anything including and after the dollar sign
\)? - optional close bracket `)` character
$ - end of line
You can use
^(\d+)\s+(.*?)\s+\(?\$(\d+\.\d+)
See the regex demo
This regex should be used with the /m modifier to match data on different lines. In JS, the /g modifier is also required.
Explanation:
^ - start of a line
(\d+) - Group 1 capturing one or more digits
\s+ - one or more whitespaces
(.*?) - Group 2 capturing zero or more any characters but a newline up to the closest
\s+ - one or more whitespaces
\(? - an optional ( (on the first line)
\$ - a literal $
(\d+\.\d+) - Group 3 capturing one or more digits followed with . and one or more digits.
JS demo:
var re = /^(\d+)\s+(.*?)\s+\(?\$(\d+\.\d+)/gm;
var str = ' Tim Hortons\n Alwasy Fresh\n\n1 Brek Wrap Combo /A ($0.76)\n1 Bacon-wrap $3.79\n1 Grilled $0.00\n1 5 Pieces Bacon-wrap $0.00\n1 Orange $1.40\n1 Deposit $0.10\nSubtotal: $55.84\nGST: $0.29\nDebit: $55.84\nTake out\n\n Thanks for stopping by!!\n Tell us how we did';
while ((m = re.exec(str)) !== null) {
document.body.innerHTML += "Pcs: <b>" + m[1] + "</b>, item: <b>" + m[2] + "</b>, paid: <b>" + m[3] + "</b><br/>";
}
Adam Katz's answer should be the accepted one! I used this variation of his answer for an implementation in JavaScript:
const receiptRegex = /^\s*(\d+)\s+(.*\S)\s+(\(?)\$([0-9.]+)\)?\s*$/gm
let items = [];
const matches = inputStr.matchAll(receiptRegex);
for (const matchedGroup of matches) {
const [
fullString, //[0] -> matched string "1 Blue gatorade $2.00"
quantity, //[1] -> quantity "1"
item, //[2] -> item description "Blue gatorade"
ignoredSymbol, //[3] -> "$" (should probably always ignore)
price //[4] -> amount "2.00"
] = matchedGroup;
items.push({
quantity,
item,
price,
});
}

Need to match multiple pattern in the same line - Perl

I need to match multiple pattern in the same line. For example, in this file:
Hello, Chester [McAllister;Scientist] lives in Boston [Massachusetts;USA;Fenway Park] # McAllister works in USA
I'm now working in New-York [NYC;USA] # I work in USA
...
First, I want to match every string into the brackets knowing that it is possible to have more than 1 pattern and also that we can have 1 to n strings into the brackets always separated by a semicolon.
Finally, for each line i need to compare the values to the string located after the #. For example in the first sentence, i want to compare:
[McAllister;Scientist] & [Massachusetts;USA;Fenway Park] TO "McAllister works in USA"
The tidiest way is probably to use a regex to find all the embedded sequences delimited by square brackets, and then use map with split to separate those sequences into terms.
This program demonstrates.
Note that I have assumed that all of the data in the file has been read into a single scalar variable. You can alter this to process a single line at a time, but only if the bracketed subsequences are never split across multiple lines
use strict;
use warnings;
my $s = <<END_TEXT;
Hello, Chester [McAllister;Scientist] lives in Boston [Massachusetts;USA;Fenway Park] # McAllister works in USA
I'm now working in New-York [NYC;USA] # I work in USA
END_TEXT
my #data = map [ split /;/ ], $s =~ / \[ ( [^\[\]]+ ) \] /xg;
use Data::Dump;
dd \#data;
output
[
["McAllister", "Scientist"],
["Massachusetts", "USA", "Fenway Park"],
["NYC", "USA"],
]
Try this
This is also gives what you expect.
use strict;
use warnings;
open('new',"file.txt");
my #z =map{m/\[[\w;\s]+\]/g} <new>;
print "$_ ,",foreach(#z);
You actually need match the words separated by the ; within the [].

How to Capture Only Surnames from a Regex Pattern?

Team
I have written a Perl program to validate the accuracy of formatting (punctuation and the like) of surnames, forenames, and years.
If a particular entry doesn't follow a specified pattern, that entry is highlighted to be fixed.
For example, my input file has lines of similar text:
<bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., & Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>
My programs works just fine, that is, if any entry doesn't follow the pattern, the script generates an error. The above input text doesn't generate any error. But the one below is an example of an error because Rose A. J. is missing a comma after Rose:
NOT FOUND: <bibliomixed id="bkrmbib120">Asher, S. R., & Rose A. J. (1997). Promoting children’s social-emotional adjustment with peers. In P. Salovey & D. Sluyter, (Eds). <emphasis>Emotional development and emotional intelligence: Educational implications.</emphasis> New York: Basic Books.</bibliomixed>
From my regex search pattern, is it possible to capture all the surnames and the year, so I can generate a text prefixed to each line as shown below?
<BIB>Abdo, Afif-Abdo, Otani, Machado, 2008</BIB><bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., & Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>
My regex search script is as follows:
while(<$INPUT_REF_XML_FH>){
$line_count += 1;
chomp;
if(/
# bibliomixed XML ID tag and attribute----<START>
<bibliomixed
\s+
id=".*?">
# bibliomixed XML ID tag and attribute----<END>
# --------2 OR MORE AUTHOR GROUP--------<START>
(?:
(?:
# pattern for surname----<START>
(?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
(?:(?:[\w\x{2019}|\x{0027}]+-)+)? # surnames with hyphens
(?:[A-Z](?:\x{2019}|\x{0027}))? # surnames with closing single quote or apostrophe O’Leary
(?:St\.\s)? # pattern for St.
(?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
(?:[\w\x{2019}|\x{0027}]+) # final surname pattern----REQUIRED
# pattern for surname----<END>
,\s
# pattern for forename----<START>
(?:
(?:(?:[A-Z]\.\s)+)? #initials with periods
(?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
(?:(?:[A-Z]\.\s)+)? #initials with periods
[A-Z]\. #----REQUIRED
# pattern for titles....<START>
(?:,\s(?:Jr\.|Sr\.|II|III|IV))?
# pattern for titles....<END>
)
# pattern for forename----<END>
,\s)+
#---------------FINAL AUTHOR GROUP SEPATOR----<START>
&\s
#---------------FINAL AUTHOR GROUP SEPATOR----<END>
# --------2 OR MORE AUTHOR GROUP--------<END>
)?
# --------LAST AUTHOR GROUP--------<START>
# pattern for surname----<START>
(?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
(?:(?:[\w\x{2019}|\x{0027}]+-)+)? # surnames with hyphens
(?:[A-Z](?:\x{2019}|\x{0027}))? # surnames with closing single quote or apostrophe O’Leary
(?:St\.\s)? # pattern for St.
(?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
(?:[\w\x{2019}|\x{0027}]+) # final surname pattern----REQUIRED
# pattern for surname----<END>
,\s
# pattern for forename----<START>
(?:
(?:(?:[A-Z]\.\s)+)? #initials with periods
(?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
(?:(?:[A-Z]\.\s)+)? #initials with periods
[A-Z]\. #----REQUIRED
# pattern for titles....<START>
(?:,\s(?:Jr\.|Sr\.|II|III|IV))?
# pattern for titles....<END>
)
# pattern for forename----<END>
(?: # pattern for editor notation----<START>
\s\(Ed(?:s)?\.\)\.
)? # pattern for editor notation----<END>
# --------LAST AUTHOR GROUP--------<END>
\s
\(
# pattern for a year----<START>
(?:[A-Za-z]+,\s)? # July, 1999
(?:[A-Za-z]+\s)? # July 1999
(?:[0-9]{4}\/)? # 1999\/2000
(?:\w+\s\d+,\s)?# August 18, 2003
(?:[0-9]{4}|in\spress|manuscript\sin\spreparation) # (1999) (in press) (manuscript in preparation)----REQUIRED
(?:[A-Za-z])? # 1999a
(?:,\s[A-Za-z]+\s[0-9]+)? # 1999, July 2
(?:,\s[A-Za-z]+\s[0-9]+\x{2013}[0-9]+)? # 2002, June 19–25
(?:,\s[A-Za-z]+)? # 1999, Spring
(?:,\s[A-Za-z]+\/[A-Za-z]+)? # 1999, Spring\/Winter
(?:,\s[A-Za-z]+-[A-Za-z]+)? # 2003, Mid-Winter
(?:,\s[A-Za-z]+\s[A-Za-z]+)? # 2007, Anniversary Issue
# pattern for a year----<END>
\)\.
/six){
print $FOUND_REPORT_FH "$line_count\tFOUND: $&\n";
$found_count += 1;
} else{
print $ERROR_REPORT_FH "$line_count\tNOT FOUND: $_\n";
$not_found_count += 1;
}
Thanks for your help,
Prem
Alter this bit
# pattern for surname----<END>
,?\s
This now means an optional , followed by white space. If the Persons surname is "Bunga Bunga" it won't work
All of your subpatterns are non-capturing groups, starting with (?:. This reduces compilation times by a number of factors, one of which being that the subpattern is not captured.
To capture a pattern you merely need to place parenthesis around the part you require to capture. So you could remove the non-capturing assertion ?: or place parens () where you need them. http://perldoc.perl.org/perlretut.html#Non-capturing-groupings
I'm not sure but, from your code I think you may be attempting to use lookahead assertions as, for example, you test for surnames with spaces, if none then test for surnames with hyphens. This will not start from the same point every time, it will either match the first example or not, then move forward to test the next position with the second surname pattern, whether the regex will then test the second name for the first subpattern is what I am unsure of. http://perldoc.perl.org/perlretut.html#Looking-ahead-and-looking-behind
#!usr/bin/perl
use warnings;
use strict;
my $line = '123 456 7antelope89';
$line =~ /^(\d+\s\d+\s)?(\d+\w+\d+)?/;
my ($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
$line = '123 456 7bealzelope89';
$line =~ /(?:\d+\s\d+\s)?(?:\d+\w+\d+)?/;
($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
$line = '123 456 7canteloupe89';
$line =~ /((?:\d+\s\d+\s))?(?:\d+(\w+)\d+)?/;
($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
exit 0;
For capturing the whole pattern the first pattern of the third example does not make sense, as this tells the regex to not capture the pattern group while also capturing the pattern group. Where this is useful is in the second pattern which is a fine grained pattern capture, in that the pattern captured is part of a non-capturing group.
a: 123 456 b: 7antelope89
a: nocapture b: nocapture
a: 123 456 b: canteloupe
One little nitpic
id=".*?"
may be better as
id="\w*?"
id names requiring to be _alphanumeric iirc.

perl regex for extracting multiline blocks

I have text like this:
00:00 stuff
00:01 more stuff
multi line
and going
00:02 still
have
So, I don't have a block end, just a new block start.
I want to recursively get all blocks:
1 = 00:00 stuff
2 = 00:01 more stuff
multi line
and going
etc
The bellow code only gives me this:
$VAR1 = '00:00';
$VAR2 = '';
$VAR3 = '00:01';
$VAR4 = '';
$VAR5 = '00:02';
$VAR6 = '';
What am I doing wrong?
my $text = '00:00 stuff
00:01 more stuff
multi line
and going
00:02 still
have
';
my #array = $text =~ m/^([0-9]{2}:[0-9]{2})(.*?)/gms;
print Dumper(#array);
Version 5.10.0 introduced named capture groups that are useful for matching nontrivial patterns.
(?'NAME'pattern)
(?<NAME>pattern)
A named capture group. Identical in every respect to normal capturing parentheses () but for the additional fact that the group can be referred to by name in various regular expression constructs (such as \g{NAME}) and can be accessed by name after a successful match via %+ or %-. See perlvar for more details on the %+ and %- hashes.
If multiple distinct capture groups have the same name then the $+{NAME} will refer to the leftmost defined group in the match.
The forms (?'NAME'pattern) and (?<NAME>pattern) are equivalent.
Named capture groups allow us to name subpatterns within the regex as in the following.
use 5.10.0; # named capture buffers
my $block_pattern = qr/
(?<time>(?&_time)) (?&_sp) (?<desc>(?&_desc))
(?(DEFINE)
# timestamp at logical beginning-of-line
(?<_time> (?m:^) [0-9][0-9]:[0-9][0-9])
# runs of spaces or tabs
(?<_sp> [ \t]+)
# description is everything through the end of the record
(?<_desc>
# s switch makes . match newline too
(?s: .+?)
# terminate before optional whitespace (which we remove) followed
# by either end-of-string or the start of another block
(?= (?&_sp)? (?: $ | (?&_time)))
)
)
/x;
Use it as in
my $text = '00:00 stuff
00:01 more stuff
multi line
and going
00:02 still
have
';
while ($text =~ /$block_pattern/g) {
print "time=[$+{time}]\n",
"desc=[[[\n",
$+{desc},
"]]]\n\n";
}
Output:
$ ./blocks-demo
time=[00:00]
desc=[[[
stuff
]]]
time=[00:01]
desc=[[[
more stuff
multi line
and going
]]]
time=[00:02]
desc=[[[
still
have
]]]
This should do the trick. Beginning of next \d\d:\d\d is treated as block end.
use strict;
my $Str = '00:00 stuff
00:01 more stuff
multi line
and going
00:02 still
have
00:03 still
have' ;
my #Blocks = ($Str =~ m#(\d\d:\d\d.+?(?:(?=\d\d:\d\d)|$))#gs);
print join "--\n", #Blocks;
Your problem is that .*? is non-greedy in the same way that .* is greedy. When it is not forced, it matches as little as possible, which in this case is the empty string.
So, you'll need something after the non-greedy match to anchor up your capture. I came up with this regex:
my #array = $text =~ m/\n?([0-9]{2}:[0-9]{2}.*?)(?=\n[0-9]{2}:|$)/gs;
As you see, I removed the /m option to accurately be able to match end of string in the look-ahead assertion.
You might also consider this solution:
my #array = split /(?=[0-9]{2}:[0-9]{2})/, $text;