Prefixing multiple lines with a previously captured token - regex

I'm looking for a search/replace regular expression which will capture tokens and apply them as a prefix to every subsequent line within a document.
So this..
Tokens always start with ##..
Nothing is prefixed until a token is encountered..
##CAT
furball
scratch
##DOG
egg
##MOUSE
wheel
on the stair
Becomes..
Tokens always start with ##..
Nothing is prefixed until a token is captured!
##CAT
CAT furball
CAT scratch
##DOG
DOG egg
#MOUSE
MOUSE wheel
MOUSE on the stair

You can use this pattern:
search: ((?:\A|\n)##([^\r\n]+)(?>\r?\n\2[^\r\n]+)*+\r?\n(?!##))
replace: $1$2 <= with a space at the end
But you must apply the search replace several times until there no more matches.

As far as I know, this is impossible. The closest I can get is replacing
^##(.*)\r?\n(.*)
with
##\1\n\1 \2
Output:
Tokens always start with ##..
Nothing is prefixed until a token is encountered..
##CAT
CAT furball
scratch
##DOG
DOG egg
##MOUSE
MOUSE wheel
on the stair

You have the pcre tag and the Notepad++ tag.
I don't think you can actually do this without a callback mechanism.
That being said, you can do it without a callback, but you need to divide
up the functionality.
This is a php example that might give you some ideas.
Note - not sure of php string concatenation syntax (used a '.' but it could be a '+').
The usage is multi-line mode //m modifier.
^ # Begin of line
(?-s) # Modifier, No Dot-All
(?:
( # (1 start)
\#\# # Token indicator
( \w+ ) # (2), Token
.* # The rest of line
) # (1 end)
| # or,
( .* ) # (3), Just a non-token line
)
$ # End of line
# $token = "";
# $str = preg_replace_callback('/^(?-s)(?:(\#\#(\w+).*)|(.*))$/m',
# function( $matches ){
# if ( $matches[1] != "" ) {
# $token = $matches[2];
# return $matches[1];
# }
# else
# if ( $token != "" ) {
# return $token . " " . $matches[3];
# }
# return $matches[3];
# },
# $text);
#

Related

How to avoid warnings in Perl regex substitution with alternatives?

I have this regex.
$string =~ s/(?<!["\w])(\w+)(?=:)|(?<=:)([\w\d\\.+=\/]+)/"$1$2"/g;
The regex itself works fine.
But since I am am substituting alternatives (and globally), I always get warning that $1 or $2 is uninitialized. These warnings clutter my logfile.
What can I do better to avoid such warning?
Or is my best option to just turn the warning off? I doubt this.
Side question: Is there possibly some better way of doing this, e.g. not using regex at all?
What I am doing is fixing JSON where some key:value pairs do not have double quotes and JSON module does not like it when trying to decode.
There are a couple of approaches to get around this.
If you intend to use capture groups:
When capturing the entirety of each clause of the alternation.
Combine the capture groups into 1 and move the group out.
( # (1 start)
(?<! ["\w] )
\w+
(?= : )
|
(?<= : )
[\w\d\\.+=/]+
) # (1 end)
s/((?<!["\w])\w+(?=:)|(?<=:)[\w\d\\.+=\/]+)/"$1"/g
Use a Branch Reset construct (?| aaa ).
This will cause capture groups in each alternation to start numbering it's groups
from the same point.
(?|
(?<! ["\w] )
( \w+ ) # (1)
(?= : )
|
(?<= : )
( [\w\d\\.+=/]+ ) # (1)
)
s/(?|(?<!["\w])(\w+)(?=:)|(?<=:)([\w\d\\.+=\/]+))/"$1"/g
Use Named capture groups that are re-useable (Similar to a branch reset).
In each alternation, reuse the same names. Make the group that isn't relevant, the empty group.
This works by using the name in the substitution instead of the number.
(?<! ["\w] )
(?<V1> \w+ ) # (1)
(?<V2> ) # (2)
(?= : )
|
(?<= : )
(?<V1> ) # (3)
(?<V2> [\w\d\\.+=/]+ ) # (4)
s/(?<!["\w])(?<V1>\w+)(?<V2>)(?=:)|(?<=:)(?<V1>)(?<V2>[\w\d\\.+=\/]+)/"$+{V1}$+{V2}"/g
The two concepts of the named substitution and a branch reset can be combined
if an alternation contains more than 1 capture group.
The example below uses the capture group numbers.
The theory is that you put dummy capture groups in each alternation to
"pad" the branch to equal the largest number of groups in a single alternation.
Indeed, this must be done to avoid the bug in Perl regex that could cause a crash.
(?| # Branch Reset
# ------ Br 1 --------
( ) # (1)
( \d{4} ) # (2)
ABC294
( [a-f]+ ) # (3)
|
# ------ Br 2 --------
( :: ) # (1)
( \d+ ) # (2)
ABC555
( ) # (3)
|
# ------ Br 3 --------
( == ) # (1)
( ) # (2)
ABC18888
( ) # (3)
)
s/(?|()(\d{4})ABC294([a-f]+)|(::)(\d+)ABC555()|(==)()ABC18888())/"$1$2$3"/g
You can try using Cpanel::JSON::XS's relaxed mode, or JSONY, to parse the almost-JSON and then write out regular JSON using Cpanel::JSON::XS. Depending what exactly is wrong with your input data one or the other might understand it better.
use strict;
use warnings;
use Cpanel::JSON::XS 'encode_json';
# JSON is normally UTF-8 encoded; if you're reading it from a file, you will likely need to decode it from UTF-8
my $string = q<{foo: 1,bar:'baz',}>;
my $data = Cpanel::JSON::XS->new->relaxed->decode($string);
my $json = encode_json $data;
print "$json\n";
use JSONY;
my $data = JSONY->new->load($string);
my $json = encode_json $data;
print "$json\n";

Capturing the same regular expression over multiple lines

I want to capture series of file names that are listed each in a new line, and I have figured out how to capture the file name in the first line, but I haven't figured out how to repeat it on the subsequent lines.
# Input
# data/raw/file1
# data/raw/file2
# Output
# data/interim/file1
# data/interim/file2
Current Attempt
The regular expression I currently have is
# Input\n(# (.*))
And my inner capture group properly captures data/raw/file1.
Desired Output
What I want is to grab all of the files in between # Input and # Output, so in this example, data/raw/file1 and data/raw/file2.
Go with \G magic:
(?:^#\s+Input|\G(?!\A))\R*(?!#\s+Output)#\s*(.*)|[\s\S]*
Live demo
Regex breakdown
(?: # Start of non-capturing group (a)
^#\s+Input # Match a line beginning with `# Input`
| # Or
\G(?!\A) # Continue from previous successful match point
) # End of NCG (a)
\R* # Match any kind of newline characters
(?!#\s+Output) # Which are not followed by such a line `# Output`
#\s*(.*) # Start matching a path line and capture path
| # If previous patterns didn't match....
[\s\S]* # Then match everything else up to end to not involve engine a lot
PHP code:
$re = '~(?:^#\s+Input|\G(?!\A))\R*(?!#\s+Output)#\s*(.*)|[\s\S]*~m';
$str = '# Input
# data/raw/file1
# data/raw/file2
# Output
# data/interim/file1
# data/interim/file2';
preg_match_all($re, $str, $matches, PREG_PATTERN_ORDER, 0);
// Print the entire match result
print_r(array_filter($matches[1]));
Output:
Array
(
[0] => data/raw/file1
[1] => data/raw/file2
)
Using the s modifier, preg_match, and preg_split you can get each result on its own.
preg_match('/# Input\n(# (?:.*?))# Output/s', '# Input
# data/raw/file1
# data/raw/file2
# Output
# data/interim/file1
# data/interim/file2', $match);
$matched = preg_split('/# /', $match[1], -1, PREG_SPLIT_NO_EMPTY);
print_r($matched);
Demo: https://3v4l.org/dAcRp
Regex demo: https://regex101.com/r/5tfJGM/1/

I want to match anything not including '#' but yes including '\#'

I'm looking for a perl code line that may contain regexps and comments
i need to capture everything until a comment. so i want all characters until # but I AM INTERESTED in capturing #
for example, if the line was:
if ($line=/\#/) { #captures lines with '#'
I want to capture:
if ($line=/\#/) {
Give this a try:
use PPI;
my $ppi = PPI::Document->new('source.pl');
my $source = '';
for my $token ( #{ $ppi->find("PPI::Token") } ) {
last if $token->isa("PPI::Token::Comment");
$source .= $token;
}
print $source;
This should handle pretty much everything except here-docs. If you need to deal with those, start by copying PPI::Document::serialize and modify it to stop on the first comment.
Try this
^(?:[^#]|(?<=\\)#)+
See it here on Regexr
This will match anything from the start of the string (^), that is not a # ([^#]) OR
a # that is preceeded by a backslash ((?<=\\)#)

perl regex for extracting multiline blocks

I have text like this:
00:00 stuff
00:01 more stuff
multi line
and going
00:02 still
have
So, I don't have a block end, just a new block start.
I want to recursively get all blocks:
1 = 00:00 stuff
2 = 00:01 more stuff
multi line
and going
etc
The bellow code only gives me this:
$VAR1 = '00:00';
$VAR2 = '';
$VAR3 = '00:01';
$VAR4 = '';
$VAR5 = '00:02';
$VAR6 = '';
What am I doing wrong?
my $text = '00:00 stuff
00:01 more stuff
multi line
and going
00:02 still
have
';
my #array = $text =~ m/^([0-9]{2}:[0-9]{2})(.*?)/gms;
print Dumper(#array);
Version 5.10.0 introduced named capture groups that are useful for matching nontrivial patterns.
(?'NAME'pattern)
(?<NAME>pattern)
A named capture group. Identical in every respect to normal capturing parentheses () but for the additional fact that the group can be referred to by name in various regular expression constructs (such as \g{NAME}) and can be accessed by name after a successful match via %+ or %-. See perlvar for more details on the %+ and %- hashes.
If multiple distinct capture groups have the same name then the $+{NAME} will refer to the leftmost defined group in the match.
The forms (?'NAME'pattern) and (?<NAME>pattern) are equivalent.
Named capture groups allow us to name subpatterns within the regex as in the following.
use 5.10.0; # named capture buffers
my $block_pattern = qr/
(?<time>(?&_time)) (?&_sp) (?<desc>(?&_desc))
(?(DEFINE)
# timestamp at logical beginning-of-line
(?<_time> (?m:^) [0-9][0-9]:[0-9][0-9])
# runs of spaces or tabs
(?<_sp> [ \t]+)
# description is everything through the end of the record
(?<_desc>
# s switch makes . match newline too
(?s: .+?)
# terminate before optional whitespace (which we remove) followed
# by either end-of-string or the start of another block
(?= (?&_sp)? (?: $ | (?&_time)))
)
)
/x;
Use it as in
my $text = '00:00 stuff
00:01 more stuff
multi line
and going
00:02 still
have
';
while ($text =~ /$block_pattern/g) {
print "time=[$+{time}]\n",
"desc=[[[\n",
$+{desc},
"]]]\n\n";
}
Output:
$ ./blocks-demo
time=[00:00]
desc=[[[
stuff
]]]
time=[00:01]
desc=[[[
more stuff
multi line
and going
]]]
time=[00:02]
desc=[[[
still
have
]]]
This should do the trick. Beginning of next \d\d:\d\d is treated as block end.
use strict;
my $Str = '00:00 stuff
00:01 more stuff
multi line
and going
00:02 still
have
00:03 still
have' ;
my #Blocks = ($Str =~ m#(\d\d:\d\d.+?(?:(?=\d\d:\d\d)|$))#gs);
print join "--\n", #Blocks;
Your problem is that .*? is non-greedy in the same way that .* is greedy. When it is not forced, it matches as little as possible, which in this case is the empty string.
So, you'll need something after the non-greedy match to anchor up your capture. I came up with this regex:
my #array = $text =~ m/\n?([0-9]{2}:[0-9]{2}.*?)(?=\n[0-9]{2}:|$)/gs;
As you see, I removed the /m option to accurately be able to match end of string in the look-ahead assertion.
You might also consider this solution:
my #array = split /(?=[0-9]{2}:[0-9]{2})/, $text;

How to ignore whitespace in a regular expression subject string, but only if it comes after a newline?

What is the best way to ignore the white space in a target string when searching for matches using a regular expression pattern, but only if the whitespace comes after a newline (\n)? For example, if my search is for "cats", I would want "c\n ats" or "ca\n ts" to match but not "c ats" since the whitespace doesn't come after a newline. I can't strip out the whitespace beforehand because I need to find the begin and end index of the match (including any whitespace) in order to highlight that match and any whitespace needs to be there for formatting purposes.
If the regex engine you're using supports lookaround assertions, use a positive lookbehind assertion to check for the presence of a preceding newline:
(?<=\n)\s
"What is the best way to ignore the white space in a target string when searching for matches using a regular expression pattern"
I would construct a regex dynamically, inserting a (?:\n\s)? between each character.
use strict;
use warnings;
my $needed = 'cats';
my $regex = join '(?:\n\s)?' , split ( '',$needed );
print "\nRegex = $regex\n", '-'x40, "\n\n";
my $target = "
cats
c ats
c\n ats
ca ts
ca\n ts
cat s
cat\n s
";
while ( $target =~ /($regex)/g)
{
print "Found - '$1'\n\n";
}
The output:
Regex = c(?:\n\s)?a(?:\n\s)?t(?:\n\s)?s
----------------------------------------
Found - 'cats'
Found - 'c
ats'
Found - 'ca
ts'
Found - 'cat
s'
I have made a small ruby snippet based on the rules you have listed. Is this what you are looking for?
data = <<DATA
test1c\n atsOKexpected
test2ca\n tsOKexpected
test3catsOKexpected
test5ca tsBADexpected
test6 catsOKexpected
test7cats OKexpected
DATA
tests = data.split(/\n\n/)
regex = /c(\n )?a(\n )?t(\n )?s/
tests.each do |s|
if s =~ regex
puts "OK\n#{s}\n\n"
else
puts "BAD\n#{s}\n\n"
end
end
# RESULTS
# OK
# test1c
# atsOKexpected
#
# OK
# test2ca
# tsOKexpected
#
# OK
# test3catsOKexpected
#
# BAD
# test5ca tsBADexpected
#
# OK
# test6 catsOKexpected
#
# OK
# test7cats OKexpected