Why doesn't zero-width match regex work? - regex

I wrote a Perl function to replace job name in JCL script. Zero-width match was used here.
sub modify_jcl_jobname ()
{
my ($jcl, $old, $new) = #_;
$jcl =~ s/
# The name must begin in column 3.
^(?<=\/\/)
# The first charater must be alphabetic or national.
($old)
# The name must be followed by at leat on blank.
# Append JCL keyword JOB
(?=\s+JOB)
/$new/xmig; # Multi-lines, ignore case.
return $jcl;
}
But this function didn't work until I did a simple modification that just deleted the leading sign "^".
#before ^(?<=\/\/)
#after (?<=\/\/)
So I'd like to make it clear that the cause of problem. Any reply would be appreciated. Thanks.

The problem lies with
^(?<=\/\/)
That pattern will only match if the spot after which ^ matched is preceded by the two characters //. That's never going to happen since /^/m matches the start of the string and after a newline.
But you don't want to start matching at the start of the line. You want to start matching 2 characters in. What you want is actually:
(?<=^\/\/)
After doing some improvements, the code looks like:
sub modify_jcl_jobname {
my ($jcl, $old, $new) = #_;
$jcl =~ s{
(?<= ^// )
\Q$old\E
(?= \s+ JOB )
}{$new}xmig;
return $jcl;
}
Improvements:
Removed the incorrect prototype (()). It forced the caller to tell Perl to ignore the prototype (by using &).
Added code (\Q...\E) to convert the contents of $old into a regex pattern before using it as such.
Removed the needless capture ((...)).
Switched the delimiters of the substitution (from s/// to s{}{}) to require less escaping.
Removed highly redundant comments. (Good comments explain why something is being done rather than what is being done.)
The optimiser might handle this version better:
$jcl =~ s{
^// \K
\Q$old\E
(?= \s+ JOB )
}{$new}xmig;

The ^ sign matches the beginning of the line. You then want something preceded by two slashes - where should these slashes go if the next character is the very first character of the line?
s{^//
($old)
...
}{//$new}xmig
should work: you need no look behind.
Update: Thanks to ikegami, I now see why you used it. You want to keep the // in the string: well, you can repeat them in the substitution, or move the ^ character into the look-behind.

Related

Regex to find(/replace) multiple instances of character in string

I have a (probably very basic) question about how to construct a (perl) regex, perl -pe 's///g;', that would find/replace multiple instances of a given character/set of characters in a specified string. Initially, I thought the g "global" flag would do this, but I'm clearly misunderstanding something very central here. :/
For example, I want to eliminate any non-alphanumeric characters in a specific string (within a larger text corpus). Just by way of example, the string is identified by starting with [ followed by #, possibly with some characters in between.
[abc#def"ghi"jkl'123]
The following regex
s/(\[[^\[\]]*?#[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1$2/g;
will find the first " and if I run it three times I have all three.
Similarly, what if I want to replace the non-alphanumeric characters with something else, let's say an X.
s/(\[[^\[\]]*?#[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1X$2/g;
does the trick for one instance. But how can I find all of them in one go?
The reason your code doesn't work is that /g doesn't rescan the string after a substitution. It finds all non-overlapping matches of the given regex and then substitutes the replacement part in.
In [abc#def"ghi"jkl'123], there is only a single match (which is the [abc#def" part of the string, with $1 = '[abc#def' and $2 = ''), so only the first " is removed.
After the first match, Perl scans the remaining string (ghi"jkl'123]) for another match, but it doesn't find another [ (or #).
I think the most straightforward solution is to use a nested search/replace operation. The outer match identifies the string within which to substitute, and the inner match does the actual replacement.
In code:
s{ \[ [^\[\]\#]* \# \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9//cdr }xe;
Or to replace each match by X:
s{ \[ [^\[\]\#]* \# \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9/X/cr }xe;
We match a prefix of [, followed by 0 or more characters that are not [ or ] or #, followed by #.
\K is used to mark the virtual beginning of the match (i.e. everything matched so far is not included in the matched string, which simplifies the substitution).
We match and capture 0 or more characters that are not [ or ].
Finally we match a suffix of ] in a look-ahead (so it's not part of the matched string either).
The replacement part is executed as a piece of code, not a string (as indicated by the /e flag). Here we could have used $1 =~ s/[^a-zA-Z0-9]//gr or $1 =~ s/[^a-zA-Z0-9]/X/gr, respectively, but since each inner match is just a single character, it's also possible to use a transliteration.
We return the modified string (as indicated by the /r flag) and use it as the replacement in the outer s operation.
So...I'm going to suggest a marvelously computationally inefficient approach to this. Marvelously inefficient, but possibly still faster than a variable-length lookbehind would be...and also easy (for you):
The \K causes everything before it to be dropped....so only the character after it is actually replaced.
perl -pe 'while (s/\[[^]]*#[^]]*\K[^]a-zA-Z0-9]//){}' file
Basically we just have an empty loop that executes until the search and replace replaces nothing.
Slightly improved version:
perl -pe 'while (s/\[[^]]*?#[^]]*?\K[^]a-zA-Z0-9](?=[^]]*?])//){}' file
The (?=) verifies that its content exists after the match without being part of the match. This is a variable-length lookahead (what we're missing going the other direction). I also made the *s lazy with the ? so we get the shortest match possible.
Here is another approach. Capture precisely the substring that needs work, and in the replacement part run a regex on it that cleans it of non-alphanumeric characters
use warnings;
use strict;
use feature 'say';
my $var = q(ah [abc#def"ghi"jkl'123] oh); #'
say $var;
$var =~ s{ \[ [^\[\]]*? \#\K ([^\]]+) }{
(my $v = $1) =~ s{[^0-9a-zA-Z]}{}g;
$v
}ex;
say $var;
where the lone $v is needed so to return that and not the number of matches, what s/ operator itself returns. This can be improved by using the /r modifier, which returns the changed string and doesn't change the original (so it doesn't attempt to change $1, what isn't allowed)
$var =~ s{ \[ [^\[\]]*? \#\K ([^\]]+) }{
$1 =~ s/[^0-9a-zA-Z]//gr;
}ex;
The \K is there so that all matches before it are "dropped" -- they are not consumed so we don't need to capture them in order to put them back. The /e modifier makes the replacement part be evaluated as code.
The code in the question doesn't work because everything matched is consumed, and (under /g) the search continues from the position after the last match, attempting to find that whole pattern again further down the string. That fails and only that first occurrence is replaced.
The problem with matches that we want to leave in the string can often be remedied by \K (used in all current answers), which makes it so that all matches before it are not consumed.

Extract certain part of a string in Perl

I have the following Perl strings. The lengths and the patterns are different. The file is always named *log.999
my $file1 = '/user/mike/desktop/sys/syslog.1';
my $file2 = '/user/mike/desktop/movie/dnslog.2';
my $file3 = '/haselog.3';
my $file4 = '/user/mike/desktop/movie/dns-sys.log'
I need to extract the words before log. In this case, sys, dns, hase and dns-sys.
How can I write a regular expression to extract them?
\w+(?=log\b)
matches one or more alphanumeric characters that are followed by log (but not logging etc.)
If the filename format is fixed, you can make the regex more reliable by using
\w+(?=log\.\d+\/$)
The main property of shown strings is that the *log* phrase is last.
Then anchor the pattern, so we wouldn't match a log somewhere in the middle
my ($name) = $string =~ /(\w+)log\.[0-9]+$/;
while if .N extension is optional
my ($name) = $string =~ /(\w+)log(?:\.[0-9]+)?$/;
The above uses the \w+ pattern to capture the text preceding log. But that text may also contain non-word characters (-, ., etc), in which case we would use [^/]+ to capture everything after the last /, as pointed out in Abigail's answer. With .N optional, per question in the comments
my ($name) = $string =~ m{ ([^/]+) log (?: \.[0-9]+ )? $}x;
where I added the }x modifier, with which spaces inside are ignored, what can aid readibility.
I use a set of delimiters other than / to be able to use / inside without escaping it, and then the m is compulsory. The [^...] is a negated character class, matching any character not listed inside. So [^/]+log matches all successive characters which are not /, coming before log.
The non capturing group (?: ... ) groups patterns inside, so that ? applies to the whole group, but doesn't needlessly capture them.
The (?:\.[0-9]+)? pattern was written specifically so to disallow things like log. (nothing after dot) and log5. But if these are acceptable, change it to the simpler \.?[0-9]*
Update Corrected a typo in code: for optional .N there is +, not *

Problems with perl regex

I need a perl regex to match A.CC3 on a line begining with something followed by anything then, my 'A.CC3 " and then anything...
I am surprised this (text =~ /^\W+\CC.*\A\.CC\[3].*/) is not working
Thanks
\A is an escape sequence that denotes beginning of line, or ^ like in the beginning of your regex. Remove the backslash to make it match a literal A.
Edit: You also seem to have \C in there. You should only use backslash to escape meta characters such as period ., or to create escape sequences, such as \Q .. \E.
At its simplest, a regex to match A.CC3 would be
$text =~ /A\.CC3/
That's all you need. This will match any string with A.CC3 in it. In the comments you mention the string you are matching is this:
my $text = "//%CC Unused Static Globals, A.CC3, Halstead Progam Volume";
You might want to avoid partial matches, in which case you can use word boundary \b
$text =~ /\bA\.CC3\b/
You might require that a line begins with //%
$text =~ m#^//%.*\bA\.CC3\b#
Of course, only you know which parts of the string should be matched and in what way. "Something followed by anything followed by A.CC3 followed by anything" really just needs the first simple regex.
It doesn't seem like you're trying to capture anything. If that's the case, and all you need to do is find lines that contain A.CC3 then you can simply do
if ( index( $str, 'A.CC3' ) >= 0 ) # Found it...
No need for a regex.
Try to give this a shot:
^.*?A\.CC.*$
That will match anything until it reaches A, then a literal ., followed by CC, then anything until end of string.
It depends what you want to match. If you want to pull back the whole line in which the A.CC3 pattern occurs then something like this should work:
^.*A\.CC3.*$

How to exclude regex text between two matches?

I have a set specific repeating text blocks. They have a dynamic file name, and a dynamic message. For every filename I want to extract the message.
Filename: dynamicFile.txt
Property: some property to neglect
Message: the message I want
Time: dynamicTime
I want to extract the part after message, which would be: the message I want.
What I have: The following would match anything between Filename and Time.
(?<=Filename: %myFileVar%)(?s)(.*)(?=Time:)
whereas %myFileVar% are dynamic file variables I will feed the expression with.
Now I need to find a way to ommit anything after the filename until the message part. Here I would have to ommit:
Property: some property to neglect
Message:
How could this be done?
use warnings;
use strict;
my $text;
{
local $/;
$text = <DATA>;
}
my $myFileVar = 'dynamicFile.txt';
if ($text =~ /Filename: \Q$myFileVar\E.*?Message: (.*?)\s*Time:/s)
{
print $1;
}
__DATA__
Filename: dynamicFile.txt
Property: some property to neglect
Message: the message I want
Time: dynamicTime
Note: this assumes that Time: always comes right after the message line. If that is not true, ikegami's solution offers a way to skip any other lines.
Explanation:
You can simply insert a variable into your pattern, and it will be interpolated.
However, if the variable contains any special regex characters, they will be treated as regex characters. Thus you need to surround the variable with \Q...\E, which make everything in between be treated literally. If you did not do that, the dot in your filename would match any character.
You don't need to use lookarounds to only capture part of a string. Instead, use a capture group--any normal sets of parentheses within the pattern will automatically be put into the variables $1, $2, etc.
For a simple case like this, it is better to enable single line mode (s) as a switch after the pattern. (/s instead of (?s)). Turning it on within the pattern is experimental and should only be used if you need it to apply to only part of the pattern.
.*? should be used instead of .*. Otherwise the pattern will match everything from the first Message: to the last Time: in the file.
/
^
Filename: \s* \Q$myFileVar\E \n
(?: (?!Message:) [^\n]*\n )*
Message: \s* ([^\n]*) \n
(?: (?!Time:) [^\n]*\n )*
Time:
/mx
(?: [^\n]*\n )* skips any number of lines.
Perl can do \K Magic
Adding a late answer because I'm not seeing my favorite solution. In Perl regex, \K tells the engine to drop everything we have matched so far from the final match. So you could have used this regex:
(?sm)^Filename:.*?Message: \K[^\r\n]+
or even:
(?m)^Message: \K[^\r\n]+
See demo.

What REGEX pattern should I use to look for a specific string pattern and remove anything else that doesnt match?

I'm parsing through code using a Perl-REGEX parsing engine in my IDE and I want to grab any variables that look like
$hash->{ hash_key04}
and nuke the rest of the code..
So far my very basic REGEX doesnt do what I expected
(.*)(\$hash\-\>\{[\w\s]+\})(.*)
(
\$
hash
\-\>
\{
[\w\s]+
\}
)
I know to use replace for this ($1,$2,etc), but match (.*) before and after the target string doesnt seem to capture all the rest of the code!
UPADTED:
tried matching null but of course thats too greedy.
([^\0]*)
What expression in regex should i use to look only for the string pattern and remove the rest?
The problem is I want to be left with the list of $hash->{} strings after the replace runs in the IDE.
This is better approached from the other direction. Instead of trying to delete everything you don't want, what about extracting everything you do want?
my #vars = $src_text =~ /(\$hash->\{[\w\s]+\})/g;
Breaking down the regex:
/( # start of capture group
\$hash-> # prefix string with $ escaped
\{ # opening escaped delimiter
[\w\s]+ # any word characters or space
\} # closing escaped delimiter
)/g; # match repeatedly returning a list of captures
Here is another way that might fit within your IDE better:
s/(\$hash->\{[\w\s]+\})|./$1/gs;
This regex tries to match one of your hash variables at each location, and if it fails, it deletes the next character and then tries again, which after running over the whole file will have deleted everything you don't want.
Depends on your coding language. What you want is group 2 (The second set of characters in parenthesis). In perl that would be $2, in VIM it would be \2, etc ...
It depends on the platform, but generally, replace the pattern with an empty string.
In javascript,
// prints "the la in ing"
console.log('the latest in testing'.replace(/test/g, ''));
In bash
$ echo 'the latest in testing' | sed 's/test//g'
the la in ing
In C#
Console.WriteLine(Regex.Replace("the latest in testing", "test", ""));
etc
By default the wildcard . won't match newlines. You can enable newlines in its matching set using a flag depending on what regex standard you're using and under what language/api. Or you can add them explicitly yourself by defining a character set:
[.\n\r]* <- Matches any character including newline, carriage return.
Combine this with capture groups to grab desired variables from your code and skip over lines which contain no capture group.
If you want help constructing the proper regex for your context you'll need to paste some input text and specify what the output should be.
I think you want to add a ^ to the beginning of the regex s/^.(PATTERN)(.)$/$1/ so that it starts at the beginning of the line and goes to the end, removing anything except that pattern.