How to exclude regex text between two matches? - regex

I have a set specific repeating text blocks. They have a dynamic file name, and a dynamic message. For every filename I want to extract the message.
Filename: dynamicFile.txt
Property: some property to neglect
Message: the message I want
Time: dynamicTime
I want to extract the part after message, which would be: the message I want.
What I have: The following would match anything between Filename and Time.
(?<=Filename: %myFileVar%)(?s)(.*)(?=Time:)
whereas %myFileVar% are dynamic file variables I will feed the expression with.
Now I need to find a way to ommit anything after the filename until the message part. Here I would have to ommit:
Property: some property to neglect
Message:
How could this be done?

use warnings;
use strict;
my $text;
{
local $/;
$text = <DATA>;
}
my $myFileVar = 'dynamicFile.txt';
if ($text =~ /Filename: \Q$myFileVar\E.*?Message: (.*?)\s*Time:/s)
{
print $1;
}
__DATA__
Filename: dynamicFile.txt
Property: some property to neglect
Message: the message I want
Time: dynamicTime
Note: this assumes that Time: always comes right after the message line. If that is not true, ikegami's solution offers a way to skip any other lines.
Explanation:
You can simply insert a variable into your pattern, and it will be interpolated.
However, if the variable contains any special regex characters, they will be treated as regex characters. Thus you need to surround the variable with \Q...\E, which make everything in between be treated literally. If you did not do that, the dot in your filename would match any character.
You don't need to use lookarounds to only capture part of a string. Instead, use a capture group--any normal sets of parentheses within the pattern will automatically be put into the variables $1, $2, etc.
For a simple case like this, it is better to enable single line mode (s) as a switch after the pattern. (/s instead of (?s)). Turning it on within the pattern is experimental and should only be used if you need it to apply to only part of the pattern.
.*? should be used instead of .*. Otherwise the pattern will match everything from the first Message: to the last Time: in the file.

/
^
Filename: \s* \Q$myFileVar\E \n
(?: (?!Message:) [^\n]*\n )*
Message: \s* ([^\n]*) \n
(?: (?!Time:) [^\n]*\n )*
Time:
/mx
(?: [^\n]*\n )* skips any number of lines.

Perl can do \K Magic
Adding a late answer because I'm not seeing my favorite solution. In Perl regex, \K tells the engine to drop everything we have matched so far from the final match. So you could have used this regex:
(?sm)^Filename:.*?Message: \K[^\r\n]+
or even:
(?m)^Message: \K[^\r\n]+
See demo.

Related

Regex find replace to add function parameter

I'm trying to find and replace some function calls in py program. The idea is to add some boolean parameter to each call found on the project.
I looked for solutions on the internet 'cause I don't know regex science at all... It seems like a basic exercice for regex guys but still.
In my case I have this call in a lot of files :
myFunction("test")
My gooal is to find and replace this call into :
myFunction("test", false)
Could you help me write the regex ?
Try this command:
sed -re 's/(myFunction)[[:space:]]*\([[:space:]]*("test")[[:space:]]*\)/\1(\2, false)/' SOURCE_FILENAME
If you prefer to replace the existing source file with an updated one, then write -i SOURCE_FILENAME instead of SOURCE_FILENAME.
This works by defining a pattern to match the function call you would like to update:
myFunction (obviously) matches the text myFunction;
[[:space:]] matches any whitespace character, mainly spaces and tabs.
[[:space:]]* matches zero or more whitespace characters.
\( and \) match literal parenthesis in your program text;
( and ) are regex metacharacters that match nothing, but ("test") matches "test" and captures the matched text for later use.
Note that this pattern captures two things using ( and ). The ("test") is the second of these.
Now let us examine the overall structure of the Sed command 's/.../.../'. The s means "substitute," so 's/.../.../' is Sed's substitution command.
Between the first and second slashes comes the pattern we have just discussed. Between the second and third slashes comes the replacement text Sed uses to replace the matched part of any line of your program text that matches the pattern. Within the replacement text, the \1 and \2 are backreferences that place the text earlier captured using ( and ).
So, there it is. Not only have I helped you to write the regex but have shown you how the regex works so that, next time, you can write your own.
Refer this:
import re
#Replace all white-space characters with the digit "9":
str = "The rain in Spain"
x = re.sub("\s", "9", str)
print(x)
you could use this regex to match and capture
(myFunction\("test")(\))
then use the regex below to replace
$1, false$2

Extract certain part of a string in Perl

I have the following Perl strings. The lengths and the patterns are different. The file is always named *log.999
my $file1 = '/user/mike/desktop/sys/syslog.1';
my $file2 = '/user/mike/desktop/movie/dnslog.2';
my $file3 = '/haselog.3';
my $file4 = '/user/mike/desktop/movie/dns-sys.log'
I need to extract the words before log. In this case, sys, dns, hase and dns-sys.
How can I write a regular expression to extract them?
\w+(?=log\b)
matches one or more alphanumeric characters that are followed by log (but not logging etc.)
If the filename format is fixed, you can make the regex more reliable by using
\w+(?=log\.\d+\/$)
The main property of shown strings is that the *log* phrase is last.
Then anchor the pattern, so we wouldn't match a log somewhere in the middle
my ($name) = $string =~ /(\w+)log\.[0-9]+$/;
while if .N extension is optional
my ($name) = $string =~ /(\w+)log(?:\.[0-9]+)?$/;
The above uses the \w+ pattern to capture the text preceding log. But that text may also contain non-word characters (-, ., etc), in which case we would use [^/]+ to capture everything after the last /, as pointed out in Abigail's answer. With .N optional, per question in the comments
my ($name) = $string =~ m{ ([^/]+) log (?: \.[0-9]+ )? $}x;
where I added the }x modifier, with which spaces inside are ignored, what can aid readibility.
I use a set of delimiters other than / to be able to use / inside without escaping it, and then the m is compulsory. The [^...] is a negated character class, matching any character not listed inside. So [^/]+log matches all successive characters which are not /, coming before log.
The non capturing group (?: ... ) groups patterns inside, so that ? applies to the whole group, but doesn't needlessly capture them.
The (?:\.[0-9]+)? pattern was written specifically so to disallow things like log. (nothing after dot) and log5. But if these are acceptable, change it to the simpler \.?[0-9]*
Update Corrected a typo in code: for optional .N there is +, not *

Regular Expression required for matching a string that should not be followed by another specific string

I am using the below code for matching a string(EX: <jdgdt\s+mdy=.*?>\s*) which should not be followed by another specific string (<jdg>). But i am unable to get the desired output as per the below code. Can anyone help me regarding this ?
Input file :
<dckt>Docket No. 7677-12.</dckt>
<jdgdt mdy='02/25/2014'>
<jdg>Opinion by Marvel, <e>J.</e></jdg>
<taxyr></taxyr>
<disp></disp>
</tcpar>
<dckt>Docket No. 7237-13.</dckt>
<jdgdt mdy='02/24/2014'>
</tcpar>
Desired Output:
<dckt>Docket No. 7677-12.</dckt>
<jdgdt mdy='02/25/2014'>
<jdg>Opinion by Marvel, <e>J.</e></jdg>
<taxyr></taxyr>
<disp></disp>
</tcpar>
<dckt>Docket No. 7237-13.</dckt>
<jdgdt mdy='02/24/2014'>
<jdg>Opinion by Marvel, <e>J.</e></jdg>
<taxyr></taxyr>
<disp></disp>
</tcpar>
Code:
#/usr/bin/perl
my $filename = $ARGV[0];
my $ext = $ARGV[1];
my $InputFile = "$filename" . "\." . "$ext";
my $document = do {
local $/ = undef;
open my $fh, "<", $InputFile or die "Error: Could Not Open File $InputFile: $!";
<$fh>;
};
$document =~ s/(<jdgdt\s+mdy=.*?>\s*)(?!<jdg>)/$1<jdg>Opinion by Marvel,<e>J.<\/e><\/jdg>\n<taxyr><\/taxyr>\n<disp><\/disp>/isg;
print $document;
I had to make two minor adjustments to your regex to get the desired output:
$document =~ s{(<jdgdt\s+mdy\=[^>]*>\s*)(?!\s*<jdg>)}{$1<jdg>Opinion by Marvel,<e>J.</e></jdg>\n<taxyr></taxyr>\n<disp></disp>}isg;
Also, to clean up the code, I switched from using / to using {} to delimit the regex; that way, you don't need to backslash all the slashes that you actually want there in your replacement.
Explanation of what I changed:
First off, negative lookahead is tricky. What you have to remember is that perl will try to match your expression the maximum amount of times possible. Because you had this initially:
/(<jdgdt\s+mdy\=.*?>\s*)(?!<jdg>)/
What would happen is that in that first clause you'd get this match:
<jdgdt mdy='02/25/2014'>\n<jdg>Opinion by Marvel, <e>J.</e></jdg>
^^^^^^^^^^^^^^^^^^^^^^^^
(this part matched by paren. Note the \n is not matched!)
Perl would consider this a match because after the first parenthesized expression, you have "\n<jdg>". Well, that doesn't match the expression "<jdg>" (because of the initial newline), so yay! found a match.
In other words, initially, perl would have the \s* that you end your parenthesized expression with match the empty string, and therefore it would find a match and you'd end up stuffing things into the first clause that you didn't want. Another way to put it is that because of the freedom to choose what went into \s*, perl would choose the amount that allowed the expression as a whole to match. (and would fill \s* with the empty string for the first docket record, and newline for the second docket record)
To get perl to never find a match on the first docket record, I repeated the \s* in the negative lookahead as well. That way, no choice of what to put in \s* could make the expression as a whole match on the initial docket record, and perl had to give up and move to the second docket record.
But then there was a second problem! Remember how I said perl was really aggressive about finding matches anywhere it could? Well, next perl would expand your mdy\=.*?> bit to still find a result in the first docket record. After I added \s* to the negative lookahead, the first docket was still matching (but in a different spot) with:
<jdgdt mdy='02/25/2014'>\n<jdg>Opinion by Marvel, <e>J.</e></jdg>
^^^^^^^^^^^???????????????????^
(Underlined part matched by paren. ? denotes the bit matched by .*?)
See how perl expanded your .*? way beyond what you had intended? You'd intended that bit to match only stuff up to the first > character, but perl will stretch your non-greedy matches as far as necessary so that the whole pattern matches. This time, it stretched your .*? to cover the > that closed the <jdg> tag so that it could find a spot where the negative lookahead didn't block the match.
To keep perl from stretching your .*? pattern that far, I replaced .*? with [^>]*, which is really what you meant.
After these two changes, we then only found a match in the second docket record, as initially desired.
Use positive lookahead. (?!<jdg>) or something similar, look it up.

Why doesn't zero-width match regex work?

I wrote a Perl function to replace job name in JCL script. Zero-width match was used here.
sub modify_jcl_jobname ()
{
my ($jcl, $old, $new) = #_;
$jcl =~ s/
# The name must begin in column 3.
^(?<=\/\/)
# The first charater must be alphabetic or national.
($old)
# The name must be followed by at leat on blank.
# Append JCL keyword JOB
(?=\s+JOB)
/$new/xmig; # Multi-lines, ignore case.
return $jcl;
}
But this function didn't work until I did a simple modification that just deleted the leading sign "^".
#before ^(?<=\/\/)
#after (?<=\/\/)
So I'd like to make it clear that the cause of problem. Any reply would be appreciated. Thanks.
The problem lies with
^(?<=\/\/)
That pattern will only match if the spot after which ^ matched is preceded by the two characters //. That's never going to happen since /^/m matches the start of the string and after a newline.
But you don't want to start matching at the start of the line. You want to start matching 2 characters in. What you want is actually:
(?<=^\/\/)
After doing some improvements, the code looks like:
sub modify_jcl_jobname {
my ($jcl, $old, $new) = #_;
$jcl =~ s{
(?<= ^// )
\Q$old\E
(?= \s+ JOB )
}{$new}xmig;
return $jcl;
}
Improvements:
Removed the incorrect prototype (()). It forced the caller to tell Perl to ignore the prototype (by using &).
Added code (\Q...\E) to convert the contents of $old into a regex pattern before using it as such.
Removed the needless capture ((...)).
Switched the delimiters of the substitution (from s/// to s{}{}) to require less escaping.
Removed highly redundant comments. (Good comments explain why something is being done rather than what is being done.)
The optimiser might handle this version better:
$jcl =~ s{
^// \K
\Q$old\E
(?= \s+ JOB )
}{$new}xmig;
The ^ sign matches the beginning of the line. You then want something preceded by two slashes - where should these slashes go if the next character is the very first character of the line?
s{^//
($old)
...
}{//$new}xmig
should work: you need no look behind.
Update: Thanks to ikegami, I now see why you used it. You want to keep the // in the string: well, you can repeat them in the substitution, or move the ^ character into the look-behind.

What REGEX pattern should I use to look for a specific string pattern and remove anything else that doesnt match?

I'm parsing through code using a Perl-REGEX parsing engine in my IDE and I want to grab any variables that look like
$hash->{ hash_key04}
and nuke the rest of the code..
So far my very basic REGEX doesnt do what I expected
(.*)(\$hash\-\>\{[\w\s]+\})(.*)
(
\$
hash
\-\>
\{
[\w\s]+
\}
)
I know to use replace for this ($1,$2,etc), but match (.*) before and after the target string doesnt seem to capture all the rest of the code!
UPADTED:
tried matching null but of course thats too greedy.
([^\0]*)
What expression in regex should i use to look only for the string pattern and remove the rest?
The problem is I want to be left with the list of $hash->{} strings after the replace runs in the IDE.
This is better approached from the other direction. Instead of trying to delete everything you don't want, what about extracting everything you do want?
my #vars = $src_text =~ /(\$hash->\{[\w\s]+\})/g;
Breaking down the regex:
/( # start of capture group
\$hash-> # prefix string with $ escaped
\{ # opening escaped delimiter
[\w\s]+ # any word characters or space
\} # closing escaped delimiter
)/g; # match repeatedly returning a list of captures
Here is another way that might fit within your IDE better:
s/(\$hash->\{[\w\s]+\})|./$1/gs;
This regex tries to match one of your hash variables at each location, and if it fails, it deletes the next character and then tries again, which after running over the whole file will have deleted everything you don't want.
Depends on your coding language. What you want is group 2 (The second set of characters in parenthesis). In perl that would be $2, in VIM it would be \2, etc ...
It depends on the platform, but generally, replace the pattern with an empty string.
In javascript,
// prints "the la in ing"
console.log('the latest in testing'.replace(/test/g, ''));
In bash
$ echo 'the latest in testing' | sed 's/test//g'
the la in ing
In C#
Console.WriteLine(Regex.Replace("the latest in testing", "test", ""));
etc
By default the wildcard . won't match newlines. You can enable newlines in its matching set using a flag depending on what regex standard you're using and under what language/api. Or you can add them explicitly yourself by defining a character set:
[.\n\r]* <- Matches any character including newline, carriage return.
Combine this with capture groups to grab desired variables from your code and skip over lines which contain no capture group.
If you want help constructing the proper regex for your context you'll need to paste some input text and specify what the output should be.
I think you want to add a ^ to the beginning of the regex s/^.(PATTERN)(.)$/$1/ so that it starts at the beginning of the line and goes to the end, removing anything except that pattern.