Perl extract group with lookbehind from different line - regex

I've tried web search and have read several answers on stackexchange, still cannot grasp why command does not extract anything. At the end I want to extract group with lookbehind from different line, e.g. from
Code>TEST1<Code Code2>best<Code2
Code>test2<Code
Type>false<Type
by finding needed key between Type and extracting first Code above the finding, so it case above to get test2. But I cannot succeed to extract even something from multiple lines, i.e.
perl -lne 'print $1,"_",$2 if /Code>(.*)<Code[\s\S\n]*?Type>(.*)<Type/'<test.txt prints nothing.
I've played with removing ln parameters and adding/removing greedy ? and trying just . in place of [\s\S\n].
perl -lne 'print $1,"_",$2 if /Code>(.*)<Code[\s\S\n]*?Code2>(.*)<Code2/'<test.txt
gives TEST1_best so same line extraction works.
What am I missing? Can what I want be done in one line of command?

The following command answers your question: it collects all values contained in a Code>...<Code pattern, if they are followed by a Type>...<Type pattern (with potentially other patterns in between, but no other occurrences of Code>...<Code in between):
perl -lne 's/^.*?(?=Code>)//s; for (split /Code>/) { print qq($1:$2\n) if /(.*?)<Code.*?Type>(.*?)<Type/s }' -0777 <test.txt
If e.g. test.txt contains the following lines,
Code>test4<Code Type>false<Type
Code>test3<Code
Type>true<Type
Code>TEST1<Code Code2>best<Code2
Code>test2<Code
Type>false<Type
then the command will collect the following value pairs:
test4:false
test3:true
test2:false
Edited on 04/08/2019, 17:38 CEST I edited the command to remove the "header part" of the file (the part before the first occurrence of Code>), as it might - by some error of the file's editor - contain a closing tag <Code which had not been opened with Code> but instead with a typo like e.g. Cde>. My assumption was that the complete file was "syntactically correct" in the sense that it consists of elements of type /(\w+)>.*?<\1/, separated by whitespace (including newlines). For files which do not conform to this syntax, the statement was not waterproof.

Another way to do it, using progressive matching and embedded code
perl -lne 'while (/\b(?:Code>(.*?)<Code(?{$c=$1})|Type>(.*?)<Type(?{print qq($c:$2\n) if defined $c;undef $c}))\b/g){}' -0777 <test.txt
Explanations:
Basically, the expression finds occurrences of Code>(.*?)<Code or Type>(.*)<Type. This gives the basic form of an alternation in an unnamed grouping expression: (?:Code>(.*?)<Code|Type>(.*?)<Type).
The word boundary assertions \b around the group ensure that the keywords Codeand Type are matched, but not e.g. Code2 or TType.
The modifier g ensures progressive application of the regular expression on the string. Since I want to extract the result inside of the expression itself, I place the regex in an empty loop, i.e. while (/.../g) {}.
You suppose a grammar rule Code ⟶ Type, i.e. you look for occurrences of a Type token following a Code token. For this, a Code token is memorized in a variable $c with the code expression (?{$c=$1}). If a Type token is found, it is considered a match only if formerly a Code token has been found, indicated by the fact that the variable $c is defined. In any case, if a Type token has been found, the variable $c will be undefd to prepare it for the next search. This gives the code evaluation (${print qq($c:$2\n) if defined $c;undef $c;}) in the Type branch of the regular expression.
Note that the captures of the Code>(.*?)<Code and Type>(.*?)<Type tokens may be the empty string. This is why I am working with undef $c and if defined $c instead of the simpler $c='' and if $c.

if your data in 'd', by gnu sed;
sed -Ez 's/.*Code>(\w+)<Code\sType>\w*<Type.*/\1/' d
Perl
perl -ne 'BEGIN{undef $/} /Code>(\w+)<Code\nType>\w*<Type/; print $1' d

Related

Powershell Regex - Replace between string A and string B only if contains string C

I have a file which looks like this
ABC01|01
Random data here 2131233154542542542
More random data
STRING-C
A bit more random stuff
&(%+
ABC02|01
Random data here 88888888
More random data 22222
STRING-D
A bit more random stuff
&(%+
I'm trying to make a script to Find everything between ABC01 and &(%+ ONLY if it contains STRING-C
I came up with this for regex ABC([\s\S]*?)STRING-C(?s)(.*?)&\(%\+
I'm getting this content from a text file with get-content.
$bad_content = gc $bad_file -raw
I want to do something like ($bad_content.replace($pattern,"") to remove the regex match.
How can I replace my matches in the file with nothing? I'm not even sure if my regex is correct but on regex101 it seems to find the strings I'm needing.
Your regex works with the sample input given, but not robustly, because if the order of blocks were reversed, it would mistakenly match across the blocks and remove both.
Tim Biegeleisen's helpful answer shows a regex that fixes the problem, via a negative lookahead assertion ((?!...)).
Let me show how to make it work from PowerShell:
You need to use the regex-based -replace operator, not the literal-substring-based .Replace() method:[1] to apply it.
To read the input string from a file, use Get-Content's -Raw switch to ensure that the file is read as a single, multi-line string; by default, Get-Content returns an array (stream) of lines, which would cause the -replace operation to be applied to each line individually.
(Get-Content -Raw file.txt) -replace '(?s)ABC01(?:(?!&\(%\+).)*?STRING-C.*?&\(%\+'
Not specifying replacement text (as the optional 2nd RHS operand to -replace) replaces the match with the empty string and therefore effectively removes what was matched.
The regex borrowed from Tim's answer is simplified a bit, by using the inline method of specifying matching options to tun on the single-line option ((?s)) at the start of the expression, which makes subsequent . instances match newlines too (a shorter and more efficient alternative to [\s\S]).
[1] See this answer for the juxtaposition of the two, including guidance on when to use which.
We can use a tempered dot trick when matching between the two markers to ensure that we don't cross the ending marker before matching STRING-C:
ABC01(?:(?!&\(%\+)[\s\S])*?STRING-C[\s\S]*?&\(%\+
Demo
Here is an explanation of the regex pattern:
ABC01 match the starting marker
(?:(?!&\(%\+)[\s\S])*? without crossing the ending marker
STRING-C match the nearest STRING-C marker
[\s\S]*? then match all content, across lines, until reaching
&\(%\+ the ending marker

Perl: how to use string variables as search pattern and replacement in regex

I want to use string variables for both search pattern and replacement in regex. The expected output is like this,
$ perl -e '$a="abcdeabCde"; $a=~s/b(.)d/_$1$1_/g; print "$a\n"'
a_cc_ea_CC_e
But when I moved the pattern and replacement to a variable, $1 was not evaluated.
$ perl -e '$a="abcdeabCde"; $p="b(.)d"; $r="_\$1\$1_"; $a=~s/$p/$r/g; print "$a\n"'
a_$1$1_ea_$1$1_e
When I use "ee" modifier, it gives errors.
$ perl -e '$a="abcdeabCde"; $p="b(.)d"; $r="_\$1\$1_"; $a=~s/$p/$r/gee; print "$a\n"'
Scalar found where operator expected at (eval 1) line 1, near "$1$1"
(Missing operator before $1?)
Bareword found where operator expected at (eval 1) line 1, near "$1_"
(Missing operator before _?)
Scalar found where operator expected at (eval 2) line 1, near "$1$1"
(Missing operator before $1?)
Bareword found where operator expected at (eval 2) line 1, near "$1_"
(Missing operator before _?)
aeae
What do I miss here?
Edit
Both $p and $r are written by myself. What I need is to do multiple similar regex replacing without touching the perl code, so $p and $r have to be in a separate data file. I hope this file can be used with C++/python code later.
Here are some examples of $p and $r.
^(.*\D)?((19|18|20)\d\d)年 $1$2<digits>年
^(.*\D)?(0\d)年 $1$2<digits>年
([TKZGD])(\d+)/(\d+)([^\d/]) $1$2<digits>$3<digits>$4
([^/TKZGD\d])(\d+)/(\d+)([^/\d]) $1$3分之$2$4
With $p="b(.)d"; you are getting a string with literal characters b(.)d. In general, regex patterns are not preserved in quoted strings and may not have their expected meaning in a regex. However, see Note at the end.
This is what qr operator is for: $p = qr/b(.)d/; forms the string as a regular expression.
As for the replacement part and /ee, the problem is that $r is first evaluated, to yield _$1$1_, which is then evaluated as code. Alas, that is not valid Perl code. The _ are barewords and even $1$1 itself isn't valid (for example, $1 . $1 would be).
The provided examples of $r have $Ns mixed with text in various ways. One way to parse this is to extract all $N and all else into a list that maintains their order from the string. Then, that can be processed into a string that will be valid code. For example, we need
'$1_$2$3other' --> $1 . '_' . $2 . $3 . 'other'
which is valid Perl code that can be evaluated.
The part of breaking this up is helped by split's capturing in the separator pattern.
sub repl {
my ($r) = #_;
my #terms = grep { $_ } split /(\$\d)/, $r;
return join '.', map { /^\$/ ? $_ : q(') . $_ . q(') } #terms;
}
$var =~ s/$p/repl($r)/gee;
With capturing /(...)/ in split's pattern, the separators are returned as a part of the list. Thus this extracts from $r an array of terms which are either $N or other, in their original order and with everything (other than trailing whitespace) kept. This includes possible (leading) empty strings so those need be filtered out.
Then every term other than $Ns is wrapped in '', so when they are all joined by . we get a valid Perl expression, as in the example above.
Then /ee will have this function return the string (such as above), and evaluate it as valid code.
We are told that safety of using /ee on external input is not a concern here. Still, this is something to keep in mind. See this post, provided by Håkon Hægland in a comment. Along with the discussion it also directs us to String::Substitution. Its use is demonstrated in this post. Another way to approach this is with replace from Data::Munge
For more discussion of /ee see this post, with several useful answers.
Note on using "b(.)d" for a regex pattern
In this case, with parens and dot, their special meaning is maintained. Thanks to kangshiyin for an early mention of this, and to Håkon Hægland for asserting it. However, this is a special case. Double-quoted strings directly deny many patterns since interpolation is done -- for example, "\w" is just an escaped w (what is unrecognized). The single quotes should work, as there is no interpolation. Still, strings intended for use as regex patterns are best formed using qr, as we are getting a true regex. Then all modifiers may be used as well.

How to substitute a string even it contains regex meta characters using Shell or Perl?

I want to substitue a word which maybe contains regex meta characters to another word, for example, substitue the .Precilla123 as .Precill, I try to use below solution:
sed 's/.Precilla123/.Precill/g'
but it will change below line
"Precilla123";"aaaa aaa";"bbb bbb"
to
.Precill";"aaaa aaa";"bbb bbb"
This side effect is not I wanted. So I try to use:
perl -pe 's/\Q.Precilla123\E/.Precill/g'
The above solution can disable interpreted regex meta characters, it will not have the side effect.
Unfortunately, if the word contains $ or #, this solution still cannot work, because Perl keep $ and # as variable prefix.
Can anybody help this? Many thanks.
Please note that the word I want to substitute is NOT hard coded, it comes from a input file, you can consider it as variable.
Unfortunately, if the word contains $ or #, this solution still cannot work, because Perl keep $ and # as variable prefix.
This is not true.
If the value that you want to replace is in a Perl variable, then quotemeta will work on the variable's contents just fine, including the characters $ and #:
echo 'pre$foo to .$foobar' | perl -pe 'my $from = q{.$foo}; s/\Q$from\E/.to/g'
Outputs:
pre$foo to .tobar
If the words that you want to replace are in an external file, then simply load that data in a BEGIN block before composing your regular expressions for replacement.
sed 's/\.Precilla123/.Precill/g'
Escape the meta character with \.
Be carrefull, mleta charactere are not the same for search pattern that are mainly regex []{}()\.*+^$ where replacement is limited to &\^$ (+ the separator that depend of your first char after the s in both pattern)

Bash regex to match dots and characters

I'm trying to use the =~ operator to execute a regular expression pattern against a curl response string.
The pattern im currently using is:
name\":\"(\.[a-zA-Z]+)\"
Currently however this pattern only extracts values that that contain only the characters a-z and A-Z. I need this pattern to also pick up values that contain a '.' character and a '#' character. How would I do this?
Also, is there any way this pattern can be improved performance wise? It takes quite a long time to execute against the string.
Cheers.
I recently ran into this problem in my script that sets my bash prompt according to my git status, and found that it was because of the placement of other things (namely, a hyphen) I wanted to match inside the expression.
For example, I wanted to match a certain part of a git status output, e.g. the part where it says "Your branch is ahead of 'origin/mybranch' by 1 commit."
This was my original pattern:
"Your branch is (ahead of|behind) '([a-zA-Z0-9_-]+)/([a-zA-Z0-9_-]+)' by ([0-9]+) commit".
One day I created a branch that had a . in it and found that my bash prompt wasn't showing me the right thing, and modified the expression to the following:
"Your branch is (ahead of|behind) '([a-zA-Z0-9_-]+)/([a-zA-Z0-9_-.]+)' by ([0-9]+) commit".
I expected it to work just fine, but instead there was no match at all.
After reading a lot of posts, I realized it was because of the placement of the hyphen (-); I had to put it right after the first square bracket, otherwise it would be interpreted as a range (in this case, it was trying to interpret the range of _-., which is invalid or just somehow makes the whole expression fall over.
It started working when I changed the expression to the following:
"Your branch is (ahead of|behind) '([a-zA-Z0-9_-]+)/([-a-zA-Z0-9_.]+)' by ([0-9]+) commit".
So basically what I meant to say that it could be something else in your expression (like the hyphen in mine) that is interfering with the matching of the dot and ampersand.
Working example script:
#!/bin/bash
regex='"name":"([a-zA-Z.#]+)"'
input='"name":"internal.action.retry.queue#temp"'
if [[ $input =~ $regex ]]
then
echo "$input matches regex $regex"
for (( i=0; i<${#BASH_REMATCH[#]}; i++))
do
echo -e "\tGroup[$i]: ${BASH_REMATCH[$i]}"
done
else
echo "$input does not match regex $regex"
fi
Just add dot ('.') and at sign ('#'):
name\":\"(\.[a-zA-Z.#]+)\"
If you don't need mandatory dot at the beginnig of the URL, use this:
\"name\":\"([a-zA-Z.#]+)\"

Specific Perl Regular Expression Needed

Have a perl script that needs to process all files of a certain type from a given directory. The files should be those that end in .jup, and SHOULDN'T contain the word 'TEMP_' in the filename. I.E. It should allow corrected.jup, but not TEMP_corrected.jup.
Have tried a look-ahead, but obviously have the pattern incorrect:
/(?!TEMP_).*\.jup$/
This returns the entire directory contents though, including files with any extension and those containing TEMP_, such as the file TEMP_corrected.jup.
The regular expression you want is:
/^((?!TEMP_).)*\.jup$/
The main difference is that your regular expression is not anchored at the start of the string, so it matches any substring that satisfies your criteria - so in the example of TEMP_corrected.jup, the substrings corrected.jup and EMP_corrected.jup both match.
(The other difference is that putting () round both the lookahead and the . ensures that TEMP_ isn't allowed anywhere in the string, as opposed to just not at the start. Not sure whether that's important to you or not!)
If you're getting files other than .jup files, then there is another problem - your expression should only match .jup files. You can test your expression with:
perl -ne 'if(/^((?!TEMP_).)*\.jup$/) {print;}'
then type strings: perl will echo them back if they match, and not if they don't. For example:
$ perl -ne 'if(/^((?!TEMP_).)*\.jup$/) {print;}'
foo
foo.jup
foo.jup <-- perl printed this because 'foo.jup' matched
TEMP_foo.jup