Extract text from a multiline string using Perl - regex

I have a string that covers several lines. I need to extract the text between two strings. For example:
Start Here Some example
text covering a few
lines. End Here
I need to extract the string, Start Here Some example text covering a few lines.
How do I go about this?

Use the /s regex modifier to treat the string as a single line:
/s
Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.
$string =~ /(Start Here.*)End Here/s;
print $1;
This will capture up to the last End Here, in case it appears more than once in your text.
If this is not what you want, then you can use:
$string =~ /(Start Here.*?)End Here/s;
print $1;
This will stop matching at the very first occurrence of End Here.

print $1 if /(Start Here.*?)End Here/s;

Wouldn't the correct modifier to treat the string as a single line be (?s) rather than (/s) ? I've been wrestling with a similar problem for quite a while now and the RegExp Tester embedded in JMeter's View Results Tree listener shows my regular expression extractor with the regex
(?s)<FMSFlightPlan>(.*?)</FMSFlightPlan>
matches
<FMSFlightPlan>
C87D
AN NTEST/GL
- FPN/FN/RP:DA:GCRR:AA:EIKN:F:SAMAR,N30540W014249.UN873.
BAROK,N35580W010014..PESUL,N40529W008069..RELVA,N41512W008359..
SIVIR,N46000W008450..EMPER,N49000W009000..CON,N53545W008492
</FMSFlightPlan>
while the regex
(?s)<FMSFlightPlan>(.*?)</FMSFlightPlan>
does not match. Other regex testers show the same result. However when I try to execute a the script I get the Beanshell Assertion error:
Assertion failure message: org.apache.jorphan.util.JMeterException: Error invoking bsh method: eval Sourced file: inline evaluation of: ``import java.io.*; //write out the data results to a file outfile = "/Users/Dani . . . '' Token Parsing Error: Lexical error at line 12, column 380. Encountered: "\n" (10),
So something else is definitely wrong with mine. Anyway, just a suggestion

Related

Perl Regex to exclude certain strings within the middle of a string

I have this string:
The file FILENAME has not been received
I'm trying to get the regex to match this, but not if the string is this (for example):
The file FAILNAME has not been received
I've got this regex so far:
/^(?=.*?\bThe\sfile\b)((?!FAILNAME).)*$/
But I'm unsure how to continue the expected text after the exclusion.
I hope I've explained that correctly :)
Thanks in advance!
I'd do it in two steps:
if ($message =~ /\AThe file (\S+) has not been received\z/ && $1 ne 'FAILNAME') {
I.e. use a regex to validate the general format and extract the filename, then check the extracted name separately.
Why cram everything into a single regex?
Speaking of which, you can actually cram arbitrary conditions into a regex. I wouldn't recommend it in this case, but:
/\AThe file (\S+) has not been received\z(?(?{ $1 eq 'FAILNAME' })(*FAIL))/
This extended pattern essentially says "if $1 equals FAILNAME, fail the match".
You could move the negative lookahead to after file followed by a whitespace character to assert what is directly on the right is not FAILNAME:
^The\sfile\s(?!\bFAILNAME\b).*$
Or of it can not occur in the string after The file use a quantifier:
^The\sfile\s(?!.*\bFAILNAME\b).*$
If there can not be anything before and after FAILNAME you could lookarounds:
^The\sfile\s(?!.*(?<!\S)FAILNAME(?!\S)).*$
Regex demo
My guess is that we wish to fail those that have FAILNAME, which your original expression seems to be working fine, and we'd then slightly modify that, which this might work:
^(?=The\sfile\s)(?!.*\s\bFAILNAME\b\s.*).*$
Demo 1
Here we are adding two spaces as an extra left and right boundaries, which if we do not wish to have those, we'd be simply excluding.
Example
use strict;
my $str = 'The file FILENAME has not been received
The file FILENAME has been received
The file FILENAME
The file AFAILNAME has not been received
FILENAME
The file FAILNAME has not been received
FAILNAME has not been received
The file FAILNAME
';
my $regex = qr/^(?=The\sfile\s)(?!.*\s\bFAILNAME\b\s.*).*$/mp;
if ( $str =~ /$regex/g ) {
print "Whole match is ${^MATCH} and its start/end positions can be obtained via \$-[0] and \$+[0]\n";
# print "Capture Group 1 is $1 and its start/end positions can be obtained via \$-[1] and \$+[1]\n";
# print "Capture Group 2 is $2 ... and so on\n";
}
# ${^POSTMATCH} and ${^PREMATCH} are also available with the use of '/p'
# Named capture groups can be called via $+{name}
RegEx Circuit
jex.im visualizes regular expressions:

Perl extract group with lookbehind from different line

I've tried web search and have read several answers on stackexchange, still cannot grasp why command does not extract anything. At the end I want to extract group with lookbehind from different line, e.g. from
Code>TEST1<Code Code2>best<Code2
Code>test2<Code
Type>false<Type
by finding needed key between Type and extracting first Code above the finding, so it case above to get test2. But I cannot succeed to extract even something from multiple lines, i.e.
perl -lne 'print $1,"_",$2 if /Code>(.*)<Code[\s\S\n]*?Type>(.*)<Type/'<test.txt prints nothing.
I've played with removing ln parameters and adding/removing greedy ? and trying just . in place of [\s\S\n].
perl -lne 'print $1,"_",$2 if /Code>(.*)<Code[\s\S\n]*?Code2>(.*)<Code2/'<test.txt
gives TEST1_best so same line extraction works.
What am I missing? Can what I want be done in one line of command?
The following command answers your question: it collects all values contained in a Code>...<Code pattern, if they are followed by a Type>...<Type pattern (with potentially other patterns in between, but no other occurrences of Code>...<Code in between):
perl -lne 's/^.*?(?=Code>)//s; for (split /Code>/) { print qq($1:$2\n) if /(.*?)<Code.*?Type>(.*?)<Type/s }' -0777 <test.txt
If e.g. test.txt contains the following lines,
Code>test4<Code Type>false<Type
Code>test3<Code
Type>true<Type
Code>TEST1<Code Code2>best<Code2
Code>test2<Code
Type>false<Type
then the command will collect the following value pairs:
test4:false
test3:true
test2:false
Edited on 04/08/2019, 17:38 CEST I edited the command to remove the "header part" of the file (the part before the first occurrence of Code>), as it might - by some error of the file's editor - contain a closing tag <Code which had not been opened with Code> but instead with a typo like e.g. Cde>. My assumption was that the complete file was "syntactically correct" in the sense that it consists of elements of type /(\w+)>.*?<\1/, separated by whitespace (including newlines). For files which do not conform to this syntax, the statement was not waterproof.
Another way to do it, using progressive matching and embedded code
perl -lne 'while (/\b(?:Code>(.*?)<Code(?{$c=$1})|Type>(.*?)<Type(?{print qq($c:$2\n) if defined $c;undef $c}))\b/g){}' -0777 <test.txt
Explanations:
Basically, the expression finds occurrences of Code>(.*?)<Code or Type>(.*)<Type. This gives the basic form of an alternation in an unnamed grouping expression: (?:Code>(.*?)<Code|Type>(.*?)<Type).
The word boundary assertions \b around the group ensure that the keywords Codeand Type are matched, but not e.g. Code2 or TType.
The modifier g ensures progressive application of the regular expression on the string. Since I want to extract the result inside of the expression itself, I place the regex in an empty loop, i.e. while (/.../g) {}.
You suppose a grammar rule Code ⟶ Type, i.e. you look for occurrences of a Type token following a Code token. For this, a Code token is memorized in a variable $c with the code expression (?{$c=$1}). If a Type token is found, it is considered a match only if formerly a Code token has been found, indicated by the fact that the variable $c is defined. In any case, if a Type token has been found, the variable $c will be undefd to prepare it for the next search. This gives the code evaluation (${print qq($c:$2\n) if defined $c;undef $c;}) in the Type branch of the regular expression.
Note that the captures of the Code>(.*?)<Code and Type>(.*?)<Type tokens may be the empty string. This is why I am working with undef $c and if defined $c instead of the simpler $c='' and if $c.
if your data in 'd', by gnu sed;
sed -Ez 's/.*Code>(\w+)<Code\sType>\w*<Type.*/\1/' d
Perl
perl -ne 'BEGIN{undef $/} /Code>(\w+)<Code\nType>\w*<Type/; print $1' d

perl script to match comma

I have a netlist generated from schematic. This netlist includes power pins. Iam trying to write a perl script to remove power pins from netlist.
As part of this i have to search for a string that matches the pattern shown below:
", );"
I have used the following code and it is not working
$line =~ s/,\s+\);//g
I have observed that pattern end with comma are matched but pattern starting with comma or pattern with comma in middle are not matched.
Any suggestions on how to get this work
You need to use this instead:
s/,\s*\);//
You should be defensive and be able to handle no whitespace between the , and the ). You have to escape the ). See perldoc perlre for more info.
Thank you every one. I have found the problem. The problem was that the pattern to be recognized is split in to two different lines. The "," is in one line followed by ");" in next line. At first, iam removing the new line character and assumed that the next line will get appended to the current line, which is not happening. Hence, the pattern matching did not work.
To resolve this, i have to read the file once again and then replace the pattern.

Tcl regexp does not escape asterisk (*)

In my script I get a string that looks like this:
Reading thisfile.txt
"lib" maps to directory somedir/work.
"superlib" maps to directory somedir/work.
"anotherlib" maps to directory somedir/anotherlib.
** Error: (errorcode) Cannot access file "somedir/anotherlib". <--
No such file or directory. (errno = ENOENT) <--
Reading anotherfile.txt
.....
But the two marked lines with the error code only appear from time to time.
I'm trying to use a regexpression to get the lines from after Reading thisfile.txt to the line before either Reading anotherfile.txt or, if it is there, before **.
So result should in every case look like this:
"lib" maps to directory somedir/work.
"superlib" maps to directory somedir/work.
"anotherlib" maps to directory somedir/anotherlib.
I have tried it with this regexp:
set pattern ".*Reading thisfile.txt\n(.*)\n.*Reading .*$"
Then I do
regexp -all $pattern $data -> result
But that only works if there is no error message.
So I'm trying to look for the *.
set pattern ".*Reading thisfile.txt\n(.*)\n.*\[\*|Reading\].*$"
But that also does not work. The part with ** Error is still there.
I wonder why. This one doesn't even compile:
set pattern ".*Reading thisfile.txt\n(.*)\n.*\*?.*Reading .*$"
any idea how I can find the and not match the *?
From the way you wrote your regex, you will have to use braces:
set pattern {.*Reading thisfile\.txt\n(.*)\n.*\*?.*Reading .*$}
If you used quotes, you would have had to use:
set pattern ".*Reading thisfile\\.txt\n(.*)\n.*\\*?.*Reading .*$"
i.e. basically put a second backslash to escape the first ones.
The above will be able to grab something; albeit everything between the first and the last Reading.
If you want to match from Reading thisfile.txt to the next line beginning with asterisk, then you could use:
set pattern {^Reading thisfile\.txt\n(.*?)\n(?=^Reading|^\*)}
regexp -all -lineanchor -- $pattern $data -> result
(?=^Reading|^\*) is a positive lookahead and I changed your (.*) to (.*?) so that you really get all the occurrences and not from the first to the last Reading.
The positive lookahead will match if either Reading or * is ahead and are both starting on a new line.
-lineanchor makes ^ match at every beginning of line instead of at the start of the string.
codepad demo
I forgot to mention that if you have more than one match, you will have to set the results of the regexp and use the -inline modifier instead of using the above construct (else you'll get only the last submatch)...
set results [regexp -all -inline -lineanchor -- $pattern $data]
foreach {main sub} $results {
puts $sub
}
I'm unfamiliar with tcl but the following regex will give you matches of which the 1st capture-group contains the filename and the 2nd capture-group contains all the lines you want:
^Reading ([^\n]*)\n((?:[^\n]|\n(?!Reading|\*\*))*)
Debuggex Demo
Basically the (?:[^\n]|\n(?!Reading|\*\*))* is saying "Match anything that isn't a new-line character or a new-line character not followed by either Reading or **".
What I'm getting from Jerry's answer is you'd define that in tcl like so:
set pattern {^Reading ([^\n]*)\n((?:[^\n]|\n(?!Reading|\*\*))*)}

What is the regular Expression to uncomment a block of Perl code in Eclipse?

I need a regular expression to uncomment a block of Perl code, commented with # in each line.
As of now, my find expression in the Eclipse IDE is (^#(.*$\R)+) which matches the commented block, but if I give $2 as the replace expression, it only prints the last matched line. How do I remove the # while replacing?
For example, I need to convert:
# print "yes";
# print "no";
# print "blah";
to
print "yes";
print "no";
print "blah";
In most flavors, when a capturing group is repeated, only the last capture is kept. Your original pattern uses + repetition to match multiple lines of comments, but group 2 can only keep what was captured in the last match from the last line. This is the source of your problem.
To fix this, you can remove the outer repetition, so you match and replace one line at a time. Perhaps the simplest pattern to do this is to match:
^#\s*
And replace with the empty string.
Since this performs match and replacement one line at a time, you must repeat it as many times as necessary (in some flavors, you can use the g global flag, in e.g. Java there are replaceFirst/All pair of methods instead).
References
regular-expressions.info/Repeating a Captured Group vs Capturing a Repeated Group
Related questions
Is there a regex flavor that allows me to count the number of repetitions matched by * and +?
.NET regex keeps all repeated matches
Special note on Eclipse keyboard shortcuts
It Java mode, Eclipse already has keyboard shortcuts to add/remove/toggle block comments. By default, Ctrl+/ binds to the "Toggle comment" action. You can highlight multiple lines, and hit Ctrl+/ to toggle block comments (i.e. //) on and off.
You can hit Ctrl+Shift+L to see a list of all keyboard shortcuts. There may be one in Perl mode to toggle Perl block comments #.
Related questions
What is your favorite hot-key in Eclipse?
Hidden features of Eclipse
Search with ^#(.*$) and replace with $1
You can try this one: -
use strict;
use warning;
my $data = "#Hello#stack\n#overflow\n";
$data =~ s/^?#//g ;
OUTPUT:-
Hello
stack
overflow
Or
open(IN, '<', "test.pl") or die $!;
read(IN, my $data, -s "test.pl"); #reading a file
$data =~ s/^?#//g ;
open(OUT, '>', "test1.pl") or die $!;
print OUT $data; #Writing a file
close OUT;
close IN;
Note: Take care of #!/usr/bin/perl in the Perl script, it will uncomment it also.
You need the GLOBAL g switch.
s/^#(.+)/$1/g
In order to determine whether a perl '#' is a comment or something else, you have to compile the perl and build a parse tree, because of Schwartz's Snippet
whatever / 25 ; # / ; die "this dies!";
Whether that '#' is a comment or part of a regex depends on whether whatever() is nullary, which depends on the parse tree.
For the simple cases, however, yours is failing because (^#(.*$\R)+) repeats a capturing group, which is not what you wanted.
But anyway, if you want to handle simple cases, I don't even like the regex that everyone else is using, because it fails if there is whitespace before the # on the line. What about
^\s*#(.*)$
? This will match any line that begins with a comment (optionally with whitespace, e.g., for indented blocks).
Try this regex:
(^[\t ]+)(\#)(.*)
With this replacement:
$1$3
Group 1 is (^[\t ]+) and matches all leading whitespace (spaces and tabs).
Group 2 is (#) and matches one # character.
Group 3 is (.*) and matches the rest of the line.