regular expression to match multiline text including delimiters - regex

I want to get data between delimiters and include the delimiters in the match.
Example text:
>>> Possible error is caused by the segmentation fault
provided detection report:
<detection-report>
This is somthing that already in the report.
just an example report.
</detection-report>
---------------------------------------------
have a nice day
My current code is:
if($oopsmessage =~/(?<=<detection-report>)((.|\n|\r|\s)+)(?=<\/detection-report>)/) {
$this->{'detection_report'} = $1;
}
It retrieves the following:
This is something that already in the report.
just an example report.
How can i include both the detection-report delimiters?

You can simplify the regex to the following:
my ($report) = $oopsmessage =~ m{(<detection-report>.*?</detection-report>)}s;
Notice I used a different delimiters to avoid the "leaning toothpick syndrome".
The s modifier makes . match newlines.
The parentheses in ($report) force list context, so the match returns all the matching groups. $1 is therefore assigned to $report.

(<detection-report>(?:(?!<\/detection-report>).)*<\/detection-report>)
Try this.Put flags g and s.See demo.
http://regex101.com/r/xT7yD8/18

Just do:
if ($oopsmessage =~ #(<detection-report>[\s\S]+?</detection-report>#) {
$this->{'detection_report'} = $1;
}
or, if you're dreading a file line by line:
while(<$fh>) {
if (/<detection-report>/ .. /<\/detection-report>/) {
$this->{'detection_report'} .= $_;
}
}

Use the below regex to get the data with delimiters.
(<detection-report>[\S\s]+?<\/detection-report>)
Group index 1 contains the string you want.
DEMO
[\S\s] would match one or more space or non-space characters.

/(<detection-report>.*?<\/detection-report>)/gs

You can simplify your regex to the following:
if($oopsmessage =~ m#(<detection-report>.+</detection-report>)#s) {
$this->{'detection_report'} = $1;
}
say $this->{'detection_report'};
Using the modifiers s allows a multiline match where . can be a new line. Using # instead of / means no faffing around with escaping slashes.
Output:
<detection-report>
This is somthing that already in the report.
just an example report.
</detection-report>

Related

How to write regular expression in powershell

I need regular expression in powershell to split string by a string ## and remove string up-to another character (;).
I have the following string.
$temp = "admin#test.com## deliver, expand;user1#test.com## deliver, expand;group1#test.com## deliver, expand;"
Now, I want to split this string and get only email ids into new array object. my expected output should be like this.
admin#test.com
user1#test.com
group1#test.com
To get above output, I need to split string by the character ## and remove sub string up-to semi-colon (;).
Can anyone help me to write regex query to achieve this need in powershell?.
If you want to use regex-based splitting with your approach, you can use ##[^;]*; regex and this code that will also remove all the empty values (with | ? { $_ }):
$res = [regex]::Split($temp, '##[^;]*;') | ? { $_ }
The ##[^;]*; matches:
## - double #
[^;]* - zero or more characters other than ;
; - a literal ;.
See the regex demo
Use [regex]::Matches to get all occurrences of your regular expression. You probably don't need to split your string first if this suits for you:
\b\w+#[^#]*
Debuggex Demo
PowerShell code:
[regex]::Matches($temp, '\b\w+#[^#]*') | ForEach-Object { $_.Groups[0].Value }
Output:
admin#test.com
user1#test.com
group1#test.com

perl Regex replace for specific string length

I am using Perl to do some prototyping.
I need an expression to replace e by [ee] if the string is exactly 2 chars and finishes by "e".
le -> l [ee]
me -> m [ee]
elle -> elle : no change
I cannot test the length of the string, I need one expression to do the whole job.
I tried:
`s/(?=^.{0,2}\z).*e\z%/[ee]/g` but this is replacing the whole string
`s/^[c|d|j|l|m|n|s|t]e$/[ee]/g` same result (I listed the possible letters that could precede my "e")
`^(?<=[c|d|j|l|m|n|s|t])e$/[ee]/g` but I have no match, not sure I can use ^ on a positive look behind
EDIT
Guys you're amazing, hours of search on the web and here I get answers minutes after I posted.
I tried all your solutions and they are working perfectly directly in my script, i.e. this one:
my $test2="le";
$test2=~ s/^(\S)e$/\1\[ee\]/g;
print "test2:".$test2."\n";
-> test2:l[ee]
But I am loading these regex from a text file (using Perl for proto, the idea is to reuse it with any language implementing regex):
In the text file I store for example (I used % to split the line between match and replace):
^(\S)e$% \1\[ee\]
and then I parse and apply all regex like that:
my $test="le";
while (my $row = <$fh>) {
chomp $row;
if( $row =~ /%/){
my #reg = split /%/, $row;
#if no replacement, put empty string
if($#reg == 0){
push(#reg,"");
}
print "reg found, reg:".$reg[0].", replace:".$reg[1]."\n";
push #regs, [ #reg ];
}
}
print "orgine:".$test."\n";
for my $i (0 .. $#regs){
my $p=$regs[$i][0];
my $r=$regs[$i][1];
$test=~ s/$p/$r/g;
}
print "final:".$test."\n";
This technique is working well with my other regex, but not yet when I have a $1 or \1 in the replace... here is what I am obtaining:
final:\1\ee\
PS: you answered to initial question, should I open another post ?
Something like s/(?i)^([a-z])e$/$1[ee]/
Why aren't you using a capture group to do the replacement?
`s/^([c|d|j|l|m|n|s|t])e$/\1 [ee]/g`
If those are the characters you need and if it is indeed one word to a line with no whitespace before it or after it, then this will work.
Here's another option depending on what you are looking for. It will match a two character string consisting of one a-z character followed by one 'e' on its own line with possible whitespace before or after. It will replace this will the single a-z character followed by ' [ee]'
`s/^\s*([a-z])e\s*$/\1 [ee]/`
^(\S)e$
Try this.Replace by $1 [ee].See demo.
https://regex101.com/r/hR7tH4/28
I'd do something like this
$word =~ s/^(\w{1})(e)$/$1$2e/;
You can use following regex which match 2 character and then you can replace it with $1\[$2$2\]:
^([a-zA-Z])([a-zA-Z])$
Demo :
$my_string =~ s/^([a-zA-Z])([a-zA-Z])$/$1[$2$2]/;
See demo https://regex101.com/r/iD9oN4/1

How to regex extract something from a string

I have this line:
[1] "RPKM_AB123_Gm12878_control.extended.bed_28m_control_500 and RPKM_AB156_GM12878-50ng_test.extended.bed_28m_test_500"
and I want to extract AB123_Gm12878_control and AB156_GM12878-50ng from the string.
I have tried this and it isn't working yet.
if ($_ =~ /.*"RPKM_([\w.]+).extended.+\s\w+\sRPKM_([\w.]+).extended.+"/){
print $1,"\t",$2,"\t";
}
Can someone point out where I did it wrong? Thanks!
".*RPKM_([\w.]+).extended.+\s\w+\sRPKM_([\w.]+).extended.+"
^^^^^
This character class is not accepting - which the string your matching against contains.
Try putting the hyphen in:
".*RPKM_([\w.]+)\.extended.+\s\w+\sRPKM_([\w.-]+)\.extended.+"
Also, it's good to escape the periods.
You can simplify regex and match all occurrences using /g
if ( my($m1, $m2) = /RPKM_([^.]+)/g ) {
print $m1,"\t",$m2,"\t";
}

Matching numbers for substitution in Perl

I have this little script:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The expected output would be
5.txt
12.txt
1.txt
But instead, I get
R3_05.txt
T3_12.txt
1.txt
The last one is fine, but I cannot fathom why the regex gives me the string start for $1 on this case.
Try this pattern
foreach (#list) {
s/^.*?_?(?|0(\d)|(\d{2})).*\.txt$/$1.txt/;
print $_ . "\n";
}
Explanations:
I use here the branch reset feature (i.e. (?|...()...|...()...)) that allows to put several capturing groups in a single reference ( $1 here ). So, you avoid using a second replacement to trim a zero from the left of the capture.
To remove all from the begining before the number, I use :
.*? # all characters zero or more times
# ( ? -> make the * quantifier lazy to match as less as possible)
_? # an optional underscore
Note that you can ensure that you have only 2 digits adding a lookahead to check if there is not a digit that follows:
s/^.*?_?(?|0(\d)|(\d{2}))(?!\d).*\.txt$/$1.txt/;
(?!\d) means not followed by a digit.
The problem here is that your substitution regex does not cover the whole string, so only part of the string is substituted. But you are using a rather complex solution for a simple problem.
It seems that what you want is to read two digits from the string, and then add .txt to the end of it. So why not just do that?
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
for (#list) {
if (/(\d{2})/) {
$_ = "$1.txt";
}
}
To overcome the leading zero effect, you can force a conversion to a number by adding zero to it:
$_ = 0+$1 . ".txt";
I would modify your regular expression. Try using this code:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/.*(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The problem is that the first part in your s/// matches, what you think it does, but that the second part isn't replacing what you think it should. s/// will only replace what was previously matched. Thus to replace something like T3_ you will have to match that too.
s/.*(\d{2}).*\.txt$/$1.txt/;

cant get the perl regex to work

My perl is getting rusty. It only prints "matched=" but $1 is blank!?!
EDIT 1: WHo the h#$! downvoted this? There are no wrong questions. If you dont like it, move on to next one!
$crazy="abcd\r\nallo\nXYZ\n\n\nQQQ";
if ($crazy =~ m/([.\n\r]+)/gsi) {
print "matched=", $1, "\n";
} else {
print "not matched!\n";
}
EDIT 2: This is the code fragment with updated regex, works great!
$crazy="abcd\r\nallo\nXYZ\n\n\nQQQ";
if ($crazy =~ m/([\s\S]+)/gsi) {
print "matched=", $1, "\n";
} else {
print "not matched!\n";
}
EDIT 3: Haha, i see perl police strikes yet again!!!
I don't know if this is your exact problem, but inside square brackets, '.' is just looking for a period. I didn't see a period in the input, so I wondered which you meant.
Aside from the period, the rest of the character class is looking for consecutive whitespace. And as you didn't use the multiline switch, you've got newlines being counted as whitespace (and any character), but no indication to scan beyond the first record separator. But because of the way that you print it out, it also gives some indication that you meant more than the literal period, as mentioned above.
Axeman is correct; your problem is that . in a character class doesn't do what you expect.
By default, . outside a character class (and not backslashed) matches any character but a newline. If you want to include newlines, you specify the /s flag (which you seem to already have) on your regex or put the . in a (?s:...) group:
my $crazy="abcd\r\nallo\nXYZ\n\n\nQQQ";
if ($crazy =~ m/((?s:.+))/) {
print "matched=", $1, "\n";
} else {
print "not matched!\n";
}
. in a character class is a literal period, not match anything. What you really want is /(.+)/s. The /g flag says to match multiple times, but you are using the regex in scalar context, so it will only match the first item. The /i flag makes the regex case insensitive, but there are no characters with case in your regex. The \s flag makes . match newlines, and it always matches "\r", so instead of [.\n\r], you can just use ..
However, /(.+)/s will match any string with one or more characters, so you would be better off with
my $crazy="abcd\r\nallo\nXYZ\n\n\nQQQ";
if (length $crazy) {
print "matched=$crazy\n";
} else {
print "not matched!\n";
}
It is possible you meant to do something like this:
#!/usr/bin/perl
use strict;
use warnings;
my $crazy = "abcd\r\nallo\nXYZ\n\n\nQQQ";
while ($crazy =~ /(.+)[\r\n]+/g) {
print "matched=$1\n";
}
But that would probably be better phrased:
#!/usr/bin/perl
use strict;
use warnings;
my $crazy = "abcd\r\nallo\nXYZ\n\n\nQQQ";
for my $part (split /[\r\n]+/, $crazy) {
print "matched=$part\n";
}
$1 contains white space, that's why you don't see it in a print like that, just add something after it/quote it.
Example:
perl -E "qq'abcd\r\nallo\nXYZ\n\n\nQQQ'=~/([.\n\r]+)/gsi;say 'got(',length($1),qq') >$1<';"
got(2) >
<
Updated for your comments:
To match everything you can simply use /(.+)/s
[.] (dot inside a character class) does not mean "match any character", it just means match the literal . character. So in an input string without any dots,
m/([.\n\r]+)/gsi
will just match strings of \n and \r characters.
With the /s modifier, you are already asking the regex engine to include newlines with . (match any character), so you could just write
m/(.+)/gsi