How to regex extract something from a string - regex

I have this line:
[1] "RPKM_AB123_Gm12878_control.extended.bed_28m_control_500 and RPKM_AB156_GM12878-50ng_test.extended.bed_28m_test_500"
and I want to extract AB123_Gm12878_control and AB156_GM12878-50ng from the string.
I have tried this and it isn't working yet.
if ($_ =~ /.*"RPKM_([\w.]+).extended.+\s\w+\sRPKM_([\w.]+).extended.+"/){
print $1,"\t",$2,"\t";
}
Can someone point out where I did it wrong? Thanks!

".*RPKM_([\w.]+).extended.+\s\w+\sRPKM_([\w.]+).extended.+"
^^^^^
This character class is not accepting - which the string your matching against contains.
Try putting the hyphen in:
".*RPKM_([\w.]+)\.extended.+\s\w+\sRPKM_([\w.-]+)\.extended.+"
Also, it's good to escape the periods.

You can simplify regex and match all occurrences using /g
if ( my($m1, $m2) = /RPKM_([^.]+)/g ) {
print $m1,"\t",$m2,"\t";
}

Related

perl Regex replace for specific string length

I am using Perl to do some prototyping.
I need an expression to replace e by [ee] if the string is exactly 2 chars and finishes by "e".
le -> l [ee]
me -> m [ee]
elle -> elle : no change
I cannot test the length of the string, I need one expression to do the whole job.
I tried:
`s/(?=^.{0,2}\z).*e\z%/[ee]/g` but this is replacing the whole string
`s/^[c|d|j|l|m|n|s|t]e$/[ee]/g` same result (I listed the possible letters that could precede my "e")
`^(?<=[c|d|j|l|m|n|s|t])e$/[ee]/g` but I have no match, not sure I can use ^ on a positive look behind
EDIT
Guys you're amazing, hours of search on the web and here I get answers minutes after I posted.
I tried all your solutions and they are working perfectly directly in my script, i.e. this one:
my $test2="le";
$test2=~ s/^(\S)e$/\1\[ee\]/g;
print "test2:".$test2."\n";
-> test2:l[ee]
But I am loading these regex from a text file (using Perl for proto, the idea is to reuse it with any language implementing regex):
In the text file I store for example (I used % to split the line between match and replace):
^(\S)e$% \1\[ee\]
and then I parse and apply all regex like that:
my $test="le";
while (my $row = <$fh>) {
chomp $row;
if( $row =~ /%/){
my #reg = split /%/, $row;
#if no replacement, put empty string
if($#reg == 0){
push(#reg,"");
}
print "reg found, reg:".$reg[0].", replace:".$reg[1]."\n";
push #regs, [ #reg ];
}
}
print "orgine:".$test."\n";
for my $i (0 .. $#regs){
my $p=$regs[$i][0];
my $r=$regs[$i][1];
$test=~ s/$p/$r/g;
}
print "final:".$test."\n";
This technique is working well with my other regex, but not yet when I have a $1 or \1 in the replace... here is what I am obtaining:
final:\1\ee\
PS: you answered to initial question, should I open another post ?
Something like s/(?i)^([a-z])e$/$1[ee]/
Why aren't you using a capture group to do the replacement?
`s/^([c|d|j|l|m|n|s|t])e$/\1 [ee]/g`
If those are the characters you need and if it is indeed one word to a line with no whitespace before it or after it, then this will work.
Here's another option depending on what you are looking for. It will match a two character string consisting of one a-z character followed by one 'e' on its own line with possible whitespace before or after. It will replace this will the single a-z character followed by ' [ee]'
`s/^\s*([a-z])e\s*$/\1 [ee]/`
^(\S)e$
Try this.Replace by $1 [ee].See demo.
https://regex101.com/r/hR7tH4/28
I'd do something like this
$word =~ s/^(\w{1})(e)$/$1$2e/;
You can use following regex which match 2 character and then you can replace it with $1\[$2$2\]:
^([a-zA-Z])([a-zA-Z])$
Demo :
$my_string =~ s/^([a-zA-Z])([a-zA-Z])$/$1[$2$2]/;
See demo https://regex101.com/r/iD9oN4/1

split one line regex in a multiline regexp in perl

I have trouble spliting my regex in multiple line. I want my regex to match the line given:
* Code "l;k""dfsakd;.*[])_lkaDald"
So I created this regex which work:
my $firstRegexpr = qr/^\s*\*\s*Code\s+\"(?<Code>((\")*[^\"]+)+)\"/x;
But now I want to split it in multiline like this(and want it to match the same thing!):
my $firstRegexpr = qr/^\s*\*\s*Code\s+\"
(?<Code>((\")*[^\"]+)+)\"/x;
I read about this, but I have trouble using it:
/
^\s*\*\s*Code\s+\"
(?<Code>((\")*[^\"]+)+)\"
/x
My last question is about removing inlining variable in perl regex:
my $firstRegexpr = qr/^\s*\*\s*Code\s+\"(?<Code>((\")*[^\"$]+)+)\"\$/x;
the character $] is matched as a variable in the regex, how to define it not as a variable?
Thanks a lot for your time and please provide explicit example.
What the x flag does is very simply say 'ignore whitespace'.
So you no longer match 'space' characters , and instead have to use \s or similar.
So you can write:
if ( m/
^
\d+\s+
fish:\w+\s+
$
/x ) {
print "Matched\n";
}
You can test regular expressions with various websites but one example is https://regex101.com/
So to take your example: https://regex101.com/r/eG5jY8/1
But how is yours not working?
This matches:
my $string = q{* Code "l;k""dfsakd;.*[])_lkaDald"};
my $firstRegexpr = qr/^\s*
\*
\s*
Code\s+
\"
(?<Code>((\")*[^\"]+)+)
\"
/x;
print "Compiled_Regex: $firstRegexpr\n";
print "Matched\n" if ( $string =~ m/$firstRegexpr/ );
And as for not having $] - there's two answers. Either: Use \ to escape it, or use \Q\E.

regular expression to match multiline text including delimiters

I want to get data between delimiters and include the delimiters in the match.
Example text:
>>> Possible error is caused by the segmentation fault
provided detection report:
<detection-report>
This is somthing that already in the report.
just an example report.
</detection-report>
---------------------------------------------
have a nice day
My current code is:
if($oopsmessage =~/(?<=<detection-report>)((.|\n|\r|\s)+)(?=<\/detection-report>)/) {
$this->{'detection_report'} = $1;
}
It retrieves the following:
This is something that already in the report.
just an example report.
How can i include both the detection-report delimiters?
You can simplify the regex to the following:
my ($report) = $oopsmessage =~ m{(<detection-report>.*?</detection-report>)}s;
Notice I used a different delimiters to avoid the "leaning toothpick syndrome".
The s modifier makes . match newlines.
The parentheses in ($report) force list context, so the match returns all the matching groups. $1 is therefore assigned to $report.
(<detection-report>(?:(?!<\/detection-report>).)*<\/detection-report>)
Try this.Put flags g and s.See demo.
http://regex101.com/r/xT7yD8/18
Just do:
if ($oopsmessage =~ #(<detection-report>[\s\S]+?</detection-report>#) {
$this->{'detection_report'} = $1;
}
or, if you're dreading a file line by line:
while(<$fh>) {
if (/<detection-report>/ .. /<\/detection-report>/) {
$this->{'detection_report'} .= $_;
}
}
Use the below regex to get the data with delimiters.
(<detection-report>[\S\s]+?<\/detection-report>)
Group index 1 contains the string you want.
DEMO
[\S\s] would match one or more space or non-space characters.
/(<detection-report>.*?<\/detection-report>)/gs
You can simplify your regex to the following:
if($oopsmessage =~ m#(<detection-report>.+</detection-report>)#s) {
$this->{'detection_report'} = $1;
}
say $this->{'detection_report'};
Using the modifiers s allows a multiline match where . can be a new line. Using # instead of / means no faffing around with escaping slashes.
Output:
<detection-report>
This is somthing that already in the report.
just an example report.
</detection-report>

Matching numbers for substitution in Perl

I have this little script:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The expected output would be
5.txt
12.txt
1.txt
But instead, I get
R3_05.txt
T3_12.txt
1.txt
The last one is fine, but I cannot fathom why the regex gives me the string start for $1 on this case.
Try this pattern
foreach (#list) {
s/^.*?_?(?|0(\d)|(\d{2})).*\.txt$/$1.txt/;
print $_ . "\n";
}
Explanations:
I use here the branch reset feature (i.e. (?|...()...|...()...)) that allows to put several capturing groups in a single reference ( $1 here ). So, you avoid using a second replacement to trim a zero from the left of the capture.
To remove all from the begining before the number, I use :
.*? # all characters zero or more times
# ( ? -> make the * quantifier lazy to match as less as possible)
_? # an optional underscore
Note that you can ensure that you have only 2 digits adding a lookahead to check if there is not a digit that follows:
s/^.*?_?(?|0(\d)|(\d{2}))(?!\d).*\.txt$/$1.txt/;
(?!\d) means not followed by a digit.
The problem here is that your substitution regex does not cover the whole string, so only part of the string is substituted. But you are using a rather complex solution for a simple problem.
It seems that what you want is to read two digits from the string, and then add .txt to the end of it. So why not just do that?
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
for (#list) {
if (/(\d{2})/) {
$_ = "$1.txt";
}
}
To overcome the leading zero effect, you can force a conversion to a number by adding zero to it:
$_ = 0+$1 . ".txt";
I would modify your regular expression. Try using this code:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/.*(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The problem is that the first part in your s/// matches, what you think it does, but that the second part isn't replacing what you think it should. s/// will only replace what was previously matched. Thus to replace something like T3_ you will have to match that too.
s/.*(\d{2}).*\.txt$/$1.txt/;

How to match a string which starts with either a new line or after a comma?

My string is $tables="newdb1.table1:100,db2.table2:90,db1.table1:90". My search string is db1.table1 and my aim is to extract the value after : (i.e 90 in this case).
I am using:
if ($tables =~ /db1.table1:(\d+)/) { print $1; }
but the problem is it is matching newdb1.table1:100 and printing 100.
Can you please give my a regular expression to match a string which either starts with a newline or has comma before it.
Use word boundaries:
if ($tables =~ /\bdb1.table1:(\d+)/) { print $1; }
here __^^
if ($tables =~ /(^|,)db1.table1:(\d+)/) { print $2; }
To answer your exact question, that is to match just after the start of the string or a comma, you want a positive look-behind assertion. You may be tempted to write a pattern of
/(?<=^|,)db1\.table1:(\d+)/
but that may fail with an error of
Variable length lookbehind not implemented in regex m/(?<=^|,)db1\.table1:(\d+)/ ...
So hold the regex engine’s hand a bit by making the alternatives equal in length—tricky to do in the general case but workable here.
/(?<=^d|,)db1\.table1:(\d+)/
While we are locking it down, let’s be sure to bracket the end with a look-ahead assertion.
while ($tables =~ /(?<=^d|,)db1\.table1:(\d+)(?=,|$)/g) {
print "[$1]\n";
}
Output:
[90]
You could also use \b for a regex word boundary, which has the same output.
while ($tables =~ /\bdb1\.table1:(\d+)(?=,|$)/g) {
print "[$1]\n";
}
For the most natural solution, follow the rule of thumb proposed by Randal Schwartz, author of Learning Perl. Use capturing when you know what you want to keep and split when you know what you want to throw away. In your case you have a mixture: you want to discard the comma separators, and you want to keep the digits after the colon for a certain table. Write that as
for (split /\s*,\s*/, $tables) { # / to fix Stack Overflow highlighting
if (my($value) = /^db1\.table1:(\d+)$/) {
print "[$value]\n";
}
}
Output:
[90]