Can not catch substring by regex which ends with tab - regex

I have two types of strings:
1: ANN=abcdefgh;blabla
2 wrong version: ANN=abcdefgh\tyxz\tyxz
2 actual version: ANN=abcdefgh
Now I want to extract the abcdefgh with a regex. So the start to extract is always after "ANN=". But the end is eighter a semicolon (;) or the FIRST occurrence of a tab.
How does the regex for this look? I tried:
(my #splitUpAnn) = $tabValues[7] =~ /ANN=(.*)[;\t]/;
But I always get just the version 1 with the semicolon back, but it does not work for version two...
EDIT: To be clear. I did not get back ANYTHING for the version two. The problem is NOT that the last tab is used!
EDIT2: Ups, there was something different in the input data than expected. Either I have a semicolon at the end of NOTHING (see "2 actual version"). Sorry for that! So what would the regex then be?

Use .*? instead of .*.
.* is greedy so it matches with second occurrence of TAB.
DEMO

Just use the non-greedy quantifier *? that matches the least it can:
for my $string ('ANN=abcdefgh;blabla', "ANN=abcdefgh\tyxz\tyxz") {
(my #splitUpAnn) = $string =~ /ANN=(.*?)[;\t]/;
print "#splitUpAnn\n";
}
If you want to get the string up to the first semicolon if present, or everything otherwise, just use
$string =~ /ANN=([^;]*)/
i.e. capture everything that's not a semicolon.

/ANN=(.*?)[;\t]/
Make your regex non greedy.
.* is greedy and will match upto the last ; or \t available.

my ($ann) = $tabValues[7] =~ /ANN=(.*?)[;\t]/;
The leading ^ negates the character class, so [^;\t] matches any character except ; and tab.
There are multiple suggestions of making you .* non-greedy, but using non-greediness as anything but an optimization is very fragile and error prone.

I've tested and I got a match
if ( "ANN=abcdefgh;blabla" =~ /(ANN=(.*)[;\t])/ ) {
print $1."\n" ;}
if ( "ANN=abcdefgh\tyxz\tyxz" =~ /(ANN=(.*)[;\t])/ ) {
print $1."\n" ;}
result is:
ANN=abcdefgh;
ANN=abcdefgh yxz
So:
your request is really greedy, as described in previous answers
Perhaps the problem lies in the way you put the values in the array, but the regexp is correct

Related

Remove string between 2 pattern using gawk regex only

Input:
secNm:ATA,_class:com.dddao.domaffin.summaggrfy.GddenericMohsg},{ttlRec:0,ttlVal:{:0}secNm:B2B,_class:com.xyz.dakjdain.sfffummary.GenericMo73hs}extra
secNm:ATA,_class:com.dddao.domaffin.summaggrfy.GddenericMohsg},{ttlRec:0,ttlVal:{:0}secNm:B2B,_class:com.xyz.dakjdain.sfffummary.GenericMo73hs
In above both the string I want to remove,
For String 1: parts which stars from ",_class" and ends at 1st occurrence "}"
For String 2: parts which stars from ",_class" till the end if if 1st condition fails.
Output:
secNm:ATA,{ttlRec:0,ttlVal:{:0}secNm:B2Bextra
secNm:ATA,{ttlRec:0,ttlVal:{:0}secNm:B2B
This type of pattern is present undefinable times in this above string.
I want simple want to remove those part.
I have written regex function gsub(/,_class(.*?)\}/,"",$0)
I want answer only using gawk regex function only no other method.
My above give function is having some issue and removing big part of the string.
Help me to correct my regex formula please.
Thanks in advance.
You may use a [^}] negated bracket expression to match any char but } since lazy quantifiers are not supported.
Besides, you do not even need a grouping construct here as you are not referring to the captured value here. You may remove ( and ) safely.
Use
/,_class[^}]*}/
Basically, this should be understood as:
,_class - match ,_class substring
[^}]* - 0 or more chars other than }
} - up to and including }.

Regex to match absence of substring

I am looking for a regex which only matches when certain sub-strings are not present. In particular - if a line of code does not assign or return the return value from a method.
Examples:
this.execute(); // should match
var x = this.execute(); // no match
return this.execute(); // no match
I was trying to use the following regex
^(?!.*=|return).*execute\(\).*
This works with regex testers etc. - but I am getting "invalid perl operator" exception when using in practice.
Thanks..
Since you want to exclude only assignment or return it's easily negated
while (<DATA>) { print if not /(?:=|return)\s+this\.execute/ }
__DATA__
this.execute();
var x = this.execute();
return this.execute();
This prints only the line this.execute();.
With Lookaround Assertions, a negative lookahead that you offer does work
if (/^(?!.*=|return)\s+this\.execute/x) { print "$_\n" }
As for the negative lookbehind, there is one problem. First, here's what works
if ( /(?<! =\s ) this\.execute/x ) { print "$_\n" }
if ( /(?<! return \s ) this\.execute/x ) { print "$_\n" }
This excludes = or return, with one space. The thing is, we can't put \s+ there nor can we do alternation -- Perl can't do it for this particular assertion, see perlretut. We get
Variable length lookbehind not implemented in regex m/(?<!=\s+)this\.execute/ at
We can add varying space \s+ outside of the assertion, with this..., and then combine multiple conditions to provide for a possibility that there is no space between = and this....
However, there's no reason for this if you can use a regular negated match.
The reported error can only be about basic syntax. It is about the exact code you run, not the regex.
Not so sure if I understand the question but you might consider trying this one. ^this.execute\(\);
With situations like these, its best to find the "lowest common denominator" in the matches you want to distinguish from similar looking strings. In this case, the var x can be ignored - your requirements are satisfied by saying "anything before the method call is ok - the method call alone is not." That statement is probably a bit too tight though, so let's change it to "anthing other than whitespace before the method call is ok, otherwise flag the call". Which means;
my $method_call = qr/ ( this \. \w+ ) \( /x;
while (<$fh>) {
if (/ ^ \s* $method_call /x) {
warn "Found method call on line $.: $1\n"
}
}
I'm presumming $fh is a filehandle to the souce code file. I've also made some presumptions which you may need to tweek about how you want to define a method call - ie. opening bracket for parameters is compulsory. Using 'extended mode regexs' allows the use of whitespace in the regex for easier reading. Also, using 'quote rule' allows referring to a regex by name inside another to make things clearer.
If on the other hand, you want to insist on the presence of var x or return before giving the ok, we can reverse the search - ie explicitly look for the "ok" situations and flag any other calls:
my $method_call = qr/ ( this \. \w+ ) \( /x;
while (<$fh>) {
next if / ^ \s* return \s+ $method_call /x; # return OK
next if / ^ \s* var \s+ \w+ = \s+ $method_call /x; # var OK
warn "Found method call on line $.: $1\n" if /$method_call/ ;
}
Both of these are a little verbose but show more clearly what you're trying to do.
I don't think we have enough information here. I say this because the following works for me in the shell
~$ echo "execute()"| perl -ne 'print if /^(?!.*=|return).*execute\(\).*/'
execute()
~$ echo "return execute()"| perl -ne 'print if /^(?!.*=|return).*execute\(\).*/'
~$
In the above code, I am running a one liner in a shell that pipes a string into a perl program. The perl program will print the string if it matches the regex. I get no errors from your regex.
It's possible that the error is due to your version of perl or something else entirely may be happening.
I am using perl v5.22.2
I mean, the simple answer is, just use the ! operator on your test, but here's the conversion in case you were wondering:
/expression/ => /^(?!.*expression)/ (either use DOTALL or [^] in JavaScript)
/^expression/ => /^(?!expression)/

regular expression to match strings with decimals

I'm trying to create a regex which will do the following:
Name description: "QUARTERLY PATCH FOR XAQE (JUL 2013 - 11.2.0.3.20) : (125546467)"
Val version : 11.2.0.3.4
In order to output:
"Name, 11.2.0.3.20"
"Val, 11.2.0.3.4"
I have created the following regex: /^([\w]+).*([\d\.\d]+).*/, but it is only matching the last number in the 2nd group, i.e. in 11.2.0.3.4 it will only match 4. Could anyone help?
Also, there could be more than the two lines given above, so it needs to account for arbitrary lines where the version number could be anywhere in the line.
You can use a one-liner for this as well:
perl -lne '/(\w+).*?(\d+(\.\d+)+)/; print "$1, $2"' <filename>
__END__
Name, 11.2.0.3.20
Val, 11.2.0.3.4
If you are only planning for the output and not doing any processing over the captured groups, then this will do:
$str =~ s/([\n\r]|^)(Name|Val).*?(\d+(\.\d+)+).*/$1"$2, $3"/g;
Your problem is that .* is greedy and will consume as much as it can whilst the pattern still matches. One solution is to make is lazy .*?
Also [\d\.\d]+ means match one of \d, \. and \d, so it's the same as [\d.]+ which isn't what you want since it would match "2013" in the first line. \d+(\.\d+)+ is more suitable.
After those 2 changes you have:
^([\w]+).*?(\d+(\.\d+)+).*
RegExr

Regex: How do I match something that may OR may not be between [ ]

I am parsing a log using Perl and I am stumped with as to how I can parse something like this:
from=[ihatethisregex#hotmail.com]
from=ihatethisregex#hotmail.com
What I need is ihatethisregex#hotmail.com and I need to capture this in a named capture group called "email".
I tried the following:
(?<email>(?:\[[^\]]+\])|(?:\S+))
But this captures the square brackets when it parses the first line. I don't want the square brackets. Was wondering if I could do something like this:
(?:\[(?<email>[^\]]+)\])|(?<email>\S+)
and when I evaluate $+{email}, it will just take whichever one that was matched. I also tried the following:
(?:\[?(?<email>(?:[^\]]+\])|(?:\S+)))
But this gave strange results when the email was wrapped in a pair of square brackets.
Any help is appreciated.
/(\[)?your-regexp-here(?(1)\]|)/
( ) capture group #1
\[ opening bracket
? optionally
your-regexp-here your regexp
(?( ) ) conditional match:
1 if capture group #1 evaluated,
\] closing bracket
| else nothing
Note that this does not work in all languages, since conditional match is not a part of a standard regular expression, but rather an extension. Works in Perl, though.
EDIT: misplaced question mark.
I tend to do these kinds of things in two steps, just because its clearer:
my ($val)= /\w+=(.*)/ ;
$val =~ s/\[(.*)\]/$1/e ;
This trims off [] seperately.
Perhaps the following will be helpful:
use strict;
use warnings;
while (<DATA>) {
/from\s*=\s*\[?(?<email>(?:[^\]]+))\]?/;
print $+{email}, "\n";
}
__DATA__
from=[ihatethisregex#hotmail.com]
from=ihatethisregex#hotmail.com
Output:
ihatethisregex#hotmail.com
ihatethisregex#hotmail.com

Little vim regex

I have a bunch of strings that look like this: '../DisplayPhotod6f6.jpg?t=before&tn=1&id=130', and I'd like to take out everything after the question mark, to look like '../DisplayPhotod6f6.jpg'.
s/\(.\.\.\/DisplayPhoto.\{4,}\.jpg\)*'/\1'/g
This regex is capturing some but not all occurences, can you see why?
\.\{4,} is trying to match 4 or more . characters. What it looks like you wanted is "match 4 or more of any character" (.\{4,}) but "match 4 or more non-. characters" ([^.]\{4,}) might be more accurate. You'll also need to change the lone * at the end of the pattern to .* since the * is currently applying to the entire \(\) group.
I think the easyest way to go for this is:
s/?.*$/'/g
This says: delete everything after the question mark and replace it with a single quote.
I would use macros, sometime simpler than regexp (and interactive) :
qa
/DisplayPhoto<Enter>
f?dt'
n
q
And then some #a, or 20000#a to go though all lines.
The following regexp: /(\.\./DisplayPhoto.*\.jpg)/gi
tested against following examples:
../DisplayPhotocef3.jpg?t=before&tn=1&id=54
../DisplayPhotod6f6.jpg?t=before&tn=1&id=130
will result:
../DisplayPhotocef3.jpg
../DisplayPhotod6f6.jpg
%s/\('\.\.\/DisplayPhoto\w\{4,}\.jpg\).*'/\1'/g
Some notes:
% will cause the swap to work on all lines.
\w instead of '.', in case there are some malformed file names.
Replace '.' at the start of your matching regex with ' which is exactly what it should be matching.