RegEx: don't capture match, but capture after match

RegEx: don't capture match, but capture after match - regex

There are a thousand regular expression questions on SO, so I apologize if this is already covered. I did look first.
I have string:
Name Subname 11X22 88X620 AB33(20) YA5619 77,66
I need to capture this string: YA5619
What I am doing is just finding AB33(20) and after this I am capturing until first white space. But AB33(20) can be AB-33(20) or AB33(-20) or AB33(-1).
My preg_match regex is: (?<=\bAB\d{2}\(\d{2}\)\s).+?(?=\s)
Why I am getting error when I change from \d{2} to \d+?
For final result I was thinking this regix will work but no:
(?<=\bAB-?\d+\(-?\d+\)\s).+?(?=\s)
Any ideas what I am doing wrong?

With most regex flavors, lookbehind needs to evaluate to a fixed-length sequence, so you can't use variable quantifiers like * or + or even {1,2}.
Instead of using lookaround, you can simply match your marker pattern and then forget it with \K.
AB-?\d+(?:\(-?\d+\))? \K[^ ]+
demo: https://regex101.com/r/8XXngH/1

It depends on the language. If it is in .NET for example, it matches due to the various length in the lookbehind.
Another solution might be to use a character class and add the character you would allow to match. Then match a whitespace character and capture in a group matching \S+ which matches 1+ times not a whitespace character.
\bAB[()\d-]+\s\K\S+
Explanation
\bAB Match literally prepended with word boundary to prevent AB being part of a larger match.
[()\d-]+ Match 1+ times any of the listed character in the character class
\s Match a whitespace char (or \s+ to match 1 or more)
\K Reset the starting point of the reported match( Forget what was matched)
\S+ Match in a group 1+ times not a whitespace character
Regex demo | Php demo

Related

Regex: Match pattern unless preceded by pattern containing element from the matching character class

I am having a hard time coming up with a regex to match a specific case:
This can be matched:
any-dashed-strings
this-can-be-matched-even-though-its-big
This cannot be matched:
strings starting with elem- or asdf- or a single -
elem-this-cannot-be-matched
asdf-this-cannot-be-matched
-
So far what I came up with is:
/\b(?!elem-|asdf-)([\w\-]+)\b/
But I keep matching a single - and the whole -this-cannot-be-matched suffix. I cannot figure it out how to not only ignore a character present inside the matching character class conditionally, and not matching anything else if a suffix is found
I am currently working with the Oniguruma engine (Ruby 1.9+/PHP multi-byte string module).
If possible, please elaborate on the solution. Thanks a lot!

If a lookbehind is supported, you can assert a whitespace boundary to the left, and make the alternation for both words without the hyphen optional.
(?<!\S)(?!(?:elem|asdf)?-)[\w-]+\b
Explanation
(?<!\S) Assert a whitespace boundary to the left
(?! Negative lookahead, assert the directly to the right is not
(?:elem|asdf)?- Optionally match elem or asdf followed by -
) Close the lookahead
[\w-]+ Match 1+ word chars or -
\b A word boundary
See a regex demo.
Or a version with a capture group and without a lookbehind:
(?:\s|^)(?!(?:elem|asdf)?-)([\w-]+)\b
See another regex demo.

What is the proper regex for capturing everything after "String" and between two delimeters ('=' and and non alphanumeric))

Details={
AwsEc2SecurityGroup={GroupName=m.com-rds, OwnerId=123, VpcId=vpc-123,
IpPermissions=[{FromPort=3306, ToPort=3306, IpProtocol=tcp, IpRanges=[{CidrIp=1.1.1.1/32}, {CidrIp=2.2.2.2/32}, {CidrIp=0.0.0.0/0}, {CidrIp=3.3.3.3/32}],
UserIdGroupPairs=[{UserId=123, GroupId=sg-123abc}]}], IpPermissionsEgress=[{IpProtocol=-1, IpRanges=[{CidrIp=0.0.0.0/0}]}], GroupId=sg-123abc}},
Region=us-east-1, Id=arn:aws:ec2:us-east-1:123:security-group/sg-123abc}]
}
I want to capture exactly arn:aws:ec2:us-east-1:123:security-group/sg-123abc in this example. Generically, I want to capture the value of Id regardless of placement. My current solution is /Details={.*Id=(.*\w)/, but this only works if it's the last object in the data. How can I take into account the following potential scenario:
Id=arn:aws:ec2:us-east-1:123:security-group/sg-123abc, Thing=123abc}]

You have a pattern with 2 times .* which will first match till the end of the line/string (depending on if the dot matches a newline) and it will backtrack to match the last occurrence where this part of the pattern Id=(.*\w) can match.
If you want to use a capture group, you can make the format and the allowed characters a bit more specific:
\bId=(\w+(?:[:\/-]\w+)+)
The pattern in parts
\b A word boundary to prevent a partial word match
Id= Match literally
( Capture group 1
\w+ Match 1+ word chars
(?:[:\/-]\w+)+ Repeat 1+ times either : / - and 1+ word chars
) Close group 1
Regex demo
Or if you know that it starts with Id=arn:
\bId=(arn:[\w:\/-]+)
Regex demo
Note that you don't have to escape the \/ only when the delimiters of the regex are forward slashes, but there is no language tagged.

You can use look-behind to check that there is the Id= prefix, and then match anything that is not a space, comma or closing brace:
(?<=\bId=)[^,}\s]*

How to consume lookaround in regex?

I want to match
abc_def_ghi,
abc_abc_ghi,
abc_a2a_ghi,
abc_999_ghi
but not abc_xxx_ghi (with xxx in center).
I came up to manually consuming look ahead (abc_(?!xxx)..._ghi), but I wonder is there any other way without manually specifying number of characters to skip.
Original qustion was with numbers, updated for strings case.

If you don't want to specify exactly how many characters to skip, perhaps you could use a quantifier like + in the negative lookahead and use a negated character class to match not an underscore.
\babc_(?!x+_)[^_]+_ghi\b
Explanation
\babc_ Word boundary, match abc_
(?! Negative lookahead, assert what is directly on the right is not
x+_ Match 1+ times x followed by an underscore
) Close lookahead
[^_]+_ Negated character class, match 1+ times any char except _
ghi\b Match ghi and word boundary
Regex demo

You can use this
123_(?:(?!000)\d){3}_789
Regex demo

If you don't wish to use look-arounds, this expression might be an option:
(?:abc_xxx_ghi)|(abc_.{3}_ghi)
Other than that I can't think of anything else.
DEMO

Regex for spoof

I would like to ask for help regarding my problem when it comes to spoofing let say usernames and I want to catch them using regex.
for example the correct username is :
rolf
and here are the spoofed versions that I could think of:
roooolf
r123olf
123rolf123
rolf5623
123rolf
rollllf
rrrrrrolf
rolffff
So basically I have this regex expression ( that I know is not sufficient because I've tried it on regex101 website )
.+(?![rolf]).+
I'm using this as a baseline because it doesnt catch the correct username which is :
rolf
but it doesn't catch all the other "spoofed" versions of the username.
Any Ideas how can I make my regex more efficient?
Thanks in advance!

You may try this too
(?m)^(?![^\n]*?rolf[^\n]*$).*$
Demo

To match not exactly rolf You can use a negative lookahead (?! to assert that what follows from the beginning of the string is not 'rolf' until the end of the string.
^(?!rolf$).+$
That would match
^ Assert position at the begin of the string
(?! Negative lookahead that asserts that what follows is not
rolf Match literally
) Close negative lookahead
.+ Match any character one or more times
$Assert position at the end of the string
From your example regex you match .+ where #Ωmega has a fair point, matches spaces.
Instead of .+ you could specify what characters you might accept like \w+ for example to match one or more word characters or specify more using a character class.

You can use a regex pattern
\b(?!rolf\b)\S+\b
\b Word boundary - Matches a word boundary position between a
word character and non-word character or position (start / end of
string).
(?! Negative lookahead - Specifies a group that can not match
after the main expression (if it matches, the result is discarded).
\S Not whitespace - Matches any character that is not a
whitespace character (spaces, tabs, line breaks).
+ Quantifier - Match 1 or more of the preceding token.
Test your inputs with this pattern here.

Regex Matching Behaviour Of \w

I noticed some interesting behaviour with some regex work I am doing, and I'd like some insight.
From what I understand, the word character, \w should match the following [a-zA-Z_0-9]
Given this input,
0000000060399301+0000000042456971+0000000
What should this regex
(\d+)\w
Capture?
I would expect it to capture 0000000060399301 but it actually captures 000000006039930
Is there something I am missing? Why is the 1 dropped from the end?
I noticed if I changed the regex to
(\d+\w)
It captures correctly i.e. including the 1
Anyone care to explain? Thanks

You require the regex to match a trailing word character - that would be the 1.
It cannot be another character, because
+ is not a word class character
+ is not a digit
matching is greedy

\d+ - matches one or more digit characters.
\w+ - matches one or more word characters. [A-Za-z\d_]
So with this string 0000000060399301+, \d+ in this (\d+)\w regex matches all the digits (including the 1 before +) at very first, since the following pattern is \w , regex engine tries to find a match, so it backtracks one character to the left and forces \w to match the digit before + . Now the captured group contains 000000006039930 and the last 1 is matched by \w

The 1 is being dropped because \w isn't in the capture group.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

RegEx: don't capture match, but capture after match - regex

Related

Regex: Match pattern unless preceded by pattern containing element from the matching character class

What is the proper regex for capturing everything after "String" and between two delimeters ('=' and and non alphanumeric))

How to consume lookaround in regex?

Regex for spoof

Regex Matching Behaviour Of \w

Categories

Resources