Is there a way to use a single regular-expression to match only within another math. For example, if I want to remove spaces from a string, but only within parentheses:
source : "foobar baz blah (some sample text in here) and some more"
desired: "foobar baz blah (somesampletextinhere) and some more"
In other words, is it possible to restrict matching to a specific part of the string?
In PCRE a combination of \G and \K can be used:
(?:\G(?!^)|\()[^)\s]*\K\s+
\G continues where the previous match ended
\K resets beginning of the reported match
[^)\s] matches any character not in the set
See demo at regex101
The idea is to chain matches to an opening parentheses. The chain-links are either [^)\s]* or \s+. To only get spaces \K is used to reset before. This solution does not require a closing ).
In other regex flavors that support \G but not \K, capturing groups can help out. Eg Search for
(\G(?!^)|\()([^)\s]*)\s+
and replace with captures of the 2 groups (depending on lang: $1$2 or \1\2) - Regex101 demo
Further there is (*SKIP)(*F), a PCRE feature for skipping over certain parts. It is often used together with The Trick. The idea is simple: skip this(*SKIP)(*F)|match that - Regex101 demo. Also this can be worked around with capture groups. Eg replace ([^)(]*\(|\)[^)(]*)|\s with$1
One idea is to replace any space between parentheses using a lookahead pattern:
(?=([^\s\(]+ )*\S*\))(?!\S*\s*\()`
The lookahead will attempt to match the last space before the closed parenthesis (\S*\)) and any optional space before ([^\s\(]+ )* (if found).
Detailed Regex Explanation:
: space
(?=([^\s\(]+ )*\S*\)): lookahead non-capturing group
([^\s\(]+ )*: any combination characters not including the open parenthesis and the space characters + space (this group is optional)
\S*\): any non-space character + closed parenthesis
(?!\S*\s*\(): what lookahead should not be
\S*: any non space character (optional), followed by
\s*: any space character (optional), followed by
\(: the open parenthesis
Check the demo here.
Related
I am having a hard time coming up with a regex to match a specific case:
This can be matched:
any-dashed-strings
this-can-be-matched-even-though-its-big
This cannot be matched:
strings starting with elem- or asdf- or a single -
elem-this-cannot-be-matched
asdf-this-cannot-be-matched
-
So far what I came up with is:
/\b(?!elem-|asdf-)([\w\-]+)\b/
But I keep matching a single - and the whole -this-cannot-be-matched suffix. I cannot figure it out how to not only ignore a character present inside the matching character class conditionally, and not matching anything else if a suffix is found
I am currently working with the Oniguruma engine (Ruby 1.9+/PHP multi-byte string module).
If possible, please elaborate on the solution. Thanks a lot!
If a lookbehind is supported, you can assert a whitespace boundary to the left, and make the alternation for both words without the hyphen optional.
(?<!\S)(?!(?:elem|asdf)?-)[\w-]+\b
Explanation
(?<!\S) Assert a whitespace boundary to the left
(?! Negative lookahead, assert the directly to the right is not
(?:elem|asdf)?- Optionally match elem or asdf followed by -
) Close the lookahead
[\w-]+ Match 1+ word chars or -
\b A word boundary
See a regex demo.
Or a version with a capture group and without a lookbehind:
(?:\s|^)(?!(?:elem|asdf)?-)([\w-]+)\b
See another regex demo.
Details={
AwsEc2SecurityGroup={GroupName=m.com-rds, OwnerId=123, VpcId=vpc-123,
IpPermissions=[{FromPort=3306, ToPort=3306, IpProtocol=tcp, IpRanges=[{CidrIp=1.1.1.1/32}, {CidrIp=2.2.2.2/32}, {CidrIp=0.0.0.0/0}, {CidrIp=3.3.3.3/32}],
UserIdGroupPairs=[{UserId=123, GroupId=sg-123abc}]}], IpPermissionsEgress=[{IpProtocol=-1, IpRanges=[{CidrIp=0.0.0.0/0}]}], GroupId=sg-123abc}},
Region=us-east-1, Id=arn:aws:ec2:us-east-1:123:security-group/sg-123abc}]
}
I want to capture exactly arn:aws:ec2:us-east-1:123:security-group/sg-123abc in this example. Generically, I want to capture the value of Id regardless of placement. My current solution is /Details={.*Id=(.*\w)/, but this only works if it's the last object in the data. How can I take into account the following potential scenario:
Id=arn:aws:ec2:us-east-1:123:security-group/sg-123abc, Thing=123abc}]
You have a pattern with 2 times .* which will first match till the end of the line/string (depending on if the dot matches a newline) and it will backtrack to match the last occurrence where this part of the pattern Id=(.*\w) can match.
If you want to use a capture group, you can make the format and the allowed characters a bit more specific:
\bId=(\w+(?:[:\/-]\w+)+)
The pattern in parts
\b A word boundary to prevent a partial word match
Id= Match literally
( Capture group 1
\w+ Match 1+ word chars
(?:[:\/-]\w+)+ Repeat 1+ times either : / - and 1+ word chars
) Close group 1
Regex demo
Or if you know that it starts with Id=arn:
\bId=(arn:[\w:\/-]+)
Regex demo
Note that you don't have to escape the \/ only when the delimiters of the regex are forward slashes, but there is no language tagged.
You can use look-behind to check that there is the Id= prefix, and then match anything that is not a space, comma or closing brace:
(?<=\bId=)[^,}\s]*
I have this regex:
\[tag\](.*?)\[\/tag\]
It match any character between two tags. The problem that is matching also empty contents or just white spaces inside the tags, for example:
[tag][/tag]
[tag] [/tag]
How can I avoid it? Make it to match at least 1 character and not only white spaces. Thanks!
Use
\[tag\](?!\s*\[\/tag\])(.*?)\[\/tag\]
^^^^^^^^^^^^^^^^
See the regex demo and the Regulex graph:
The (?!\s*\[\/tag\]) is a negative lookahead that fails the match if, immediately to the right of the current location, there is 0+ whitespaces, [/tag].
You might change your expression to something similar to this:
\[tag\]([\s\S]+)\[\/tag\]
and you might add a quantifier to it, and bound it with number of chars, similar to this expression:
\[tag\]([\s\S]{3,})\[\/tag\]
Or you could do the same with your original expression as this expression:
Try this regex:
\[(tag)\](?!\s*\[\/\1\])(.*?)\[\/\1\]
This regex matches tag only if it has at least one non-whitespace char.
If this is a PCRE (or php) or NP++ or Perl, use this
(?s)(?:\[tag\]\s*\[/tag\](*SKIP)(?!)|\[tag\]\s*(.+?)\s*\[/tag\])
https://regex101.com/r/aCsOoQ/1
If not, you're stuck with using Stribnetz regex, which works because of
an odd condition of your requirements.
Readable
(?s)
(?:
\[tag\]
\s*
\[/tag\]
(*SKIP)
(?!)
|
\[tag\]
\s*
( .+? ) # (1)
\s*
\[/tag\]
)
There are a thousand regular expression questions on SO, so I apologize if this is already covered. I did look first.
I have string:
Name Subname 11X22 88X620 AB33(20) YA5619 77,66
I need to capture this string: YA5619
What I am doing is just finding AB33(20) and after this I am capturing until first white space. But AB33(20) can be AB-33(20) or AB33(-20) or AB33(-1).
My preg_match regex is: (?<=\bAB\d{2}\(\d{2}\)\s).+?(?=\s)
Why I am getting error when I change from \d{2} to \d+?
For final result I was thinking this regix will work but no:
(?<=\bAB-?\d+\(-?\d+\)\s).+?(?=\s)
Any ideas what I am doing wrong?
With most regex flavors, lookbehind needs to evaluate to a fixed-length sequence, so you can't use variable quantifiers like * or + or even {1,2}.
Instead of using lookaround, you can simply match your marker pattern and then forget it with \K.
AB-?\d+(?:\(-?\d+\))? \K[^ ]+
demo: https://regex101.com/r/8XXngH/1
It depends on the language. If it is in .NET for example, it matches due to the various length in the lookbehind.
Another solution might be to use a character class and add the character you would allow to match. Then match a whitespace character and capture in a group matching \S+ which matches 1+ times not a whitespace character.
\bAB[()\d-]+\s\K\S+
Explanation
\bAB Match literally prepended with word boundary to prevent AB being part of a larger match.
[()\d-]+ Match 1+ times any of the listed character in the character class
\s Match a whitespace char (or \s+ to match 1 or more)
\K Reset the starting point of the reported match( Forget what was matched)
\S+ Match in a group 1+ times not a whitespace character
Regex demo | Php demo
I have following input:
!foo\[bar[bB]uz\[xx/
I want to match everything from start to [, including escaped bracket \[ and ommiting first characters if in [!#\s] group
Expected output:
foo\[bar
I've tried with:
(?![!#\s])[^/\s]+\[
But it returns:
foo\[bar[bB]uz\[
Java: Use Lookbehind
(?<=!)(?:\\\[|[a-z])+
See the regex demo
Explanation
The lookbehind (?<=!) asserts that what precedes the current position is the character !
The non-capture group (?:\\\[|[a-z]) matches \[ OR | a letter between a and z
The + causes the group to be matched one or more times
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind
You can use this regex:
!((?:[^[\\]*\\\[)*[^[]*)
Online Regex Demo
Add a ? after [^/\s]+ to catch the shortest group possible
Add \w+ to the end to catch the first group of alphanumeric characters after \[
Result :
(?![!#\s])[^\/\s]+?\[\w+
Try it
You can try this pattern:
(?<=^[!#\s]{0,1000})(?:[^!#\s\\\[]|\\.)(?>[^\[\\]+|\\.)*(?=\[)
pattern details:
The begining is a lookbehind and means preceded by zero or several forbidden characters at the start of the string
(?:[^!#\s\\\[]|\\.) ensures that the first character is an allowed character or an escaped character.
(?>[^\[\\]+|\\.)* describes the content: all that is not a [ or a \, or an escaped character. (note that this subpattern can be written like that too: (?:[^\[\\]|\\.)*)
(?=\[) checks that the next character is a literal opening square bracket. (since all escaped characters are matched by the precedent group, you can be sure that this one is not escaped)
link to fiddle (push the Java button)
Use a negated character class first the start (ie the match must not start with a special char), then a reluctant quantifier (which stops at the first hit), with a negative look behind to skip over escaped brackets:
[^!#\s].*?(?<!\\)\[
See live demo