I have a string:
users/554983490\/Another+Test+/Question????\/+dhjkfsdf/
How would i write a RegExp that would match all of the forward slashes NOT preceded by a back slash?
EDIT: Is there a way to do it without using a negative lookbehinds?
If your regular expressions support negative lookbehinds:
/(?<!\\)\//
Otherwise, you will need to match the character before the / as well:
/(^|[^\\])\//
This matches either the start of a string (^), or (|) anything other than a \ ([^\\]) as capture group #1 (). Then it matches the literal / after. Whatever character was before the / will be stored in the capture group $1 so you can put it back in if you are doing a replace....
Example (JavaScript):
'st/ri\\/ng'.replace(/(^|[^\\])\//, "$1\\/");
// returns "st\/ri\/ng"
You can use this :
/(?<!\\)\//
This is called a negative lookbehind.
I used / as the delimiters
(?<! <-- Start of the negative lookbehind (means that it should be preceded by the following pattern)
\\ <-- The \ character (escaped)
) <-- End of the negative lookbehind
\/ <-- The / character (escaped)
Related
I'm trying to match a pair of special characters, while excluding the enclosed content from the match. For example, ~some enclosed content~ should match only the pair of ~ and leave out some enclosed content entirely. I can only use vanilla PCRE, and capture groups aren't an option.
My match criteria for the entire string is ~([^\s].*?(?<!\s))~. Matching the first and second ~ separately would also be acceptable.
Looking at your pattern, you want a non whitespace char right after the opening ~ and a non whitespace char right before the closing ~
As those are the delimiters, and the non whitespace char should also not be ~ itself, you might use:
~(?=[^~\s](?:[^~\r\n]*[^\s~])?~)|(?<=~)[^\s~](?:[^~\r\n]*[^\s~])?\K~
Explanation
~ Match literally
(?= Positive lookahead, assert that to the right is
[^~\s] Match a non whitespace char except for ~
(?: Non capture group
[^~\r\n]*[^\s~] Match repeating any char other than a newline or ~ followed by a non whitespace char except for ~
)? Close non capture group and make it optional (to also match a single char ~a~)
~ Match literally
) Close the lookahead
| Or
(?<=~) Positive lookbehind, assert ~ to the left
[^\s~] Match a non whitespace char except for ~
(?:[^~\r\n]*[^\s~])? Same optional pattern as in the lookahead
\K Forget what is matched so far
~ Match literally
Regex demo
I'm trying to get UNIX group names using a regex (can't use groups because I can only get the process uid, so I'm using id <process_id> to get groups)
input looks like this
uid=1001(kawsay) gid=1001(kawsay) groups=1001(kawsay),27(sudo),44(video),997(gpio)\n
I'd like to capture kawsay, sudo, video and gpio
The only pieces I've got are:
a positive lookbehind to start capturing after groups: /(?<=groups)/
capture the parenthesis content: /\((\w+)\)/
Using PCRE's \G you may use this regex:
(?:\bgroups=|(?<!^)\G)[^(]*\(([^)]+)\)
Your intended matches are available in capture group #1
RegEx Demo
RegEx Details:
(?:: Start non-capture group
\bgroups=: Match word groups followed by a =
|: OR
(?<!^)\G: Start from end position of the previous match
): End non-capture group
[^(]*: Match 0 or more of any character that is not (
\(: Match opening (
([^)]+): Use capture group #1 to match 1+ of any non-) characters
\): Match closing )
You can use
(?:\G(?!\A)\),|\bgroups=)\d+\(\K\w+
See the regex demo. Details:
(?:\G(?!\A)\),|\bgroups=) - either of
\G(?!\A)\), - end of the previous match (\G operator matches either start of string or end of the previous match, so the (?!\A) is necessary to exclude the start of string location) and then ), substring
| - or
\bgroups= - a whole word groups (\b is a word boundary) and then a = char
\d+\( - one or more digits and a (
\K - match reset operator that makes the regex engine "forget" the text matched so far
\w+ - one or more word chars.
Here are two more ways to extract the strings of interest. Both return matches and do not employ capture groups. My preference is for second one.
str = "uid=1001(kawsay) gid=1001(kawsay) groups=1001(kawsay),27(sudo),44(video),997(gpio)\n"
Match substrings between parentheses that are not followed later in the string with "groups="
Match the regular expression
rgx = /(?<=\()(?!.*\bgroups=).*?(?=\))/
str.scan(rgx)
#=> ["kawsay", "sudo", "video", "gpio"]
Demo
See String#scan.
This expression can be broken down as follows.
(?<=\() # positive lookbehind asserts previous character is '('
(?! # begin negative lookahead
.* # match zero or more characters
\bgroups= # match 'groups=' preceded by a word boundary
) # end negative lookahead
.* # match zero or more characters lazily
(?=\)) # positive lookahead asserts next character is ')'
This may not be as efficient as expressions that employ \G (because of the need to determine if 'groups=' appears in the string after each left parenthesis), but that may not matter.
Extract the portion of the string following "groups=" and then match substrings between parentheses
First, obtain the portion of the string that follows "groups=":
rgx1 = /(?<=\bgroups=).*/
s = str[rgx1]
#=> "1001(kawsay),27(sudo),44(video),997(gpio)\n"
See String#[].
Then match the regular expression
rgx2 = /(?<=\()[^\)\r\n]+/
against s:
s.scan(rgx2)
#=> ["kawsay", "sudo", "video", "gpio"]
The regular expression rgx1 can be broken down as follows:
(?<=\bgroups=) # Positive lookbehind asserts that the current
# position in the string is preceded by`'groups'`,
# which is preceded by a word boundary
.* # match zero of more characters other than line
# terminators (to end of line)
rgx2 can be broken down as follows:
(?<=\() # Use a positive lookbehind to assert that the
# following character is preceded by '('
[^\)\r\n]+ # Match one or more characters other than
# ')', '\r' and '\n'
Note:
The operations can of course be chained: str[/(?<=\bgroups=).*/].scan(/(?<=\()[^\)\r\n]+/); and
rgx2 could alternatively be written /(?<=\().+?(?=\)), where ? makes the match of one or more characters lazy and (?=\)) is a positive lookahead that asserts that the match is followed by a right parenthesis.
This would probably be the fastest solution of those offered and certainly the easiest to test.
I have this string
(Mozilla/5.0 \(X11; Linux x86_64\) AppleWebKit/537.36 \(KHTML, like Gecko\) Chrome/data Safari/data2) /Producer (Skia/PDF m80) /CreationDate (D:20200420090009+00'00') /ModDate (D:20200420090009+00'00')
I want to get the first ocurrence of () where there isn't any \ before ( or ). That case I would get
(Mozilla/5.0 \(X11; Linux x86_64\) AppleWebKit/537.36 \(KHTML, like Gecko\) Chrome/data Safari/data2)
I'm using this regex expression
\([\s\S]*[^\\]{1}\)?
However I get the whole string
Your regex can be broken down like so.
[The spaces and newlines are for clarity]
\( match a literal (
[\s\S]* match 0 or more of whitespace or not-whitespace (anything)
[^\\]{1} match 1 thing which is not \
\)? optionally match a literal )
regex101 demo
It's that [\s\S]* which winds up slurping in everything.
The ? on the end doesn't mean lazy, it makes matching the ) optional. To be lazy, ? must be put in front of an open-ended qualifier like *? or +? or {3,}? or {1,5}?.
To match just the first set of parenthesis, we want to lazily match anything between unescaped parens. Lazy matching anything is easy .*?.
Matching unescaped parens is a little harder. We could match [^\\]\), but that requires a character to match. This won't work if the opening paren is at the beginning of the string because there's no character before the (. We can solve this by also matching the beginning of the string: (?:[^\\]|^)\).
(?: non-capturing group
[^\\] match a non \
| or
^ the beginning of the string
)
\( match a literal (
.*? lazy match 0 or more of anything
[^\\] match a non \
\) match a literal )
regex101 demo
But this will be foiled by (). It will match all of ()(foo).
(?:[^\\]|^) matches the beginning of the string. \( matches the first (. That leaves .*?[^\\]\) looking at )(foo). The first ) does not match because there is no leading character, it was already consumed. So .*? gobbles up characters until it his o) which matches [^\\]\).
The boundary problem is better solved by negative look behinds. (?<!\\) says the preceding character must not be a \ which includes no character at all. Lookbehinds don't consume what they match so they can be used to peek behind and ahead. Most, but not all, regex engines support them.
(?<!\\) \( match a literal ( which is not after a \
.*? lazy match 0 or more of anything
(?<!\\) \) match a literal ) which is not after a \
regex101 demo
However, there are libraries to parse User-Agents. ua-parser has libraries for many languages,
I am working on an .xml file with this tag
<Categories><![CDATA[Test/Test1-Test2-Test3|Test4/Test5-Test6|Test7/Test8]]></Categories>
and I am trying to replace / with ; by using regular expressions in Sublime Text 3.
The output should be
<Categories><![CDATA[Test;Test1-Test2-Test3|Test4;Test5-Test6|Test7;Test8]]></Categories>
When I use this (<Categories>\S+)\/(.+</Categories>) it matches all the line and of course if I use this \/ it matches all / everywhere inside the .xml file.
Could you please help?
For you example string, you could make use of a \G to assert the position at the end of the previous match and use \K to forget what has already been matched and then match a forward slash.
In the replacement use a ;
Use a positive lookahead to assert what is on the right is ]]></Categories>
(?:<Categories><!\[CDATA\[|\G(?!^))[^/]*\K/(?=[^][]*]]></Categories>)
Explanation
(?: Non capturing group
<Categories><!\[CDATA\[ Match <Categories><![CDATA[
| Or
\G(?!^) Assert the position at the end of the previous match, not at the start
) Close non capturing group
[^/]* Match 0+ times not / using a negated character class
\K/ Forget what was matched, then match /
(?= Positive lookahead, assert what is on the right is
[^][]*]]></Categories> Match 0+ times not [ or ], then match ]]></Categories>
) Close positive lookahead
Regex demo
I have following input:
!foo\[bar[bB]uz\[xx/
I want to match everything from start to [, including escaped bracket \[ and ommiting first characters if in [!#\s] group
Expected output:
foo\[bar
I've tried with:
(?![!#\s])[^/\s]+\[
But it returns:
foo\[bar[bB]uz\[
Java: Use Lookbehind
(?<=!)(?:\\\[|[a-z])+
See the regex demo
Explanation
The lookbehind (?<=!) asserts that what precedes the current position is the character !
The non-capture group (?:\\\[|[a-z]) matches \[ OR | a letter between a and z
The + causes the group to be matched one or more times
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind
You can use this regex:
!((?:[^[\\]*\\\[)*[^[]*)
Online Regex Demo
Add a ? after [^/\s]+ to catch the shortest group possible
Add \w+ to the end to catch the first group of alphanumeric characters after \[
Result :
(?![!#\s])[^\/\s]+?\[\w+
Try it
You can try this pattern:
(?<=^[!#\s]{0,1000})(?:[^!#\s\\\[]|\\.)(?>[^\[\\]+|\\.)*(?=\[)
pattern details:
The begining is a lookbehind and means preceded by zero or several forbidden characters at the start of the string
(?:[^!#\s\\\[]|\\.) ensures that the first character is an allowed character or an escaped character.
(?>[^\[\\]+|\\.)* describes the content: all that is not a [ or a \, or an escaped character. (note that this subpattern can be written like that too: (?:[^\[\\]|\\.)*)
(?=\[) checks that the next character is a literal opening square bracket. (since all escaped characters are matched by the precedent group, you can be sure that this one is not escaped)
link to fiddle (push the Java button)
Use a negated character class first the start (ie the match must not start with a special char), then a reluctant quantifier (which stops at the first hit), with a negative look behind to skip over escaped brackets:
[^!#\s].*?(?<!\\)\[
See live demo