Regex multiple characters but without specific string - regex

I have lines which consist of:
optional: "no ",
optional: two non whitespace characters,
several non whitespace characters.
I want to capture string from each line which consist of:
optional: two non whitespace characters (but not "no " part),
several non whitespace characters.
Example lines:
ab123
ab 123
no abc123
no ab 123
I want to capture:
ab123
ab 123
abc123
ab 123
My regexp (works only for examples without "no ").
^
(?! no \s) # not "no "
( # match it
(?: \S{1,2} \s )? # optional: 1-2 non whitespace characters and one space, BUT NOT GET "no " (it doesn't works)
\S+ # non whitespace characters
)
$
Example online (4 unit tests): https://regex101.com/r/70soe2/1
Maybe should I use negative look ahead (?! no \\s) or negative look behind (?<! no \\s) in some way? But I don't know how to use it.

You cannot actually rely on lookarounds here, you need to consume the optional no + whitespace part of the string.
It is best to use a non-capturing optional group at the start:
^
(?: no \s)? # not "no "
( # capture it
(?: \S{1,2} \s )? # optional: 1-2 non whitespace characters and one space, BUT NOT GET "no " (it doesn't works)
\S+ # non whitespace characters
)
$
See the regex demo
The value you need is inside Group 1.
If your regex engine supports \K construct, you may use this instead:
^
(?:no \s \K)? # not "no "
( # match it
(?: \S{1,2} \s )? # optional: 1-2 non whitespace characters and one space, BUT NOT GET "no " (it doesn't works)
\S+ # non whitespace characters
)
$
The \K in (?:no \s \K)? will omit the consumed string part from the match value, and you will get the expected result as a whole match value.
See the regex demo

Related

Parenthesis content after a specific word

I'm trying to get UNIX group names using a regex (can't use groups because I can only get the process uid, so I'm using id <process_id> to get groups)
input looks like this
uid=1001(kawsay) gid=1001(kawsay) groups=1001(kawsay),27(sudo),44(video),997(gpio)\n
I'd like to capture kawsay, sudo, video and gpio
The only pieces I've got are:
a positive lookbehind to start capturing after groups: /(?<=groups)/
capture the parenthesis content: /\((\w+)\)/
Using PCRE's \G you may use this regex:
(?:\bgroups=|(?<!^)\G)[^(]*\(([^)]+)\)
Your intended matches are available in capture group #1
RegEx Demo
RegEx Details:
(?:: Start non-capture group
\bgroups=: Match word groups followed by a =
|: OR
(?<!^)\G: Start from end position of the previous match
): End non-capture group
[^(]*: Match 0 or more of any character that is not (
\(: Match opening (
([^)]+): Use capture group #1 to match 1+ of any non-) characters
\): Match closing )
You can use
(?:\G(?!\A)\),|\bgroups=)\d+\(\K\w+
See the regex demo. Details:
(?:\G(?!\A)\),|\bgroups=) - either of
\G(?!\A)\), - end of the previous match (\G operator matches either start of string or end of the previous match, so the (?!\A) is necessary to exclude the start of string location) and then ), substring
| - or
\bgroups= - a whole word groups (\b is a word boundary) and then a = char
\d+\( - one or more digits and a (
\K - match reset operator that makes the regex engine "forget" the text matched so far
\w+ - one or more word chars.
Here are two more ways to extract the strings of interest. Both return matches and do not employ capture groups. My preference is for second one.
str = "uid=1001(kawsay) gid=1001(kawsay) groups=1001(kawsay),27(sudo),44(video),997(gpio)\n"
Match substrings between parentheses that are not followed later in the string with "groups="
Match the regular expression
rgx = /(?<=\()(?!.*\bgroups=).*?(?=\))/
str.scan(rgx)
#=> ["kawsay", "sudo", "video", "gpio"]
Demo
See String#scan.
This expression can be broken down as follows.
(?<=\() # positive lookbehind asserts previous character is '('
(?! # begin negative lookahead
.* # match zero or more characters
\bgroups= # match 'groups=' preceded by a word boundary
) # end negative lookahead
.* # match zero or more characters lazily
(?=\)) # positive lookahead asserts next character is ')'
This may not be as efficient as expressions that employ \G (because of the need to determine if 'groups=' appears in the string after each left parenthesis), but that may not matter.
Extract the portion of the string following "groups=" and then match substrings between parentheses
First, obtain the portion of the string that follows "groups=":
rgx1 = /(?<=\bgroups=).*/
s = str[rgx1]
#=> "1001(kawsay),27(sudo),44(video),997(gpio)\n"
See String#[].
Then match the regular expression
rgx2 = /(?<=\()[^\)\r\n]+/
against s:
s.scan(rgx2)
#=> ["kawsay", "sudo", "video", "gpio"]
The regular expression rgx1 can be broken down as follows:
(?<=\bgroups=) # Positive lookbehind asserts that the current
# position in the string is preceded by`'groups'`,
# which is preceded by a word boundary
.* # match zero of more characters other than line
# terminators (to end of line)
rgx2 can be broken down as follows:
(?<=\() # Use a positive lookbehind to assert that the
# following character is preceded by '('
[^\)\r\n]+ # Match one or more characters other than
# ')', '\r' and '\n'
Note:
The operations can of course be chained: str[/(?<=\bgroups=).*/].scan(/(?<=\()[^\)\r\n]+/); and
rgx2 could alternatively be written /(?<=\().+?(?=\)), where ? makes the match of one or more characters lazy and (?=\)) is a positive lookahead that asserts that the match is followed by a right parenthesis.
This would probably be the fastest solution of those offered and certainly the easiest to test.

regex pattern to highlight all the matches for the punctuation in VBA

need an expression to allow only the below pattern
end word(dot)(space)start word [eg: end. start]
in other words
no space before colon,semicolon and dot |
one space after colon,semicolon and dot
rest of the all other patterns need to get capture to identify such as
end.start || end . start || end .start
i used
"([\s{0,}][\.]|[\.][\s{2,}a-z]|[\.][\s{0,}a-z])"
but not working as i expected.Need your support please
need_regex_patterns aim_of_regex_need
You could match 1+ word characters using \w+ and match either a colon or semi colon using a character class [;:] between optional spaces ?.
After that, match again 1+ word characters.
\w+ ?[;:] ?\w+
Regex demo
To match the dot followed by a single space variant, you don't need a character class but you could match the dot only using \.
\w+\. \w+
Regex demo
Edit
To highlight all the matches for the punctuations:
(?: [.:;]|[.:;] {2,}|(?<=\S)[;:.](?=\S))
Explanation
(?: Non capture group
[.:;] match a space followed by either . : or ;
| Or
[.:;] {2,} Match one of the listed followed by 2 or more spaces
| Or
(?<=\S)[;:.](?=\S) Match one of the listed surrounded by non whitespace chars
) Close group
Regex demo

Exclude curly brace matches

I have the following strings:
logger.debug('123', 123)
logger.debug(`123`,123)
logger.debug('1bc','test')
logger.debug('1bc', `test`)
logger.debug('1bc', test)
logger.debug('1bc', {})
logger.debug('1bc',{})
logger.debug('1bc',{test})
logger.debug('1bc',{ test })
logger.debug('1bc',{ test})
logger.debug('1bc',{test })
Instead of debug there can be other calls like warn, fatal etc.
All quote pairs can be "", '' or ``.
I need to create a regular express which matches case 1 - 5 but not 6 - 11.
That's what I've come up with:
logger.*\(['`].*['`],\s*.([^{.*}])
This also matches 8 - 11, so I'm suspecting this part is wrong ([^{.*}]) but I don't get it why.
You can try this
logger\.[^(]+\((?:"(?:\\"|[^"])*"|'(?:\\'|[^'])*'|`(?:\\`|[^`])*`),[^{}]*?\)
Regex Demo
P.S:- This pattern can be shorten if we are sure there won't be any mismatch of quotes, also if there won't be any escaped quote inside string
If there's no escaped string
logger\.[^(]+\((?:"[^"]*"|'[^']*'|`[^`]*`),[^{}]*?\)
If there's no quotes in between string. i.e no strings like "mr's jhon
logger\.[^(]+\(([`"'])[^"'`]*\1,[^{}]*?\)
If there are no quotes between the quoted parts, you could make use of a capturing group to match one of the quote types (['`"]) and use a backreference \1 to match the closing quote type.
The \r\n in the negated character class is to not cross newline boundaries.
The pattern will match either the quoted parts or 1+ times a word character for the first part.
The second part matches any char except { or } or ) using a negated character class.
logger\.[^(\r\n]+\((?:(['`"])[^'`"]+\1|\w+),[^{})\r\n]+\)
That will match
logger\. Match logger.
[^(\r\n]+ Match 1+ times any char except ( or a newline
\( Match (
(?: Non capture group
(['`"]) Capture group 1
[^'`"]+\1 Match 1+ times any char except the quote types, backreference to the captured
| or
\w+ Match 1+ word chars
), Close non capture group and match ,
[^{})\r\n]+ Match 1+ times any char except { } ) or a newline
\) Match )
Regex demo

Regex Ruby How to group every word within parentheses

I'm trying to get all the words between the parentheses after a specific word and the end of the string.
For example, I have this case:
p " some other text in downcase LOREM (foo, bar)".scan(/ LOREM \((.*?)\)\z/m)
# [["foo, bar"]]
The regex is getting foo, bar which is between the parenthesis, it's okay, but I'd like to get them like two separate elements within a single array, meaning:
["foo", "bar"]
That's to say, the regex should group every words as a separate element.
My intention is to get everything between LOREM ( and the last closing parenthesis ).
I've tried adding (\b\w+\b), which groups every word in the string. But when adding it to the attempt to get the words from the parenthesis, it returns nothing.
You may use
.scan(/(?:\G(?!\A)\s*,\s*|\sLOREM\s+\()\K\w+(?=[^()]*\)\z)/
See the Ruby demo and the Rubular regex demo. You may replace \w+ with [[:alnum:]]+, or \p{L}+ (to only match letters), or [^\s,()]+ (to match any 1+ chars other than whitespace, ,, ( and )), it all depends on what you want to match inside the paretheses.
Details
(?:\G(?!\A)\s*,\s*|\sLOREM\s+\() - either the end of the previous successful match and a , enclosed with 0+ whitespaces, or whitespace, LOREM, 1+ whitespaces and (
\K - omit the text matched so far
\w+ - consume 1+ word chars
(?=[^()]*\)\z) - immediately to the right, there must be 0 or more chars other than ( and ) and then ) at the end of the string.
r = /
(?<= # begin a positive lookbehind
LOREM[ ] # match 'LOREM '
\( # match left paren
| # or
,[ ] # match a comma followed by a space
) # end positive lookbehind
(?: # begin a non-capture group
[^, ")]+ # match one or more characters other than ',', ' ', '"' and ")"
| # or
\" # match a double quote
[^, ")]+ # match one or more characters other than ',', ' ', '"' and ")"
\" # match a double quote
) # end non-capture group
(?= # begin a positive lookahead
.*\) # match any number of characters followed by a right paren
) # end positive lookahead
/x # free-spacing regex definition mode
Conventionally this is written
r = /(?<=LOREM \(|, )(?:[^, ")]+|\"[^, ")]+\")(?=.*\))/
Let's try it.
str = "some other text in downcase LOREM (foo, \"bar\", \"baz), daz"
str.scan(r)
#=> ["foo", "\"bar\""]
The first match, "foo", matches
str.scan /(?<=LOREM \()[^, ")]+/
#=> ["foo"]
That is, this matches one or more characters other than a comma, space, double quote or left parenthesis, immediately preceded by "LOREM " followed by a left parenthesis.
The next attempted match begins at the end of "foo". There is no match of "L" in "LOREM" so an attempt is made to match ", ", which is met with success. [^, ")]+ does not match "bar", so an attempt is made to match \"[^, ")]+\", which is successful. As ", " is matched within the lookaround it is not part of the match returned. This matches '"bar"'.
\"baz is not matched because it has no closing double quote.

Regular expression captures unwanted string

I have created the following expression: (.NET regex engine)
((-|\+)?\w+(\^\.?\d+)?)
hello , hello^.555,hello^111, -hello,+hello, hello+, hello^.25, hello^-1212121
It works well except that :
it captures the term 'hello+' but without the '+' : this group should not be captured at all
the last term 'hello^-1212121' as 2 groups 'hello' and '-1212121' both should be ignored
The strings to capture are as follows :
word can have a + or a - before it
or word can have a ^ that is followed by a positive number (not necessarily an integer)
words are separated by commas and any number of white spaces (both not part of the capture)
A few examples of valid strings to capture :
hello^2
hello^.2
+hello
-hello
hello
EDIT
I have found the following expression which effectively captures all these terms, it's not really optimized but it just works :
([a-zA-Z]+(?= ?,))|((-|\+)[a-zA-Z]+(?=,))|([a-zA-Z]+\^\.?\d+)
Ok, there are some issues to tackle here:
((-|+)?\w+(\^.?\d+)?)
^ ^
The + and . should be escaped like this:
((-|\+)?\w+(\^\.?\d+)?)
Now, you'll also get -1212121 there. If your string hello is always letters, then you would change \w to [a-zA-Z]:
((-|\+)?[a-zA-Z]+(\^\.?\d+)?)
\w includes letters, numbers and underscore. So, you might want to restrict it down a bit to only letters.
And finally, to take into consideration of the completely not capturing groups, you'll have to use lookarounds. I don't know of anyway otherwise to get to the delimiters without hindering the matches:
(?<=^|,)\s*((-|\+)?[a-zA-Z]+(\^\.?\d+)?)\s*(?=,|$)
EDIT: If it cannot be something like -hello^2, and if another valid string is hello^9.8, then this one will fit better:
(?<=^|,)\s*((?:-|\+)?[a-zA-Z]+|[a-zA-Z]+\^(?:\d+)?\.?\d+)(?=\s*(?:,|$))
And lastly, if capturing the words is sufficient, we can remove the lookarounds:
([-+]?[a-zA-Z]+|[a-zA-Z]+\^(?:\d+)?\.?\d+)
It would be better if you first state what it is you are looking to extract.
You also don't indicate which Regular Expression engine you're using, which is important since they vary in their features, but...
Assuming you want to capture only:
words that have a leading + or -
words that have a trailing ^ followed by an optional period followed by one or more digits
and that words are sequences of one or more letters
I'd use:
([a-zA-Z]+\^\.?\d+|[-+][a-zA-Z]+)
which breaks down into:
( # start capture group
[a-zA-Z]+ # one or more letters - note \w matches numbers and underscores
\^ # literal
\.? # optional period
\d+ # one or more digits
| # OR
[+-]? # optional plus or minus
[a-zA-Z]+ # one or more letters or underscores
) # end of capture group
EDIT
To also capture plain words (without leading or trailing chars) you'll need to rearrange the regexp a little. I'd use:
([+-][a-zA-Z]+|[a-zA-Z]+\^(?:\.\d+|\d+\.\d+|\d+)|[a-zA-Z]+)
which breaks down into:
( # start capture group
[+-] # literal plus or minus
[a-zA-Z]+ # one or more letters - note \w matches numbers and underscores
| # OR
[a-zA-Z]+ # one or more letters
\^ # literal
(?: # start of non-capturing group
\. # literal period
\d+ # one or more digits
| # OR
\d+ # one or more digits
\. # literal period
\d+ # one or more digits
| # OR
\d+ # one or more digits
) # end of non-capturing group
| # OR
[a-zA-Z]+ # one or more letters
) # end of capture group
Also note that, per your updated requirements, this regexp captures both true non-negative numbers (i.e. 0, 1, 1.2, 1.23) as well as those lacking a leading digit (i.e. .1, .12)
FURTHER EDIT
This regexp will only match the following patterns delimited by commas:
word
word with leading plus or minus
word with trailing ^ followed by a positive number of the form \d+, \d+.\d+, or .\d+
([+-][A-Za-z]+|[A-Za-z]+\^(?:.\d+|\d+(?:.\d+)?)|[A-Za-z]+)(?=,|\s|$)
Please note that the useful match will appear in the first capture group, not the entire match.
So, in Javascript, you'd:
var src="hello , hello ,hello,+hello,-hello,hello+,hello-,hello^1,hello^1.0,hello^.1",
RE=/([+-][A-Za-z]+|[A-Za-z]+\^(?:\.\d+|\d+(?:\.\d+)?)|[A-Za-z]+)(?=,|\s|$)/g;
while(RE.test(src)){
console.log(RegExp.$1)
}
which produces:
hello
hello
hello
+hello
-hello
hello^1
hello^1.0
hello^.1