why adding group to my regex changes what it catches - regex

I have the line:
[asos-qa:2021:5]#0 Row[info=[ts=-9223372036854775808] ]: 6, 23 |
I want to get the first word: asos-qa, so I tried this regex: ^\[\S*?(:|]) and it gets me: [asos-qa:.
So in order to get only the word without the other characters I tried to add a group (python syntax): ^\[(?P<app_id>\S*)?(:|]) but for some reason it returns [asos-qa:2021:5].
What am I doing wrong?

Your ^\[(?P<app_id>\S*)?(:|]) regex returns [asos-qa:2021:5] because \S* matches any zero or more non-whitespace chars greedily up to the last available :or ] in the current chunk of non-whitespace chars, ? you used is applied to the whole (?P<app_id>\S*) group pattern and is also greedy, i.e. the regex engine tries at least once to match the group pattern.
You need
^\[(?P<app_id>[^]\s:]+)
See the regex demo. Details:
^ - start of string
\[ - a [ char
(?P<app_id>[^]\s:]+) - Group "app_id": any one or more chars other than ], whitespace and :. NOTE: ] does not need to be escaped when it is the first char in the character class.
See the Python demo:
import re
pattern = r"^\[(?P<app_id>[^]\s:]+)"
text = "[asos-qa:2021:5]#0 Row[info=[ts=-9223372036854775808] ]: 6, 23 |"
m = re.search(pattern, text)
if m:
print( m.group(1) )
# => asos-qa

Your pattern uses a greedy \S which matches any non whitespace character.
You can make it non greedy using \S*? like ^\[(?P<app_id>\S*?)(:|]) which will have the value in capture group 1.
Or you can use a negated character class not matching : assuming the closing ] will be there.
^\[(?P<app_id>[^:]+)
Regex demo | Python demo
Example code
import re
pattern = r"\[(?P<app_id>[^:]+)"
s = "[asos-qa:2021:5]#0 Row[info=[ts=-9223372036854775808] ]: 6, 23 |"
match = re.match(pattern, s)
if match:
print(match.group("app_id"))
Output
asos-qa
Or matching only words characters with an optional hyphen in between:
^\[(?P<app_id>\w+(?:-\w+)*)[^]\[]*]
Regex demo

Related

Regex ignoring matches between square brackets

Hi I'm trying to create a Regex to help separate a string into a series of object fields, however having issues where the individual field values themselves are lists and therefore comma separated internally.
string = "field1:1234,field2:[[1, 3],[3,4]], field3:[[1, 3],[3,4]]"
I want the regex to identify only the commas before "field2" and "field3", ignoring the ones separating the list values (e.g. 1 and 3, ] and [, 3 and 4.
I've tried using non-capturing groups to ignore the character after the commas (e.g. (,)([?!a-z]) ) but given I'm running this in Kotlin I don't think non-capturing and group separation is useful.
Is there a way to ignore string values between specified characters? E.g. ignore anything between "[[" and "]]" would work here.
Any help appreciated.
You can tweak the existing Java recursion mimicking regex to extract all the matches you need:
val rx = """\w+:(?:(?=\[)(?:(?=.*?\[(?!.*?\1)(.*\](?!.*\2).*))(?=.*?\](?!.*?\2)(.*)).)+?.*?(?=\1)[^\[]*(?=\2$)|\w+)""".toRegex()
val matches = rx.findAll(string).map{it.value}.joinToString("\n")
See the regex demo. Quick details:
\w+ - one or more letters, digits, underscores
: - a colon
(?: - start of a non-capturing group matching either
(?=\[)(?:(?=.*?\[(?!.*?\1)(.*\](?!.*\2).*))(?=.*?\](?!.*?\2)(.*)).)+?.*?(?=\1)[^\[]*(?=\2$) - a substring between two paired [ and ]
| - or
\w+ - one or more word chars
) - end of the non-capturing group.
See the Kotlin demo:
val string = "field1:1234,field2:[[1, 3],[3,4]], field3:[[1, 3],[3,4]]"
val rx = """\w+:(?:(?=\[)(?:(?=.*?\[(?!.*?\1)(.*\](?!.*\2).*))(?=.*?\](?!.*?\2)(.*)).)+?.*?(?=\1)[^\[]*(?=\2$)|\w+)""".toRegex()
print( rx.findAll(string).map{it.value}.joinToString("\n") )
Output:
field1:1234
field2:[[1, 3],[3,4]]
field3:[[1, 3],[3,4]]

Regex to extract the characters [duplicate]

I have a text like this;
[Some Text][1][Some Text][2][Some Text][3][Some Text][4]
I want to match [Some Text][2] with this regex;
/\[.*?\]\[2\]/
But it returns [Some Text][1][Some Text][2]
How can i match only [Some Text][2]?
Note : There can be any character in Some Text including [ and ] And the numbers in square brackets can be any number not only 1 and 2. The Some Text that i want to match can be at the beginning of the line and there can be multiple Some Texts
JSFiddle
The \[.*?\]\[2\] pattern works like this:
\[ - finds the leftmost [ (as the regex engine processes the string input from left to right)
.*? - matches any 0+ chars other than line break chars, as few as possible, but as many as needed for a successful match, as there are subsequent patterns, see below
\]\[2\] - ][2] substring.
So, the .*? gets expanded upon each failure until it finds the leftmost ][2]. Note the lazy quantifiers do not guarantee the "shortest" matches.
Solution
Instead of a .*? (or .*) use negated character classes that match any char but the boundary char.
\[[^\]\[]*\]\[2\]
See this regex demo.
Here, .*? is replaced with [^\]\[]* - 0 or more chars other than ] and [.
Other examples:
Strings between angle brackets: <[^<>]*> matches <...> with no < and > inside
Strings between parentheses: \([^()]*\) matches (...) with no ( and ) inside
Strings between double quotation marks: "[^"]*" matches "..." with no " inside
Strings between curly braces: \{[^{}]*} matches "..." with no " inside
In other situations, when the starting pattern is a multichar string or complex pattern, use a tempered greedy token, (?:(?!start).)*?. To match abc 1 def in abc 0 abc 1 def, use abc(?:(?!abc).)*?def.
You could try the below regex,
(?!^)(\[[A-Z].*?\]\[\d+\])
DEMO

Replace N spaces at the beginning of a line with N characters

I am looking for a regex substitution to transform N white spaces at the beginning of a line to N . So this text:
list:
- first
should become:
list:
- first
I have tried:
str = "list:\n - first"
str.gsub(/(?<=^) */, " ")
which returns:
list:
- first
which is missing one . How to improve the substitution to get the desired output?
You could make use of the \G anchor and \K to reset the starting point of the reported match.
To match all leading single spaces:
(?:\R\K|\G)
(?: Non capture group
\R\K Match a newline and clear the match buffer
| Or
\G Assert the position at the end of the previous match
) Close non capture group and match a space
See a regex demo and a Ruby demo.
To match only the single leading spaces in the example string:
(?:^.*:\R|\G)\K
In parts, the pattern matches:
(?: Non capture group
^.*:\R Match a line that ends with : and match a newline
| Or
\G Assert the position at the end of the previous match, or at the start of the string
) Close non capture group
\K Forget what is matched so far and match a space
See a regex demo and a Ruby demo.
Example
re = /(?:^.*:\R|\G)\K /
str = 'list:
- first'
result = str.gsub(re, ' ')
puts result
Output
list:
- first
I would write
"list:\n - first".gsub(/^ +/) { |s| ' ' * s.size }
#=> "list:\n - first"
See String#*
Use gsub with a callback function:
str = "list:\n - first"
output = str.gsub(/(?<=^|\n)[ ]+/) {|m| m.gsub(" ", " ") }
This prints:
list:
- first
The pattern (?<=^|\n)[ ]+ captures one or more spaces at the start of a line. This match then gets passed to the callback, which replaces each space, one at a time, with .
You can use a short /(?:\G|^) / regex with a plain text replacement pattern:
result = text.gsub(/(?:\G|^) /, ' ')
See the regex demo. Details:
(?:\G|^) - start of a line or string or the end of the previous match
- a space.
See a Ruby demo:
str = "list:\n - first"
result = str.gsub(/(?:\G|^) /, ' ')
puts result
# =>
# list:
# - first
If you need to match any whitespace, replace with a \s pattern. Or use \h if you need to only match horizontal whitespace.

Extracting only the first occurance between square brackets in a line

I want extract the only the value between square brackets in a given line.
From the text
TID: [-1] [] [2019-07-29 10:18:41,876] INFO
I want to extract the first occurrence between square brackets which is -1.
I tried using
(?<Ten ID>((^(?!(TID: )))*((?<=\[).*?(?=\]))))
but it gives
-1, ,2019-07-29 10:18:41,876
as resultant matches.
How to capture only the first occurrence?
You can access the regex editor here.
Regarding
Is there a solution without group capturing?
You may use
/\bTID:\s*\[\K[^\]]+(?=\])/
See the Rubular demo
Details
\bTID: - whole word TID followed with a colon
\s* - 0+ whitespace chars
\[ - a [ char
\K - match reset operator that discards the text matched so far
[^\]]+ - one or more chars other than ]
(?=\]) - a positive lookahead that makes sure there is a ] char immediately to the right of the current location.
You might capture the first occurrence in the named capturing group using a negated character class:
\ATID: \[(?<Ten ID>[^\[\]]+)\]
\A Start of string
TID: Match literally
\[ Match [
(?<Ten ID> Named capturing group Ten ID
[^\[\]]+ Match not [ or ] using a negated character class
) Close group
\] Match ]
See https://rubular.com/r/4Hc80yrDxGVgvi
str = “TID:] [-1] [] [2019-07-29 10:18:41,876] INFO”
i1 = str.index(‘[‘)
#=> 6
i2 = str.index(‘]’, i1+1)
#=> 9
i1.nil? || i2.nil? ? nil : str[i1+1..i2-1]
#=> “-1”

re.findall() equivalent to a string.split() loop with inner search

Is there a regex string <regex> such that re.findall(r'<regex>', doc) will return the same result as the following code?
doc = ' th_is is stuff. and2 3things if y4ou kn-ow ___ whaaaat iii mean)'
new_doc = []
for word in re.split(r'\s+', doc.strip()):
if not re.search(r'(.)\1{2,}|[_\d\W]+', word):
new_doc.append(word)
>>> new_doc
['is', 'if']
Perhaps, your current way of getting the matches is the best.
You can't do that without some additional operation, e.g. list comprehension, because re.findall with a pattern that contains a capturing group outputs the captured substrings in the resulting list.
Thus, you may either add an outer capturing group and use re.findall or use re.finditer and get the first group using
(?<!\S)(?!\S*(\S)\1{2}|\S*(?!\s)[_\d\W])\S+
See this regex demo.
Details
(?<!\S) - a whitespace or start of string must be immediately to the left of the current location
(?!\S*(\S)\1{2}|\S*(?!\s)[_\d\W]) - there cannot be 3 same non-whitespace chars or a char that is a _, digit or any non-word char other than whitespace after any 0+ non-whitespace chars immediately to the right the current location
\S+ - 1+ non-whitespace chars.
See the Python demo:
import re
doc = ' th_is is stuff. and2 3things if y4ou kn-ow ___ whaaaat iii mean)'
new_doc = [x.group(0) for x in re.finditer(r'(?<!\S)(?!\S*(\S)\1{2}|\S*(?!\s)[_\d\W])\S+', doc)]
print(new_doc) # => ['is', 'if']
new_doc2 = re.findall(r'(?<!\S)((?!\S*(\S)\2{2}|\S*(?!\s)[_\d\W])\S+)', doc)
print([x[0] for x in new_doc2]) # => ['is', 'if']