I am looking for a regex substitution to transform N white spaces at the beginning of a line to N . So this text:
list:
- first
should become:
list:
- first
I have tried:
str = "list:\n - first"
str.gsub(/(?<=^) */, " ")
which returns:
list:
- first
which is missing one . How to improve the substitution to get the desired output?
You could make use of the \G anchor and \K to reset the starting point of the reported match.
To match all leading single spaces:
(?:\R\K|\G)
(?: Non capture group
\R\K Match a newline and clear the match buffer
| Or
\G Assert the position at the end of the previous match
) Close non capture group and match a space
See a regex demo and a Ruby demo.
To match only the single leading spaces in the example string:
(?:^.*:\R|\G)\K
In parts, the pattern matches:
(?: Non capture group
^.*:\R Match a line that ends with : and match a newline
| Or
\G Assert the position at the end of the previous match, or at the start of the string
) Close non capture group
\K Forget what is matched so far and match a space
See a regex demo and a Ruby demo.
Example
re = /(?:^.*:\R|\G)\K /
str = 'list:
- first'
result = str.gsub(re, ' ')
puts result
Output
list:
- first
I would write
"list:\n - first".gsub(/^ +/) { |s| ' ' * s.size }
#=> "list:\n - first"
See String#*
Use gsub with a callback function:
str = "list:\n - first"
output = str.gsub(/(?<=^|\n)[ ]+/) {|m| m.gsub(" ", " ") }
This prints:
list:
- first
The pattern (?<=^|\n)[ ]+ captures one or more spaces at the start of a line. This match then gets passed to the callback, which replaces each space, one at a time, with .
You can use a short /(?:\G|^) / regex with a plain text replacement pattern:
result = text.gsub(/(?:\G|^) /, ' ')
See the regex demo. Details:
(?:\G|^) - start of a line or string or the end of the previous match
- a space.
See a Ruby demo:
str = "list:\n - first"
result = str.gsub(/(?:\G|^) /, ' ')
puts result
# =>
# list:
# - first
If you need to match any whitespace, replace with a \s pattern. Or use \h if you need to only match horizontal whitespace.
Related
I have the line:
[asos-qa:2021:5]#0 Row[info=[ts=-9223372036854775808] ]: 6, 23 |
I want to get the first word: asos-qa, so I tried this regex: ^\[\S*?(:|]) and it gets me: [asos-qa:.
So in order to get only the word without the other characters I tried to add a group (python syntax): ^\[(?P<app_id>\S*)?(:|]) but for some reason it returns [asos-qa:2021:5].
What am I doing wrong?
Your ^\[(?P<app_id>\S*)?(:|]) regex returns [asos-qa:2021:5] because \S* matches any zero or more non-whitespace chars greedily up to the last available :or ] in the current chunk of non-whitespace chars, ? you used is applied to the whole (?P<app_id>\S*) group pattern and is also greedy, i.e. the regex engine tries at least once to match the group pattern.
You need
^\[(?P<app_id>[^]\s:]+)
See the regex demo. Details:
^ - start of string
\[ - a [ char
(?P<app_id>[^]\s:]+) - Group "app_id": any one or more chars other than ], whitespace and :. NOTE: ] does not need to be escaped when it is the first char in the character class.
See the Python demo:
import re
pattern = r"^\[(?P<app_id>[^]\s:]+)"
text = "[asos-qa:2021:5]#0 Row[info=[ts=-9223372036854775808] ]: 6, 23 |"
m = re.search(pattern, text)
if m:
print( m.group(1) )
# => asos-qa
Your pattern uses a greedy \S which matches any non whitespace character.
You can make it non greedy using \S*? like ^\[(?P<app_id>\S*?)(:|]) which will have the value in capture group 1.
Or you can use a negated character class not matching : assuming the closing ] will be there.
^\[(?P<app_id>[^:]+)
Regex demo | Python demo
Example code
import re
pattern = r"\[(?P<app_id>[^:]+)"
s = "[asos-qa:2021:5]#0 Row[info=[ts=-9223372036854775808] ]: 6, 23 |"
match = re.match(pattern, s)
if match:
print(match.group("app_id"))
Output
asos-qa
Or matching only words characters with an optional hyphen in between:
^\[(?P<app_id>\w+(?:-\w+)*)[^]\[]*]
Regex demo
I have a file with text like this:
"Title" = "Body"
And I would like to remove both " before the =, to leave it like this:
Title = "Body"
So far I managed to select the first block of text with:
.+(=)
That selects everything up to the =, but I can't find how to reemplace (or delete) both " .
Any suggestions?
You could use a capture group in the replacement, and match the double quotes to be removed while asserting an equals sign at the right.
Find what:
"([^"]+)"(?=\h*=)
" Match literally
([^"]+) Capture group 1, match 1+ times any char other than "
" Match literally
(?=\h*=) Positive lookahead, assert an = sigh at the right
Regex demo
Replace with:
$1
To match the whole pattern from the start till end end of the string, you might also use 2 capture groups and use those in the replacement.
^"([^"]+)"(\h*=\h*"[^"]+")$
Regex demo
In the replacement use $1$2
You can use
(?:\G(?!^)|^(?=.*=))[^"=\v]*\K"
Replace with an empty string.
Details:
(?:\G(?!^)|^(?=.*=)) - end of the previous successful match (\G(?!^)) or (|) start of a line that contains = somewhere on it (^(?=.*=))
[^"=\v]* - any zero or more chars other than ", = and vertical whitespace
\K - omit the text matched
" - a " char (matched, consumed and removed)
See the screenshot with settings and a demo:
Is there a regex string <regex> such that re.findall(r'<regex>', doc) will return the same result as the following code?
doc = ' th_is is stuff. and2 3things if y4ou kn-ow ___ whaaaat iii mean)'
new_doc = []
for word in re.split(r'\s+', doc.strip()):
if not re.search(r'(.)\1{2,}|[_\d\W]+', word):
new_doc.append(word)
>>> new_doc
['is', 'if']
Perhaps, your current way of getting the matches is the best.
You can't do that without some additional operation, e.g. list comprehension, because re.findall with a pattern that contains a capturing group outputs the captured substrings in the resulting list.
Thus, you may either add an outer capturing group and use re.findall or use re.finditer and get the first group using
(?<!\S)(?!\S*(\S)\1{2}|\S*(?!\s)[_\d\W])\S+
See this regex demo.
Details
(?<!\S) - a whitespace or start of string must be immediately to the left of the current location
(?!\S*(\S)\1{2}|\S*(?!\s)[_\d\W]) - there cannot be 3 same non-whitespace chars or a char that is a _, digit or any non-word char other than whitespace after any 0+ non-whitespace chars immediately to the right the current location
\S+ - 1+ non-whitespace chars.
See the Python demo:
import re
doc = ' th_is is stuff. and2 3things if y4ou kn-ow ___ whaaaat iii mean)'
new_doc = [x.group(0) for x in re.finditer(r'(?<!\S)(?!\S*(\S)\1{2}|\S*(?!\s)[_\d\W])\S+', doc)]
print(new_doc) # => ['is', 'if']
new_doc2 = re.findall(r'(?<!\S)((?!\S*(\S)\2{2}|\S*(?!\s)[_\d\W])\S+)', doc)
print([x[0] for x in new_doc2]) # => ['is', 'if']
I want to keep only the last term of a string separated by dots
Example:
My string is:
abc"val1.val2.val3.val4"zzz
Expected string after i use regex:
abc"val4"zzz
Which means i want the content from left-hand side which was separated with dot (.)
The most relevant I tried was
val json="""abc"val1.val2.val3.val4"zzz"""
val sortie="""(([A-Za-z0-9]*)\.([A-Za-z0-9]*){2,10})\.([A-Za-z0-9]*)""".r.replaceAllIn(json, a=> a.group(3))
the result was:
abc".val4"zzz
Can you tell me if you have different solution for regex please?
Thanks
You may use
val s = """abc"val1.val2.val3.val4"zzz"""
val res = "(\\w+\")[^\"]*\\.([^\"]*\")".r replaceAllIn (s, "$1$2")
println(res)
// => abc"val4"zzz
See the Scala demo
Pattern details:
(\\w+\") - Group 1 capturing 1+ word chars and a "
[^\"]* - 0+ chars other than "
\\. - a dot
([^\"]*\") - Group 2 capturing 0+ chars other than " and then a ".
The $1 is the backreference to the first group and $2 inserts the text inside Group 2.
Maybe without Regex at all:
scala> json.split("\"").map(_.split("\\.").last).mkString("\"")
res4: String = abc"val4"zzz
This assumes you want each "token" (separated by ") to become the last dot-separated inner token.
I am not good regex and need to update following pattern without impacting other pattern. Any suggestion $ sign contain 1t0 4. $ sign always be begining of the line.( space may or may not be)
import re
data = " $$$AKL_M0_90_2K: Two line end vias (VIAG, VIAT and/or"
patt = '^ (?:ABC *)?([A-Za-z0-9/\._\:]+)\s*: ? '
match = re.findall( patt, data, re.M )
print match
Note : data is multi line string
match should contain : "$$$AKL_M0_90_2K" this result
I suggest the following solution (see IDEONE demo):
import re
data = r" $$$AKL_M0_90_2K: Two line end vias (VIAG, VIAT and/or"
patt = r'^\s*([$]{1,4}[^:]+)'
match = re.findall( patt, data, re.M )
print(match)
The re.findall will return the list with just one match. The ^\s*([$]{1,4}[^:]+) regex matches:
^ - start of a line (you use re.M)
\s* - zero or more whitespaces
([$]{1,4}[^:]+) - Group 1 capturing 1 to 4 $ symbols, and then one or more characters other than :.
See the regex demo
If you need to keep your own regex, just do one of the following:
Add $ to the character class (demo): ^ (?:ABC *)?([$A-Za-z0-9/._:]+)\s*: ?
Add an alternative to the first non-capturing group and place it at the start of the capturing one (demo): ^ ((?:ABC *|[$]{1,4})?[A-Za-z0-9/._:]+)\s*: ?