Python regular expression to find and replace multiple matches - regex

I have a string as follows which can have any number of spaces after the first [ or before the last ]:
my_string = " [ 0.53119281 1.53762345 ]"
I have a regular expression which matches and replaces each one individually as follows:
my_regex_start = "(\[\s+)" #Find square bracket and any number of white spaces
replaced_1 = re.sub(my_regex_start, '[', my_string) --> "[0.53119281 -0.16633733 ]"
my_regex_end = "(\s+\])" #Find any number of white spaces and a square bracket
replaced_2 = re.sub(my_regex_end, ']', my_string) -->" [ 0.53119281 -0.16633733]"
I have a regular expression which finds one OR the other:
my_regex_both = "(\[\s+)|(\s+\])" ##Find square bracket and any number of white spaces OR ny number of white spaces and a square bracket
How can I use this my_regex_both to replace the first one and OR the second one if any or both are found?

Instead of catching the brackets, you can replace the spaces that are preceded by [ or followed by ] with an empty string:
import re
my_string = "[ 0.53119281 1.53762345 ]"
my_regex_both = r"(?<=\[)\s+|\s+(?=\])"
replaced = re.sub(my_regex_both, '', my_string)
print(replaced)
Output:
[0.53119281 1.53762345]

Another option you can use aside from MrGeek's answer would be to use a capture group to catch everything between your my_regex_start and my_regex_end like so:
import re
string1 = " [ 0.53119281 1.53762345 ]"
result = re.sub(r"(\[\s+)(.*?)(\s+\])", r"[\2]", string1)
print(result)
I have just sandwiched (.*?) between your two expressions. This will lazily catch what is between which can be used with \2
OUTPUT
[0.53119281 1.53762345]

Related

Ruby Regex: empty space at beginning and end of string

I want to find all users with a first name that has an empty space at the beginning or ending.
It could look like: "Juliette " or " Juliette"
For now I only have the regex to match when the space is at the end of string:
^[ab]:[[:space:]]|$
I didn't find how to match the empty space at the beginning of the string and I don't know if it's possible to accomplish both of these conditions in one regex ?
Thanks for your help.
Test for Strippable Whitespace without Regexp
There's a little trick you can use with String#strip!, which returns nil if it can't find whitespace to strip. For example:
# return true if str has leading/trailing whitespace;
# otherwise returns false
def strippable? str
{ str => !!str.dup.strip! }
end
# leading space, trailing space, no space
test_values = [ ' foo', 'foo ', 'foo' ]
test_values.map { |str| strippable? str }
#=> [{" foo"=>true}, {"foo "=>true}, {"foo"=>false}]
This doesn't rely on a regular expression, but rather on properties of the String and the Boolean result of an inverted #strip!. Regardless of whether the Ruby engine uses regular expressions under the hood, these types of String methods are often faster than comparable Regexp matches, but your mileage and specific use cases may vary.
Alternatives with Regexp
Using the same test data as above, you could do something similar with a regular expression. For example:
# leading space, trailing space, no space
test_values = [ ' foo', 'foo ', 'foo' ]
# test start/end of string
test_values = [ ' foo', 'foo ', 'foo' ].grep /\A\s+|\s+\z/
#=> [" foo", "foo "]
# test start/end of line
test_values = [ ' foo', 'foo ', 'foo' ].grep /^\s+|\s+$/
#=> [" foo", "foo "]
Benchmarks
require 'benchmark'
ITERATIONS = 1_000_000
TEST_VALUES = [ ' foo', 'foo ', 'foo' ]
def regex_grep array
array.grep /^\s+|\s+$/
end
def string_strip array
array.map { |str| { str => !!str.dup.strip! } }
end
Benchmark.bmbm do |x|
n = ITERATIONS
x.report('regex') { n.times { regexp_grep TEST_VALUES } }
x.report('strip') { n.times { string_strip TEST_VALUES } }
end
user system total real
regex 1.539269 0.001325 1.540594 ( 1.541438)
strip 1.256836 0.001357 1.258193 ( 1.259955)
A quarter second over a million iterations may not seem like a big difference, but on significantly larger data sets or iterations it can add up. Whether or not it's enough for you to care for this particular use case is up to you, but the general pattern is that native String methods (regardless of how they're implemented by the interpreter under the hood) are generally faster than regular expression pattern matching. Of course there are edge cases, but that's what benchmarks are for!
You can use
/\A([a-zA-Z]+ | [a-zA-Z]+)\z/
/\A(?:[[:alpha:]]+[[:space:]]|[[:space:]][[:alpha:]]+)\z/
/\A(?:\p{L}+[\p{Z}\t]|[\p{Z}\t]\p{L}+)\z/
See the Rubular demo (with line anchors instead of string anchors used for the demo purposes)
Details:
\A - a string start anchor
(...) - a capturing group
(?:...) - a non-capturing group (it is preferred here since you are not extracting, just validating)
[a-zA-Z]+ - any one or more ASCII letters
\p{L}+ - any one or more Unicode letters
| - or
\z - end of string anchor.

Replace '-' with space if the next charcter is a letter not a digit and remove when it is at the start

I have a list of string i.e.
slist = ["-args", "-111111", "20-args", "20 - 20", "20-10", "args-deep"]
I want to remove the '-' from string where it is the first character and is followed by strings but not numbers or if before the '-' there is number/alphabet but after it is alphabets, then it should replace the '-' with space
So for the list slist I want the output as
["args", "-111111", "20 args", "20 - 20", "20-10", "args deep"]
I have tried
slist = ["-args", "-111111", "20-args", "20 - 20", "20-10", "args-deep"]
nlist = list()
for estr in slist:
nlist.append(re.sub("((^-[a-zA-Z])|([0-9]*-[a-zA-Z]))", "", estr))
print (nlist)
and i get the output
['rgs', '-111111', 'rgs', '20 - 20', '20-10', 'argseep']
You may use
nlist.append(re.sub(r"-(?=[a-zA-Z])", " ", estr).lstrip())
or
nlist.append(re.sub(r"-(?=[^\W\d_])", " ", estr).lstrip())
Result: ['args', '-111111', '20 args', '20 - 20', '20-10', 'args deep']
See the Python demo.
The -(?=[a-zA-Z]) pattern matches a hyphen before an ASCII letter (-(?=[^\W\d_]) matches a hyphen before any letter), and replaces the match with a space. Since - may be matched at the start of a string, the space may appear at that position, so .lstrip() is used to remove the space(s) there.
Here, we might just want to capture the first letter after a starting -, then replace it with that letter only, maybe with an i flag expression similar to:
^-([a-z])
DEMO
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"^-([a-z])"
test_str = ("-args\n"
"-111111\n"
"20-args\n"
"20 - 20\n"
"20-10\n"
"args-deep")
subst = "\\1"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Demo
const regex = /^-([a-z])/gmi;
const str = `-args
-111111
20-args
20 - 20
20-10
args-deep`;
const subst = `$1`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
RegEx
If this expression wasn't desired, it can be modified or changed in regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:
One option could be to do 2 times a replacement. First match the hyphen at the start when there are only alphabets following:
^-(?=[a-zA-Z]+$)
Regex demo
In the replacement use an empty string.
Then capture 1 or more times an alphabet or digit in group 1, match - followed by capturing 1+ times an alphabet in group 2.
^([a-zA-Z0-9]+)-([a-zA-Z]+)$
Regex demo
In the replacement use r"\1 \2"
For example
import re
regex1 = r"^-(?=[a-zA-Z]+$)"
regex2 = r"^([a-zA-Z0-9]+)-([a-zA-Z]+)$"
slist = ["-args", "-111111", "20-args", "20 - 20", "20-10", "args-deep"]
slist = list(map(lambda s: re.sub(regex2, r"\1 \2", re.sub(regex1, "", s)), slist))
print(slist)
Result
['args', '-111111', '20 args', '20 - 20', '20-10', 'args deep']
Python demo

regex - how to specify the expressions to exclude

I need to replace two characters {, } with {\n, \n}.
But they must be not surrounded in '' or "".
I tried this code to achieve that
text = 'hello(){imagine{myString("HELLO, {WORLD}!")}}'
replaced = re.sub(r'{', "{\n", text)
Ellipsis...
Naturally, This code replaces curly brackets that are surrounded in quote marks.
What are the negative statements like ! or not that can be used in regular expressions?
And the following is what I wanted.
hello(){
imagine{
puts("{HELLO}")
}
}
In a nutshell - what I want to do is
Search { and }.
If that is not enclosed in '' or ""
replace { or } to {\n or \n}
In the opposite case, I can solve it with (?P<a>\".*){(?P<b>.*?\").
But I have no clue how I can solve it in my case.
First replace all { characters with {\n. You will also be replacing {" with {\n". Now, you can replace back all {\n" characters with {".
text = 'hello(){imagine{puts("{HELLO}")}}'
replaced = text.replace('{', '{\n').replace('{\n"','{"')
You may match single and double quoted (C-style) string literals (those that support escape entities with backslashes) and then match { and } in any other context that you may replace with your desired values.
See Python demo:
import re
text = 'hello(){imagine{puts("{HELLO}")}}'
dblq = r'(?<!\\)(?:\\{2})*"[^"\\]*(?:\\.[^"\\]*)*"'
snlq = r"(?<!\\)(?:\\{2})*'[^'\\]*(?:\\.[^'\\]*)*'"
rx = re.compile(r'({}|{})|[{{}}]'.format(dblq, snlq))
print(rx.pattern)
def repl(m):
if m.group(1):
return m.group(1)
elif m.group() == '{':
return '{\n'
else:
return '\n}'
# Examples
print(rx.sub(repl, text))
print(rx.sub(repl, r'hello(){imagine{puts("Nice, Mr. \"Know-all\"")}}'))
print(rx.sub(repl, "hello(){imagine{puts('MORE {HELLO} HERE ')}}"))
The pattern that is generated in the code above is
((?<!\\)(?:\\{2})*"[^"\\]*(?:\\.[^"\\]*)*"|(?<!\\)(?:\\{2})*'[^'\\]*(?:\\.[^'\\]*)*')|[{}]
It can actually be reduced to
(?<!\\)((?:\\{2})*(?:"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*'))|[{}]
See the regex demo.
Details:
The pattern matches 2 main alternatives. The first one matches single- and double-quoted string literals.
(?<!\\) - no \ immediately to the left is allowed
((?:\\{2})*(?:"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')) - Group 1:
(?:\\{2})* - 0+ repetitions of two consecutive backslashes
(?: - a non-capturing group:
"[^"\\]*(?:\\.[^"\\]*)*" - a double quoted string literal
| - or
'[^'\\]*(?:\\.[^'\\]*)*' - a single quoted string literal
) - end of the non-capturing group
| - or
[{}] - a { or }.
In the repl method, Group 1 is checked for a match. If it matched, the single- or double-quoted string literal is matched, it must be put back where it was. Else, if the match value is {, it is replaced with {\n, else, with \n}.
Replace { with {\n:
text.replace('{', '{\n')
Replace } with \n}:
text.replace('}', '\n}')
Now to fix the braces that were quoted:
text.replace('"{\n','"{')
and
text.replace('\n}"', '}"')
Combined together:
replaced = text.replace('{', '{\n').replace('}', '\n}').replace('"{\n','"{').replace('\n}"', '}"')
Output
hello(){
imagine{
puts("{HELLO}")
}
}
You can check the similarities with the input and try to match them.
text = 'hello(){imagine{puts("{HELLO}")}}'
replaced = text.replace('){', '){\n').replace('{puts', '{\nputs').replace('}}', '\n}\n}')
print(replaced)
output:
hello(){
imagine{
puts("{HELLO}")
}
}
UPDATE
try this: https://regex101.com/r/DBgkrb/1

Groovy complaining about illegal character range in regex

Groovy 2.4 here. I am trying to build a regex that will filter out all the following characters:
`,./;[]-&<>?:"()|
Here's my best attempt:
static void main(String[] args) {
// `,./;[]-&<>?:"()|
String regex = "`,./;[]-&<>?:\"()|"
String test = "ooekrofkrofor ` oxkeoe , wdkeodeko / kodek ] woekoedk \" swjiej ' wsjwdjeiji :"
println test.replaceAll(regex, "")
}
However this produces a compile error on the regex string definition, complaining:
illegal character range (to < from)
Not sure if this is a Java or Groovy thing, but I can't figure out how to define the regex properly so that it quiets the error and correctly strips these "illegal characters" out of my string. Any ideas?
It seems to me you want to remove all the characters listed in your regex variable. The problem is that you declared a sequence while you need a character class (enclose the characters with []).
See Groovy demo:
String regex = "[`,./;\\[\\]&<>?:\"()|-]+"
^ ^^^^^^ ^ ^
String test = "ooekrofkrofor ` oxkeoe , wdkeodeko / kodek ] woekoedk \" swjiej ' wsjwdjeiji :"
println test.replaceAll(regex, "")
Output: ooekrofkrofor oxkeoe wdkeodeko kodek woekoedk swjiej ' wsjwdjeiji
The pattern now contains a character class matching any of the characters defined inside it - [`,./;\[\]&<>?:\"()|-] - one or more times due to the + quantifier. Note that inside the character class, ] and [ must always be escaped, and the - can be left unescaped when placed at the start/end of the character class.
You need to escape a few special characters in your pattern:
String regex = "[`,./;\\[]\\-&<>?:\"\\(\\)|]+"
Note using double \\ to turn them into a single \ in the string, so when the pattern is parsed, the next character is escaped.

Regex that matches specific spaces

I've been trying to do this Regex for a while now. I'd like to create one that matches all the spaces of a text, except those in literal string.
Exemple:
123 Foo "String with spaces"
Space between 123 and Foo would match, as well as the one between Foo and "String with spaces", but only those two.
Thanks
A common, simple strategy for this is to count the number of quotes leading up to your location in the string. If the count is odd, you are inside a quoted string; if the amount is even, you are outside a quoted string. I can't think of a way to do this in regular expressions, but you could use this strategy to filter the results.
You could use re.findall to match either a string or a space and then afterwards inspect the matches:
import re
hits = re.findall("\"(?:\\\\.|[^\\\"])*\"|[ ]", 'foo bar baz "another\\" test\" and done')
for h in hits:
print "found: [%s]" % h
yields:
found: [ ]
found: [ ]
found: [ ]
found: ["another\" test"]
found: [ ]
found: [ ]
A short explanation:
" # match a double quote
(?: # start non-capture group 1
\\\\. # match a backslash followed by any character (except line breaks)
| # OR
[^\\\"] # match any character except a '\' and '"'
)* # end non-capture group 1 and repeat it zero or more times
" # match a double quote
| # OR
[ ] # match a single space
If this ->123 Foo "String with spaces" <- is your structure for a line that is to say text followed by a quoted text you could create 2 groups the quoted and the unquoted text and an tackle them separately.
ex.regex -> (.*)(".*") where $1 should contain ->123 Foo <- and $2 ->"String with spaces"<-
java example.
String aux = "123 Foo \"String with spaces\"";
String regex = "(.*)(\".*\")";
String unquoted = aux.replaceAll(regex, "$1").replace(" ", "");
String quoted = aux.replaceAll(regex, "$2");
System.out.println(unquoted+quoted);
javascript example.
<SCRIPT LANGUAGE="JavaScript">
<!--
str='1 23 Foo \"String with spaces\"';
re = new RegExp('(.*)(".*")') ;
var quoted = str.replace(re, "$1");
var unquoted = str.replace(re, "$2");
document.write (quoted.split(' ').join('')+unquoted);
// -->
</SCRIPT>