regex in Python to remove commas and spaces - regex

I have a string with multiple commas and spaces as delimiters between words. Here are some examples:
ex #1: string = 'word1,,,,,,, word2,,,,,, word3,,,,,,'
ex #2: string = 'word1 word2 word3'
ex #3: string = 'word1,word2,word3,'
I want to use a regex to convert either of the above 3 examples to "word1, word2, word3" - (Note: no comma after the last word in the result).
I used the following code:
import re
input_col = 'word1 , word2 , word3, '
test_string = ''.join(input_col)
test_string = re.sub(r'[,\s]+', ' ', test_string)
test_string = re.sub(' +', ',', test_string)
print(test_string)
I get the output as "word1,word2,word3,". Whereas I actually want "word1, word2, word3". No comma after word3.
What kind of regex and re methods should I use to achieve this?

you can use the split to create an array and filter len < 1 array
import re
s='word1 , word2 , word3, '
r=re.split("[^a-zA-Z\d]+",s)
ans=','.join([ i for i in r if len(i) > 0 ])

How about adding the following sentence to the end your program:
re.sub(',+$','', test_string)
which can remove the comma at the end of string

One approach is to first split on an appropriate pattern, then join the resulting array by comma:
string = 'word1,,,,,,, word2,,,,,, word3,,,,,,'
parts = re.split(",*\s*", string)
sep = ','
output = re.sub(',$', '', sep.join(parts))
print(output
word1,word2,word3
Note that I make a final call to re.sub to remove a possible trailing comma.

You can simply use [ ]+ to detect extra spaces and ,\s*$ to detect the last comma. Then you can simply substitute the [ ]+,[ ]+ with , and the last comma with an empty string
import re
input_col = 'word1 , word2 , word3, '
test_string = re.sub('[ ]+,[ ]+', ', ', input_col) # remove extra space
test_string = re.sub(',\s*$', '', test_string) # remove last comma
print(test_string)

Related

Matching any combination of space AND newline

I'm trying to find a regexp that catches all instances that contain at least one \n and any number of (space), no matter the order. So, for instance (with spaces denoted with _), all of these should be caught by the regexp:
\n
\n\n\n\n
\n\n\n_\n\n
_\n
\n_
_\n_
_\n\n
\n\n_
_\n\n_
_\n\n_\n
\n_\n_
_\n\n_\n_
___\n__\n and so on...
However, it must not catch spaces that do not border a \n.
In other words, I'd like to reduce all of this (if I'm not making any mistake) to one line:
import re
mystring = re.sub(r'(\n)+' , '\n' , mystring)
mystring = re.sub(r'( )+' , ' ' , mystring)
mystring = re.sub(r'\n ' , '\n' , mystring)
mystring = re.sub(r' \n' , '\n' , mystring)
mystring = re.sub(r'(\n)+' , '\n' , mystring)
mystring = re.sub(r'(\n)+' , ' | ' , mystring)
[ ]*(?:\n[ ]*)+
or, if you want to match tabulations:
[ \t]*(?:\n[ \t]*)+
Demo & explanation
You can use the following regular expression:
(( )*\n+( )*)+

Python regular expression to find and replace multiple matches

I have a string as follows which can have any number of spaces after the first [ or before the last ]:
my_string = " [ 0.53119281 1.53762345 ]"
I have a regular expression which matches and replaces each one individually as follows:
my_regex_start = "(\[\s+)" #Find square bracket and any number of white spaces
replaced_1 = re.sub(my_regex_start, '[', my_string) --> "[0.53119281 -0.16633733 ]"
my_regex_end = "(\s+\])" #Find any number of white spaces and a square bracket
replaced_2 = re.sub(my_regex_end, ']', my_string) -->" [ 0.53119281 -0.16633733]"
I have a regular expression which finds one OR the other:
my_regex_both = "(\[\s+)|(\s+\])" ##Find square bracket and any number of white spaces OR ny number of white spaces and a square bracket
How can I use this my_regex_both to replace the first one and OR the second one if any or both are found?
Instead of catching the brackets, you can replace the spaces that are preceded by [ or followed by ] with an empty string:
import re
my_string = "[ 0.53119281 1.53762345 ]"
my_regex_both = r"(?<=\[)\s+|\s+(?=\])"
replaced = re.sub(my_regex_both, '', my_string)
print(replaced)
Output:
[0.53119281 1.53762345]
Another option you can use aside from MrGeek's answer would be to use a capture group to catch everything between your my_regex_start and my_regex_end like so:
import re
string1 = " [ 0.53119281 1.53762345 ]"
result = re.sub(r"(\[\s+)(.*?)(\s+\])", r"[\2]", string1)
print(result)
I have just sandwiched (.*?) between your two expressions. This will lazily catch what is between which can be used with \2
OUTPUT
[0.53119281 1.53762345]

Replace '-' with space if the next charcter is a letter not a digit and remove when it is at the start

I have a list of string i.e.
slist = ["-args", "-111111", "20-args", "20 - 20", "20-10", "args-deep"]
I want to remove the '-' from string where it is the first character and is followed by strings but not numbers or if before the '-' there is number/alphabet but after it is alphabets, then it should replace the '-' with space
So for the list slist I want the output as
["args", "-111111", "20 args", "20 - 20", "20-10", "args deep"]
I have tried
slist = ["-args", "-111111", "20-args", "20 - 20", "20-10", "args-deep"]
nlist = list()
for estr in slist:
nlist.append(re.sub("((^-[a-zA-Z])|([0-9]*-[a-zA-Z]))", "", estr))
print (nlist)
and i get the output
['rgs', '-111111', 'rgs', '20 - 20', '20-10', 'argseep']
You may use
nlist.append(re.sub(r"-(?=[a-zA-Z])", " ", estr).lstrip())
or
nlist.append(re.sub(r"-(?=[^\W\d_])", " ", estr).lstrip())
Result: ['args', '-111111', '20 args', '20 - 20', '20-10', 'args deep']
See the Python demo.
The -(?=[a-zA-Z]) pattern matches a hyphen before an ASCII letter (-(?=[^\W\d_]) matches a hyphen before any letter), and replaces the match with a space. Since - may be matched at the start of a string, the space may appear at that position, so .lstrip() is used to remove the space(s) there.
Here, we might just want to capture the first letter after a starting -, then replace it with that letter only, maybe with an i flag expression similar to:
^-([a-z])
DEMO
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"^-([a-z])"
test_str = ("-args\n"
"-111111\n"
"20-args\n"
"20 - 20\n"
"20-10\n"
"args-deep")
subst = "\\1"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Demo
const regex = /^-([a-z])/gmi;
const str = `-args
-111111
20-args
20 - 20
20-10
args-deep`;
const subst = `$1`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
RegEx
If this expression wasn't desired, it can be modified or changed in regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:
One option could be to do 2 times a replacement. First match the hyphen at the start when there are only alphabets following:
^-(?=[a-zA-Z]+$)
Regex demo
In the replacement use an empty string.
Then capture 1 or more times an alphabet or digit in group 1, match - followed by capturing 1+ times an alphabet in group 2.
^([a-zA-Z0-9]+)-([a-zA-Z]+)$
Regex demo
In the replacement use r"\1 \2"
For example
import re
regex1 = r"^-(?=[a-zA-Z]+$)"
regex2 = r"^([a-zA-Z0-9]+)-([a-zA-Z]+)$"
slist = ["-args", "-111111", "20-args", "20 - 20", "20-10", "args-deep"]
slist = list(map(lambda s: re.sub(regex2, r"\1 \2", re.sub(regex1, "", s)), slist))
print(slist)
Result
['args', '-111111', '20 args', '20 - 20', '20-10', 'args deep']
Python demo

python replace line text with weired characters

How do I replace the following using python
GSA*HC*11177*NYSfH-EfC*23130303*0313*1*R*033330103298
STEM*333*3001*0030303238
BHAT*3319*33*33377*23330706*031829*RTRCP
NUM4*41*2*My Break Room Place*****6*1133337
I want to replace the all character after first occurence of '*' . All characters must be replace except '*'
Example input:
NUM4*41*2*My Break Room Place*****6*1133337
example output:
NUM4*11*1*11 11111 1111 11111*****1*1111111
Fairly simple, use a callback to return group 1 (if matched) unaltered, otherwise
return replacement 1
Note - this also would work in multi-line strings.
If you need that, just add (?m) to the beginning of the regex. (?m)(?:(^[^*]*\*)|[^*\s])
You'd probably want to test the string for the * character first.
( ^ [^*]* \* ) # (1), BOS/BOL up to first *
| # or,
[^*\s] # Not a * nor whitespace
Python
import re
def repl(m):
if ( m.group(1) ) : return m.group(1)
return "1"
str = 'NUM4*41*2*My Break Room Place*****6*1133337'
if ( str.find('*') ) :
newstr = re.sub(r'(^[^*]*\*)|[^*\s]', repl, str)
print newstr
else :
print '* not found in string'
Output
NUM4*11*1*11 11111 1111 11111*****1*1111111
If you want to use regex, you can use this one: (?<=\*)[^\*]+ with re.sub
inputs = ['GSA*HC*11177*NYSfH-EfC*23130303*0313*1*R*033330103298',
'STEM*333*3001*0030303238',
'BHAT*3319*33*33377*23330706*031829*RTRCP',
'NUM4*41*2*My Break Room Place*****6*1133337']
outputs = [re.sub(r'(?<=\*)[^\*]+', '1', inputline) for inputline in inputs]
Regex explication here

Split words from CamelCase string

I have a string
string = 'one Two9three four_Five 67SixSevenEightNine';
I need to split it into the words:
'one' 'two' 'three' 'four' 'five' 'six' 'seven' 'eight' 'nine'
I managed to separate all except the CamelCase, when the lowercase letter is followed by uppercase:
while ~isempty(string)
[str,string] = ...
strtok(string, ...
[' ~#$/#.-:&*+=[]?!(){},''">_<;%' char(9) char(10) char(13) '0-9']);
str = regexprep(str, '[0-9]','');
end
I also can get the index of the pattern, but only if I knew how to insert space or some character between, then I could use the code above once again to split into words:
pattern = '[a-z][A-Z]+';
[pat,idx]=regexp(str, pattern,'match');
any ideas?
Thanks!
Why not replace the camelCase before you do your other processing?
newstring = regexprep(string, '([a-z])([A-Z])', '$1 $2');
while ~isempty(newstring)
...