Retaining punctuations in a word - regex

How can i remove punctuations from a line, but retain punctuation in the word using re ??
For Example :
Input = "Hello!!!, i don't like to 'some String' .... isn't"
Output = (['hello','i', 'don't','like','to', 'some', 'string', 'isn't'])
I am trying to do this:
re.sub('\W+', ' ', myLine.lower()).split()
But this is splitting the words like "don't" into don and t.

You can use lookarounds in your regex:
>>> input = "Hello!!!, i didn''''t don't like to 'some String' .... isn't"
>>> regex = r'\W+(?!\S*[a-z])|(?<!\S)\W+'
>>> print re.sub(regex, '', input, 0, re.IGNORECASE).split()
['Hello', 'i', "didn''''t", "don't", 'like', 'to', 'some', 'String', "isn't"]
RegEx Demo
\W+(?!\S*[a-z])|(?<!\S)\W+ matches a non-word, non-space character that doesn't have a letter at previous position or a letter at next position after 1 or more non-space characters.

Related

regex in Python to remove commas and spaces

I have a string with multiple commas and spaces as delimiters between words. Here are some examples:
ex #1: string = 'word1,,,,,,, word2,,,,,, word3,,,,,,'
ex #2: string = 'word1 word2 word3'
ex #3: string = 'word1,word2,word3,'
I want to use a regex to convert either of the above 3 examples to "word1, word2, word3" - (Note: no comma after the last word in the result).
I used the following code:
import re
input_col = 'word1 , word2 , word3, '
test_string = ''.join(input_col)
test_string = re.sub(r'[,\s]+', ' ', test_string)
test_string = re.sub(' +', ',', test_string)
print(test_string)
I get the output as "word1,word2,word3,". Whereas I actually want "word1, word2, word3". No comma after word3.
What kind of regex and re methods should I use to achieve this?
you can use the split to create an array and filter len < 1 array
import re
s='word1 , word2 , word3, '
r=re.split("[^a-zA-Z\d]+",s)
ans=','.join([ i for i in r if len(i) > 0 ])
How about adding the following sentence to the end your program:
re.sub(',+$','', test_string)
which can remove the comma at the end of string
One approach is to first split on an appropriate pattern, then join the resulting array by comma:
string = 'word1,,,,,,, word2,,,,,, word3,,,,,,'
parts = re.split(",*\s*", string)
sep = ','
output = re.sub(',$', '', sep.join(parts))
print(output
word1,word2,word3
Note that I make a final call to re.sub to remove a possible trailing comma.
You can simply use [ ]+ to detect extra spaces and ,\s*$ to detect the last comma. Then you can simply substitute the [ ]+,[ ]+ with , and the last comma with an empty string
import re
input_col = 'word1 , word2 , word3, '
test_string = re.sub('[ ]+,[ ]+', ', ', input_col) # remove extra space
test_string = re.sub(',\s*$', '', test_string) # remove last comma
print(test_string)

python replace line text with weired characters

How do I replace the following using python
GSA*HC*11177*NYSfH-EfC*23130303*0313*1*R*033330103298
STEM*333*3001*0030303238
BHAT*3319*33*33377*23330706*031829*RTRCP
NUM4*41*2*My Break Room Place*****6*1133337
I want to replace the all character after first occurence of '*' . All characters must be replace except '*'
Example input:
NUM4*41*2*My Break Room Place*****6*1133337
example output:
NUM4*11*1*11 11111 1111 11111*****1*1111111
Fairly simple, use a callback to return group 1 (if matched) unaltered, otherwise
return replacement 1
Note - this also would work in multi-line strings.
If you need that, just add (?m) to the beginning of the regex. (?m)(?:(^[^*]*\*)|[^*\s])
You'd probably want to test the string for the * character first.
( ^ [^*]* \* ) # (1), BOS/BOL up to first *
| # or,
[^*\s] # Not a * nor whitespace
Python
import re
def repl(m):
if ( m.group(1) ) : return m.group(1)
return "1"
str = 'NUM4*41*2*My Break Room Place*****6*1133337'
if ( str.find('*') ) :
newstr = re.sub(r'(^[^*]*\*)|[^*\s]', repl, str)
print newstr
else :
print '* not found in string'
Output
NUM4*11*1*11 11111 1111 11111*****1*1111111
If you want to use regex, you can use this one: (?<=\*)[^\*]+ with re.sub
inputs = ['GSA*HC*11177*NYSfH-EfC*23130303*0313*1*R*033330103298',
'STEM*333*3001*0030303238',
'BHAT*3319*33*33377*23330706*031829*RTRCP',
'NUM4*41*2*My Break Room Place*****6*1133337']
outputs = [re.sub(r'(?<=\*)[^\*]+', '1', inputline) for inputline in inputs]
Regex explication here

Python 2.7 RE Search by condition

When I am using re.search, I have some problem.
For example:
a = '<span class="chapternum">1 </span>abc,def.</span>'
How can I search the number '1'?
Or how to search by matching digit start with ">" and end with writespace?
I tried:
test = re.search('(^>)(\d+)(\s$)', a)
print test
>> []
It is fail to get the number "1"
^ and $ indicate the beginning and the end of the string. If you get rid of them you have your answer:
>>> test = re.search('(>)(\d+)(\s)', a)
>>> test.groups()
('>', '1', ' ')
Not sure that you need the first and last groups though (capturing with parenthesis):
>>> a = '<span class="chapternum">23 </span>abc,def.</span>'
>>> test = re.search('>(\d+)\s', a)
>>> test.group(1)
'23'

Python 2.7 : Remove elements from a multidimensional list

Basically, I have a 3dimensional list (it is a list of tokens, where the first dimension is for the text, second for the sentence, and third for the word).
Addressing an element in the list (lets call it mat) can be done for example:
mat[2][3][4]. That would give us the fifth word or the fourth sentence in the third text.
But, some of the words are just symbols like '.' or ',' or '?'. I need to remove all of them. I thought to do that with a procedure:
def removePunc(mat):
newMat = []
newText = []
newSentence = []
for text in mat:
for sentence in text:
for word in sentence:
if word not in " !##$%^&*()-_+={}[]|\\:;'<>?,./\"":
newSentence.append(word)
newText.append(newSentence)
newMat.append(newText)
return newMat
Now, when I try to use that:
finalMat = removePunc(mat)
it is giving me the same list (mat is a 3 dimensional list). My idea was to iterate over the list and remove only the 'words' which are actually punctuation symbols.
I don't know what I am doing wrong but surely there is a simple logical mistake.
Edit: I need to keep the structure of the array. So, words of the same sentence should still be in the same sentence (just without the 'punctuation symbol' words). Example:
a = [[['as', '.'], ['w', '?', '?']], [['asas', '23', '!'], ['h', ',', ',']]]
after the changes should be:
a = [[['as'], ['w']], [['asas', '23'], ['h']]]
Thanks for reading and/or giving me a reply.
I would suspect that your data are not organized as you think they are. And although I am usually not the one to propose regular expressions, I think in your case they may be among the best solutions.
I would also suggest that instead of eliminating non-alphabetic characters from words, you process sentences
>>> import re
>>> non_word = re.compile(r'\W+') # If your sentences may
>>> sentence = '''The formatting sucks, but the only change that I've made to your code was shortening the "symbols" string to one character. The only issue that I can identify is either with the "symbols" string (though it looks like all chars in it are properly escaped) that you used, or the punctuation is not actually separate words'''
>>> words = re.split(non_word, sentence)
>>> words
['The', 'formatting', 'sucks', 'but', 'the', 'only', 'change', 'that', 'I', 've', 'made', 'to', 'your', 'code', 'was', 'shortening', 'the', 'symbols', 'string', 'to', 'one', 'character', 'The', 'only', 'issue', 'that', 'I', 'can', 'identify', 'is', 'either', 'with', 'the', 'symbols', 'string', 'though', 'it', 'looks', 'like', 'all', 'chars', 'in', 'it', 'are', 'properly', 'escaped', 'that', 'you', 'used', 'or', 'the', 'punctuation', 'is', 'not', 'actually', 'separate', 'words']
>>>
The code you wrote seems solid and it looks like "it should work", but only if this:
But, some of the words are just symbols like '.' or ',' or '?'
is actually fulfilled.
I would actually expect the symbols to not be separate from words, so instead of:
["Are", "you", "sure", "?"] #example sentence
you would rather have:
["Are", "you", "sure?"] #example sentence
If this is the case, you would need to go along the lines of:
def removePunc(mat):
newMat = []
newText = []
newSentence = []
newWord = ""
for text in mat:
for sentence in text:
for word in sentence:
for char in word:
if char not in " !##$%^&*()-_+={}[]|\\:;'<>?,./\"":
newWord += char
newSentence.append(newWord)
newText.append(newSentence)
newMat.append(newText)
return newMat
Finally, found it. As expected, it was a very small logical mistake that was always there but couldn't see it. Here is the working solution:
def removePunc(mat):
newMat = []
for text in mat:
newText = []
for sentence in text:
newSentence = []
for word in sentence:
if word not in " !##$%^&*()-_+={}[]|\\:;'<>?,./\"":
newSentence.append(word)
newText.append(newSentence)
newMat.append(newText)
return newMat

Split words from CamelCase string

I have a string
string = 'one Two9three four_Five 67SixSevenEightNine';
I need to split it into the words:
'one' 'two' 'three' 'four' 'five' 'six' 'seven' 'eight' 'nine'
I managed to separate all except the CamelCase, when the lowercase letter is followed by uppercase:
while ~isempty(string)
[str,string] = ...
strtok(string, ...
[' ~#$/#.-:&*+=[]?!(){},''">_<;%' char(9) char(10) char(13) '0-9']);
str = regexprep(str, '[0-9]','');
end
I also can get the index of the pattern, but only if I knew how to insert space or some character between, then I could use the code above once again to split into words:
pattern = '[a-z][A-Z]+';
[pat,idx]=regexp(str, pattern,'match');
any ideas?
Thanks!
Why not replace the camelCase before you do your other processing?
newstring = regexprep(string, '([a-z])([A-Z])', '$1 $2');
while ~isempty(newstring)
...