Basically, I have a 3dimensional list (it is a list of tokens, where the first dimension is for the text, second for the sentence, and third for the word).
Addressing an element in the list (lets call it mat) can be done for example:
mat[2][3][4]. That would give us the fifth word or the fourth sentence in the third text.
But, some of the words are just symbols like '.' or ',' or '?'. I need to remove all of them. I thought to do that with a procedure:
def removePunc(mat):
newMat = []
newText = []
newSentence = []
for text in mat:
for sentence in text:
for word in sentence:
if word not in " !##$%^&*()-_+={}[]|\\:;'<>?,./\"":
newSentence.append(word)
newText.append(newSentence)
newMat.append(newText)
return newMat
Now, when I try to use that:
finalMat = removePunc(mat)
it is giving me the same list (mat is a 3 dimensional list). My idea was to iterate over the list and remove only the 'words' which are actually punctuation symbols.
I don't know what I am doing wrong but surely there is a simple logical mistake.
Edit: I need to keep the structure of the array. So, words of the same sentence should still be in the same sentence (just without the 'punctuation symbol' words). Example:
a = [[['as', '.'], ['w', '?', '?']], [['asas', '23', '!'], ['h', ',', ',']]]
after the changes should be:
a = [[['as'], ['w']], [['asas', '23'], ['h']]]
Thanks for reading and/or giving me a reply.
I would suspect that your data are not organized as you think they are. And although I am usually not the one to propose regular expressions, I think in your case they may be among the best solutions.
I would also suggest that instead of eliminating non-alphabetic characters from words, you process sentences
>>> import re
>>> non_word = re.compile(r'\W+') # If your sentences may
>>> sentence = '''The formatting sucks, but the only change that I've made to your code was shortening the "symbols" string to one character. The only issue that I can identify is either with the "symbols" string (though it looks like all chars in it are properly escaped) that you used, or the punctuation is not actually separate words'''
>>> words = re.split(non_word, sentence)
>>> words
['The', 'formatting', 'sucks', 'but', 'the', 'only', 'change', 'that', 'I', 've', 'made', 'to', 'your', 'code', 'was', 'shortening', 'the', 'symbols', 'string', 'to', 'one', 'character', 'The', 'only', 'issue', 'that', 'I', 'can', 'identify', 'is', 'either', 'with', 'the', 'symbols', 'string', 'though', 'it', 'looks', 'like', 'all', 'chars', 'in', 'it', 'are', 'properly', 'escaped', 'that', 'you', 'used', 'or', 'the', 'punctuation', 'is', 'not', 'actually', 'separate', 'words']
>>>
The code you wrote seems solid and it looks like "it should work", but only if this:
But, some of the words are just symbols like '.' or ',' or '?'
is actually fulfilled.
I would actually expect the symbols to not be separate from words, so instead of:
["Are", "you", "sure", "?"] #example sentence
you would rather have:
["Are", "you", "sure?"] #example sentence
If this is the case, you would need to go along the lines of:
def removePunc(mat):
newMat = []
newText = []
newSentence = []
newWord = ""
for text in mat:
for sentence in text:
for word in sentence:
for char in word:
if char not in " !##$%^&*()-_+={}[]|\\:;'<>?,./\"":
newWord += char
newSentence.append(newWord)
newText.append(newSentence)
newMat.append(newText)
return newMat
Finally, found it. As expected, it was a very small logical mistake that was always there but couldn't see it. Here is the working solution:
def removePunc(mat):
newMat = []
for text in mat:
newText = []
for sentence in text:
newSentence = []
for word in sentence:
if word not in " !##$%^&*()-_+={}[]|\\:;'<>?,./\"":
newSentence.append(word)
newText.append(newSentence)
newMat.append(newText)
return newMat
Related
How can i remove punctuations from a line, but retain punctuation in the word using re ??
For Example :
Input = "Hello!!!, i don't like to 'some String' .... isn't"
Output = (['hello','i', 'don't','like','to', 'some', 'string', 'isn't'])
I am trying to do this:
re.sub('\W+', ' ', myLine.lower()).split()
But this is splitting the words like "don't" into don and t.
You can use lookarounds in your regex:
>>> input = "Hello!!!, i didn''''t don't like to 'some String' .... isn't"
>>> regex = r'\W+(?!\S*[a-z])|(?<!\S)\W+'
>>> print re.sub(regex, '', input, 0, re.IGNORECASE).split()
['Hello', 'i', "didn''''t", "don't", 'like', 'to', 'some', 'String', "isn't"]
RegEx Demo
\W+(?!\S*[a-z])|(?<!\S)\W+ matches a non-word, non-space character that doesn't have a letter at previous position or a letter at next position after 1 or more non-space characters.
I have a list containing strings and I want capitalize the first letter of first string using Python and not the entire list.
I have attempted the following but every first letter in the list is capitalized:
L = ("hello", "what", "is", "your", "name")
LCaps = [str.capitalize(element) for element in L)
print LCaps
So, you want to capitalize the first string and only the first string of a tuple. Use:
>>> L = ("hello", "what", "is", "your", "name")
>>> (L[0].capitalize(),) + L[1:]
('Hello', 'what', 'is', 'your', 'name')
Key points:
Strings have methods. There is no need to use the string module: just use the strings capitalize method.
By running L[0].capitalize(), we capitalize the first string but none of the others.
Because L is a tuple, we can't change the first string in-place. We can however capitalize the first string and concatenate it with the rest.
(L[0].title(),) + L[1:]
You can use title() too.
I'm working with Python v2.7, and I'm trying to find out if you can tell if a word is in a string.
If for example i have a string and the word i want to find:
str = "ask and asked, ask are different ask. ask"
word = "ask"
How should i code so that i know that the result i obtain doesn't include words that are part of other words. In the example above i want all the "ask" except the one "asked".
I have tried with the following code but it doesn't work:
def exact_Match(str1, word):
match = re.findall(r"\\b" + word + "\\b",str1, re.I)
if len(match) > 0:
return True
return False
Can someone please explain how can i do it?
You can use the following function :
>>> test_str = "ask and asked, ask are different ask. ask"
>>> word = "ask"
>>> def finder(s,w):
... return re.findall(r'\b{}\b'.format(w),s,re.U)
...
>>> finder(text_str,word)
['ask', 'ask', 'ask', 'ask']
Note that you need \b for boundary regex!
Or you can use the following function to return the indices of words :
in splitted string :
>>> def finder(s,w):
... return [i for i,j in enumerate(re.findall(r'\b\w+\b',s,re.U)) if j==w]
...
>>> finder(test_str,word)
[0, 3, 6, 7]
Can someone please suggest an approach to write the piece of codes that will automatically maps the letter in letter_str onto the dic_key (a dict key string type that contains dashes that match the length of words in word_lst)?
So, the mapping only occurs if a letter appears in every words in the list at the same position no matter how many words is in the list.
If no letter appears at any position for all the words in the word list then the new_dic_key would be '----'. Please see the examples below
Thanks
word_lst = ['ague', 'bute', 'byre', 'came', 'case', 'doze']
dic_key = '----'
letters_str ='abcdefghijklmnopqrstuvwxyz'
new_dic_key = '---e'
if
word_list = ['bute', 'byre']
new_dic_key = 'b--e'
or
word_list = ['drek', 'drew', 'dyes']
new_dic_key = 'd-e-'
If the words in the word_list will be of the same length this code will give what you want:
word_list = ['drek', 'drew', 'dyes']
cols = []
for i in range(len(word_list[0])):
cols.append([])
for word in word_list:
for i, ch in enumerate(word):
cols[i].append(ch)
pattern = [item[0] if len(set(item)) == 1 else '-' for item in cols]
print ''.join(pattern)
d-e-
Explanation:
We initialize cols to be a list of list. It will contain two dimensional representation of the letters in the words of word_list. After populating cols this is what it looks like:
[['d', 'd', 'd'], ['r', 'r', 'y'], ['e', 'e', 'e'], ['k', 'w', 's']]
So the final result new_dic_key will contain the letter only if all elements in the sub-list above have the same letter, otherwise it will contain a -. This is achieved using the list comprehension for pattern.
Hope it helps.
I have a string
string = 'one Two9three four_Five 67SixSevenEightNine';
I need to split it into the words:
'one' 'two' 'three' 'four' 'five' 'six' 'seven' 'eight' 'nine'
I managed to separate all except the CamelCase, when the lowercase letter is followed by uppercase:
while ~isempty(string)
[str,string] = ...
strtok(string, ...
[' ~#$/#.-:&*+=[]?!(){},''">_<;%' char(9) char(10) char(13) '0-9']);
str = regexprep(str, '[0-9]','');
end
I also can get the index of the pattern, but only if I knew how to insert space or some character between, then I could use the code above once again to split into words:
pattern = '[a-z][A-Z]+';
[pat,idx]=regexp(str, pattern,'match');
any ideas?
Thanks!
Why not replace the camelCase before you do your other processing?
newstring = regexprep(string, '([a-z])([A-Z])', '$1 $2');
while ~isempty(newstring)
...