Gurus,
I have list which looks like following :
[u'test1', u'test2', '', '']
I am trying to find a way to replace character u which is before 'test1' and 'test2' with none ''. So after replacing it will look like:
['test1','test2', '', '']
Initially I had list like following:
[u'test1\n', u'test2\r\n', '', '']
This I could reduce using following:
row_val = [w.replace('\n', '') for w in row_val]
row_val = [w.replace('\r', '') for w in row_val]
Let me know there is a way to perform the same without iterating through each string.
The u is not a string character, it is telling you that it is a unicode object rather than a str object.
You can just do:
row_val = [str(w) for w in row_val]
Related
I have long list that a can simplify as below and even trying de function "re.sub" i can't remove the blank spaces ''.
overall_list = []
directory = '/content/drive/MyDrive/Colab Notebooks/S N'
for filename in os.listdir (directory):
f = os.path.join(directory,filename)
imagestring = pytesseract.image_to_string(Image.open(f))
string_lists = re.split('',imagestring,1)
print(string_lists)
for x in string_lists:
x = re.sub('\x0c', '', x)
x = re.sub('[\n-\x0c]',' ', x)
x = re.sub('','')
overall_list.append(x)
print(overall_list)
all the code above returns scanned images as individual lists:
['', 'N/S:10229876-5\n\x0c']
['', '192.1638.1 729.200\n\x0c']
['', '192.168.179.103 SPARE\n\x0c']
And the "overall_list" is all the above in one list
['', 'N/S:10229876-5 ', '', '192.1638.1 729.200 ', '', '192.168.179.103 SPARE ']
But a ran out of ideas to clean this list form the '' elements. However i noticed that these occur in a alternating pattern and maybe i can use pop to create a loop for and delete everytime it appears.
How do i structure this loop for this particular goal?
I have a string with multiple commas and spaces as delimiters between words. Here are some examples:
ex #1: string = 'word1,,,,,,, word2,,,,,, word3,,,,,,'
ex #2: string = 'word1 word2 word3'
ex #3: string = 'word1,word2,word3,'
I want to use a regex to convert either of the above 3 examples to "word1, word2, word3" - (Note: no comma after the last word in the result).
I used the following code:
import re
input_col = 'word1 , word2 , word3, '
test_string = ''.join(input_col)
test_string = re.sub(r'[,\s]+', ' ', test_string)
test_string = re.sub(' +', ',', test_string)
print(test_string)
I get the output as "word1,word2,word3,". Whereas I actually want "word1, word2, word3". No comma after word3.
What kind of regex and re methods should I use to achieve this?
you can use the split to create an array and filter len < 1 array
import re
s='word1 , word2 , word3, '
r=re.split("[^a-zA-Z\d]+",s)
ans=','.join([ i for i in r if len(i) > 0 ])
How about adding the following sentence to the end your program:
re.sub(',+$','', test_string)
which can remove the comma at the end of string
One approach is to first split on an appropriate pattern, then join the resulting array by comma:
string = 'word1,,,,,,, word2,,,,,, word3,,,,,,'
parts = re.split(",*\s*", string)
sep = ','
output = re.sub(',$', '', sep.join(parts))
print(output
word1,word2,word3
Note that I make a final call to re.sub to remove a possible trailing comma.
You can simply use [ ]+ to detect extra spaces and ,\s*$ to detect the last comma. Then you can simply substitute the [ ]+,[ ]+ with , and the last comma with an empty string
import re
input_col = 'word1 , word2 , word3, '
test_string = re.sub('[ ]+,[ ]+', ', ', input_col) # remove extra space
test_string = re.sub(',\s*$', '', test_string) # remove last comma
print(test_string)
How can i remove punctuations from a line, but retain punctuation in the word using re ??
For Example :
Input = "Hello!!!, i don't like to 'some String' .... isn't"
Output = (['hello','i', 'don't','like','to', 'some', 'string', 'isn't'])
I am trying to do this:
re.sub('\W+', ' ', myLine.lower()).split()
But this is splitting the words like "don't" into don and t.
You can use lookarounds in your regex:
>>> input = "Hello!!!, i didn''''t don't like to 'some String' .... isn't"
>>> regex = r'\W+(?!\S*[a-z])|(?<!\S)\W+'
>>> print re.sub(regex, '', input, 0, re.IGNORECASE).split()
['Hello', 'i', "didn''''t", "don't", 'like', 'to', 'some', 'String', "isn't"]
RegEx Demo
\W+(?!\S*[a-z])|(?<!\S)\W+ matches a non-word, non-space character that doesn't have a letter at previous position or a letter at next position after 1 or more non-space characters.
I am learning python for beginners. I would like to convert column values from unicode time ('1383260400000') to timestamp (1970-01-01 00:00:01enter code here). I have read and tried the following but its giving me an error.
ti=datetime.datetime.utcfromtimestamp(int(arr[1]).strftime('%Y-%m-%d %H:%M:%S');
Its saying invalid syntax. I read and tried a few other stuffs but I can not come right.. Any suggestion?
And another one, in the same file I have some empty cells that I would like to replace with 0, I tried this too and its giving me invalid syntax:
smsin=arr[3];
if arr[3]='' :
smsin='0';
Please help. Thank you alot.
You seem to have forgotten a closing bracket after (arr[1]).
import datetime
arr = ['23423423', '1163838603', '1263838603', '1463838603']
ti = datetime.datetime.utcfromtimestamp(int(arr[1])).strftime('%Y-%m-%d %H:%M:%S')
print(ti)
# => 2006-11-18 08:30:03
To replace empty strings with '0's in your list you could do:
arr = ['123', '456', '', '789', '']
arr = [x if x else '0' for x in arr]
print(arr)
# => ['123', '456', '0', '789', '0']
Note that the latter only works correctly since the empty string '' is the only string with a truth value of False. If you had other data types within arr (e.g. 0, 0L, 0.0, (), [], ...) and only wanted to replace the empty strings you would have to do:
arr = [x if x != '' else '0' for x in arr]
More efficient yet would be to modify arr in place instead of recreating the whole list.
for index, item in enumerate(arr):
if item = '':
arr[index] = '0'
But if that is not an issue (e.g. your list is not too large) I would prefer the former (more readable) way.
Also you don't need to put ;s at the end of your code lines as Python does not require them to terminate statements. They can be used to delimit statements if you wish to put multiple statements on the same line but that is not the case in your code.
Basically, I have a 3dimensional list (it is a list of tokens, where the first dimension is for the text, second for the sentence, and third for the word).
Addressing an element in the list (lets call it mat) can be done for example:
mat[2][3][4]. That would give us the fifth word or the fourth sentence in the third text.
But, some of the words are just symbols like '.' or ',' or '?'. I need to remove all of them. I thought to do that with a procedure:
def removePunc(mat):
newMat = []
newText = []
newSentence = []
for text in mat:
for sentence in text:
for word in sentence:
if word not in " !##$%^&*()-_+={}[]|\\:;'<>?,./\"":
newSentence.append(word)
newText.append(newSentence)
newMat.append(newText)
return newMat
Now, when I try to use that:
finalMat = removePunc(mat)
it is giving me the same list (mat is a 3 dimensional list). My idea was to iterate over the list and remove only the 'words' which are actually punctuation symbols.
I don't know what I am doing wrong but surely there is a simple logical mistake.
Edit: I need to keep the structure of the array. So, words of the same sentence should still be in the same sentence (just without the 'punctuation symbol' words). Example:
a = [[['as', '.'], ['w', '?', '?']], [['asas', '23', '!'], ['h', ',', ',']]]
after the changes should be:
a = [[['as'], ['w']], [['asas', '23'], ['h']]]
Thanks for reading and/or giving me a reply.
I would suspect that your data are not organized as you think they are. And although I am usually not the one to propose regular expressions, I think in your case they may be among the best solutions.
I would also suggest that instead of eliminating non-alphabetic characters from words, you process sentences
>>> import re
>>> non_word = re.compile(r'\W+') # If your sentences may
>>> sentence = '''The formatting sucks, but the only change that I've made to your code was shortening the "symbols" string to one character. The only issue that I can identify is either with the "symbols" string (though it looks like all chars in it are properly escaped) that you used, or the punctuation is not actually separate words'''
>>> words = re.split(non_word, sentence)
>>> words
['The', 'formatting', 'sucks', 'but', 'the', 'only', 'change', 'that', 'I', 've', 'made', 'to', 'your', 'code', 'was', 'shortening', 'the', 'symbols', 'string', 'to', 'one', 'character', 'The', 'only', 'issue', 'that', 'I', 'can', 'identify', 'is', 'either', 'with', 'the', 'symbols', 'string', 'though', 'it', 'looks', 'like', 'all', 'chars', 'in', 'it', 'are', 'properly', 'escaped', 'that', 'you', 'used', 'or', 'the', 'punctuation', 'is', 'not', 'actually', 'separate', 'words']
>>>
The code you wrote seems solid and it looks like "it should work", but only if this:
But, some of the words are just symbols like '.' or ',' or '?'
is actually fulfilled.
I would actually expect the symbols to not be separate from words, so instead of:
["Are", "you", "sure", "?"] #example sentence
you would rather have:
["Are", "you", "sure?"] #example sentence
If this is the case, you would need to go along the lines of:
def removePunc(mat):
newMat = []
newText = []
newSentence = []
newWord = ""
for text in mat:
for sentence in text:
for word in sentence:
for char in word:
if char not in " !##$%^&*()-_+={}[]|\\:;'<>?,./\"":
newWord += char
newSentence.append(newWord)
newText.append(newSentence)
newMat.append(newText)
return newMat
Finally, found it. As expected, it was a very small logical mistake that was always there but couldn't see it. Here is the working solution:
def removePunc(mat):
newMat = []
for text in mat:
newText = []
for sentence in text:
newSentence = []
for word in sentence:
if word not in " !##$%^&*()-_+={}[]|\\:;'<>?,./\"":
newSentence.append(word)
newText.append(newSentence)
newMat.append(newText)
return newMat