String value decode utf-8 - python-2.7

I want to decode string values ​​to utf-8. But it doesn't change.
So, here is my code:
self.textEdit_3.append(str(self.new_header).decode("utf-8") + "\n")
The result image is here:
The original output value is:
['matchkey', 'a', 'b', 'd', '안녕'] # 안녕 is Korean Language
I changed the default encoding for encoding / decoding with unicode to utf-8 instead of ascii. On the first line I added this code:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Why doesn't the string value change?

You can fix your code like this:
header = str(self.new_header).decode('string-escape').decode("utf-8")
self.textEdit_3.append(header + "\n")
You do not need the setdefaultencoding lines.
Expanantion:
The original value is a list containing byte-strings:
>>> value = ['matchkey', 'a', 'b', 'd', '안녕']
>>> value
['matchkey', 'a', 'b', 'd', '\xec\x95\x88\xeb\x85\x95']
If you convert this list with str, it will use repr on all the list elements:
>>> strvalue = str(value)
>>> strvalue
"['matchkey', 'a', 'b', 'd', '\\xec\\x95\\x88\\xeb\\x85\\x95']"
The repr parts can be decoded like this:
>>> strvalue = strvalue.decode('string-escape')
>>> strvalue
"['matchkey', 'a', 'b', 'd', '\xec\x95\x88\xeb\x85\x95']"
and this can now be decoded to unicode like this:
>>> univalue = strvalue.decode('utf-8')
>>> univalue
u"['matchkey', 'a', 'b', 'd', '\uc548\ub155']"
>>> print univalue
['matchkey', 'a', 'b', 'd', '안녕']
PS:
Regarding the problems reading files with a utf-8 bom, please test this script:
# -*- coding: utf-8 -*-
import os, codecs, tempfile
text = u'a,b,d,안녕'
data = text.encode('utf-8-sig')
print 'text:', repr(text), len(text)
print 'data:', repr(data), len(data)
f, path = tempfile.mkstemp()
print 'write:', os.write(f, data)
os.close(f)
with codecs.open(path, 'r', encoding='utf-8-sig') as f:
string = f.read()
print 'read:', repr(string), len(string), string == text

Related

Why my RegexTokenizer transformation in PySpark gives me the opposite of the required pattern?

When I use the RegexTokenizer from pyspark.ml.feature to tokenize sentences column in my dataframe to find all the word characters, I get the opposite of what I would get when the python re package is used for the same sentence. Here is the sample code:
from pyspark.sql import SparkSession
from pyspark.ml.feature import RegexTokenizer
spark = SparkSession.builder \
.master("local") \
.appName("Word list") \
.getOrCreate()
df = spark.createDataFrame(data = [["Hi there, I have a question about RegexTokenizer, Could you
please help me..."]], schema = ["Sentence"])
regexTokenizer = RegexTokenizer(inputCol="Sentence", outputCol="letters", pattern="\\w")
df = regexTokenizer.transform(df)
df.first()['letters']
This gives the following output:
[' ', ', ', ' ', ' ', ' ', ' ', ' ', ', ', ' ', ' ', ' ', ' ', '...']
On the other hand if I use the re module on the same sentence and use the same pattern to match the letters, using this code here:
import re
sentence = "Hi there, I have a question about RegexTokenizer, could you
please help me..."
letters_list = re.findall("\\w", sentence)
print(letters_list)
I get the desired output as per the regular expression pattern as:
['H', 'i', 't', 'h', 'e', 'r', 'e', 'I', 'h', 'a', 'v', 'e', 'a',
'q', 'u', 'e', 's', 't', 'i', 'o', 'n', 'a', 'b', 'o', 'u', 't',
'R', 'e', 'g', 'e', 'x', 'T', 'o', 'k', 'e', 'n', 'i', 'z', 'e',
'r', 'c', 'o', 'u', 'l', 'd', 'y', 'o', 'u', 'p', 'l', 'e', 'a',
's', 'e', 'h', 'e', 'l', 'p', 'm', 'e']
I also found that I need to use \W instead of \w in pySpark to solve this problem. Why is this difference? Or have I misunderstood the usage of pattern argument in RegexTokenizer?
From what the documentation on RegexTokenizer says, on creation it has a parameter called gaps. In one mode, the regexp matches gaps (true and is the default), in other it matches tokens (not the gaps, false).
Try setting it manually to the value you need: in your case, gaps = false.

Python splitting a string with accented letters

I would like to split a string which contains accented characters into characters without breaking the accent and the letter apart.
A simple example is
>>> o = u"šnjiwgetit"
>>> print u" ".join(o)
s ̌ n j i w g e t i t
or
>>> print list(o)
[u's', u'\u030c', u'n', u'j', u'i', u'w', u'g', u'e', u't', u'i', u't']
Whereas I would like the result to be š n j i w g e t i t so that the accent stays on top of the consonant.
The solution should work even with more difficult characters such as h̭ɛ̮ŋkkɐᴅ
You can use regex to group the characters. Here is a sample code for doing so:
import re
pattern = re.compile(r'(\w[\u02F3\u1D53\u0300\u2013\u032E\u208D\u203F\u0311\u0323\u035E\u031C\u02FC\u030C\u02F9\u0328\u032D:\u02F4\u032F\u0330\u035C\u0302\u0327\u03572\u0308\u0351\u0304\u02F2\u0352\u0355\u00B7\u032C\u030B\u2019\u0339\u00B4\u0301\u02F1\u0303\u0306\u030A7\u0325\u0307\u0354`\u02F0]+|\w|\W)', re.UNICODE | re.IGNORECASE)
In case you had some accents missing, add them the pattern.
Then, you can split words into characters as follows.
print(list(pattern.findall('šnjiwgetit')))
['š', 'n', 'j', 'i', 'w', 'g', 'e', 't', 'i', 't'
print(list(pattern.findall('h̭ɛ̮ŋkkɐᴅ')))
['h̭', 'ɛ̮', 'ŋ', 'k', 'k', 'ɐ', 'ᴅ']
If you are using Python2, add from __future__ import unicode_literals at the beginning of the file.

How to remove all consonants and print vowels in a list

Here is my code:
#Alphabet class
class Alphabet(object):
def __init__(self, s):
self.s = s
def __str__(self):
return "Before: " + str(self.s)
#Define your Vowels class here
class Vowels:
def __init__(self,vowelList):
self.vowelList = vowelList
def __str__(self):
return "Invoking the method in Vowels by passing the Alphabet object\nAfter: " + str(vowelList)
def addVowels(self,a_obj):
for letter in a_obj:
if letter in 'aeiou':
vowelList.append(letter)
l = ','.join(vowelList)
a1 = Alphabet('A,B,C,E,I')
print a1
b = Vowels(a1)
b.addVowels(a1)
print (a2)
Right now, all it is printing is "Before: A,B,C,E,I", but I am trying to take a string of letters separated by commas (i.e. a_obj), extract the vowels from the string, then append the result to a list. I have looked at other answers regarding finding and printing only the vowels, which is why I have the for loop and if statement in addVowels, but no luck. Just to note,Vowels is supposed to be a container class for Alphabet.
When trying to get the output...the below code gives me
a1 = Alphabet('A,B,C,E,I')
print a1
a2 = Vowels(a1)
print a2
ouput:
Before: A,B,C,E,I
Invoking the method in Vowels by passing the Alphabet object
After: []
it seems like it isn't passing the letters from Alphabet...
You can create the list and get rid of the commas in one line by using split.
>>> "a,b,c,d,e,f,g,h,i,j".split(",")
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
From there you can remove the consonants by only keeping the vowels.
You can use a for loop:
letterList = ['a', 'b', 'c', 'd']
vowelList = []
for letter in letterList:
if letter in 'aeiou':
vowelList.append(letter)
Or you can use list comprehension:
letterList = ['a', 'b', 'c', 'd']
vowelList = [letter for letter in letterList if letter in 'aeiou']
Example of how this would work for your code:
class Vowels(object):
def __init__(self, vowelList):
self.vowelList = vowelList
lettersList = self.vowelList.s.split(",")
self.vowelList = [letter for letter in self.lettersList if letter in 'aeiou']
I'm using this code, and it works for me.
def getVowels(text):
vowel_letters = []
vowel_list = ['a', 'e', 'i', 'o', 'u', 'A', 'E', 'I', 'O', 'U',]
for vowels in text:
if vowels in vowel_list:
vowel_letters.append(vowels)
return vowel_letters
print(getVowels('Hi, How are you today!'))
## Output: ['i', 'o', 'a', 'e', 'o', 'u', 'o', 'a']

Regex to split by square brackets and dots with python and re module

I want to build a regex expression to split by '.' and '[]', but here, I would want to keep the result between square brackets.
I mean:
import re
pattern = re.compile("\.|[\[-\]]")
my_string = "a.b.c[0].d.e[12]"
pattern.split(my_string)
# >>> ['a', 'b', 'c', '0', '', 'd', 'e', '12', '']
But I would wish to get the following output (without any empty string):
# >>> ['a', 'b', 'c', '0', 'd', 'e', '12']
Would be it possible? I've tested with a lot of regex patterns and that is the best which I've found but it's not perfect.
You can use a quantifier in your regex and filter:
>>> pattern = re.compile(r'[.\[\]]+')
>>> my_string = "a.b.c[0].d.e[12]"
>>> filter(None, pattern.split(my_string))
['a', 'b', 'c', '0', 'd', 'e', '12']

Comparing lists - Homework Python

correct_ans = ['B', 'D', 'A', 'A', 'C', 'A', 'B', 'A', 'C', 'D', 'B', 'C', \
'D', 'A', 'D', 'C', 'C', 'B', 'D', 'A']
here is my statement to import the list from txt file
# import user answers into a list
infile = open('testscores.txt', 'r')
driver_ans = infile.readlines()
infile.close()
driver_ans = ['B', 'D', 'A', 'A', 'C', 'B', 'B', 'A', 'C', 'D', 'B', 'C', \
'D', 'A', 'D', 'C', 'C', 'B', 'D', 'A']
for index in range(0, 20):
if driver_ans[index] == correct_ans[index]:
total_correct += 1
else:
wrong_ans.append(index + 1)
This logic continues to return that all are wrong answers. This is not correct comparing visually my "correct_ans" list and my "driver_ans" list. What am I doing wrong?!
Only guessing. If testscores.txt has the content
B
D
A
A
...
keep in mind, that driver_ans will be
['B\n', 'D\n', 'A\n', 'A\n', ...
try maybe
driver_ans = [x.strip ('\n') for x in infile.readlines()]
The readlines() function returns lines that include the trailing newline. So, try:
driver_ans = [x.strip() for x in infile.readlines()]