Scandinavian letters (æøå) in python 2.7 - python-2.7

So I am having this weird problem when using 'æ', 'ø' and 'å' in python.
I have included: # -- coding: utf-8 --
at the top of every file, and æøå prints fine so no worries there. However if i do len('æ') i get 2. I am making a program where i loop over and analyze danish text, so this is a big problem.
Below is some examples from the python terminal to illustrate the problem:
In [1]: 'a'.islower()
Out[1]: True
In [2]: 'æ'.islower()
Out[2]: False
In [3]: len('a')
Out[3]: 1
In [4]: len('æ')
Out[4]: 2
In [5]: for c in 'æ': print c in "æøå"
True
True
In [6]: print "æøå are troublesome characters"
æøå are troublesome characters
I can get around the problem of islower() and isupper() not working for 'æ', 'ø' and 'å' by simply doing c.islower() or c in "æøå" to check if c is a lower case letter, but as shown above both parts of 'æ' will then count as a lower case and be counted double.
Is there a way that I can make those letters act like any other letter?
I run python 2.7 on windows 10 using canopy as its an easy way to get sklearn and numpy which i need.

You have stumbled across the problem that strings are bytes by default in python 2. With your header # -- coding: utf-8 -- you have only told the interpreter that your source code is utf-8 but this has no effect on the handling of strings.
The solution to your problem is to convert all your strings to unicode objects with the decode method, e.g
danish_text_raw = 'æ' # here you would load your text
print(type(danish_text_raw)) # returns string
danish_text = danish_text_raw.decode('utf-8')
print(type(danish_text)) # returns <type 'unicode'>
The issues with islower and len should be fixed then. Make sure that all the strings you use in your program are unicode and not bytes objects. Otherwise comparisons can lead to strange results. For example
danish_text_raw == danish_text # this yields false
To make sure that you use unicode strings you can for example use this function to ensure it
def to_unicode(in_string):
if isinstance(in_string,str):
out_string = in_string.decode('utf-8')
elif isinstance(in_string,unicode):
out_string = in_string
else:
raise TypeError('not stringy')
return out_string

Related

Python Decoding and Encoding, List Element utf-8

just another question about encoding in python i think. I have this programm:
regex = re.compile(ur'\b[sw]\w+', flags= re.U | re.I)
ergebnisliste = []
for line in fileobject:
print str(line)
erg = regex.findall(line)
ergebnisliste = ergebnisliste + erg
ergebnislistesortiert = sorted(ergebnisliste, key=lambda x: len(x))
print ergebnislistesortiert
fileobject.close()
I am searching a textfile for words beginning with s or w. My "ergebnislistesortiert" is the sorted result list.
I will print the result list and there appers to be a problem with the encoding:
['so', 'Wer', 'sp\xc3']
the 'sp\xc3' should be print as spät. What is wrong here? Why is the list element utf-8?
And how can i get the right decoding to print "spät"?
Thanks a lot guys!
\xc3 is not UTF-8. It's a fragment of the full UTF-8 encoding of U+00E4 but you're probably reading it with something like a Latin-1 decoder (which is effectively what Python 2 does if you read bytes without specifying an encoding), in which case the second byte in the UTF-8 sequence isn't matched by \w.
The real fix is to decode the data when you are reading it into Python in the first place. If you are writing new code, switching to Python 3 is probably the best and easiest fix.
If you're stuck on Python 2.7, a somewhat Python 3-compatible approach is something like
import io
fileobject = io.open(filename, encoding='utf-8')
If you have control over the input file and want to postpone the proper solution until you are older, (ask your parents for permission to) convert the UTF-8 input file to some legacy 8-bit encoding.

How can I change this function to be compatible with Python 2 and Python 3? I'm running into string, unicode and other problems

I have a function that is designed to make some text safe for filenames or URLs. I'm trying to change it so that it works in Python 2 and Python 3. In my attempt, I've confused myself with bytecode and would welcome some guidance. I'm encountering errors like sequence item 1: expected a bytes-like object, str found.
def slugify(
text = None,
filename = True,
URL = False,
return_str = True
):
if sys.version_info >= (3, 0):
# insert magic here
else:
if type(text) is not unicode:
text = unicode(text, "utf-8")
if filename and not URL:
text = unicodedata.normalize("NFKD", text).encode("ascii", "ignore")
text = unicode(re.sub("[^\w\s-]", "", text).strip())
text = unicode(re.sub("[\s]+", "_", text))
elif URL:
text = unicodedata.normalize("NFKD", text).encode("ascii", "ignore")
text = unicode(re.sub("[^\w\s-]", "", text).strip().lower())
text = unicode(re.sub("[-\s]+", "-", text))
if return_str:
text = str(text)
return text
It seems like your main problem is figuring out how to convert the text to unicode and back to bytes when you aren't sure what the original type was. In fact, you can do this without any conditional checks if you're careful.
if isinstance(s, bytes):
s = s.decode('utf8')
Should be sufficient to convert something to unicode in either Python 2 or 3 (assuming 2.6+ and 3.2+ as is usual). This is because bytes exists as an alias for string in Python 2. The explicit utf8 argument is only required in Python 2, but there's no harm in providing it in Python 3 as well. Then to convert back to a bytestring, you just do the reverse.
if not isinstance(s, bytes):
s = s.encode('utf8')
Of course, I would recommend that you think hard about why you are unsure what types your strings have in the first place. It is better to keep the distinction seperate, rather than write "weak" APIs that accept either. Python 3 just encourages you to maintain the separation.

python hangman code for a beginner

I just started to learn python about a week ago. I tried to create a simple hangman game today. All of my code in this works so far, but there is one thing that I cannot think of how to implement. I want the code to print 'you win' when it the player correctly types 'python', letter by letter. But I cant seem to end it after they get it right. It will end if they type 'python' in one attempt, opposed to letter form. My attempt to do it is on the line with the .join. I can't seem to figure it out though. Any help or advice for a new programmer would be greatly appreciated.
guesses = []
count = 1
ans = 'python'
word = ''
while count < 10:
guess = raw_input('guess a letter: ')
guesses.append(guess)
if ''.join(word) == ans:
print 'you win'
break
elif len(guess) > 1 and ans == guess:
print ans
print 'you win'
break
else:
for char in ans:
if char in guesses:
word.append(char)
print char,
else:
print '_',
count += 1
else:
print '\nyou lose'
First, I want to start off by saying, unless you are dealing with legacy code or some library which you need that only works in 2.7, do not use python 2.7, instead use python 3.x (currently on 3.6). This is because soon 2.7 will be deprecated, and 3.6 + has a lot more features and a lot of QOL improvements to the syntax and language you will appreciate (and has support for functionality that 2.7 just doesn't have now).
With that said, I'll make the translation to 3.6 for you. it barely makes a difference.
guesses = []
count = 1
ans = 'python'
word = ''
while count < 10:
guess = input('guess a letter: ')
guesses.append(guess)
if ''.join(word) == ans:
print('you win')
break
elif len(guess) > 1 and ans == guess:
print(ans)
print('you win')
break
else:
for char in ans:
if char in guesses:
word.append(char)
print(char)
else:
print('_')
count += 1
else:
print('\nyou lose')
The only two changes here are that print now requires parenthesis, so every print 'stuff' is now print('stuff'), and raw_input is now input('input prompt'). Other than that, I'm suprised you were able to get away with word.append(char). You cannot use append() on a python str in either 2.7 or 3.x. I think you were trying to use it as an array, as that is the only reason you would use ''.join(word). To fix this I would do word = [] instead of word = ''. now your ''.join(word) should work properly.
I would advise you to take the next step and try to implement the following things to your program: If the user doesn't enter a single character, make it so that the characters are not added to the guesses list. Try making this a main.py file if you haven't already. Make parts of the program into functions. Add a new game command. Add an actual hangman in chars print out every time. Add file io to read guess words (ie instead of just python, you could add a lot of words inside of a file to chose).

Certain dictionary key is overlooked/missed/not recognized during iteration in python

I was trying to define a simple function to convert numbers expressed as string to real numbers. e.g 1234K to 1234000, 1234M to 1234000000. This can be easily done using if statement. Out of curiosity, I used dictionary instead and found out the following problem. Please see my code below.
# -*- coding: utf-8 -*-
dict_EN_num={"K":1000,'M':1000000}
def ENtoNum(value):
x=value
if type(x)==str:
for k,v in dict_EN_num.items():
if k in x:
x=int(x[:x.find(k)])*v
break
return x
y="1234K"
z="1234M"
print ENtoNum(y)
print ENtoNum(z)
The result in my iPython console was:
1234000
1234M
The conversion of variable y with "K" in it worked but the conversion of variable z with "M" failed.
Any idea why this is the case?
The problem is your break needs to be indented one more level.

URL-Encoding a Byte String?

I am writing a Bittorrent client. One of the steps involved requires that the program sends a HTTP GET request to the tracker containing an SHA1 hash of part of the torrent file. I have used Fiddler2 to intercept the request sent by Azureus to the tracker.
The hash that Azureus sends is URL-Encoded and looks like this: %D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR
The hash should look like this before it's URL-Encoded: d90c3ce39418f0c5d98358e03349262b608cbf52
I notice that it is not as simple as placing a '%' symbol every two characters, so how would I go about encoding this BYTE string to get the same as Azureus.
Thanks in advance.
Actually, you can just place a % symbol every two characters. Azureus doesn't do that because, for example, R is a safe character in a URL, and 52 is the hexadecimal representation of R, so it doesn't need to percent-encode it. Using %52 instead is equivalent.
Go through the string from left to right. If you encounter a %, output the next two characters, converting upper-case to lower-case. If you encounter anything else, output the ASCII code for that character in hex using lower-case letters.
%D9 %0C %3C %E3 %94 %18 %F0 %C5 %D9 %83 X %E0 3 I %26 %2B %60 %8C %BF R
The ASCII code for X is 0x58, so that becomes 58. The ASCII code for 3 is 0x33.
(I'm kind of puzzled why you had to ask though. Your question clearly shows that you recognized this as URL-Encoded.)
Even though I know well the original question was about C++, it might be useful somehow, sometimes to see alternative solutions. Therefore, for what it's worth (10 years later), here's
An alternative solution implemented in Python 3.6+
import binascii
import urllib.parse
def hex_str_to_esc_str(s: str, *, encoding: str='Windows-1252') -> str:
# decode hex string as a Windows-1252 string
win1252_str = binascii.unhexlify(hex_str).decode(encoding)
# escape string and return
return urllib.parse.quote(win1252_str, encoding=encoding)
def esc_str_to_hex_str(s: str, *, encoding: str='Windows-1252') -> str:
# unescape the escaped string as a Windows-1252 string
win1252_str = urllib.parse.unquote(esc_str, encoding='Windows-1252')
# encode string, hexlify, and return
return win1252_str.encode('Windows-1252').hex()
Two elementary tests:
esc_str = '%D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR'
hex_str = 'd90c3ce39418f0c5d98358e03349262b608cbf52'
print(hex_str_to_esc_str(hex_str) == esc_str) # True
print(esc_str_to_hex_str(esc_str) == hex_str) # True
Note
Windows-1252 (aka cp1252) emerged as the default encoding as a result of the following test:
import binascii
import chardet
esc_str = '%D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR'
hex_str = 'd90c3ce39418f0c5d98358e03349262b608cbf52'
print(
chardet.detect(
binascii.unhexlify(hex_str)
)
)
...which gave a pretty strong clue:
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}