Python 2.7 "latin-1" encoding used instead of "UTF-8" - python-2.7

I am aware that there are plenty of discussions on the "UTF-8" encoding issue on Python 2 but I was unable to find a solution to my problem so far. I am currently creating a script to get the name of a file and hyperlink it in xlwt, so that the file can be accessed by clicks in the spreadsheet. Problem is, some of the names of these files include non-ASCII characters.
Question 1
I used the following line to retrieve the name of the file. There is only one file in the folder by the way.
>>f = filter(os.path.isfile, os.listdir(tmp_path))[0]
And then
>>print f
'521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc'
>>print sys.stdout.encoding
'UTF-8'
>>f.decode("UTF-8")
*** UnicodeDecodeError: 'utf8' codec can't decode byte 0xe7 in position 76: invalid continuation byte
From browsing the discussions here, I realized that "\xe7\xe3o" is not a "UTF-8" encoding. Running the following line seems to back this point.
>>f.decode("latin-1")
u'521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc'
My question is then, why is the variable f being encoded in "latin-1" when the system encoding is set to "UTF-8"?
Question 2
While f.decode("latin-1") gives me the output that I want, I am still unable to supply the variable to the hyperlink function in the spreadsheet.
>>data.append(["File", xlwt.Formula('HYPERLINK("%s";"%s")' % (os.path.join(dl_path,f.decode("latin-1")),f.decode("latin-1")))])
*** FormulaParseException: can't parse formula HYPERLINK("u'H:\\Mad Lab\\SE Doc Crawler\\bovespa\\download\\521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc's;"u'521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc's)
Apparently, the closing double quote got eaten up and was replaced by a " 's" suffix. Can somebody help to explain what's going on here? 0.0
Oh and if someone can suggest a solution to Question 2 above then I will be very grateful - for you would have saved my weekend from misery!
Thanks in advance all!

Welcome to the confusing world of encoding! There's at least file encoding, terminal encoding and filename encoding to deal with, and all three could be different.
In Python 2.x, the goal is to get a Unicode string (different from str) from an encoded str. The problem is that you don't always know the encoding used for the str so it's difficult to decode it.
When using listdir() to get filenames, there's a documented but often overlooked quirk - if you pass a str to listdir() you get encoded strs back. These will be encoded according to your locale. On Windows these will be an 8bit character set, like windows-1252.
Alternatively, pass listdir() a Unicode string instead.
E.g.
os.listdir(u'C:\\mydir')
Note the u prefix
This will return properly decoded Unicode filenames. On Windows and OS X, this is pretty reliable as long your environment locale hasn't been messed with.
In your case, listdir() would return:
u'521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc'
Again, note the u prefix. You can now print this to your PyCharm console with no modification.
E.g.
f = filter(os.path.isfile, os.listdir(tmp_path))[0]
print f

As for Question 2, I did not investigate further but just printed the output as unicode strings, rather than xlwt objects, due to time constraint. I'm able to continue with the project, though without the understanding of what went wrong here. In that sense, the 2 questions above have been answered.

Related

Print special character from utf-8 encoded string

I'm having trouble dealing with encoding in Python:
I get some strings from a csv that I open using pandas.read_csv(), they are encoded in unicode so I encode it to utf-8 doing the following
# data is from my csv
string = data.encode('utf-8')
print string
However, when I print it, i get
"Parc d'Activit\xc3\xa9s des Gravanches"
and i would like to return
"Parc d'Activités des Gravanches"
It seems like an easy issue but I'm quite new to python and did not find anything close enough to my problem.
Note: I am using Python 2.7 and my file starts with
#!/usr/bin/env python2.7
# coding: utf8
EDIT: I just say that you are using Python 2, okay, I think the answer below is still valuable though.
In Python 2 this is even more complicated and inconsistent. Here you have str and unicode, and the default str doesn't support unicode stuff.
Anyways, the situation is more or less the same, use decode instead of encode to convert from str to unicode. That should fix it.
More info at: https://pythonhosted.org/kitchen/unicode-frustrations.html
This is a common source of confusion.The issue is a bit complex, but I'll try to simplify it. I'm talking about Python 3 here, I believe there's several differences with Python 2.
There's two types of what you would call a string: str and bytes.
str is the general string type form Python, it supports unicode seamlessly in Python 3, but the way it encodes the actual data is not relevant, it's an object.
bytes is a byte array, like char* in C. It's a sequence of bytes.
Strings can be represented both ways, but you need to specify an encoding standard to translate between the two, as bytes needs to be interpreted, because it's just, again, a raw array of bytes.
encode converts a str into bytes, that's the mistake you make. Of course, if you print bytes it will just show it's raw data, AKA, the string encoded as utf-8.
decode does the opposite operation, that may be what you need.
However, if you open the file normally (open(file_name, 'r')) instead of in byte mode (open(file_name, 'b'), which I doubt you are doing, you shouldn't need to do anything, printing data should just work as you want it to.
More info at: https://docs.python.org/3/howto/unicode.html

Convert \xc3\xd8\xe8\xa7\xc3\xb4\xd to human readable format

I am having trouble converting '\xc3\xd8\xe8\xa7\xc3\xb4\xd' (which is a Thai text) to a readable format. I get this value from a smart card, and it basically was working for Windows but not in Linux.
If I print in my Python console, I get:
����ô
I tried to follow some google hints but I am unable to accomplish my goal.
Any suggestion is appreciated.
Your text does not seem to be a Unicode text. Instead, it looks like it is in one of Thai encodings. Hence, you must know the encoding before printing the text.
For example, if we assume your data is encoded in TIS-620 (and the last character is \xd2 instead of \xd) then it will be "รุ่งรดา".
To work with the non-Unicode strings in Python, you may try: myString.decode("tis-620") or even sys.setdefaultencoding("tis-620")

Unicode, UTF-8, UTF-16 and UTF-32 questions [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I read a lot about Unicode, ASCII, code pages, all the history, the invention of UTF-8, UTF-16 (UCS-2), UTF-32 (UCS-4) and who use them and so on, but I still having some questions that I tried hardly to find answers but I couldn't and I hope you to help me.
1 - Unicode is a standard for encoding characters and they specify a code point for each character. Something like U+0000 (example). Imagine that I have a file that has those code points (\u0000), in which point of my application I'm going to use it?
This might be a silly question but I really don't know in which point of my application I'm going to use it.
I'm creating an application that can read file that has those code points using the escape \u and I know that I can read it, decode it but now the next question.
2 - To which character set (code page) do I need to convert it? I saw some C++ libraries that they uses the name utf8_to_unicode or utf8-to-utf16 and also only utf8_decode, and this is what makes me confuse.
I don't know if will appear answers like this, but some might say: You need to convert it into code pages that you are going to use, but what if my application needs to be internationalized?
3 - I was wondering, in C++ if I try to display non-ASCII characters on terminal I got some confusing words. The question is: What makes the characters to be displayed are the fonts?
#include <iostream>
int main()
{
std::cout << "ö" << std::endl;
return 0;
}
The output (Windows):
├Â
4 - In which part of that process the encoding enter? It encodes, takes the code point and try to find the word that is equal on the fonts?
5 = WebKit is an engine for rendering web pages in web browsers, if you specify the charset as UTF-8 it works nicely with all characters, but if I specify another charset it doesn't, doesn't matter the font that I'm using, what happen?
<html>
<head>
<meta charset="iso-8859-1">
</head>
<body>
<p>ö</p>
</body>
</html>
The output:
ö
Works using:
<meta charset="utf-8">
6 - Imagine now that I read the file, I encode it, I have all the code points and I need to save the file again. Do I need to save it encoded (\u0000) or I need to decode first to transform again into characters and then save?
7 - Why the word "unicode" is a bit overloaded and is sometimes understood to mean utf-16? (source)
That's all for now. Thanks in advance.
I'm creating an application that can read file that has those code points using the escape \u and I know that I can read it, decode it but now the next question.
If you're writing a program that processes some kind of custom escapes, such as \uXXXX, it's entirely up to you when to convert these escapes into Unicode code points.
To which character set (code page) do I need to convert it?
That depends on what you want to do. If you're using some other library that requires a specific code page then it's up to you to convert data from one encoding into the encoding required by that library. If you don't have any hard requirements imposed by such third party libraries then there may be no reason to do any conversion.
I was wondering, in C++ if I try to display non-ASCII characters on terminal I got some confusing words.
This is because various layers of the technology stack use different encodings. From the sample output you give, "├Â" I can see that what's happening is that your compiler is encoding the string literal as UTF-8, but the console is using Windows codepage 850. Normally when there are encoding problems with the console you can fix them by setting the console output codepage to the correct value, unfortunately passing UTF-8 through std::cout currently has some unique problems. Using printf instead worked for me in VS2012:
#include <cstdio>
#include <Windows.h>
int main() {
SetConsoleOutputCP(CP_UTF8);
std::printf("%s\n", "ö");
}
Hopefully Microsoft fixes the C++ libraries if they haven't already done so in VS 14.
In which part of that process the encoding enter? It encodes, takes the code point and try to find the word that is equal on the fonts?
Bytes of data are meaningless unless you know the encoding. So the encoding matters in all parts of the process.
I don't understand the second question here.
if you specify the charset as UTF-8 it works nicely with all characters, but if I specify another charset it doesn't, doesn't matter the font that I'm using, what happen?
What's going on here is that when you write charset="iso-8859-1" you also have to actually convert the document to that encoding. You're not doing that and instead you're leaving the document as UTF-8 encoded.
As a little exercise, say I have a file that contains the following two bytes:
0xC3 0xB6
Using information on UTF-8 encoding and decoding, what codepoint do the bytes decode to?
Now using this 8859-1 codepage, what do the same bytes decode to?
As another exercise, save two copies of your HTML document, one using charset="iso-8859-1" and one with charset="utf-8". Now use a hex editor and examine the contents of both files.
Imagine now that I read the file, I encode it, I have all the code points and I need to save the file again. Do I need to save it encoded (\u0000) or I need to decode first to transform again into characters and then save?
This depends on the program that will need to read the file. If the program expects all non-ASCII characters to be escaped like that then you have to save the file that way. But escaping characters with \u is not a normal thing to do. I only see this done in a few places, such as JSON data and C++ source code.
Why the word "unicode" is a bit overloaded and is sometimes understood to mean utf-16?
Largely because Microsoft uses the term this way. They do so for historical reasons: When they added Unicode support they named all their options and setting "Unicode" but the only encoding they supported was UTF-16.

Encode gives wrong value of Japanese kanji

As a part of a scraper, I need to encode kanji to URLs, but I just can't seem to even get the correct output from a simple sign, and I'm currently blinded by everything I've tried thus far from various Stack Overflow posts.
The document is set to UTF-8.
sampleText=u'ル'
print sampleText
print sampleText.encode('utf-8')
print urllib2.quote(sampleText.encode('utf-8'))
It gives me the values:
ル
ル
%E3%83%AB
But as far as I understand, it should give me:
ル
XX
%83%8B
What am I doing wrong? Are there some settings I don't have correct? Because as far as I understand it, my output from the encode() should not be ル.
The code you show works correctly. The character ル is KATAKANA LETTER RU, and is Unicode codepoint U+30EB. When encoded to UTF-8, you'll get the Python bytestring '\xe3\x83\xab', which prints out as ル if your console encoding is Latin-1. When you URL-escape those three bytes, you get %E3%83%AB.
The value you seem to be expecting, %83%8B is the Shift-JIS encoding of ル, rather than UTF-8 encoding. For a long time there was no standard for how to encode non-ASCII text in a URL, and as this Wikipedia section notes, many programs simply assumed a particular encoding (often without specifying it). The newer standard of Internationalized Resource Identifiers (IRIs) however says that you should always convert Unicode text to UTF-8 bytes before performing percent encoding.
So, if you're generating your encoded string for a new program that wants to meet the current standards, stick with the UTF-8 value you're getting now. I would only use the Shift-JIS version if you need it for backwards compatibility with specific old websites or other software that expects that the data you send will have that encoding. If you have any influence over the server (or other program), see if you can update it to use IRIs too!

How to convert ISO-8859-1 to UTF-8 using libiconv in C++

I'm using libcurl to fetch some HTML pages.
The HTML pages contain some character references like: סלקום
When I read this using libxml2 I'm getting: ׳₪׳¨׳˜׳ ׳¨
is it the ISO-8859-1 encoding?
If so, how do I convert it to UTF-8 to get the correct word.
Thanks
EDIT: I got the solution, MSalters was right, libxml2 does use UTF-8.
I added this to eclipse.ini
-Dfile.encoding=utf-8
and finally I got Hebrew characters on my Eclipse console.
Thanks
Have you seen the libxml2 page on i18n ? It explains how libxml2 solves these problems.
You will get a ס from libxml2. However, you said that you get something like ׳₪׳¨׳˜׳ ׳¨. Why do you think that you got that? You get an XMLchar*. How did you convert that pointer into the string above? Did you perhaps use a debugger? Does that debugger know how to render a XMLchar* ? My bet is that the XMLchar* is correct, but you used a debugger that cannot render the Unicode in a XMLchar*
To answer your last question, a XMLchar* is already UTF-8 and needs no further conversion.
No. Those entities correspond t the decimal value of the Unicode sequence number of your characters. See this page for example.
You can therefore store your Unicode values as integers and use an algorithm to transform those integers to an UTF-8 multibyte character. See UTF-8 specification for this.
This answer was given in the assumpltion that the encoded text is returned as UTF-16, which as it turns out, isn't the case.
I would guess the encoding is UTF-16 or UCS2. Specify this as input for iconv. There might also be an endian issue, have a look here
The c-style way would be (no checking for clarity):
iconv_t ic = iconv_open("UCS-2", "UTF-8");
iconv(ic, myUCS2_Text, inputSize, myUTF8-Text, outputSize);
iconv_close(ic);