View utf-8 tuple element as string in python - python-2.7

I have a list with Unicode utf-8 tuples like:
((u'\u0d2a\u0d31\u0d1e\u0d4d\u0d1e\u0d41',
u'\u0d15\u0d47\u0d3e\u0d23\u0d4d\u200d\u0d17\u0d4d\u0d30\u0d38\u0d4d'),
7.5860818562067314)
I want to convert the utf-8 code as string. I have tried decode. But getting error.
Can anyone please help me out?
Thanks in advance!

You should use .encode('utf-8') method instead of .decode(), because strings are represented in unicode type, and we want to get byte strings.
Here is great how-to about string encoding in python 2.7, must read it:
http://docs.python.org/2/howto/unicode.html
For example, data[0][0].encode('utf-8') produces normal result.

Those are not UTF-8.
print data[0][0], data[0][1]

Related

Print special character from utf-8 encoded string

I'm having trouble dealing with encoding in Python:
I get some strings from a csv that I open using pandas.read_csv(), they are encoded in unicode so I encode it to utf-8 doing the following
# data is from my csv
string = data.encode('utf-8')
print string
However, when I print it, i get
"Parc d'Activit\xc3\xa9s des Gravanches"
and i would like to return
"Parc d'Activités des Gravanches"
It seems like an easy issue but I'm quite new to python and did not find anything close enough to my problem.
Note: I am using Python 2.7 and my file starts with
#!/usr/bin/env python2.7
# coding: utf8
EDIT: I just say that you are using Python 2, okay, I think the answer below is still valuable though.
In Python 2 this is even more complicated and inconsistent. Here you have str and unicode, and the default str doesn't support unicode stuff.
Anyways, the situation is more or less the same, use decode instead of encode to convert from str to unicode. That should fix it.
More info at: https://pythonhosted.org/kitchen/unicode-frustrations.html
This is a common source of confusion.The issue is a bit complex, but I'll try to simplify it. I'm talking about Python 3 here, I believe there's several differences with Python 2.
There's two types of what you would call a string: str and bytes.
str is the general string type form Python, it supports unicode seamlessly in Python 3, but the way it encodes the actual data is not relevant, it's an object.
bytes is a byte array, like char* in C. It's a sequence of bytes.
Strings can be represented both ways, but you need to specify an encoding standard to translate between the two, as bytes needs to be interpreted, because it's just, again, a raw array of bytes.
encode converts a str into bytes, that's the mistake you make. Of course, if you print bytes it will just show it's raw data, AKA, the string encoded as utf-8.
decode does the opposite operation, that may be what you need.
However, if you open the file normally (open(file_name, 'r')) instead of in byte mode (open(file_name, 'b'), which I doubt you are doing, you shouldn't need to do anything, printing data should just work as you want it to.
More info at: https://docs.python.org/3/howto/unicode.html

Convert \xc3\xd8\xe8\xa7\xc3\xb4\xd to human readable format

I am having trouble converting '\xc3\xd8\xe8\xa7\xc3\xb4\xd' (which is a Thai text) to a readable format. I get this value from a smart card, and it basically was working for Windows but not in Linux.
If I print in my Python console, I get:
����ô
I tried to follow some google hints but I am unable to accomplish my goal.
Any suggestion is appreciated.
Your text does not seem to be a Unicode text. Instead, it looks like it is in one of Thai encodings. Hence, you must know the encoding before printing the text.
For example, if we assume your data is encoded in TIS-620 (and the last character is \xd2 instead of \xd) then it will be "รุ่งรดา".
To work with the non-Unicode strings in Python, you may try: myString.decode("tis-620") or even sys.setdefaultencoding("tis-620")

Segment a korean word into individual syllables - C++/Python

I am trying to segment a Korean string into individual syllable.
So the input would be a string like "서울특별시" and the outcome "서","울","특","별","시".
I have tried with both C++ and Python to segment a string but the result is a series of ? or white spaces respectively (The string itself however can be printed correctly on the screen).
In c++ I have first initialized the input string as string korean="서울특별시" and then used a string::iterator to go through the string and print each individual component.
In Python I have just used a simple for loop.
I have wondering if there is a solution to this problem. Thanks.
I don't know Korean at all, and can't comment on the division into syllables, but in Python 2 the following works:
# -*- coding: utf-8 -*-
print(repr(u"서울특별시"))
print(repr(u"서울특별시"[0]))
Output:
u'\uc11c\uc6b8\ud2b9\ubcc4\uc2dc'
u'\uc11c'
In Python 3 you don't need the u for Unicode strings.
The outputs are the unicode values of the characters in the string, which means that the string has been correctly cut up in this case. The reason I printed them with repr is that the font in the terminal I used, can't represent them and so without repr I just see square boxes. But that's purely a rendering issue, repr demonstrates that the data is correct.
So, if you know logically how to identify the syllables then you can use repr to see what your code has actually done. Unicode NFC sounds like a good candidate for actually identifying them (thanks to R. Martinho Fernandes), and unicodedata.normalize() is the way to get that.

PHP: How to get rid of � symbol inside text?

Can't figure out, how to remove this � symbol from string.
String is in utf-8 format.
What to do? :(
This removes whole string:
preg_replace('/\W/','',utf8_decode(substr(utf8_encode($ad['description']),0,125)))
Thanks ;)
Update:
Using:
header('Content-Type: text/html; charset=utf-8');
After replacement using exit() right away.
U+FFFD REPLACEMENT CHARACTER is used when the character does not have a representation in the current charset encoding. Declare your encodings properly as UTF-8 and use UTF-8 strings and it will not show upon most platforms.
The problem here is that your string is not in utf-8 format. You pretend it is, and handle the data accordingly, but the string probably contains Ansi characters. You don't just need to pass the Content-Encoding = utf-8 header, but your contents needs to be converted to utf-8 before it is sent as well.
you could try utf8_decode('string'); or utf8_encode('string');
but you should really try to find the actuall problem make sure the headers are correct set, document type and that the text is encoded in the right format when saved or what not

How to convert ISO-8859-1 to UTF-8 using libiconv in C++

I'm using libcurl to fetch some HTML pages.
The HTML pages contain some character references like: סלקום
When I read this using libxml2 I'm getting: ׳₪׳¨׳˜׳ ׳¨
is it the ISO-8859-1 encoding?
If so, how do I convert it to UTF-8 to get the correct word.
Thanks
EDIT: I got the solution, MSalters was right, libxml2 does use UTF-8.
I added this to eclipse.ini
-Dfile.encoding=utf-8
and finally I got Hebrew characters on my Eclipse console.
Thanks
Have you seen the libxml2 page on i18n ? It explains how libxml2 solves these problems.
You will get a ס from libxml2. However, you said that you get something like ׳₪׳¨׳˜׳ ׳¨. Why do you think that you got that? You get an XMLchar*. How did you convert that pointer into the string above? Did you perhaps use a debugger? Does that debugger know how to render a XMLchar* ? My bet is that the XMLchar* is correct, but you used a debugger that cannot render the Unicode in a XMLchar*
To answer your last question, a XMLchar* is already UTF-8 and needs no further conversion.
No. Those entities correspond t the decimal value of the Unicode sequence number of your characters. See this page for example.
You can therefore store your Unicode values as integers and use an algorithm to transform those integers to an UTF-8 multibyte character. See UTF-8 specification for this.
This answer was given in the assumpltion that the encoded text is returned as UTF-16, which as it turns out, isn't the case.
I would guess the encoding is UTF-16 or UCS2. Specify this as input for iconv. There might also be an endian issue, have a look here
The c-style way would be (no checking for clarity):
iconv_t ic = iconv_open("UCS-2", "UTF-8");
iconv(ic, myUCS2_Text, inputSize, myUTF8-Text, outputSize);
iconv_close(ic);