len() with unicode strings - python-2.7

If I do:
print "\xE2\x82\xAC"
print len("€")
print len(u"€")
I get:
€
3
1
But if I do:
print '\xf0\xa4\xad\xa2'
print len("𤭢")
print len(u"𤭢")
I get:
𤭢
4
2
In the second example, the len() function returned 2 instead of 1 for the one character unicode string u"𤭢".
Can someone explain to me why this is the case?

Python 2 can use UTF-16 as internal encoding for unicode objects (so called "narrow" build), which means 𤭢 is being encoded as two surrogates: D852 DF62. In this case, len returns the number of UTF-16 units, not the number of actual Unicode codepoints.
Python 2 can also be compiled with UTF-32 enabled for unicode (so called "wide" build), which means most unicode objects take twice as much memory, but then len(u'𤭢') == 1
Python 3's str objects since 3.3 switch on demand between ISO-8859-1, UTF-16 and UTF-32, so you'd never encounter this problem: len('𤭢') == 1.
str in Python 3.0 to 3.2 is the same as unicode in Python 2.

Related

Playing cards Unicode printing in C++

According to this wiki link, the play cards have Unicode of form U+1f0a1.
I wanted to create an array in c++ to sore the 52 standard playing cards but I notice this Unicode is longer that 2 bytes.
So my simple example below does not work, how do I store a Unicode character that is longer than 2 bytes?
wchar_t t = '\u1f0a1';
printf("%lc",t);
The above code truncates t to \u1f0a
how do I store a longer that 2 byte unicode character?
you can use char32_t with prefix U, but there's no way to print it to console. Besides, you don't need char32_t at all, utf-16 is enough to encode that character. wchar_t t = L'\u2660', you need the prefix L to specify it's a wide char.
If you are using Windows with Visual C++ compiler, I recommend a way:
Save your source file with utf-8 encoding
set compile parameter /utf-8, reference here.
use a console supports utf-8 encoded like Git Bash to see the result.
On Windows wchar_t stores a UTF-16 code-unit, you have to store your string as UTF-16 (using a string-literal with prefix) This doesn't help you either since the windows console can only output characters up to 0xFFFF. See this:
How to use unicode characters in Windows command line?

Correct len() 32-bit unicode strings in Python

I am facing a problem with 32-bit unicode strings in Python 2.7. A simple declaration such as:
s = u'\U0001f601'
print s
Will print a nice 😁 (smiley face) in the shell (if the shell supports unicode). The problem is that when I try:
print len(s), s.encode('latin-1', errors='replace')
I get different responses for different platforms. In Linux, I get:
1 ?
But in Mac, I get:
2 ??
Is the string declaration correct? Is this a bug in Python for Mac?
The OS X Python has been compiled with UCS-2 (really UTF-16) support versus UCS-4 support for Linux. This means that a surrogate pair with a length of 2 characters is being used to represent the SMP character on OS X.

find if string starts with \U in Python 3.3

I have a string and I want to find out if it starts with \U.
Here is an example
myStr = '\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
I was trying this:
myStr.startswith('\\U')
but I get False.
How can I detect \U in a string?
The larger picture:
I have a list of strings, most of them are normal English word strings, but there are a few that are similar to what I have shown in myStr, how can I distinguish them?
The original string does not have the character \U. It has the unicode escape sequence \U0001f64c, which is a single Unicode character.
Therefore, it does not make sense to try to detect \U in the string you have given.
Trying to detect the \U in that string is similar to trying to detect \x in the C string "\x90".
It makes no sense because the interpreter has read the sequence and converted it. Of course, if you want to detect the first Unicode character in that string, that works fine.
myStr.startswith('\U0001f64c')
Note that if you define the string with a real \U, like this, you can detect it just fine. Based on some experimentation, I believe Python 2.7.6 defaults to this behavior.
myStr = r'\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
myStr.startswith('\\U') # Returns True.
Update: The OP requested a way to convert from the Unicode string into the raw string above.
I will show the solution in two steps.
First observe that we can view the raw hex for each character like this.
>>> [hex(ord(x)) for x in myStr]
['0x1f64c', '0x1f60d', '0x1f4a6', '0x1f445', '0x1f4af']
Next, we format it by using a format string.
formatString = "".join(r'\U%08x' for x in myStr)
output = formatString % tuple(myChars)
output.startswith("\\U") # Returns True.
Note of course that since we are converting a Unicode string and we are formatting it this way deliberately, it guaranteed to start with \U. However, I assume your actual application is not just to detect whether it starts with \U.
Update2: If the OP is trying to differentiate between "normal English" strings and "Unicode Strings", the above approach will not work, because all characters have a corresponding Unicode representation.
However, one heuristic you might use to check whether a string looks like ASCII is to just check whether the values of each character are outside the normal ASCII range. Assuming that you consider the normal ASCII range to be between 32 and 127 (You can take a look here and decide what you want to include.), you can do something like the following.
def isNormal(myStr):
myChars = [ord(x) for x in myStr]
return all(x < 128 and x > 31 for x in myChars)
This can be done in one line, but I separated it to make it more readable.
Your string:
myStr = '\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
is not a foraign language text. It is 5 Unicode characters, which are (in order):
PERSON RAISING BOTH HANDS IN CELEBRATION
SMILING FACE WITH HEART-SHAPED EYES
SPLASHING SWEAT SYMBOL
TONGUE
HUNDRED POINTS SYMBOL
If you want to get strings that only contain 'normal' characters, you can use something like this:
if re.search(r'[^A-Za-z0-9\s]', myStr):
# String contained 'weird' characters.
Note that this will also trip on characters like é, which will sometimes be used in English on words with a French origin.

How to get Unicode for Chracter strings(UTF-8) in c or c++ language (Linux)

I am working on one application in which i need to know Unicode of Characters to classify them like Chinese Characters, Japanese Characters(Kanji,Katakana,Hiragana) , Latin , Greek etc .
The given string is in UTF-8 Format.
If there is any way to know Unicode for UTF-8 Character? For example:
Character '≠' has U+2260 Unicode value.
Character '建' has U+5EFA Unicode value.
The utf-8 encoding is a variable width encoding of unicode. Each unicode code point can be encoded from one to four char.
To decode a char* string and extract a single code point, you read one byte. If the most significant bit is set then, the code point is encoded on multiple characters, otherwise it is the unicode code point. The number of bits set counting from the most-significant bit indicate how many char are used to encode the unicode code point.
This table explain how to make the conversion:
UTF-8 (char*) | Unicode (21 bits)
------------------------------------+--------------------------
0xxxxxxx | 00000000000000000xxxxxxx
------------------------------------+--------------------------
110yyyyy 10xxxxxx | 0000000000000yyyyyxxxxxx
------------------------------------+--------------------------
1110zzzz 10yyyyyy 10xxxxxx | 00000000zzzzyyyyyyxxxxxx
------------------------------------+--------------------------
11110www 10zzzzzz 10yyyyyy 10xxxxxx | 000wwwzzzzzzyyyyyyxxxxxx
Based on that, the code is relatively straightforward to write. If you don't want to write it, you can use a library that does the conversion for you. There are many available under Linux : libiconv, icu, glib, ...
libiconv can help you with converting the utf-8 string to utf-16 or utf-32. Utf-32 would be the savest option if you really want to support every possible unicode codepoint.

Unicode Woes! Ms-Access 97 migration to Ms-Access 2007

Problem is categorized in two steps:
Problem Step 1. Access 97 db containing XML strings that are encoded in UTF-8.
The problem boils down to this: the Access 97 db contains XML strings that are encoded in UTF-8. So I created a patch tool for separate conversion for the XML strings from UTF-8 to Unicode. In order to covert UTF8 string to Unicode, I have used function
MultiByteToWideChar(CP_UTF8, 0, PChar(OriginalName), -1, #newName, Size);.(where newName is array as declared "newName : Array[0..2048] of WideChar;" ).
This function works good on most of the cases, I have checked it with Spainsh, Arabic, characters. but I am working on Greek and Chineese Characters it is choking.
For some greek characters like "Ευγ. ΚαÏαβιά" (as stored in Access-97), the resultant new string contains null charaters in between, and when it is stored to wide-string the characters are getting clipped.
For some chineese characters like "?¢»?µ?"(as stored in Access-97), the result is totally absurd like "?¢»?µ?".
Problem Step 2. Access 97 db Text Strings, Application GUI takes unicode input and saved in Access-97
First I checked with Arabic and Spainish Characters, it seems then that no explicit characters encoding is required. But again the problem comes with greek and chineese characters.
I tried the above mentioned same function for the text conversion( Is It correct???), the result was again disspointing. The Spainsh characters which are ok with out conversion, get unicode character either lost or converted to regular Ascii Alphabets.
The Greek and Chineese characters shows similar behaviour as mentined in step 1.
Please guide me. Am I taking the right approach? Is there some other way around???
Well Right now I am confused and full of Questions :)
There is no special requirement for working with Greek characters. The real problem is that the characters were stored in an encoding that Access doesn't recognize in the first place. When the application stored the UTF8 values in the database it tried to convert every single byte to the equivalent byte in the database's codepage. Every character that had no correspondence in that encoding was replaced with ? That may mean that the Greek text is OK, while the chinese text may be gone.
In order to convert the data to something readable you have to know the codepage they are stored in. Using this you can get the actual bytes and then convert them to Unicode.