Correct len() 32-bit unicode strings in Python - python-2.7

I am facing a problem with 32-bit unicode strings in Python 2.7. A simple declaration such as:
s = u'\U0001f601'
print s
Will print a nice 😁 (smiley face) in the shell (if the shell supports unicode). The problem is that when I try:
print len(s), s.encode('latin-1', errors='replace')
I get different responses for different platforms. In Linux, I get:
1 ?
But in Mac, I get:
2 ??
Is the string declaration correct? Is this a bug in Python for Mac?

The OS X Python has been compiled with UCS-2 (really UTF-16) support versus UCS-4 support for Linux. This means that a surrogate pair with a length of 2 characters is being used to represent the SMP character on OS X.

Related

how to use stdscr.addstr() (curses) to print unicode characters

I know how to use the print() function to print unicode characters, but I do not know how to do it using stdscr.addstr()
I'm using python 2.7 on a Linux operating system
Thanks
I'm pretty sure you need to encode the string.
The docs reads:
Since version 5.4, the ncurses library decides how to interpret non-ASCII data using the nl_langinfo function. That means that you have to call locale.setlocale() in the application and encode Unicode strings using one of the system’s available encodings.
This example worked for me in 2.7.12
import locale
locale.setlocale(locale.LC_ALL, '')
stdscr.addstr(0, 0, mystring.encode('UTF-8'))

C++ Infinity Sign

Hello I was just wondering how I can display the infinity (∞) in C++? I am using CodeBlocks. I read couple of Q&A's on this topic but I'm a newbie at this stuff, especially with Hex coding and stuff. What do I have to include and what do I type out exactly. If someone can write the code and explain it, that'd be great! Thanks!
The symbol is not part of the ASCII code. However, in the code page 437 (most of the time the default in Windows Command Prompt with English locales/US regional settings) it is represented as the character #236. So in principle
std::cout << static_cast<unsigned char>(236);
should display it, but the result depends on the current locale/encoding. On my Mac (OS X) it is not displayed properly.
The best way to go about it is to use the UNICODE set of characters (which standardized a large amount of characters/symbols). In this case,
std::cout << "\u221E";
should do the job, as the UNICODE character #221 represents inf.
However, to be able to display UNICODE, your output device should support UTF encoding. On my Mac, the Terminal uses UTF, however Windows Command Prompt still uses the old ASCII encoding CodePage 437 (thanks to #chris for pointing this out). According to this answer, you can change to UNICODE by typing
chcp 65001
in a Command Prompt.
You can show it through its UNICODE
∞ has the value: \u221E
You can show any character from the Character Map by its unicode.

len() with unicode strings

If I do:
print "\xE2\x82\xAC"
print len("€")
print len(u"€")
I get:
€
3
1
But if I do:
print '\xf0\xa4\xad\xa2'
print len("𤭢")
print len(u"𤭢")
I get:
𤭢
4
2
In the second example, the len() function returned 2 instead of 1 for the one character unicode string u"𤭢".
Can someone explain to me why this is the case?
Python 2 can use UTF-16 as internal encoding for unicode objects (so called "narrow" build), which means 𤭢 is being encoded as two surrogates: D852 DF62. In this case, len returns the number of UTF-16 units, not the number of actual Unicode codepoints.
Python 2 can also be compiled with UTF-32 enabled for unicode (so called "wide" build), which means most unicode objects take twice as much memory, but then len(u'𤭢') == 1
Python 3's str objects since 3.3 switch on demand between ISO-8859-1, UTF-16 and UTF-32, so you'd never encounter this problem: len('𤭢') == 1.
str in Python 3.0 to 3.2 is the same as unicode in Python 2.

Python 2.7 range regex matching unicode emoticons

How to count the number of unicode emoticons in a string using python 2.7 regex? I tried the first answer posted for this question. But it has been showing invalid expression error.
re.findall(u'[\U0001f600-\U0001f650]', s.decode('utf-8')) is not working and showing invalid expression error
How to find and count emoticons in a string using python?
"Thank you for helping out 😊(Emoticon1) Smiley emoticon rocks!😉(Emoticon2)"
Count : 2
The problem is probably due to using a "narrow build" of Python 2. That is, if you fire up your interpreter, you'll find that sys.maxunicode == 0xffff is True.
This site has a few interesting notes on wide builds of Python (which are commonly found on Linux, but not, as the link suggests, on OS X in my experience). These builds use UCS-4 internally to encode characters, and as a result seem to have saner support for higher range Unicode code points, such as the ranges you are talking about. Narrow builds apparently use UTF-16 internally, and as a result encode these higher code points using "surrogate pairs". I presume this is the reason you see a bad character range error when you try and compile this regular expression.
The only solution I know is to switch to a python version >= 3.3 which no longer has the wide/narrow distinction if you can, or install a wide Python build

ASCII character problem on mac. Can't print the black square(which is char(219))

When I'm trying to do this code in C++
cout << char(219);
the output on my mac is question mark ?
However, on PC it gives me a black square.
Does anyone have any idea why on mac there is only 128 characters, when it should be 256?
Thanks for your help.
There's no such thing as ASCII character 219. ASCII only goes up to 127. chars 128-255 are defined in different ways in different character encodings for different languages and different OSs.
MacRoman defines it as €.
IBM code page 437 (used at the Windows command prompt) defines it as █.
Windows code page 1252 (used in Windows GUI programs) defines it as Û.
UTF-8 defines it as a part of a 2-byte character. (Specifically, the lead byte of the characters U+06C0 to U+06FF.)
ASCII is really a 7-bit encoding. If you are printing char(219) that is using some other encoding: on Windows most probably CP 1252. On Mac, I have no idea...
When a character is missing from an encoding set, it shows a box on Windows (it's not character 219, which doesn't exist) Macs show the question mark in a diamond symbol because a designer wanted it that way. But they both mean the same thing, missing/invalid character.