How to add non ASCII characters in a python list? - python-2.7

I am a new learner of python. I want to have a list of strings with non-ASCII characters.
This answer suggested a way to do this, but when I tried a code, I got some weird results. Please see the following MWE -
#-*- coding: utf-8 -*-
mylist = ["अ,ब,क"]
print mylist
The output was ['\xe0\xa4\x85,\xe0\xa4\xac,\xe0\xa4\x95']
When I use ASCII characters in the list, let's say ["a,b,c"] the output also is ['a,b,c']. I want the output of my code to be ["अ,ब,क"]
How to do this?
PS - I am using python 2.7.16

You want to mark these as Unicode strings.
mylist = [u"अ,ब,क"]
Depending on what you want to accomplish, if the data is just a single string, it might not need to be in a list. Or perhaps you want a list of strings?
mylist = [u"अ", u"ब", u"क"]
Python 3 brings a lot of relief to working with Unicode (and doesn't need the u sigil in front of Unicode strings, because all strings are Unicode), and should definitely be your learning target unless you are specifically tasked with maintaining legacy software after Python 2 is officially abandoned at the end of this year.
Regardless of your Python version, there may still be issues with displaying Unicode on your system, in particular on older systems and on Windows.
If you are unfamiliar with encoding issues, you'll want to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and perhaps the Python-specific Pragmatic Unicode.

Use:
#-*- coding: utf-8 -*-
mylist = ["अ,ब,क"]
print [unicode(i) for i in mylist]
Or use:
#-*- coding: utf-8 -*-
mylist = ["अ,ब,क"]
print map(unicode, mylist)

Related

Replacing unicode characters with ascii characters in Python/Django

I'm using Python 2.7 here (which is very relevant).
Let's say I have a string containing an "em" dash, "—". This isn't encoded in ASCII. Therefore, when my Django app processes it, it complains. A lot.
I want to to replace some such characters with unicode equivalents for string tokenization and use with a spell-checking API (PyEnchant, which considers non-ASCII apostrophes to be misspellings), for example by using the shorter "-" dash instead of an em dash. Here's what I'm doing:
s = unicode(s).replace(u'\u2014', '-').replace(u'\u2018', "'").replace(u'\u2019', "'").replace(u'\u201c', '"').replace(u'\u201d', '"')
Unfortunately, this isn't actually replacing any of the unicode characters, and I'm not sure why.
I don't really have time to upgrade to Python 3 right now, importing unicode_literals from future at the top of the page or setting the encoding there does not let me place actual unicode literals in the code, as it should, and I have tried endless tricks with encode() and decode().
Can anyone give me a straightforward, failsafe way to do this in Python 2.7?
Oh boy... false alarm, here! It actually works, but I entered some incorrect character codes. I'm going to leave the question up since that code is the only thing that seemed to let me complete this particular task in this environment.

Is it possible to create string decorators?

In Python I can automatically create an unicode object by prepending an u (as in u"test").
Is it possible to build something like that myself?
All things are possible - but in this case, only by modifying the source code of the Python interpreter and recompiling.
A related question with the same answer: Can you add new statements to Python's syntax?
Yes you can.
All you need to do is type the following:
ur"\u<hex>"
For example, if you were to type
print ur"\u0186"
it would output the following character (given you are using a certain font)
҉
To print this, you could just simply type
print "҉"
but for this to be allowed, you must put the following line of code as the FIRST line of code
# -*- coding: utf-8 -*-
Yes, I know it has the # symbol, that is supposed to be there.
Hope this helps! Have fun with unicoding! :)

How can the python interpreter read the charaset?

I'm learning python. And i learned every comment starts with a hash "#". So how can the python interpreter read this line?
# -*- coding: utf-8 -*-
and set the charset to utf-8 ? (I'm using Python 2.7.3)
Thank you in advance.
Yes, it is a comment. But this does not mean that python doesn't see it. So it can obviously parse it, too.
What python actually does is using the regular expression coding[:=]\s*([-\w.]+) on the first two lines. Most likely this is done even before the actual python parser steps in.
See PEP-0263 for details.

Segment a korean word into individual syllables - C++/Python

I am trying to segment a Korean string into individual syllable.
So the input would be a string like "서울특별시" and the outcome "서","울","특","별","시".
I have tried with both C++ and Python to segment a string but the result is a series of ? or white spaces respectively (The string itself however can be printed correctly on the screen).
In c++ I have first initialized the input string as string korean="서울특별시" and then used a string::iterator to go through the string and print each individual component.
In Python I have just used a simple for loop.
I have wondering if there is a solution to this problem. Thanks.
I don't know Korean at all, and can't comment on the division into syllables, but in Python 2 the following works:
# -*- coding: utf-8 -*-
print(repr(u"서울특별시"))
print(repr(u"서울특별시"[0]))
Output:
u'\uc11c\uc6b8\ud2b9\ubcc4\uc2dc'
u'\uc11c'
In Python 3 you don't need the u for Unicode strings.
The outputs are the unicode values of the characters in the string, which means that the string has been correctly cut up in this case. The reason I printed them with repr is that the font in the terminal I used, can't represent them and so without repr I just see square boxes. But that's purely a rendering issue, repr demonstrates that the data is correct.
So, if you know logically how to identify the syllables then you can use repr to see what your code has actually done. Unicode NFC sounds like a good candidate for actually identifying them (thanks to R. Martinho Fernandes), and unicodedata.normalize() is the way to get that.

Accents/special characters (e.g., ñ) in verbose_name or help_text ?

How do I use letters with accent marks or special characters like ñ in verbose_name or help_text?
include this in the head of your file:
# -*- coding: utf-8 -*-
and then use this:
u'áéíóú'
i did it:
import os, sys
#encoding= utf-8
Thanks
#diegueus9 has the right answer for using raw Unicode characters in the source file. Use whatever characters you like as long as you declare the encoding as per the instructions in PEP263. However, for using just a few special characters you may find this easier: declare the string as Unicode with the u prefix and use the character's code point. The following are equivalent ways of writing "ñ":
help_text=u'\xF1 \u00F1 \U000000F1'
When it comes to actually finding the code point for a character...that's a little harder. Windows has the useful Character Map utility. gucharmap is similar. The charts at unicode.org provide alphabet-specific PDFs you can search through. Anyone know an easier way?