In Python I can automatically create an unicode object by prepending an u (as in u"test").
Is it possible to build something like that myself?
All things are possible - but in this case, only by modifying the source code of the Python interpreter and recompiling.
A related question with the same answer: Can you add new statements to Python's syntax?
Yes you can.
All you need to do is type the following:
ur"\u<hex>"
For example, if you were to type
print ur"\u0186"
it would output the following character (given you are using a certain font)
҉
To print this, you could just simply type
print "҉"
but for this to be allowed, you must put the following line of code as the FIRST line of code
# -*- coding: utf-8 -*-
Yes, I know it has the # symbol, that is supposed to be there.
Hope this helps! Have fun with unicoding! :)
Related
I am a new learner of python. I want to have a list of strings with non-ASCII characters.
This answer suggested a way to do this, but when I tried a code, I got some weird results. Please see the following MWE -
#-*- coding: utf-8 -*-
mylist = ["अ,ब,क"]
print mylist
The output was ['\xe0\xa4\x85,\xe0\xa4\xac,\xe0\xa4\x95']
When I use ASCII characters in the list, let's say ["a,b,c"] the output also is ['a,b,c']. I want the output of my code to be ["अ,ब,क"]
How to do this?
PS - I am using python 2.7.16
You want to mark these as Unicode strings.
mylist = [u"अ,ब,क"]
Depending on what you want to accomplish, if the data is just a single string, it might not need to be in a list. Or perhaps you want a list of strings?
mylist = [u"अ", u"ब", u"क"]
Python 3 brings a lot of relief to working with Unicode (and doesn't need the u sigil in front of Unicode strings, because all strings are Unicode), and should definitely be your learning target unless you are specifically tasked with maintaining legacy software after Python 2 is officially abandoned at the end of this year.
Regardless of your Python version, there may still be issues with displaying Unicode on your system, in particular on older systems and on Windows.
If you are unfamiliar with encoding issues, you'll want to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and perhaps the Python-specific Pragmatic Unicode.
Use:
#-*- coding: utf-8 -*-
mylist = ["अ,ब,क"]
print [unicode(i) for i in mylist]
Or use:
#-*- coding: utf-8 -*-
mylist = ["अ,ब,क"]
print map(unicode, mylist)
Given the six.text_type function. it's easy to write i/o code for unicode text, e.g. https://github.com/nltk/nltk/blob/develop/nltk/parse/malt.py#L188
fout.write(text_type(line))
But without the six module, it would require a try-except gymnastics that looks like this:
try:
fout.write(text_type(line))
except:
try:
fout.write(unicode(line))
except:
fout.write(bytes(line))
What is the pythonic way to resolve the file writing a unicode line and ensuring the python script is py2.x and py3.x compatible?
Is the try-except above the pythonic way to handle the py2to3 compatibility? What other alternatives are there?
For more details/context of this question: https://github.com/nltk/nltk/issues/1080#issuecomment-134542174
Do what six does, and define text_type yourself:
try:
# Python 2
text_type = unicode
except NameError:
# Python 3
text_type = str
In any case, never use blanked except lines here, you'll be masking other issues entirely unrelated to using a different Python version.
It is not clear to me what kind of file object you are writing to however. If you are using io.open() to open the file you'll get a file object that'll always expect Unicode text, in both Python 2 and 3, and you should not need to convert text to bytes, ever.
I have an application that needs to be able to handle non-ASCII characters of unknown encoding. The program may delete or replace these characters (if they are discovered in a user dictionary file), otherwise they need to pass cleanly through unaltered. What's mind-boggling is, it works one minute, then I make some seemingly trivial change, and now it fails with UnicodeDecode, UnicodeEncode, or kindred errors. Addressing this has led me down the road of cargo cult programing--making random tweaks that get it working again, but I have no idea why. Is there a general-purpose solution for dealing this, perhaps even the creation of class that modifies the normal way Python deals with strings?
I'm not sure what code to include as about five separate modules are involved. Here is what I am doing in abstract terms:
Taking a text from one of two sources: text that the user has pasted directly into a Tkinter toplevel window; text captured from the Win32 clipboard via a hotkey command.
The text is processed, including the removal of whitespace charters, then certain characters/words are replaced or simply deleted based on a customizable user dictionary.
The result is then returned to the Tkinter GUI or the Win32 clipboard, depending on whether or not the keyboard shortcut was used.
Some details that may be relevant:
All modules use
# -*- coding: utf-8 -*-
The user dictionary is saved in UTF-16 LE with BOM (a function removes BOM characters when parsing the file). The file object is instantiated with
self.pf = codecs.open(self.pattern_fn, 'r', 'utf-16')
The text entry points for text are via a Tkinter GUI Text widget:
text = self.paste_to_field.get(1.0, Tkinter.END)
Or from the clipboard:
text = win32clipboard.GetClipboardData(win32clipboard.CF_UNICODETEXT)
And example error:
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u201d' in position
2: character maps to <undefined>
Furthermore, the same text might work when tested on OS X (where I do development work) but cause an error on Windows.
Regular expressions are used, however in this case no non-ASCIIs are included in the pattern. For non-ASCIIs I simply
text = text.replace(old, new)
Another thing to consider: for c in text type iterations are no good because a non-ASCII may look like several characters to Python. The normal word/character distinction no longer holds. Also, using bad_letter = repr(non_ASCII) doesn't help since str(bad_letter) merely returns a string of the escape sequence--it can't restore the original character.
Sorry if this is extremely vague. Please let me know what info I can provide to help clarify. Thanks in advance for reading this.
I'm learning python. And i learned every comment starts with a hash "#". So how can the python interpreter read this line?
# -*- coding: utf-8 -*-
and set the charset to utf-8 ? (I'm using Python 2.7.3)
Thank you in advance.
Yes, it is a comment. But this does not mean that python doesn't see it. So it can obviously parse it, too.
What python actually does is using the regular expression coding[:=]\s*([-\w.]+) on the first two lines. Most likely this is done even before the actual python parser steps in.
See PEP-0263 for details.
How do I use letters with accent marks or special characters like ñ in verbose_name or help_text?
include this in the head of your file:
# -*- coding: utf-8 -*-
and then use this:
u'áéíóú'
i did it:
import os, sys
#encoding= utf-8
Thanks
#diegueus9 has the right answer for using raw Unicode characters in the source file. Use whatever characters you like as long as you declare the encoding as per the instructions in PEP263. However, for using just a few special characters you may find this easier: declare the string as Unicode with the u prefix and use the character's code point. The following are equivalent ways of writing "ñ":
help_text=u'\xF1 \u00F1 \U000000F1'
When it comes to actually finding the code point for a character...that's a little harder. Windows has the useful Character Map utility. gucharmap is similar. The charts at unicode.org provide alphabet-specific PDFs you can search through. Anyone know an easier way?