Accents/special characters (e.g., ñ) in verbose_name or help_text ? - django

How do I use letters with accent marks or special characters like ñ in verbose_name or help_text?

include this in the head of your file:
# -*- coding: utf-8 -*-
and then use this:
u'áéíóú'

i did it:
import os, sys
#encoding= utf-8
Thanks

#diegueus9 has the right answer for using raw Unicode characters in the source file. Use whatever characters you like as long as you declare the encoding as per the instructions in PEP263. However, for using just a few special characters you may find this easier: declare the string as Unicode with the u prefix and use the character's code point. The following are equivalent ways of writing "ñ":
help_text=u'\xF1 \u00F1 \U000000F1'
When it comes to actually finding the code point for a character...that's a little harder. Windows has the useful Character Map utility. gucharmap is similar. The charts at unicode.org provide alphabet-specific PDFs you can search through. Anyone know an easier way?

Related

How to add non ASCII characters in a python list?

I am a new learner of python. I want to have a list of strings with non-ASCII characters.
This answer suggested a way to do this, but when I tried a code, I got some weird results. Please see the following MWE -
#-*- coding: utf-8 -*-
mylist = ["अ,ब,क"]
print mylist
The output was ['\xe0\xa4\x85,\xe0\xa4\xac,\xe0\xa4\x95']
When I use ASCII characters in the list, let's say ["a,b,c"] the output also is ['a,b,c']. I want the output of my code to be ["अ,ब,क"]
How to do this?
PS - I am using python 2.7.16
You want to mark these as Unicode strings.
mylist = [u"अ,ब,क"]
Depending on what you want to accomplish, if the data is just a single string, it might not need to be in a list. Or perhaps you want a list of strings?
mylist = [u"अ", u"ब", u"क"]
Python 3 brings a lot of relief to working with Unicode (and doesn't need the u sigil in front of Unicode strings, because all strings are Unicode), and should definitely be your learning target unless you are specifically tasked with maintaining legacy software after Python 2 is officially abandoned at the end of this year.
Regardless of your Python version, there may still be issues with displaying Unicode on your system, in particular on older systems and on Windows.
If you are unfamiliar with encoding issues, you'll want to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and perhaps the Python-specific Pragmatic Unicode.
Use:
#-*- coding: utf-8 -*-
mylist = ["अ,ब,क"]
print [unicode(i) for i in mylist]
Or use:
#-*- coding: utf-8 -*-
mylist = ["अ,ब,क"]
print map(unicode, mylist)

Replacing unicode characters with ascii characters in Python/Django

I'm using Python 2.7 here (which is very relevant).
Let's say I have a string containing an "em" dash, "—". This isn't encoded in ASCII. Therefore, when my Django app processes it, it complains. A lot.
I want to to replace some such characters with unicode equivalents for string tokenization and use with a spell-checking API (PyEnchant, which considers non-ASCII apostrophes to be misspellings), for example by using the shorter "-" dash instead of an em dash. Here's what I'm doing:
s = unicode(s).replace(u'\u2014', '-').replace(u'\u2018', "'").replace(u'\u2019', "'").replace(u'\u201c', '"').replace(u'\u201d', '"')
Unfortunately, this isn't actually replacing any of the unicode characters, and I'm not sure why.
I don't really have time to upgrade to Python 3 right now, importing unicode_literals from future at the top of the page or setting the encoding there does not let me place actual unicode literals in the code, as it should, and I have tried endless tricks with encode() and decode().
Can anyone give me a straightforward, failsafe way to do this in Python 2.7?
Oh boy... false alarm, here! It actually works, but I entered some incorrect character codes. I'm going to leave the question up since that code is the only thing that seemed to let me complete this particular task in this environment.

Is it possible to create string decorators?

In Python I can automatically create an unicode object by prepending an u (as in u"test").
Is it possible to build something like that myself?
All things are possible - but in this case, only by modifying the source code of the Python interpreter and recompiling.
A related question with the same answer: Can you add new statements to Python's syntax?
Yes you can.
All you need to do is type the following:
ur"\u<hex>"
For example, if you were to type
print ur"\u0186"
it would output the following character (given you are using a certain font)
҉
To print this, you could just simply type
print "҉"
but for this to be allowed, you must put the following line of code as the FIRST line of code
# -*- coding: utf-8 -*-
Yes, I know it has the # symbol, that is supposed to be there.
Hope this helps! Have fun with unicoding! :)

Python 2.7: Handeling Unicode Objects

I have an application that needs to be able to handle non-ASCII characters of unknown encoding. The program may delete or replace these characters (if they are discovered in a user dictionary file), otherwise they need to pass cleanly through unaltered. What's mind-boggling is, it works one minute, then I make some seemingly trivial change, and now it fails with UnicodeDecode, UnicodeEncode, or kindred errors. Addressing this has led me down the road of cargo cult programing--making random tweaks that get it working again, but I have no idea why. Is there a general-purpose solution for dealing this, perhaps even the creation of class that modifies the normal way Python deals with strings?
I'm not sure what code to include as about five separate modules are involved. Here is what I am doing in abstract terms:
Taking a text from one of two sources: text that the user has pasted directly into a Tkinter toplevel window; text captured from the Win32 clipboard via a hotkey command.
The text is processed, including the removal of whitespace charters, then certain characters/words are replaced or simply deleted based on a customizable user dictionary.
The result is then returned to the Tkinter GUI or the Win32 clipboard, depending on whether or not the keyboard shortcut was used.
Some details that may be relevant:
All modules use
# -*- coding: utf-8 -*-
The user dictionary is saved in UTF-16 LE with BOM (a function removes BOM characters when parsing the file). The file object is instantiated with
self.pf = codecs.open(self.pattern_fn, 'r', 'utf-16')
The text entry points for text are via a Tkinter GUI Text widget:
text = self.paste_to_field.get(1.0, Tkinter.END)
Or from the clipboard:
text = win32clipboard.GetClipboardData(win32clipboard.CF_UNICODETEXT)
And example error:
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u201d' in position
2: character maps to <undefined>
Furthermore, the same text might work when tested on OS X (where I do development work) but cause an error on Windows.
Regular expressions are used, however in this case no non-ASCIIs are included in the pattern. For non-ASCIIs I simply
text = text.replace(old, new)
Another thing to consider: for c in text type iterations are no good because a non-ASCII may look like several characters to Python. The normal word/character distinction no longer holds. Also, using bad_letter = repr(non_ASCII) doesn't help since str(bad_letter) merely returns a string of the escape sequence--it can't restore the original character.
Sorry if this is extremely vague. Please let me know what info I can provide to help clarify. Thanks in advance for reading this.

Segment a korean word into individual syllables - C++/Python

I am trying to segment a Korean string into individual syllable.
So the input would be a string like "서울특별시" and the outcome "서","울","특","별","시".
I have tried with both C++ and Python to segment a string but the result is a series of ? or white spaces respectively (The string itself however can be printed correctly on the screen).
In c++ I have first initialized the input string as string korean="서울특별시" and then used a string::iterator to go through the string and print each individual component.
In Python I have just used a simple for loop.
I have wondering if there is a solution to this problem. Thanks.
I don't know Korean at all, and can't comment on the division into syllables, but in Python 2 the following works:
# -*- coding: utf-8 -*-
print(repr(u"서울특별시"))
print(repr(u"서울특별시"[0]))
Output:
u'\uc11c\uc6b8\ud2b9\ubcc4\uc2dc'
u'\uc11c'
In Python 3 you don't need the u for Unicode strings.
The outputs are the unicode values of the characters in the string, which means that the string has been correctly cut up in this case. The reason I printed them with repr is that the font in the terminal I used, can't represent them and so without repr I just see square boxes. But that's purely a rendering issue, repr demonstrates that the data is correct.
So, if you know logically how to identify the syllables then you can use repr to see what your code has actually done. Unicode NFC sounds like a good candidate for actually identifying them (thanks to R. Martinho Fernandes), and unicodedata.normalize() is the way to get that.