django's setting.py can't read utf-8 characters - django

I have tried to give a value to my MEDIA_ROOT that constain a word with an accent mark, but django doent accept it.
I have tried to unicode(utf-8) and encoding it with no positive results
The error that I get is: SyntaxError: Non-ASCII character '\xc3' in file
What can I do in order to make settings accept acent marks(ó,á,é,í,ú)
SyntaxError: Non-ASCII character '\xc3' in file C:\Users\Meccha\Documents\django\project\settings.py on line 160, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
And in line 160 I have:MEDIA_ROOT = os.path.join(u'D:', u'INVESTIGACIÓN._P')

In python-2.x, there is a destinction between str, and unicode. str are ASCII strings, so these can contain only ASCII characters. On the other hand unicode strings can contain all unicode characters.
You can define a unicode string with the u prefix, this allows to write unicode characters like u'\xf3' to write a unicode string that contains the à character.
If you however want to write unicode strings as well, you need to specify the encoding of the file, in the header of the file. So then the settings.py file looks like:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import os
# ...
# (some other settings)
# ...
MEDIA_ROOT = os.path.join('D:', u' INVESTIGACIÓN_P')
So the top part specifies the encoding, and the latter has a u prefix to mark the string as a unicode string.

Related

Replace utf8 characters

I want to replace some utf-8 characters set with another utf-8 character set but anything I try I end up with errors.
I am a noob at Python so please be patient
What I want to achieve is converting characters by unicode values or by html entities (more readable, for maintanance)
Tries (with example):
1.First
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#Found this function
def multiple_replace(dic, text):
pattern = "|".join(map(re.escape, dic.keys()))
return re.sub(pattern, lambda m: dic[m.group()], text)
text="Larry Wall is ùm© some text"
replace_table = {
u'\x97' : u'\x82' # ù -> é
}
text2=multiple_replace(dic,text)
print text #Expected:Larry Wall is ém© some text
#Got: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
2.Html entities
dic = {
"ú" : "é" # ù -> é
}
some_text="Larry Wall is ùm© some text"
some_text2=some_text.encode('ascii', 'xmlcharrefreplace')
some_text2=multiple_replace(dic,some_text2)
print some_text2
#Got:UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 14: ordinal not in range(128)
Any ideas are welcome
Your problem is due to the fact that your input strings are in non-unicode representation (<type 'str'> rather than <type 'unicode'>). You must define the input string using the u"..." syntax:
text=u"Larry Wall is ùm© some text"
# ^
(Besides you will have to fix the last statement in your first example - currently it prints the input string (text), whereas I am pretty sure that you meant to see the result (text2)).

Python: ascii codec can't encode en-dash

I'm trying to print a poem from the Poetry Foundation's daily poem RSS feed with a thermal printer that supports an encoding of CP437. This means I need to translate some characters; in this case an en-dash to a hyphen. But python won't even encode the en dash to begin with. When I try to decode the string and replace the en-dash with a hyphen I get the following error:
Traceback (most recent call last):
File "pftest.py", line 46, in <module>
str = str.decode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 140: ordinal not in range(128)
And here is my code:
#!/usr/bin/python
#-*- coding: utf-8 -*-
# This string is actually a variable entitled d['entries'][1].summary_detail.value
str = """Love brought by night a vision to my bed,
One that still wore the vesture of a child
But eighteen years of age – who sweetly smiled"""
str = str.decode('utf-8')
str = str.replace("\u2013", "-") #en dash
str = str.replace("\u2014", "--") #em dash
print (str)
I can actually print the output using the following code without errors in my terminal window (Mac), but my printer spits out sets of 3 CP437 characters:
str = u''.str.encode('utf-8')
I'm using Sublime Text as my editor, and I've saved the page with UTF-8 encoding, but I'm not sure that will help things. I would greatly appreciate any help with this code. Thank you!
I don't fully understand what's happening in your code, but I've also been trying to replace en-dashes with hyphens in a string I got from the Web, and here's what's working for me. My code is just this:
txt = re.sub(u"\u2013", "-", txt)
I'm using Python 2.7 and Sublime Text 2, but I don't bother setting -*- coding: utf-8 -*- in my script, as I'm trying not to introduce any new encoding issues. (Even though my variables may contain Unicode I like to keep my code pure ASCII.) Do you need to include Unicode in your .py file, or was that just to help with debugging?
I'll note that my txt variable is already a unicode string, i.e.
print type(txt)
produces
<type 'unicode'>
I'd be curious to know what type(str) would produce in your case.
One thing I noticed in your code is
str = str.replace("\u2013", "-") #en dash
Are you sure that does anything? My understanding is that \u only means "unicode character' inside a u"" string, and what you've created there is a string with 5 characters, a "u", a "2", a "0", etc. (The first character is because you can escape any character and if there's no special meaning, like in the case of '\n' or '\t', it just ignores the backslash.)
Also, the fact that you get 3 CP437 characters from your printer makes me suspect that you still have an en-dash in your string. The UTF-8 encoding of an en-dash is 3 bytes: 0xe2 0x80 0x93. When you call str.encode('utf-8') on a unicode string that contains an en-dash you get those three bytes in the returned string. I'm guessing that your terminal knows how to interpret that as an en-dash and that's what you're seeing.
If you can't get my first method to work, I'll mention that I also had success with this:
txt = txt.encode('utf-8')
txt = re.sub("\xe2\x80\x93", "-", txt)
Maybe that re.sub() would work for you if you put it after your call to encode(). And in that case you might not even need that call to decode() at all. I'll confess that I really don't understand why it's there.

python different result from IDLE and python script

I have tried the following in Python 2.7 shell:
>>> from nltk.stem.isri import ISRIStemmer
>>> st = ISRIStemmer()
>>> string = u'\u062D\u064E\u062F\u0651\u064E\u062B\u064E\u0646\u064E\u0627'
>>> st.stem(string)
u'\u062d\u062f\u062b'
So basically, I am trying to obtain:
u'\u062d\u062f\u062b'
from
u'\u062D\u064E\u062F\u0651\u064E\u062B\u064E\u0646\u064E\u0627'
using nltk's arabic stemmer, which works!
However, when I try to accomplish the exact thing through a python script, it fails to stem any of the words in the list, tokens :
#!/c/Python27/python
# -*- coding: utf8 -*-
import nltk
import nltk.data
from nltk.stem.isri import ISRIStemmer
#In my script, I tokenize the following string
commasection = '\u0623\u064E\u062E\u0652\u0628\u064E\u0631\u064E\u0646\u064E\u0627 \u0628\u0650\u0634\u0652\u0631\u064F \u0628\u0652\u0646\u064F \u0647\u0650\u0644\u0627\u064E\u0644\u064D'
#The tokenizing works
tokens = nltk.word_tokenize(commasection)
st = ISRIStemmer()
for word in tokens:
#But the stemming of each word in tokens doesn't work????
print st.stem(word)
#Should display
#u'u0623\u062e\u0628\u0631'
#u'\u0628\u0634\u0631'
#u'\u0628\u0646'
#u'\u0647\u0644\u0644'
#But it just shows whatever is in commasection
I need my python code to stem all words in tokens. But I don't get how the simpler example running in python shell works but not this script.
I have noticed that in the shell scenario, there is that 'u' in front of the sequence of unicode, so I tried all sorts of encodings/decodings and read a lot about it all night long (pulled an all-nighter on this one), but this python script is just not stemming word from tokens like the python shell!!!
If anyone can please help me make my script display the correct result I would be super super appreciative
Unicode escapes only work in unicode literals.
commasection = u'\u0623\u064E\u062E\u0652\u0628\u064E\u0631\u064E\u0646\u064E\u0627 \u0628\u0650\u0634\u0652\u0631\u064F \u0628\u0652\u0646\u064F \u0647\u0650\u0644\u0627\u064E\u0644\u064D'
Ignacio is correct that I have to have unicode literals in order for the stemming to work, but since I am grabbing this string dynamically, I had to find a way to convert what I get dynamically
i.e. '\u0623\u064E\u062E\u0652\u0628\u064E\u0631\u064E\u0646\u064E\u0627 \u0628\u0650\u0634\u0652\u0631\u064F \u0628\u0652\u0646\u064F \u0647\u0650\u0644\u0627\u064E\u0644\u064D'
into a string literal with a unicode escapes i.e.
u'\u0623\u064E\u062E\u0652\u0628\u064E\u0631\u064E\u0646\u064E\u0627 \u0628\u0650\u0634\u0652\u0631\u064F \u0628\u0652\u0646\u064F \u0647\u0650\u0644\u0627\u064E\u0644\u064D'
(notice the u in front)
This can be done with the following unichr() http://infohost.nmt.edu/tcc/help/pubs/python/web/unichr-function.html:
word = "".join([unichr(int(x, 16)) for x in word.split("\\u") if x !=""])
So basically I grab the numeric codes and form the unicode character while maintaining the unicode escape. And my stemmer works!

if statement for checking if string with non standard characters is within another string is not working

So say I have the following code:
a = 'naïve' # It contains the character ï
b = 'some text that may or may not contain the word we are looking for'
if a in b: #error happens here
print 'success'
I'm trying to see if a is within b but it apparently doesn't know how to take in and work with unicode characters that are not english-standard.
It throws me the following error:
SyntaxError: Non-ASCII character '\xc3' in file app.py on line 10, but no encoding declared
I am not sure what to do or try. Any clues? Thank you.
Put this line at the top of your file:
# -*- coding: utf-8 -*-

regular expression with special chars

I need a regular expression to validate string with one or more of these characters:
a-z
A-Z
'
àòèéùì
simple white space
FOR EXAMPLE these string are valide:
D' argon calabrò
maryòn l' Ancol
these string are NOT valide:
hello38239
my_house
work [tab] with me
I tryed this:
re.match(r"^[a-zA-Z 'òàèéìù]+$", string )
It seems to work in my python shell but in Django I get this error:
SyntaxError at /home/
("Non-ASCII character '\\xc3' ...
Why ?
Edit:
I have added # -- coding: utf-8 -- at the top of my forms.py but the strings with à,è,ò,ù,é or ì doesn't match never.
This is my forms.py clean method:
def clean_title(self):
if(re.match(r"^[a-zA-Z 'òàèéìù]+$", self.cleaned_data['title'].strip())):
return self.cleaned_data['title'].strip()
raise forms.ValidationError(_("This title is not valid."))
If you user Non-ASCII characters in your python source files you should add proper encoding to the top of your source file like this:
# -*- coding: utf-8 -*-
utf_string='čćžđšp'
Defining Python Source Code Encodings
This seems to work fine for me:
>>> import re
>>> mystring = "D' argon calabrò"
>>> matched = re.match(r"^([a-zA-Z 'òàèéìù]+)$", mystring)
>>> print matched.groups()
("D' argon calabr\xc3\xb2",)
Well, those are pretty much all non-ascii characters. So i'd figure that it's using just ascii for character encoding. Maybe you need to configure it to using UTF-8?