I need a regular expression to validate string with one or more of these characters:
a-z
A-Z
'
àòèéùì
simple white space
FOR EXAMPLE these string are valide:
D' argon calabrò
maryòn l' Ancol
these string are NOT valide:
hello38239
my_house
work [tab] with me
I tryed this:
re.match(r"^[a-zA-Z 'òàèéìù]+$", string )
It seems to work in my python shell but in Django I get this error:
SyntaxError at /home/
("Non-ASCII character '\\xc3' ...
Why ?
Edit:
I have added # -- coding: utf-8 -- at the top of my forms.py but the strings with à,è,ò,ù,é or ì doesn't match never.
This is my forms.py clean method:
def clean_title(self):
if(re.match(r"^[a-zA-Z 'òàèéìù]+$", self.cleaned_data['title'].strip())):
return self.cleaned_data['title'].strip()
raise forms.ValidationError(_("This title is not valid."))
If you user Non-ASCII characters in your python source files you should add proper encoding to the top of your source file like this:
# -*- coding: utf-8 -*-
utf_string='čćžđšp'
Defining Python Source Code Encodings
This seems to work fine for me:
>>> import re
>>> mystring = "D' argon calabrò"
>>> matched = re.match(r"^([a-zA-Z 'òàèéìù]+)$", mystring)
>>> print matched.groups()
("D' argon calabr\xc3\xb2",)
Well, those are pretty much all non-ascii characters. So i'd figure that it's using just ascii for character encoding. Maybe you need to configure it to using UTF-8?
Related
I need to create a regular expression, that would match only the characters NOT in the windows-1251 encoding character set, to detect if there are any characters in a given piece of text that would violate the encoding. I tried to do it through the [^\u0000-\u044F]+ expression, however it is also matching some characters that are actually in line with the encoding.
Appreciate any help on the issue
No language specified, but in Python no need for a regex with sets. Create a set of all Unicode code points that are members of Windows-1251 and subtract it from the set of the text. Note that only byte 98h is not used in Windows-1251 encoding:
>>> # Create the set of characters in code page 1251
>>> cp1251 = set(bytes(range(256)).decode('cp1251',errors='ignore'))
>>> set('This is a test \x98 马') - cp1251
{'\x98', '马'}
As a regular expression:
>>> import re
>>> text = ''.join(cp1251) # string of all Windows-1251 codepoints from previous set
>>> text
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\xa0¤¦§©«¬\xad®°±µ¶·»ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюяёђѓєѕіїјљњћќўџҐґ–—‘’‚“”„†‡•…‰‹›€№™'
>>> not_cp1251 = re.compile(r'[^\x00-\x7f\xa0\xa4\xa6\xa7\xa9\xab-\xae\xb0\xb1\xb5-\xb7\xbb\u0401-\u040c\u040e-\u044f\u0451-\u045c\u045e\u045f\u0490\u0491\u2013\u2014\u2018-\u201a\u201c-\u201e\u2020-\u2022\u2026\u2030\u2039\u203a\u20ac\u2116\u2122]')
>>> not_cp1251.findall(text) # all cp1251 text finds no outliers
[]
>>> not_cp1251.findall(text+'\x98') # adding known outlier
['\x98']
>>> not_cp1251.findall('马克'+text+'\x98') # adding other outliers
['马', '克', '\x98']
I have tried to give a value to my MEDIA_ROOT that constain a word with an accent mark, but django doent accept it.
I have tried to unicode(utf-8) and encoding it with no positive results
The error that I get is: SyntaxError: Non-ASCII character '\xc3' in file
What can I do in order to make settings accept acent marks(ó,á,é,í,ú)
SyntaxError: Non-ASCII character '\xc3' in file C:\Users\Meccha\Documents\django\project\settings.py on line 160, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
And in line 160 I have:MEDIA_ROOT = os.path.join(u'D:', u'INVESTIGACIÓN._P')
In python-2.x, there is a destinction between str, and unicode. str are ASCII strings, so these can contain only ASCII characters. On the other hand unicode strings can contain all unicode characters.
You can define a unicode string with the u prefix, this allows to write unicode characters like u'\xf3' to write a unicode string that contains the à character.
If you however want to write unicode strings as well, you need to specify the encoding of the file, in the header of the file. So then the settings.py file looks like:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import os
# ...
# (some other settings)
# ...
MEDIA_ROOT = os.path.join('D:', u' INVESTIGACIÓN_P')
So the top part specifies the encoding, and the latter has a u prefix to mark the string as a unicode string.
I want to replace some utf-8 characters set with another utf-8 character set but anything I try I end up with errors.
I am a noob at Python so please be patient
What I want to achieve is converting characters by unicode values or by html entities (more readable, for maintanance)
Tries (with example):
1.First
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#Found this function
def multiple_replace(dic, text):
pattern = "|".join(map(re.escape, dic.keys()))
return re.sub(pattern, lambda m: dic[m.group()], text)
text="Larry Wall is ùm© some text"
replace_table = {
u'\x97' : u'\x82' # ù -> é
}
text2=multiple_replace(dic,text)
print text #Expected:Larry Wall is ém© some text
#Got: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
2.Html entities
dic = {
"ú" : "é" # ù -> é
}
some_text="Larry Wall is ùm© some text"
some_text2=some_text.encode('ascii', 'xmlcharrefreplace')
some_text2=multiple_replace(dic,some_text2)
print some_text2
#Got:UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 14: ordinal not in range(128)
Any ideas are welcome
Your problem is due to the fact that your input strings are in non-unicode representation (<type 'str'> rather than <type 'unicode'>). You must define the input string using the u"..." syntax:
text=u"Larry Wall is ùm© some text"
# ^
(Besides you will have to fix the last statement in your first example - currently it prints the input string (text), whereas I am pretty sure that you meant to see the result (text2)).
I have tried the following in Python 2.7 shell:
>>> from nltk.stem.isri import ISRIStemmer
>>> st = ISRIStemmer()
>>> string = u'\u062D\u064E\u062F\u0651\u064E\u062B\u064E\u0646\u064E\u0627'
>>> st.stem(string)
u'\u062d\u062f\u062b'
So basically, I am trying to obtain:
u'\u062d\u062f\u062b'
from
u'\u062D\u064E\u062F\u0651\u064E\u062B\u064E\u0646\u064E\u0627'
using nltk's arabic stemmer, which works!
However, when I try to accomplish the exact thing through a python script, it fails to stem any of the words in the list, tokens :
#!/c/Python27/python
# -*- coding: utf8 -*-
import nltk
import nltk.data
from nltk.stem.isri import ISRIStemmer
#In my script, I tokenize the following string
commasection = '\u0623\u064E\u062E\u0652\u0628\u064E\u0631\u064E\u0646\u064E\u0627 \u0628\u0650\u0634\u0652\u0631\u064F \u0628\u0652\u0646\u064F \u0647\u0650\u0644\u0627\u064E\u0644\u064D'
#The tokenizing works
tokens = nltk.word_tokenize(commasection)
st = ISRIStemmer()
for word in tokens:
#But the stemming of each word in tokens doesn't work????
print st.stem(word)
#Should display
#u'u0623\u062e\u0628\u0631'
#u'\u0628\u0634\u0631'
#u'\u0628\u0646'
#u'\u0647\u0644\u0644'
#But it just shows whatever is in commasection
I need my python code to stem all words in tokens. But I don't get how the simpler example running in python shell works but not this script.
I have noticed that in the shell scenario, there is that 'u' in front of the sequence of unicode, so I tried all sorts of encodings/decodings and read a lot about it all night long (pulled an all-nighter on this one), but this python script is just not stemming word from tokens like the python shell!!!
If anyone can please help me make my script display the correct result I would be super super appreciative
Unicode escapes only work in unicode literals.
commasection = u'\u0623\u064E\u062E\u0652\u0628\u064E\u0631\u064E\u0646\u064E\u0627 \u0628\u0650\u0634\u0652\u0631\u064F \u0628\u0652\u0646\u064F \u0647\u0650\u0644\u0627\u064E\u0644\u064D'
Ignacio is correct that I have to have unicode literals in order for the stemming to work, but since I am grabbing this string dynamically, I had to find a way to convert what I get dynamically
i.e. '\u0623\u064E\u062E\u0652\u0628\u064E\u0631\u064E\u0646\u064E\u0627 \u0628\u0650\u0634\u0652\u0631\u064F \u0628\u0652\u0646\u064F \u0647\u0650\u0644\u0627\u064E\u0644\u064D'
into a string literal with a unicode escapes i.e.
u'\u0623\u064E\u062E\u0652\u0628\u064E\u0631\u064E\u0646\u064E\u0627 \u0628\u0650\u0634\u0652\u0631\u064F \u0628\u0652\u0646\u064F \u0647\u0650\u0644\u0627\u064E\u0644\u064D'
(notice the u in front)
This can be done with the following unichr() http://infohost.nmt.edu/tcc/help/pubs/python/web/unichr-function.html:
word = "".join([unichr(int(x, 16)) for x in word.split("\\u") if x !=""])
So basically I grab the numeric codes and form the unicode character while maintaining the unicode escape. And my stemmer works!
So say I have the following code:
a = 'naïve' # It contains the character ï
b = 'some text that may or may not contain the word we are looking for'
if a in b: #error happens here
print 'success'
I'm trying to see if a is within b but it apparently doesn't know how to take in and work with unicode characters that are not english-standard.
It throws me the following error:
SyntaxError: Non-ASCII character '\xc3' in file app.py on line 10, but no encoding declared
I am not sure what to do or try. Any clues? Thank you.
Put this line at the top of your file:
# -*- coding: utf-8 -*-