Removal of Diacritics using regex in JAVA [duplicate] - regex

The problem is that, as you know, there are thousands of characters in the Unicode chart and I want to convert all the similar characters to the letters which are in English alphabet.
For instance here are a few conversions:
ҥ->H
Ѷ->V
Ȳ->Y
Ǭ->O
Ƈ->C
tђє Ŧค๓เℓy --> the Family
...
and I saw that there are more than 20 versions of letter A/a. and I don't know how to classify them. They look like needles in the haystack.
The complete list of unicode chars is at http://www.ssec.wisc.edu/~tomw/java/unicode.html or http://unicode.org/charts/charindex.html . Just try scrolling down and see the variations of letters.
How can I convert all these with Java? Please help me :(

Reposting my post from How do I remove diacritics (accents) from a string in .NET?
This method works fine in java (purely for the purpose of removing diacritical marks aka accents).
It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics. Now you can use a regex to strip off the diacritics.
import java.text.Normalizer;
import java.util.regex.Pattern;
public String deAccent(String str) {
String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
return pattern.matcher(nfdNormalizedString).replaceAll("");
}

It's a part of Apache Commons Lang as of ver. 3.0.
org.apache.commons.lang3.StringUtils.stripAccents("Añ");
returns An
Also see http://www.drillio.com/en/software-development/java/removing-accents-diacritics-in-any-language/

Attempting to "convert them all" is the wrong approach to the problem.
Firstly, you need to understand the limitations of what you are trying to do. As others have pointed out, diacritics are there for a reason: they are essentially unique letters in the alphabet of that language with their own meaning / sound etc.: removing those marks is just the same as replacing random letters in an English word. This is before you even go onto consider the Cyrillic languages and other script based texts such as Arabic, which simply cannot be "converted" to English.
If you must, for whatever reason, convert characters, then the only sensible way to approach this it to firstly reduce the scope of the task at hand. Consider the source of the input - if you are coding an application for "the Western world" (to use as good a phrase as any), it would be unlikely that you would ever need to parse Arabic characters. Similarly, the Unicode character set contains hundreds of mathematical and pictorial symbols: there is no (easy) way for users to directly enter these, so you can assume they can be ignored.
By taking these logical steps you can reduce the number of possible characters to parse to the point where a dictionary based lookup / replace operation is feasible. It then becomes a small amount of slightly boring work creating the dictionaries, and a trivial task to perform the replacement. If your language supports native Unicode characters (as Java does) and optimises static structures correctly, such find and replaces tend to be blindingly quick.
This comes from experience of having worked on an application that was required to allow end users to search bibliographic data that included diacritic characters. The lookup arrays (as it was in our case) took perhaps 1 man day to produce, to cover all diacritic marks for all Western European languages.

Since the encoding that turns "the Family" into "tђє Ŧค๓เℓy" is effectively random and not following any algorithm that can be explained by the information of the Unicode codepoints involved, there's no general way to solve this algorithmically.
You will need to build the mapping of Unicode characters into latin characters which they resemble. You could probably do this with some smart machine learning on the actual glyphs representing the Unicode codepoints. But I think the effort for this would be greater than manually building that mapping. Especially if you have a good amount of examples from which you can build your mapping.
To clarify: a few of the substitutions can actually be solved via the Unicode data (as the other answers demonstrate), but some letters simply have no reasonable association with the latin characters which they resemble.
Examples:
"ђ" (U+0452 CYRILLIC SMALL LETTER DJE) is more related to "d" than to "h", but is used to represent "h".
"Ŧ" (U+0166 LATIN CAPITAL LETTER T WITH STROKE) is somewhat related to "T" (as the name suggests) but is used to represent "F".
"ค" (U+0E04 THAI CHARACTER KHO KHWAI) is not related to any latin character at all and in your example is used to represent "a"

String tested : ÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝß
Tested :
Output from Apache Commons Lang3 : AAAAAÆCEEEEIIIIÐNOOOOOØUUUUYß
Output from ICU4j : AAAAAÆCEEEEIIIIÐNOOOOOØUUUUYß
Output from JUnidecode : AAAAAAECEEEEIIIIDNOOOOOOUUUUUss (problem with Ý and another issue)
Output from Unidecode : AAAAAAECEEEEIIIIDNOOOOOOUUUUYss
The last choice is the best.

The original request has been answered already.
However, I am posting the below answer for those who might be looking for generic transliteration code to transliterate any charset to Latin/English in Java.
Naive meaning of tranliteration:
Translated string in it's final form/target charset sounds like the string in it's original form.
If we want to transliterate any charset to Latin(English alphabets), then ICU4(ICU4J library in java ) will do the job.
Here is the code snippet in java:
import com.ibm.icu.text.Transliterator; //ICU4J library import
public static String TRANSLITERATE_ID = "NFD; Any-Latin; NFC";
public static String NORMALIZE_ID = "NFD; [:Nonspacing Mark:] Remove; NFC";
/**
* Returns the transliterated string to convert any charset to latin.
*/
public static String transliterate(String input) {
Transliterator transliterator = Transliterator.getInstance(TRANSLITERATE_ID + "; " + NORMALIZE_ID);
String result = transliterator.transliterate(input);
return result;
}

If the need is to convert "òéışöç->oeisoc", you can use this a starting point :
public class AsciiUtils {
private static final String PLAIN_ASCII =
"AaEeIiOoUu" // grave
+ "AaEeIiOoUuYy" // acute
+ "AaEeIiOoUuYy" // circumflex
+ "AaOoNn" // tilde
+ "AaEeIiOoUuYy" // umlaut
+ "Aa" // ring
+ "Cc" // cedilla
+ "OoUu" // double acute
;
private static final String UNICODE =
"\u00C0\u00E0\u00C8\u00E8\u00CC\u00EC\u00D2\u00F2\u00D9\u00F9"
+ "\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA\u00DD\u00FD"
+ "\u00C2\u00E2\u00CA\u00EA\u00CE\u00EE\u00D4\u00F4\u00DB\u00FB\u0176\u0177"
+ "\u00C3\u00E3\u00D5\u00F5\u00D1\u00F1"
+ "\u00C4\u00E4\u00CB\u00EB\u00CF\u00EF\u00D6\u00F6\u00DC\u00FC\u0178\u00FF"
+ "\u00C5\u00E5"
+ "\u00C7\u00E7"
+ "\u0150\u0151\u0170\u0171"
;
// private constructor, can't be instanciated!
private AsciiUtils() { }
// remove accentued from a string and replace with ascii equivalent
public static String convertNonAscii(String s) {
if (s == null) return null;
StringBuilder sb = new StringBuilder();
int n = s.length();
for (int i = 0; i < n; i++) {
char c = s.charAt(i);
int pos = UNICODE.indexOf(c);
if (pos > -1){
sb.append(PLAIN_ASCII.charAt(pos));
}
else {
sb.append(c);
}
}
return sb.toString();
}
public static void main(String args[]) {
String s =
"The result : È,É,Ê,Ë,Û,Ù,Ï,Î,À,Â,Ô,è,é,ê,ë,û,ù,ï,î,à,â,ô,ç";
System.out.println(AsciiUtils.convertNonAscii(s));
// output :
// The result : E,E,E,E,U,U,I,I,A,A,O,e,e,e,e,u,u,i,i,a,a,o,c
}
}
The JDK 1.6 provides the java.text.Normalizer class that can be used for this task.
See an example here

The problem with "converting" arbitrary Unicode to ASCII is that the meaning of a character is culture-dependent. For example, “ß” to a German-speaking person should be converted to "ss" while an English-speaker would probably convert it to “B”.
Add to that the fact that Unicode has multiple code points for the same glyphs.
The upshot is that the only way to do this is create a massive table with each Unicode character and the ASCII character you want to convert it to. You can take a shortcut by normalizing characters with accents to normalization form KD, but not all characters normalize to ASCII. In addition, Unicode does not define which parts of a glyph are "accents".
Here is a tiny excerpt from an app that does this:
switch (c)
{
case 'A':
case '\u00C0': // À LATIN CAPITAL LETTER A WITH GRAVE
case '\u00C1': // Á LATIN CAPITAL LETTER A WITH ACUTE
case '\u00C2': // Â LATIN CAPITAL LETTER A WITH CIRCUMFLEX
// and so on for about 20 lines...
return "A";
break;
case '\u00C6':// Æ LATIN CAPITAL LIGATURE AE
return "AE";
break;
// And so on for pages...
}

You could try using unidecode, which is available as a ruby gem and as a perl module on cpan. Essentially, it works as a huge lookup table, where each unicode code point relates to an ascii character or string.

There is no easy or general way to do what you want because it is just your subjective opinion that these letters look loke the latin letters you want to convert to. They are actually separate letters with their own distinct names and sounds which just happen to superficially look like a latin letter.
If you want that conversion, you have to create your own translation table based on what latin letters you think the non-latin letters should be converted to.
(If you only want to remove diacritial marks, there are some answers in this thread: How do I remove diacritics (accents) from a string in .NET? However you describe a more general problem)

I'm late to the party, but after facing this issue today, I found this answer to be very good:
String asciiName = Normalizer.normalize(unicodeName, Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "");
Reference:
https://stackoverflow.com/a/16283863

Following Class does the trick:
org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter

Related

Regex to exclude non-ASCII but keep Nordic characters

I have a macro in which I use Regex to strip a text of all non-ASCII characters (in order to create folder names).
I am relatively new to Regex and I was wondering how to strip all non-ASCII but still include Nordic characters, as the macro goes through Scandinavian data. Basically, I would need to include characters 128 to 165 from this table
Here is my code so far:
Public Function GetStrippedText(txt As String) As String
Dim regEx As Object
Set regEx = CreateObject("vbscript.regexp")
regEx.Pattern = "[^\u0000-\u007F]"
GetStrippedText = regEx.Replace(txt, "")
End Function
I understand that I need to include this range in there somehow "[^\u0000-\u007F]", I just don't know where to find the associated code or how to include it.
To the best of my knowledge I think there are a few points here to highlight:
Not all extended (or non-) ASCII tables follow the same character encoding. The table you linked seems to follow CP437, and Excel follows UTF-8 (Unicode), which you can test using the UNICODE function in Excel. Here is a link to see the difference it makes in Hex-codes. So you most likely need to pick a range of interest within the "Latin-1 Supplement" which can be found here. For this exercise I went with characters from À-ÿ which is range: u00C0-\u00FF
Next, your current character class covers normal ASCII characters, however I believe you might just be interested in 0020-007F as you probably don't want to include 0000-001F.
Thirdly, you did not set the Global parameter to True which means your current UDF will only replace the first character it finds outside your character class. So you'll need to set this parameter to replace all characters outside defined character class.
So to conclude, the below might work for you:
Public Function GetStrippedText(txt As String) As String
Dim regEx As Object
Set regEx = CreateObject("vbscript.regexp")
regEx.Global = True
regEx.Pattern = "[^\u0020-\u007F\u00C0-\u00FF]"
GetStrippedText = regEx.Replace(txt, "")
End Function
For your understanding; [^\u0020-\u007F\u00C0-\u00FF] means:
[....] - The brackets tell us this is a character class
^ - The caret means it's a negated character class
\u0020-\u007F - means the characters run from index 32 till index 127 and \u00C0-\u00FF runs from 192 till 255.
In this same fashion you can extend the amount of character ranges.
Note1: Instead of Unicode, you could also just use the Hex codes: "[^\x20-\x7F\xC0-\xFF]"
Note2: You could also create a character class without Unicode or Hex ranges. Simply concatenate the characters of interest instead.

ICU combining Thai vowel signs and when to ignore

I'm processing Thai keyboard input. Some of the keys are vowel signs and only allowed when combined with certain preceding characters.
Here 0x0E33 is the vowel sign
For example 0x0E1C + 0x0E33 is valid
but 0x0E44 + 0x0E33 is not valid and the 0x0E33 should be ignored.
I'm looking to find a way to know when I should ignore the vowel sign, or when it does not combine with the previous character.
Any ideas please?
Many Thai vowels (and Tone Marks, by the way) belong to the Non-Spacing Combining Marks category. Your goal is to use some library that would tell which category each character belongs to. Then you may decide whether to "ignore" it, whatever the "ignoring" means in your application context.
Check Unicode General Category Values
Your two points of interest are:
Lo | Other_Letter for normal character;
Mn | Nonspacing_Mark for zero-width non-spacing marks;
Further reading:
Unicode data for Thai script (scroll down till the first occurrence of "THAI CHARACTER")
I know his thread is from a few years ago but this is what I have come up with using the icu lib I suspect it can be improved ...
UChar32 newChar;
UChar32 previousChar;
int32_t gcb = u_getIntPropertyValue(newChar, UCHAR_GRAPHEME_CLUSTER_BREAK);
if (gcb != U_GCB_OTHER)
{
int32_t insc = u_getIntPropertyValue(newChar, UCHAR_INDIC_SYLLABIC_CATEGORY);
if (insc == U_INSC_VOWEL_DEPENDENT || insc == U_INSC_TONE_MARK)
{
if (u_getIntPropertyValue(prevChar, UCHAR_INDIC_SYLLABIC_CATEGORY) != U_INSC_CONSONANT)
{
// invalid combination, ignore
}
}
}

How can I parse a char array with octal values in Python?

EDIT: I should note that I want a general case for any hex array, not just the google one I provided.
EDIT BACKGROUND: Background is networking: I'm parsing a DNS packet and trying to get its QNAME. I'm taking in the whole packet as a string, and every character represents a byte. Apparently this problem looks like a Pascal string problem, and using the struct module seems like the way to go.
I have a char array in Python 2.7 which includes octal values. For example, let's say I have an array
DNS = "\03www\06google\03com\0"
I want to get:
www.google.com
What's an efficient way to do this? My first thought would be iterating through the DNS char array and adding chars to my new array answer. Every time i see a '\' char, I would ignore the '\' and two chars after it. Is there a way to get the resulting www.google.com without using a new array?
my disgusting implementation (my answer is an array of chars, which is not what i want, i want just the string www.google.com:
DNS = "\\03www\\06google\\03com\\0"
answer = []
i = 0
while i < len(DNS):
if DNS[i] == '\\' and DNS[i+1] != 0:
i += 3
elif DNS[i] == '\\' and DNS[i+1] == 0:
break
else:
answer.append(DNS[i])
i += 1
Now that you've explained your real problem, none of the answers you've gotten so far will work. Why? Because they're all ways to remove sequences like \03 from a string. But you don't have sequences like \03, you have single control characters.
You could, of course, do something similar, just replacing any control character with a dot.
But what you're really trying to do is not replace control characters with dots, but parse DNS packets.
DNS is defined by RFC 1035. The QNAME in a DNS packet is:
a domain name represented as a sequence of labels, where each label consists of a length octet followed by that number of octets. The domain name terminates with the zero length octet for the null label of the root. Note that this field may be an odd number of octets; no padding is used.
So, let's parse that. If you understand how "labels consisting of "a length octet followed by that number of octets" relates to "Pascal strings", there's a quicker way. Also, you could write this more cleanly and less verbosely as a generator. But let's do it the dead-simple way:
def parse_qname(packet):
components = []
offset = 0
while True:
length, = struct.unpack_from('B', packet, offset)
offset += 1
if not length:
break
component = struct.unpack_from('{}s'.format(length), packet, offset)
offset += length
components.append(component)
return components, offset
import re
DNS = "\\03www\\06google\\03com\\0"
m = re.sub("\\\\([0-9,a-f]){2}", "", DNS)
print(m)
Maybe something like this?
#!/usr/bin/python3
import re
def convert(adorned_hostname):
result1 = re.sub(r'^\\03', '', adorned_hostname )
result2 = re.sub(r'\\0[36]', '.', result1)
result3 = re.sub(r'\\0$', '', result2)
return result3
def main():
adorned_hostname = r"\03www\06google\03com\0"
expected_result = 'www.google.com'
actual_result = convert(adorned_hostname)
print(actual_result, expected_result)
assert actual_result == expected_result
main()
For the question as originally asked, replacing the backslash-hex sequences in strings like "\\03www\\06google\\03com\\0" with dots…
If you want to do this with a regular expression:
\\ matches a backslash.
[0-9A-Fa-f] matches any hex digit.
[0-9A-Fa-f]+ matches one or more hex digits.
\\[0-9A-Fa-f]+ matches a backslash followed by one or more hex digits.
You want to find each such sequence, and replace it with a dot, right? If you look through the re docs, you'll find a function called sub which is used for replacing a pattern with a replacement string:
re.sub(r'\\[0-9A-Fa-f]+', '.', DNS)
I suspect these may actually be octal, not hex, in which case you want [0-7] rather than [0-9A-Fa-f], but nothing else would change.
A different way to do this is to recognize that these are valid Python escape sequences. And, if we unescape them back to where they came from (e.g., with DNS.decode('string_escape')), this turns into a sequence of length-prefixed (aka "Pascal") strings, a standard format that you can parse in any number of ways, including the stdlib struct module. This has the advantage of validating the data as you read it, and not being thrown off by any false positives that could show up if one of the string components, say, had a backslash in the middle of it.
Of course that's presuming more about the data. It seems likely that the real meaning of this is "a sequence of length-prefixed strings, concatenated, then backslash-escaped", in which case you should parse it as such. But it could be just a coincidence that it looks like that, in which case it would be a very bad idea to parse it as such.

Strategy to replace spaces in string

I need to store a string replacing its spaces with some character. When I retrieve it back I need to replace the character with spaces again. I have thought of this strategy while storing I will replace (space with _a) and (_a with _aa) and while retrieving will replace (_a with space) and (_aa with _a). i.e even if the user enters _a in the string it will be handled. But I dont think this is a good strategy. Please let me know if anyone has a better one?
Replacing spaces with something is a problem when something is already in the string. Why don't you simply encode the string - there are many ways to do that, one is to convert all characters to hexadecimal.
For instance
Hello world!
is encoded as
48656c6c6f20776f726c6421
The space is 0x20. Then you simply decode back (hex to ascii) the string.
This way there are no space in the encoded string.
-- Edit - optimization --
You replace all % and all spaces in the string with %xx where xx is the hex code of the character.
For instance
Wine having 12% alcohol
becomes
Wine%20having%2012%25%20alcohol
%20 is space
%25 is the % character
This way, neither % nor (space) are a problem anymore - Decoding is easy.
Encoding algorithm
- replace all `%` with `%25`
- replace all ` ` with `%20`
Decoding algorithm
- replace all `%xx` with the character having `xx` as hex code
(You may even optimize more since you need to encode only two characters: use %1 for % and %2 for , but I recommend the %xx solution as it is more portable - and may be utilized later on if you need to code more characters)
I'm not sure your solution will work. When reading, how would you
distinguish between strings that were orginally " a" and strings that
were originally "_a": if I understand correctly, both will end up
"_aa".
In general, given a situation were a specific set of characters cannot
appear as such, but must be encoded, the solution is to choose one of
allowed characters as an "escape" character, remove it from the set of
allowed characters, and encode all of the forbidden characters
(including the escape character) as a two (or more) character sequence
starting with the escape character. In C++, for example, a new line is
not allowed in a string or character literal. The escape character is
\; because of that, it must be encoded as an escape sequence as well.
So we have "\n" for a new line (the choice of n is arbitrary), and
"\\" for a \. (The choice of \ for the second character is also
arbitrary, but it is fairly usual to use the escape character, escaped,
to represent itself.) In your case, if you want to use _ as the
escape character, and "_a" to represent a space, the logical choice
would be "__" to represent a _ (but I'd suggest something a little
more visually suggestive—maybe ^ as the escape, with "^_" for
a space and "^^" for a ^). When reading, anytime you see the escape
character, the following character must be mapped (and if it isn't one
of the predefined mappings, the input text is in error). This is simple
to implement, and very reliable; about the only disadvantage is that in
an extreme case, it can double the size of your string.
You want to implement this using C/C++? I think you should split your string into multiple part, separated by space.
If your string is like this : "a__b" (multiple space continuous), it will be splited into:
sub[0] = "a";
sub[1] = "";
sub[2] = "b";
Hope this will help!
With a normal string, using X characters, you cannot write or encode a string with x-1 using only 1 character/input character.
You can use a combination of 2 chars to replace a given character (this is exactly what you are trying in your example).
To do this, loop through your string to count the appearances of a space combined with its length, make a new character array and replace these spaces with "//" this is just an example though. The problem with this approach is that you cannot have "//" in your input string.
Another approach would be to use a rarely used char, for example "^" to replace the spaces.
The last approach, popular in a combination of these two approaches. It is used in unix, and php to have syntax character as a literal in a string. If you want to have a " " ", you simply write it as \" etc.
Why don't you use Replace function
String* stringWithoutSpace= stringWithSpace->Replace(S" ", S"replacementCharOrText");
So now stringWithoutSpace contains no spaces. When you want to put those spaces back,
String* stringWithSpacesBack= stringWithoutSpace ->Replace(S"replacementCharOrText", S" ");
I think just coding to ascii hexadecimal is a neat idea, but of course doubles the amount of storage needed.
If you want to do this using less memory, then you will need two-letter sequences, and have to be careful that you can go back easily.
You could e.g. replace blank by _a, but you also need to take care of your escape character _. To do this, replace every _ by __ (two underscores). You need to scan through the string once and do both replacements simultaneously.
This way, in the resulting text all original underscores will be doubled, and the only other occurence of an underscore will be in the combination _a. You can safely translate this back. Whenever you see an underscore, you need a lookahed of 1 and see what follows. If an a follows, then this was a blank before. If _ follows, then it was an underscore before.
Note that the point is to replace your escape character (_) in the original string, and not the character sequence to which you map the blank. Your idea with replacing _a breaks. as you do not know if _aa was originally _a or a (blank followed by a).
I'm guessing that there is more to this question than appears; for example, that you the strings you are storing must not only be free of spaces, but they must also look like words or some such. You should be clear about your requirements (and you might consider satisfying the curiosity of the spectators by explaining why you need to do such things.)
Edit: As JamesKanze points out in a comment, the following won't work in the case where you can have more than one consecutive space. But I'll leave it here anyway, for historical reference. (I modified it to compress consecutive spaces, so it at least produces unambiguous output.)
std::string out;
char prev = 0;
for (char ch : in) {
if (ch == ' ') {
if (prev != ' ') out.push_back('_');
} else {
if (prev == '_' && ch != '_') out.push_back('_');
out.push_back(ch);
}
prev = ch;
}
if (prev == '_') out.push_back('_');

How can I specify Cyrillic character ranges in a Python 3.2 regex?

Once upon a time, I found this question interesting.
Today I decided to play around with the text of that book.
I want to use the regular expression in this script. When I use the script on Cyrillic text, it wipes out all of the Cyrillic characters, leaving only punctuation and whitespace.
#!/usr/bin/env python3.2
# coding=UTF-8
import sys, re
for file in sys.argv[1:]:
f = open(file)
fs = f.read()
regexnl = re.compile('[^\s\w.,?!:;-]')
rstuff = regexnl.sub('', f)
f.close()
print(rstuff)
Something very similar has already been done in this answer.
Basically, I just want to be able to specify a set of characters that are not alphabetic, alphanumeric, or punctuation or whitespace.
This doesn't exactly answer your question, but the regex module has much much better unicode support than the built-in re module. e.g. regex supports the \p{Cyrillic} property and its negation \P{Cyrillic} (as well as a huge number of other unicode properties). Also, it handles unicode case-insensitivity correctly.
You can specify the unicode range pretty easily: \u0400-\u0500. See also here.
Here's an example with some text from the Russian wikipedia, and also a sentence from the English wikipedia containing a single word in cyrillic.
#coding=utf-8
import re
ru = u"Владивосток находится на одной широте с Сочи, однако имеет среднегодовую температуру почти на 10 градусов ниже."
en = u"Vladivostok (Russian: Владивосток; IPA: [vlədʲɪvɐˈstok] ( listen); Chinese: 海參崴; pinyin: Hǎishēnwǎi) is a city and the administrative center of Primorsky Krai, Russia"
cyril1 = re.findall(u"[\u0400-\u0500]+", en)
cyril2 = re.findall(u"[\u0400-\u0500]+", ru)
for x in cyril1:
print x
for x in cyril2:
print x
output:
Владивосток
------
Владивосток
находится
на
одной
широте
с
Сочи
однако
имеет
среднегодовую
температуру
почти
на
градусов
ниже
Addition:
Two other ways that should also work, and in a bit less hackish fashion than specifying a unicode range:
re.findall("(?u)\w+", text) should match Cyrillic as well as Latin word characters.
re.findall("\w+", text, re.UNICODE) is equivalent
So, more specifically for your problem:
* re.compile('[^\s\w.,?!:;-], re.UNICODE') should do the trick.
See here (point 7)
For practical reasons I suggest using the exact Modern Russian subset of glyphs, instead of general Cyrillic. This is because Russian websites never use the full Cyrillic subset, which includes Belarusian, Ukrainian, Slavonic and Macedonian glyphs. For historical reasons I am keeping "u\0463".
//Basic Cyr Unicode range for use on Russian websites.
0401,0406,0410,0411,0412,0413,0414,0415,0416,0417,0418,0419,041A,041B,041C,041D,041E,041F,0420,0421,0422,0423,0424,0425,0426,0427,0428,0429,042A,042B,042C,042D,042E,042F,0430,0431,0432,0433,0434,0435,0436,0437,0438,0439,043A,043B,043C,043D,043E,043F,0440,0441,0442,0443,0444,0445,0446,0447,0448,0449,044A,044B,044C,044D,044E,044F,0451,0462,0463
Using this subset on a multilingual website will save you 60% of bandwidth, in comparison to using the original full range, and will increase page loading speed accordingly.