How to use Regex to strip punctuation without tainting UTF-8 or UTF-16 encoded text like chinese? - regex

How do I strip punctuation from ASCII and UTF-8 encoded strings without messing up the UTF-8 original characters, specifically Chinese, in R.
text <- "Longchamp Le Pliage 肩背包 (小)"
stri_replace_all_regex(text, '\\p{P}', '')
results in:
Longchamp Le Pliage ��背�� 小
but the desired result should be:
Longchamp Le Pliage 肩背包 小
I'm looking to remove all the CJK Symbols and Punctuation as well ask ASCII punctuations.
#akrun, sessionInfo() is as follows
locale:
[1] LC_COLLATE=English_Singapore.1252 LC_CTYPE=English_Singapore.1252 LC_MONETARY=English_Singapore.1252
[4] LC_NUMERIC=C LC_TIME=English_Singapore.1252

Display of Chinese characters (hanzi) works variably depending on platform and IDE (see this answer for lots of details about R's handling of non-ASCII characters). It looks to me like stri_replace_all_regex is doing what you want, but that some of the hanzi are being displayed wrong (even if their underlying codepoints are correct). Try this:
library(stringi)
my_text <- "Longchamp Le Pliage 肩背包 (小)"
plot(0,0)
text(0, 0, my_text, pos=3)
If you can get the text to display on a plot, then underlyingly the string is properly encoded and the problem is just how it displays in the R terminal. If not, check Encoding(my_text) and consider using enc2utf8 before further text processing. If the plotting worked, try:
no_punct <- stri_replace_all_regex(my_text, "\\p{P}", "")
text(0, 0, no_punct, pos=1)
to see if the result of stri_replace_all_regex is in fact doing what you expect.

Related

Replace all emojis from a given unicode string

I have a list of unicode symbols from the emoji package. My end goal is to create a function that takes as input a unicode a string, i.e. some👩😌thing, and then removes all emojis, i.e. "something". Below is a demonstration of what I want to achieve:
from emoji import UNICODE_EMOJI
text = 'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
output = ... = 'something'
I have been trying to do the above, and in that process, I came across a strange behavior which I demonstrate below, as you can see. I believe if the code below is fixed, then I will be able to achieve my end goal.
import regex as re
print u'\U0001F469' # 👩
print u'\U0001F60C' # 😌
print u'\U0001F469\U0001F60C' # 👩😌
text = u'some\U0001F469\U0001F60Cthing'
print text # some👩😌thing
# Removing "👩😌" works
print re.sub(ur'[\U0001f469\U0001F60C]+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'[\U0001f469]+', u'', text) # some�thing
In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'\U0001F469').
The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.
To create a regular expression to use for the replace, simply join all the characters together with |. Since the list of characters already is encoded with surrogate pairs it will create the proper string.
subs = u'|'.join(exclude_list)
print re.sub(subs, u'', text)
The old 2.7 regex engine gets confused because:
Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.
Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).
That means that [\U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.
This fixes it:
print re.sub(ur'(\U0001f469|U0001F60C)+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'(\U0001f469)+', u'', text) # some�thing
# .. and now it does:
some😌thing
because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.
If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:
exclude_list = UNICODE_EMOJI.keys()
for bad in exclude_list: # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
if bad in text:
print 'Removing '+bad
text = text.replace(bad, '')
Removing 👩
Removing 😌
something
(This also shows the intermediate results as proof it works; you only need the replace line in the loop.)
To remove all emojis from the input string using the current approach, use
import re
from emoji import UNICODE_EMOJI
text = u'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))
print re.sub(rx, u'', text)
# => u'something'
If you do not re.escape the emoji chars, you will get nothing to repeat error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list) is required.
Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2.

How to determine if character string contains non-Roman characters in R

What is the preferred way of determining if a string contains non-Roman/non-English (e.g., ないでさ) characters?
You could use regex/grep to check for hex values of characters outside the range of printable ASCII characters:
x <- 'ないでさ'
grep( "[^\x20-\x7F]",x )
#[1] 1
grep( "[^\x20-\x7F]","Normal text" )
#integer(0)
If you wanted to allow the non-printing ("control") character to be considered "English", you could extend the range of the character class in hte first argument to grep to start with "\x01". See ?regex for more information on using character class argumets. See ?Quotes for more information about how to specify characters as Unicode, hexadecimal, or octal values.
The R.oo package has conversion functions that may be useful:
library(R.oo)
?intToChar
?charToInt
The fact that Henrik Bengtsson saw fit to include these in his package says to me that there is no a handy method to do this in base/default R. He's a long-time useR/guRu.
Seeing the other answer prompted this effort which seems straight-forward:
> is.na( iconv( c(x, "OrdinaryASCII") , "", "ASCII") )
[1] TRUE FALSE
You could determine if string contains non-Latin/non-ASCII characters with iconv and grep
# My example, because you didn't add your data
characters <- c("ないでさ, satisfação, катынь, Work, Awareness, Potential, für")
# First you convert string to vector of words
characters.unlist <- unlist(strsplit(characters, split=", "))
# Then find indices of words with non-ASCII characters using ICONV
characters.non.ASCII <- grep("characters.unlist", iconv(characters.unlist, "latin1", "ASCII", sub="characters.unlist"))
# subset original vector of words to exclude words with non-ASCII characters
data <- characters.unlist[-characters.non.ASCII]
# convert vector back to a string
dat.1 <- paste(data, collapse = ", ")
# Now if you run
characters.non.ASCII
[1] 1 2 3 7
That means that the first, second, third and seventh indices are non-ASCII characters, in my case 1, 2, 3 and 7 correspond to: "ないでさ, satisfação, катынь and für.
You could also run
dat.1 #and the output will be all ASCII charaters
[1] "Work, Awareness, Potential"

ASCII Control characters: \x0e - \x1f

I want to convert the \x0e and \x0f characters to equivalent keyboard text.
Does python able to encode/decode the ASCII control characters(\x0e - \x1f) to keyboard text.
This is not encoded (well not in ascii anyway). This is how the text is supposed to look.
ascii is encoded something like \55 (which is a "-") and nothing like you have given above.
Proof of this can be found if we run your commands and then \55 through this simple program I built:
text = "\x0e \x0f \55" # What we want to try goes here
new_text = text.encode('ascii') # What we want to encode it in
print new_text # Print the outcome
The outcome is:
\x0e \x0f -
This shows that \55 has been converted and so has \x0e \x0f however they remain the same because they are not encoded in ascii.

How can I specify Cyrillic character ranges in a Python 3.2 regex?

Once upon a time, I found this question interesting.
Today I decided to play around with the text of that book.
I want to use the regular expression in this script. When I use the script on Cyrillic text, it wipes out all of the Cyrillic characters, leaving only punctuation and whitespace.
#!/usr/bin/env python3.2
# coding=UTF-8
import sys, re
for file in sys.argv[1:]:
f = open(file)
fs = f.read()
regexnl = re.compile('[^\s\w.,?!:;-]')
rstuff = regexnl.sub('', f)
f.close()
print(rstuff)
Something very similar has already been done in this answer.
Basically, I just want to be able to specify a set of characters that are not alphabetic, alphanumeric, or punctuation or whitespace.
This doesn't exactly answer your question, but the regex module has much much better unicode support than the built-in re module. e.g. regex supports the \p{Cyrillic} property and its negation \P{Cyrillic} (as well as a huge number of other unicode properties). Also, it handles unicode case-insensitivity correctly.
You can specify the unicode range pretty easily: \u0400-\u0500. See also here.
Here's an example with some text from the Russian wikipedia, and also a sentence from the English wikipedia containing a single word in cyrillic.
#coding=utf-8
import re
ru = u"Владивосток находится на одной широте с Сочи, однако имеет среднегодовую температуру почти на 10 градусов ниже."
en = u"Vladivostok (Russian: Владивосток; IPA: [vlədʲɪvɐˈstok] ( listen); Chinese: 海參崴; pinyin: Hǎishēnwǎi) is a city and the administrative center of Primorsky Krai, Russia"
cyril1 = re.findall(u"[\u0400-\u0500]+", en)
cyril2 = re.findall(u"[\u0400-\u0500]+", ru)
for x in cyril1:
print x
for x in cyril2:
print x
output:
Владивосток
------
Владивосток
находится
на
одной
широте
с
Сочи
однако
имеет
среднегодовую
температуру
почти
на
градусов
ниже
Addition:
Two other ways that should also work, and in a bit less hackish fashion than specifying a unicode range:
re.findall("(?u)\w+", text) should match Cyrillic as well as Latin word characters.
re.findall("\w+", text, re.UNICODE) is equivalent
So, more specifically for your problem:
* re.compile('[^\s\w.,?!:;-], re.UNICODE') should do the trick.
See here (point 7)
For practical reasons I suggest using the exact Modern Russian subset of glyphs, instead of general Cyrillic. This is because Russian websites never use the full Cyrillic subset, which includes Belarusian, Ukrainian, Slavonic and Macedonian glyphs. For historical reasons I am keeping "u\0463".
//Basic Cyr Unicode range for use on Russian websites.
0401,0406,0410,0411,0412,0413,0414,0415,0416,0417,0418,0419,041A,041B,041C,041D,041E,041F,0420,0421,0422,0423,0424,0425,0426,0427,0428,0429,042A,042B,042C,042D,042E,042F,0430,0431,0432,0433,0434,0435,0436,0437,0438,0439,043A,043B,043C,043D,043E,043F,0440,0441,0442,0443,0444,0445,0446,0447,0448,0449,044A,044B,044C,044D,044E,044F,0451,0462,0463
Using this subset on a multilingual website will save you 60% of bandwidth, in comparison to using the original full range, and will increase page loading speed accordingly.

QString and german umlauts

I am working with C++ and QT and have a problem with german umlauts. I have a QString like "wir sind müde" and want to change it to "wir sind müde" in order to show it correctly in a QTextBrowser.
I tried to do it like this:
s = s.replace( QChar('ü'), QString("ü"));
But it does not work.
Also
s = s.replace( QChar('\u00fc'), QString("ü"))
does not work.
When I iterate through all characters of the string in a loop, the 'ü' are two characters.
Can anybody help me?
QStrings are UTF-16.
QString stores a string of 16-bit QChars, where each QChar corresponds one Unicode 4.0 character. (Unicode characters with code values above 65535 are stored using surrogate pairs, i.e., two consecutive QChars.)
So try
//if ü is utf-16, see your fileencoding to know this
s.replace("ü", "ü")
//if ü if you are inputting it from an editor in latin1 mode
s.replace(QString::fromLatin1("ü"), "ü");
s.replace(QString::fromUtf8("ü"), "ü"); //there are a bunch of others, just make sure to select the correct one
There are two different representations of ü in Unicode:
The single point 00FC (LATIN SMALL LETTER U WITH DIAERESIS)
The sequence 0075 (LATIN SMALL LETTER U) 0308 (COMBINING DIAERESIS)
You should check for both.