Why does this strange character encoding happen? - html-entities

This HTML source code:
<td class="result">'DIVISÃO DE EDUCAÇÃO
PR?ESCOLAR E ENSINO PRIMÁRIOO'</td>
displays as:
'DIVISÃO DE EDUCAÇÃO PR?ESCOLAR E ENSINO PRIMÁRIOO'
Yeah, these are some Portuguese characters.
Why does à stand for �

That's just HTML character entities. Here's a whole list. à stands for the à character because it's a reasonable name for an A with a ~ over it ;-)

à is an entity much like &nbsp ;
It stands for a unicode point which defines the character A with a tilde on top.
This effect is not due to any special character encoding. The entity is defined in all common encodings. Have a look at ISO-8859-1:
http://en.wikipedia.org/wiki/ISO/IEC_8859-1

Related

BeautifulSoup miss some alphabets in decoding from utf-8 to unicode

I'm trying to parse cyrillyc text from the site page and i have that error if i try to print soup.text of the scring which includes closing quotation marks in the word "word"
error 'charmap' codec can't encode character u'\xab' in position 6: charater maps to undefined
The original string page (utf-8)
urllib2.urlopen raw page = bbb = '\xab\x80\xd1\x8c\xc2\xbb'
\xbb and \xab- it's closing quotation mark
I try to convert to unicode by hand (BeautifulSoup does this too)
unicode(bbb, 'utf8', errors='ignore')
But inspite of error key "ignore" unknown elements they still exists int
i get
\xab\u0446\u0435\u0437\u0430\u0440\u044c**\xbb**'
I try to delete all unknown element starting with ^\x with help regular exp, but it's doesn't work
bbb = re.sub(r'[\x00-\x7f]', r' ', bbb)
But inspite of error key "ignore" unknown elements they still exists
u'\xbb' is not an unknown element, there is no problem there. It represents the character U+00BB Right-Pointing Double Angle Quotation Mark. The Unicode string literals u'\xbb' and u'\u00bb' represent the same string.
\x has a different meaning depending on what kind of string literal it is used in. In a byte string, it introduces a hex-encoded byte from 0x00 to 0xFF. In a Unicode string, it introduces a hex-encoded character from U+0000 to U+00FF. When producing the repr() representation of a string, Python prefers to output characters in the range up to U+00FF using \x escapes rather than the arguably-clearer \u escapes, because they're shorter.
The \u and \x are merely alternative ways to refer to a character in the string literal representation; they are not literally part of the value of the string. There is no actual backslash in the value, so you can't use re to try to remove characters that might appear in the repr() form as backslash escapes.
The actual error:
error 'charmap' codec can't encode character u'\xab' in position 6: charater maps to undefined
Is just PrintFails again as usual. Apparently your console is using an encoding that doesn't include the character U+00AB.
If you are using the Windows Command Prompt, you could try to use win-unicode-console as a workaround for the brokenness of that particular console.

How to use Regex to strip punctuation without tainting UTF-8 or UTF-16 encoded text like chinese?

How do I strip punctuation from ASCII and UTF-8 encoded strings without messing up the UTF-8 original characters, specifically Chinese, in R.
text <- "Longchamp Le Pliage 肩背包 (小)"
stri_replace_all_regex(text, '\\p{P}', '')
results in:
Longchamp Le Pliage ��背�� 小
but the desired result should be:
Longchamp Le Pliage 肩背包 小
I'm looking to remove all the CJK Symbols and Punctuation as well ask ASCII punctuations.
#akrun, sessionInfo() is as follows
locale:
[1] LC_COLLATE=English_Singapore.1252 LC_CTYPE=English_Singapore.1252 LC_MONETARY=English_Singapore.1252
[4] LC_NUMERIC=C LC_TIME=English_Singapore.1252
Display of Chinese characters (hanzi) works variably depending on platform and IDE (see this answer for lots of details about R's handling of non-ASCII characters). It looks to me like stri_replace_all_regex is doing what you want, but that some of the hanzi are being displayed wrong (even if their underlying codepoints are correct). Try this:
library(stringi)
my_text <- "Longchamp Le Pliage 肩背包 (小)"
plot(0,0)
text(0, 0, my_text, pos=3)
If you can get the text to display on a plot, then underlyingly the string is properly encoded and the problem is just how it displays in the R terminal. If not, check Encoding(my_text) and consider using enc2utf8 before further text processing. If the plotting worked, try:
no_punct <- stri_replace_all_regex(my_text, "\\p{P}", "")
text(0, 0, no_punct, pos=1)
to see if the result of stri_replace_all_regex is in fact doing what you expect.

QString and german umlauts

I am working with C++ and QT and have a problem with german umlauts. I have a QString like "wir sind müde" and want to change it to "wir sind müde" in order to show it correctly in a QTextBrowser.
I tried to do it like this:
s = s.replace( QChar('ü'), QString("ü"));
But it does not work.
Also
s = s.replace( QChar('\u00fc'), QString("ü"))
does not work.
When I iterate through all characters of the string in a loop, the 'ü' are two characters.
Can anybody help me?
QStrings are UTF-16.
QString stores a string of 16-bit QChars, where each QChar corresponds one Unicode 4.0 character. (Unicode characters with code values above 65535 are stored using surrogate pairs, i.e., two consecutive QChars.)
So try
//if ü is utf-16, see your fileencoding to know this
s.replace("ü", "ü")
//if ü if you are inputting it from an editor in latin1 mode
s.replace(QString::fromLatin1("ü"), "ü");
s.replace(QString::fromUtf8("ü"), "ü"); //there are a bunch of others, just make sure to select the correct one
There are two different representations of ü in Unicode:
The single point 00FC (LATIN SMALL LETTER U WITH DIAERESIS)
The sequence 0075 (LATIN SMALL LETTER U) 0308 (COMBINING DIAERESIS)
You should check for both.

Regular Expression To Anglicize String Characters?

Is there a common regular expression that replaces all known special characters in non-English languages:
é, ô, ç, etc.
with English characters:
e, o, c, etc.
¡⅁uoɹʍ puɐ ⅂IɅƎ
This cannot be done, and you should not want to do it! It’s offensive to the whole world, and it’s naïve to the point of ignorance to believe that façade rhymes with arcade, or that Cañon City, Colorado falls under canon law.
You could run the string through Unicode Normalization Form D and discard mark characters, but I am certainly not going to tell you how because it is evil and wrong. It is evil for reasons already outlined, and it is wrong because there are zillion cases it doesn’t address at all.
Study Material
Here are what you need to read up on:
Unicode Normalization Forms - UAX #15 This annex describes normalization forms for Unicode text. When implementations keep strings in a normalized form, they can be assured that equivalent strings have a unique binary representation. This annex also provides examples, additional specifications regarding normalization of Unicode text, and information about conformance testing for Unicode normalization forms.
Canonical Equivalence in Applications - UTN #5 This document describes methods and formats for efficient processing of text under canonical equivalence, as defined in UAX #15 Unicode Normalization Forms [UAX15].
Unicode Collation Algorithm - UTS #10 This report is the specification of the Unicode Collation Algorithm (UCA), which details how to compare two Unicode strings while remaining conformant to the requirements of the Unicode Standard. The UCA also supplies the Default Unicode Collation Element Table (DUCET) as the data specifying the default collation order for all Unicode characters.
You MUST learn how to compare strings in a way that makes sense, and mutilating them simply never makes any sense whatso [pəʇələp] ever.
You must never just compare unnormalized strings code point by code point, and if possible you need to take the language into account, since rules differ between them.
Practical Examples
No matter the programming language you’re using, it may also help you to read the documentation for Perl’s Unicode::Normalize, Unicode::Collate, and Unicode::Collate::Locale modules.
For example, to search for "MÜSS" in a text that has "muß" in it, you would do this:
my $Collator = Unicode::Collate->new( normalization => undef, level => 1 );
# (normalization => undef) is REQUIRED.
my $str = "Ich muß studieren Perl.";
my $sub = "MÜSS";
my $match;
if (my($pos,$len) = $Collator->index($str, $sub)) {
$match = substr($str, $pos, $len);
}
That will put "muß" into $match.
The Unicode::Collate::Module has support for tailoring to these locales:
af Afrikaans
ar Arabic
az Azerbaijani (Azeri)
be Belarusian
bg Bulgarian
ca Catalan
cs Czech
cy Welsh
da Danish
de__phonebook German (umlaut as 'ae', 'oe', 'ue')
eo Esperanto
es Spanish
es__traditional Spanish ('ch' and 'll' as a grapheme)
et Estonian
fi Finnish
fil Filipino
fo Faroese
fr French
ha Hausa
haw Hawaiian
hr Croatian
hu Hungarian
hy Armenian
ig Igbo
is Icelandic
ja Japanese [1]
kk Kazakh
kl Kalaallisut
ko Korean [2]
lt Lithuanian
lv Latvian
mk Macedonian
mt Maltese
nb Norwegian Bokmal
nn Norwegian Nynorsk
nso Northern Sotho
om Oromo
pl Polish
ro Romanian
ru Russian
se Northern Sami
sk Slovak
sl Slovenian
sq Albanian
sr Serbian
sv Swedish
sw Swahili
tn Tswana
to Tonga
tr Turkish
uk Ukrainian
vi Vietnamese
wo Wolof
yo Yoruba
zh Chinese
zh__big5han Chinese (ideographs: big5 order)
zh__gb2312han Chinese (ideographs: GB-2312 order)
zh__pinyin Chinese (ideographs: pinyin order)
zh__stroke Chinese (ideographs: stroke order)
You have a choice: you can do this right, or you can not do it at all. No one will thank you if you do it wrong.
Doing it right means taking UAX#15 and UTS#10 into account.
Nothing less is acceptable in this day and age. It’s not the 1960s any more, you know!
No, there is no such regex. Note that with a regex you "describe" a specific piece of text.
A certain regex implementation might provide the possibility to do replacements using regex, but these replacements are usually only performed by a single replacement: not replace a with a' and b with b' etc.
Perhaps the language you're working with has a method in its API to perform this kind of replacements, but it won't be using regex.
This task is what the iconv library is for. Find out how to use it in whichever language you're developing in.
Chances are your library already has a binding for it

How can I change extended latin characters to their unaccented ASCII equivalents?

I need a generic transliteration or substitution regex that will map extended latin characters to similar looking ASCII characters, and all other extended characters to '' (empty string) so that...
é becomes e
ê becomes e
á becomes a
ç becomes c
Ď becomes D
and so on, but things like ‡ or Ω or ‰ just get striped away.
Use Unicode::Normalize to get the NFD($str). In this form all the characters with diacritics will be turned into a base character followed by a combining diacritic character. Then simply remove all the non-ASCII characters.
Maybe a CPAN module might be of help?
Text::Unidecode looks promising, though it does not strip ‡ or Ω or ‰. Rather these are replaced by ++, O and %o. This might or might not be what you want.
Text::Unaccent, is another candidate but only for the part of getting rid of the accents.
Text::Unaccent or alternatively Text::Unaccent::PurePerl sounds like what you're asking for, at least the first half of it.
$unaccented = unac_string($charset, $string);
Removing all non-ASCII characters would be a relatively simple.
s/[^\000-\177]+//g;
All brilliant answers. But none actually really worked. Putting extended characters directly in the source-code caused problems when working in terminal windows or various code/text editors across platforms. I was able to try out Unicode::Normalize, Text::Unidecode and Text::Unaccent, but wan't able to get any of them to do exactly what I want.
In the end I just enumerated all the characters I wanted transliterated myself for UTF-8 (which is most frequent code page found in my input data).
I needed two extra substitutions to take care of æ and Æ which I want mapping to two characters
For interested parties the final code is: (the tr is a single line)
$word =~ tr/\xC0\xC1\xC2\xC3\xC4\xC5\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF
\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD8\xD9\xDA\xDB\xDC\xDD\xE0\xE1\xE2\xE3\xE4
\xE5\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF8
\xF9\xFA\xFB\xFC\xFD\xFF/AAAAAACEEEEIIIIDNOOOOOOUUUUYaaaaaaceeeeiiiionoo
oooouuuuyy/;
$word =~ s/\xC6/AE/g;
$word =~ s/\xE6/ae/g;
$word =~ s/[^\x00-\x7F]+//g;
Since things like Ď are not part of UTF-8, they don't occur nearly so often in my input data. For non-UTF-8 input, I chose to just loose everything above 127.
When I would like translate some string, not only chars, I'm using this approach:
my %trans = (
'é' => 'e',
'ê' => 'e',
'á' => 'a',
'ç' => 'c',
'Ď' => 'D',
map +($_=>''), qw(‡ Ω ‰)
};
my $re = qr/${ \(join'|', map quotemeta, keys %trans)}/;
s/($re)/$trans{$1}/ge;
If you want some more complicated you can use functions instead string constants. With this approach you can do anything what you want. But for your case tr should be more effective:
tr/éêáçĎ/eeacD/;
tr/‡Ω‰//d;