Trouble with encodings on my server, htmlentities shows wrong output by default - html-entities

Here is my problem
<?php
$aeo = 'å ä ö';
echo htmlentities($aeo);
?>
OUTPUT: (centos w/apache) [wrong]
å ä ö
[output with html:] å ä ö
OUTPUT SHOULD BE: (works perfectly on localhost with xampp on windows 7) [correct]
å ä ö
[output with html] å ä ö
I have no idea how to fix this and I have tried everything. Do you know how to possibly solve it? htmlentities isn't working correctly, apparently using wrong encoding or something like that... And the thing is, if I use htmlentities($aeo, ENT_QUOTES, 'UTF-8'); it works correctly (shows å ä ö as it should), but I have this in my php.ini: default_charset = "UTF-8" and this in my core.php: setlocale(LC_ALL, "sv_SE.UTF-8"); Thanks in advance

Unfortunately depends on the version of PHP. Newer versions use UTF-8 as default and older versions (less then 5.4.0) the default is ISO-8859-1 and is independent of the CHARSET value set in PHP.INI

Related

re.compile(r' [Б]$').search(' Б') returns None

It seems that re.compile(r' [Б]$').search(' Б') returns None even though it should return <_sre.SRE_Match object; span=(0, 2), match=' Б'>.
This happens when running it on python2, but not on python3, and it happens only with a Unicode symbol (I tried Cyrillic and Chinese). It works fine with Latin symbols.
sashoalm#HP:~/$ python2
Python 2.7.17 (default)
>>> print(re.compile(r' [Б]$').search(' Б'))
None
Any idea what is happening? Is it a real bug or is it supposed to fail?
OK, I realized what is happening after reading https://snarky.ca/why-python-3-exists/ - the part about bytes - python is treating the unicode utf8 symbol as 2 ASCII characters - \xd0\x91, so it has to match a space, then only one of the 2, then end.
This means that print(re.compile(r' [Б][Б]$').search(' Б')) will match.

How to use Regex to strip punctuation without tainting UTF-8 or UTF-16 encoded text like chinese?

How do I strip punctuation from ASCII and UTF-8 encoded strings without messing up the UTF-8 original characters, specifically Chinese, in R.
text <- "Longchamp Le Pliage 肩背包 (小)"
stri_replace_all_regex(text, '\\p{P}', '')
results in:
Longchamp Le Pliage ��背�� 小
but the desired result should be:
Longchamp Le Pliage 肩背包 小
I'm looking to remove all the CJK Symbols and Punctuation as well ask ASCII punctuations.
#akrun, sessionInfo() is as follows
locale:
[1] LC_COLLATE=English_Singapore.1252 LC_CTYPE=English_Singapore.1252 LC_MONETARY=English_Singapore.1252
[4] LC_NUMERIC=C LC_TIME=English_Singapore.1252
Display of Chinese characters (hanzi) works variably depending on platform and IDE (see this answer for lots of details about R's handling of non-ASCII characters). It looks to me like stri_replace_all_regex is doing what you want, but that some of the hanzi are being displayed wrong (even if their underlying codepoints are correct). Try this:
library(stringi)
my_text <- "Longchamp Le Pliage 肩背包 (小)"
plot(0,0)
text(0, 0, my_text, pos=3)
If you can get the text to display on a plot, then underlyingly the string is properly encoded and the problem is just how it displays in the R terminal. If not, check Encoding(my_text) and consider using enc2utf8 before further text processing. If the plotting worked, try:
no_punct <- stri_replace_all_regex(my_text, "\\p{P}", "")
text(0, 0, no_punct, pos=1)
to see if the result of stri_replace_all_regex is in fact doing what you expect.

How to search all CJK chars in vim?

I can search a CJK char (such as 小) by using a unicode code point:
/\%u5c0f
/[\u5c0f]
I cannot search all of CJK chars by using [\u4E00-\u9FFF], because vim manual says:
:help /[]
NOTE: The other backslash codes mentioned above do not work inside []!
Is these a way to do the job?
It seems that Vim ranges are somehow limited to the same high byte, because /[\u4E00-\u4eFF] works fine. If you don't mind the mess, try:
/[\u4e00-\u4eff\u4f00-\u4fff\u5000-\u50ff\u5100-\u51ff\u5200-\u52ff\u5300-\u53ff\u5400-\u54ff\u5500-\u55ff\u5600-\u56ff\u5700-\u57ff\u5800-\u58ff\u5900-\u59ff\u5a00-\u5aff\u5b00-\u5bff\u5c00-\u5cff\u5d00-\u5dff\u5e00-\u5eff\u5f00-\u5fff\u6000-\u60ff\u6100-\u61ff\u6200-\u62ff\u6300-\u63ff\u6400-\u64ff\u6500-\u65ff\u6600-\u66ff\u6700-\u67ff\u6800-\u68ff\u6900-\u69ff\u6a00-\u6aff\u6b00-\u6bff\u6c00-\u6cff\u6d00-\u6dff\u6e00-\u6eff\u6f00-\u6fff\u7000-\u70ff\u7100-\u71ff\u7200-\u72ff\u7300-\u73ff\u7400-\u74ff\u7500-\u75ff\u7600-\u76ff\u7700-\u77ff\u7800-\u78ff\u7900-\u79ff\u7a00-\u7aff\u7b00-\u7bff\u7c00-\u7cff\u7d00-\u7dff\u7e00-\u7eff\u7f00-\u7fff\u8000-\u80ff\u8100-\u81ff\u8200-\u82ff\u8300-\u83ff\u8400-\u84ff\u8500-\u85ff\u8600-\u86ff\u8700-\u87ff\u8800-\u88ff\u8900-\u89ff\u8a00-\u8aff\u8b00-\u8bff\u8c00-\u8cff\u8d00-\u8dff\u8e00-\u8eff\u8f00-\u8fff\u9000-\u90ff\u9100-\u91ff\u9200-\u92ff\u9300-\u93ff\u9400-\u94ff\u9500-\u95ff\u9600-\u96ff\u9700-\u97ff\u9800-\u98ff\u9900-\u99ff\u9a00-\u9aff\u9b00-\u9bff\u9c00-\u9cff\u9d00-\u9dff\u9e00-\u9eff\u9f00-\u9fff]
I played around with this quite a bit and in vim the following seems to find all the Kanji characters in my Kanji/Pinyin/English text:
[^!-~0-9 aāáǎăàeēéěèiīíǐĭìoōóǒŏòuūúǔùǖǘǚǜ]
Vim cannot actually do this by itself, since you aren’t given access to Unicode properties like \p{Han}.
As of Unicode v6.0, the range of codepoints for characters in the Han script is:
2E80-2E99 2E9B-2EF3 2F00-2FD5 3005-3005 3007-3007 3021-3029 3038-303B 3400-4DB5 4E00-9FCB F900-FA2D FA30-FA6D FA70-FAD9 20000-2A6D6 2A700-2B734 2B740-2B81D 2F800-2FA1D
Whereas with Unicode v6.1, the range of Han codepoints has changed to:
2E80-2E99 2E9B-2EF3 2F00-2FD5 3005-3005 3007-3007 3021-3029 3038-303B 3400-4DB5 4E00-9FCC F900-FA6D FA70-FAD9 20000-2A6D6 2A700-2B734 2B740-2B81D 2F800-2FA1D
I also seem to recall that Vim has difficulties expressing astral code points, which are needed for this to work correctly. For example, using the flexible \x{HHHHHH} notation from Java 7 or Perl, you would have:
[\x{2E80}-\x{2E99}\x{2E9B}-\x{2EF3}\x{2F00}-\x{2FD5}\x{3005}-\x{3005}\x{3007}-\x{3007}\x{3021}-\x{3029}\x{3038}-\x{303B}\x{3400}-\x{4DB5}\x{4E00}-\x{9FCC}\x{F900}-\x{FA6D}\x{FA70}-\x{FAD9}\x{20000}-\x{2A6D6}\x{2A700}-\x{2B734}\x{2B740}-\x{2B81D}\x{2F800}-\x{2FA1D}]
Notice that the last part of the range is \x{2F800}-\x{2FA1D}, which is beyond the BMP. But what you really need is \p{Han} (meaning, \p{Script=Han}). This again shows that regex dialects that don’t support at least Level 1 of UTS#18: Basic Unicode Support are inadequate for working with Unicode. Vim’s regexes are inadequate for basic Unicode work.
EDITED TO ADD
Here’s the program that dumps out the ranges of code points that apply to any given Unicode script.
#!/usr/bin/env perl
#
# uniscrange - given a Unicode script name, print out the ranges of code
# points that apply.
# Tom Christiansen <tchrist#perl.com>
use strict;
use warnings;
use Unicode::UCD qw(charscript);
for my $arg (#ARGV) {
print "$arg: " if #ARGV > 1;
dump_range($arg);
}
sub dump_range {
my($scriptname) = #_;
my $alist = charscript($scriptname);
unless ($alist) {
warn "Unknown script '$scriptname'\n";
return;
}
for my $aref (#$alist) {
my($start, $stop, $name) = #$aref;
die "got $name, not $scriptname\n" unless $name eq $scriptname;
printf "%04X-%04X ", $start, $stop;
}
print "\n";
}
Its answers depend on which version of Perl — and thus, which version of Unicode — you’re running it against.
$ perl5.8.8 ~/uniscrange Latin Greek
Latin: 0041-005A 0061-007A 00AA-00AA 00BA-00BA 00C0-00D6 00D8-00F6 00F8-01BA 01BB-01BB 01BC-01BF 01C0-01C3 01C4-0241 0250-02AF 02B0-02B8 02E0-02E4 1D00-1D25 1D2C-1D5C 1D62-1D65 1D6B-1D77 1D79-1D9A 1D9B-1DBF 1E00-1E9B 1EA0-1EF9 2071-2071 207F-207F 2090-2094 212A-212B FB00-FB06 FF21-FF3A FF41-FF5A
Greek: 0374-0375 037A-037A 0384-0385 0386-0386 0388-038A 038C-038C 038E-03A1 03A3-03CE 03D0-03E1 03F0-03F5 03F6-03F6 03F7-03FF 1D26-1D2A 1D5D-1D61 1D66-1D6A 1F00-1F15 1F18-1F1D 1F20-1F45 1F48-1F4D 1F50-1F57 1F59-1F59 1F5B-1F5B 1F5D-1F5D 1F5F-1F7D 1F80-1FB4 1FB6-1FBC 1FBD-1FBD 1FBE-1FBE 1FBF-1FC1 1FC2-1FC4 1FC6-1FCC 1FCD-1FCF 1FD0-1FD3 1FD6-1FDB 1FDD-1FDF 1FE0-1FEC 1FED-1FEF 1FF2-1FF4 1FF6-1FFC 1FFD-1FFE 2126-2126 10140-10174 10175-10178 10179-10189 1018A-1018A 1D200-1D241 1D242-1D244 1D245-1D245
$ perl5.10.0 ~/uniscrange Latin Greek
Latin: 0041-005A 0061-007A 00AA-00AA 00BA-00BA 00C0-00D6 00D8-00F6 00F8-01BA 01BB-01BB 01BC-01BF 01C0-01C3 01C4-0293 0294-0294 0295-02AF 02B0-02B8 02E0-02E4 1D00-1D25 1D2C-1D5C 1D62-1D65 1D6B-1D77 1D79-1D9A 1D9B-1DBE 1E00-1E9B 1EA0-1EF9 2071-2071 207F-207F 2090-2094 212A-212B 2132-2132 214E-214E 2184-2184 2C60-2C6C 2C74-2C77 FB00-FB06 FF21-FF3A FF41-FF5A
Greek: 0374-0375 037A-037A 037B-037D 0384-0385 0386-0386 0388-038A 038C-038C 038E-03A1 03A3-03CE 03D0-03E1 03F0-03F5 03F6-03F6 03F7-03FF 1D26-1D2A 1D5D-1D61 1D66-1D6A 1DBF-1DBF 1F00-1F15 1F18-1F1D 1F20-1F45 1F48-1F4D 1F50-1F57 1F59-1F59 1F5B-1F5B 1F5D-1F5D 1F5F-1F7D 1F80-1FB4 1FB6-1FBC 1FBD-1FBD 1FBE-1FBE 1FBF-1FC1 1FC2-1FC4 1FC6-1FCC 1FCD-1FCF 1FD0-1FD3 1FD6-1FDB 1FDD-1FDF 1FE0-1FEC 1FED-1FEF 1FF2-1FF4 1FF6-1FFC 1FFD-1FFE 2126-2126 10140-10174 10175-10178 10179-10189 1018A-1018A 1D200-1D241 1D242-1D244 1D245-1D245
You can use the corelist -a Unicode command to see which version of Unicode goes with which version of Perl. Here is selected output:
$ corelist -a Unicode
v5.8.8 4.1.0
v5.10.0 5.0.0
v5.12.2 5.2.0
v5.14.0 6.0.0
v5.16.0 6.1.0
I don't understand the "same high byte problem" but it seems like it does not apply (at least not for me, VIM 7.4) when you actually enter the character to build up the ranges.
I usually search from U+3400(㐀) to U+9FCC(鿌) to capture Chinese characters in Japanese texts.
U+3400(㐀) is beginning of "CJK Unified Ideographs Extension A"
U+4DC0 - U+4DFF "Yijing Hexagram Symbols" is in between but not excluded for the sake of simplicity.
U+9FCC(鿌) is the end of "CJK Unified Ideographs"
Please note that the Japanese writing uses "々" as a kanji repetition symbol which is not part of this block. You can find it in the Block "Japanese Symbols and Punctuation."
/[㐀-鿌]
A (almost?) complete set of Chinese characters with extensions
/[㐀-鿌豈-龎𠀀-𪘀]
This range includes:
CJK Unified Ideographs Extension A
Yijing Hexagram Symbols (shouldn't be part of it)
CJK Unified Ideographs (main part)
CJK Compatibility Ideographs
CJK Unified Ideographs Extension B,
CJK Unified Ideographs Extension C,
CJK Unified Ideographs Extension D,
CJK Compatibility Ideographs Supplement
Bonus for people working on content in Japanese language:
Hiragana goes from U+3041 to U+3096
/[ぁ-ゟ]
Katakana
/[゠-ヿ]
Kanji Radicals
/[⺀-⿕]
Japanese Symbols and Punctuation.
Note that this range also includes 々(repetition of last kanji) and 〆(abbreviation for shime「しめ」). You might want to add them to your range to find words.
[ -〿]
Miscellaneous Japanese Symbols and Characters
/[ㇰ-ㇿ㈠-㉃㊀-㍿]
Alphanumeric and Punctuation (Full Width)
[!-~]
sources:
http://www.fileformat.info/info/unicode/char/9fcc/index.htm
http://www.localizingjapan.com/blog/2012/01/20/regular-expressions-for-japanese-text/comment-page-1/#comment-46891
In some simple cases, I use this to search chinese characters. It also matches Japanese, Russian characters and so on.
[^\x00-\xff]

QString and german umlauts

I am working with C++ and QT and have a problem with german umlauts. I have a QString like "wir sind müde" and want to change it to "wir sind müde" in order to show it correctly in a QTextBrowser.
I tried to do it like this:
s = s.replace( QChar('ü'), QString("ü"));
But it does not work.
Also
s = s.replace( QChar('\u00fc'), QString("ü"))
does not work.
When I iterate through all characters of the string in a loop, the 'ü' are two characters.
Can anybody help me?
QStrings are UTF-16.
QString stores a string of 16-bit QChars, where each QChar corresponds one Unicode 4.0 character. (Unicode characters with code values above 65535 are stored using surrogate pairs, i.e., two consecutive QChars.)
So try
//if ü is utf-16, see your fileencoding to know this
s.replace("ü", "ü")
//if ü if you are inputting it from an editor in latin1 mode
s.replace(QString::fromLatin1("ü"), "ü");
s.replace(QString::fromUtf8("ü"), "ü"); //there are a bunch of others, just make sure to select the correct one
There are two different representations of ü in Unicode:
The single point 00FC (LATIN SMALL LETTER U WITH DIAERESIS)
The sequence 0075 (LATIN SMALL LETTER U) 0308 (COMBINING DIAERESIS)
You should check for both.

How can I change extended latin characters to their unaccented ASCII equivalents?

I need a generic transliteration or substitution regex that will map extended latin characters to similar looking ASCII characters, and all other extended characters to '' (empty string) so that...
é becomes e
ê becomes e
á becomes a
ç becomes c
Ď becomes D
and so on, but things like ‡ or Ω or ‰ just get striped away.
Use Unicode::Normalize to get the NFD($str). In this form all the characters with diacritics will be turned into a base character followed by a combining diacritic character. Then simply remove all the non-ASCII characters.
Maybe a CPAN module might be of help?
Text::Unidecode looks promising, though it does not strip ‡ or Ω or ‰. Rather these are replaced by ++, O and %o. This might or might not be what you want.
Text::Unaccent, is another candidate but only for the part of getting rid of the accents.
Text::Unaccent or alternatively Text::Unaccent::PurePerl sounds like what you're asking for, at least the first half of it.
$unaccented = unac_string($charset, $string);
Removing all non-ASCII characters would be a relatively simple.
s/[^\000-\177]+//g;
All brilliant answers. But none actually really worked. Putting extended characters directly in the source-code caused problems when working in terminal windows or various code/text editors across platforms. I was able to try out Unicode::Normalize, Text::Unidecode and Text::Unaccent, but wan't able to get any of them to do exactly what I want.
In the end I just enumerated all the characters I wanted transliterated myself for UTF-8 (which is most frequent code page found in my input data).
I needed two extra substitutions to take care of æ and Æ which I want mapping to two characters
For interested parties the final code is: (the tr is a single line)
$word =~ tr/\xC0\xC1\xC2\xC3\xC4\xC5\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF
\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD8\xD9\xDA\xDB\xDC\xDD\xE0\xE1\xE2\xE3\xE4
\xE5\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF8
\xF9\xFA\xFB\xFC\xFD\xFF/AAAAAACEEEEIIIIDNOOOOOOUUUUYaaaaaaceeeeiiiionoo
oooouuuuyy/;
$word =~ s/\xC6/AE/g;
$word =~ s/\xE6/ae/g;
$word =~ s/[^\x00-\x7F]+//g;
Since things like Ď are not part of UTF-8, they don't occur nearly so often in my input data. For non-UTF-8 input, I chose to just loose everything above 127.
When I would like translate some string, not only chars, I'm using this approach:
my %trans = (
'é' => 'e',
'ê' => 'e',
'á' => 'a',
'ç' => 'c',
'Ď' => 'D',
map +($_=>''), qw(‡ Ω ‰)
};
my $re = qr/${ \(join'|', map quotemeta, keys %trans)}/;
s/($re)/$trans{$1}/ge;
If you want some more complicated you can use functions instead string constants. With this approach you can do anything what you want. But for your case tr should be more effective:
tr/éêáçĎ/eeacD/;
tr/‡Ω‰//d;