C++ encode string to Unicode - ICU library

C++ encode string to Unicode - ICU library - c++

I need to convert a bunch of bytes in ISO-2022-JP and ISO-2022-JP-2 (and other variations of ISO-2022) into Unicode. I am trying to use ICU (link text), but the following code doesn't work.
std::string input = "\x1B\x28\x4A" "ABC\xA6\xA7"; //the first 3 chars are escape sequence to use JIS_X201 character set in GL/GR
UErrorCode status = U_ZERO_ERROR;
UConverter *conv;
// set up the converter
conv = ucnv_open("ISO-2022-JP", &status);
if (status != U_ZERO_ERROR) return false; //couldn't find character set
UChar * convDest = new UChar[2*input.length()]; //ucnv_toUChars will use up to 2*length
// convert to Unicode
int resultLen = (int)ucnv_toUChars(conv, convDest, 2*input.length(), input.c_str(), input.length(), &status);
This doesn't work. The result contains '?' charcters for anything I put in that was above ASCII. The status has no error. What am I doing wrong?
On top of that I was having trouble compiling the library ver 4.4 as the MSVC 9 project would not convert to MSVC 10 project.
I am also aware of libiconv open source library. I couldn't compile that one on windows. If anyone has any advice on a different library, that's also welcome.
Thanks.
EDIT
The escape sequence I originally used was wrong. So now ICU takes the string, strips out the escape sequence - which is a step in the right direction. But the result still contains '?' chars.
EDIT2 The reason I couldn't convert to MSVC 10 project was because x64 platform wasn't installed (it isn't by default). Alternatively I could open all the projects in text editor and remove all mention of x64 target.

This doesn't resemble an ISO 2022 encoding. The high bits are supposed to be zero. The escape sequence looks somewhat recognizable, but it starts with ESC. 0x1b, not 0xb0. No idea what those byte values really mean.

(This question looks familiar, Hi again.)
A minor, minor nit: You want to check the error status with if(U_FAILURE(status)) (or conversely, U_SUCCESS(status)).

I couldn't get the conversion to work for JIS_X201 character set in ISO-2022-JP encoding. And I couldn't generate a "valid" one using any tools at my disposal - tried Java (ICU and non ICU implementation of ISO2022) and C++.
So I basically just wrote a function to do a code lookup and convert to Unicode using this table: wikipedia.
EDIT
As I started filling out the bug report I wanted to include the RFC for ISO-2022-JP. Then I found this line in the RFC "The Kana set of JIS X 0201 is not used in ISO-2022-JP messages." link text. So it appears that the standard doesn't actually define the upper bits. The ISO-2022-JP-3 WILL map the upper bits, but to lower plane. So I have to take each byte and subtract 0x80 from it, and pass it through ISO-2022-JP-3, and take the other bytes < 128 and pass them through ISO-2022-JP converter for full JIS_X201 character set. Well it's a lot easier to just do it myself.
So strictly speaking I would say it's not a bug. It's a huge headache though.
P.S. the whole messed up stream that I'm trying to decode comes from DICOM. See pdf page 107 to see what they consider acceptable.

Related

JFlex String Regex Strange Behaviour

I am trying to write a JSON string parser in JFlex, so far I have
string = \"((\\(\"|\\|\/|b|f|n|r|t|u[0-9a-fA-F]{4})) | [^\"\\])*\"
which I thought captured the specs (http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf).
I have tested it on the control characters and standard characters and symbols, but for some reason it does not accept £ or ( or ) or ¬. Please can someone let me know what is causing this behaviour?

Perhaps you are running in JLex compatability mode? If so, please see the following from the official JFlex User's Manual. It seems that it will use 7bit character codes for input by default, whereas what you want is 16bit (unicode).
You can fix this by adding the line %unicode after the first %%.
Input Character sets
%7bit
Causes the generated scanner to use an 7 bit input character set (character codes 0-127). If an input character with a code greater than 127 is encountered in an input at runtime, the scanner will throw an ArrayIndexOutofBoundsException. Not only because of this, you should consider using the %unicode directive. See also Encodings for information about character encodings. This is the default in JLex compatibility mode.
%full
%8bit
Both options cause the generated scanner to use an 8 bit input character set (character codes 0-255). If an input character with a code greater than 255 is encountered in an input at runtime, the scanner will throw an ArrayIndexOutofBoundsException. Note that even if your platform uses only one byte per character, the Unicode value of a character may still be greater than 255. If you are scanning text files, you should consider using the %unicode directive. See also section Econdings for more information about character encodings.
%unicode
%16bit
Both options cause the generated scanner to use the full Unicode input character set, including supplementary code points: 0-0x10FFFF. %unicode does not mean that the scanner will read two bytes at a time. What is read and what constitutes a character depends on the runtime platform. See also section Encodings for more information about character encodings. This is the default unless the JLex compatibility mode is used (command line option --jlex).

Encode gives wrong value of Japanese kanji

As a part of a scraper, I need to encode kanji to URLs, but I just can't seem to even get the correct output from a simple sign, and I'm currently blinded by everything I've tried thus far from various Stack Overflow posts.
The document is set to UTF-8.
sampleText=u'ル'
print sampleText
print sampleText.encode('utf-8')
print urllib2.quote(sampleText.encode('utf-8'))
It gives me the values:
ル
ãƒ«
%E3%83%AB
But as far as I understand, it should give me:
ル
XX
%83%8B
What am I doing wrong? Are there some settings I don't have correct? Because as far as I understand it, my output from the encode() should not be ãƒ«.

The code you show works correctly. The character ル is KATAKANA LETTER RU, and is Unicode codepoint U+30EB. When encoded to UTF-8, you'll get the Python bytestring '\xe3\x83\xab', which prints out as ãƒ« if your console encoding is Latin-1. When you URL-escape those three bytes, you get %E3%83%AB.
The value you seem to be expecting, %83%8B is the Shift-JIS encoding of ル, rather than UTF-8 encoding. For a long time there was no standard for how to encode non-ASCII text in a URL, and as this Wikipedia section notes, many programs simply assumed a particular encoding (often without specifying it). The newer standard of Internationalized Resource Identifiers (IRIs) however says that you should always convert Unicode text to UTF-8 bytes before performing percent encoding.
So, if you're generating your encoded string for a new program that wants to meet the current standards, stick with the UTF-8 value you're getting now. I would only use the Shift-JIS version if you need it for backwards compatibility with specific old websites or other software that expects that the data you send will have that encoding. If you have any influence over the server (or other program), see if you can update it to use IRIs too!

how to display extended ascii character in QTextEdit

All the ASCII codes greater than 127 are replaced by Diamond? symbol. How can I display those characters. I have an unsigned char buffer[1024] which contains values from 0 to 256.

Use the QString class's fromAscii() method. By default this will treat Ascii chars above 128 as Latin-1 chars. To change this behavior use QTextCodec::setCodecForCStrings method to set the correct codec for your usage.
I believe QT5 may have taken out the setCodecForCStrings method.
EDIT: Adnan supplied the QT5 alternative to setCodecForCStrings method, adding to answer for completeness.
Qt5 alternative for setCodecForCStrings is QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));

This is a rabbit hole with no end. Qt does not fully support printing ascii > 127 as it is not well defined. The current method is to use "fromLocal8bit()" which will take a char array and transform it into the "right" Unicode string (the only thing Qt supports printing).
QTextCodec::setCodecForLocale can be used to identify the character set you wish to transform from. Many codecs are supported, but for some reason IBM437 (the character set used by IBM PCs in the US for decades) is not supported, where several other codecs used by Europe, etc. are. Probably some characters in IBM437 were never assigned proper code points in Unicode, so transforming it isn't possible?
What's frustrating is that there are fonts with all 256 ascii code points, but it is simply not possible to display these in Qt as they only work with Unicode strings. There are a handful of glyphs they don't support, and it seems to grow with newer versions of Qt. Currently I know of 9, 10, 12, 13, and 173. Some of these are for obvious reasons (usually you don't want to print a carriage return glyph, though it did exist in DOS), but others used to work in Qt and now do not.
In my application, I resorted to creating a new font that has copies of the unprintable glyphs in higher unicode codepoints, and translate them before printing them on the screen. It's quite silly but Qt gave up on ascii many years ago, so it's the best option I could find.

what locale does wstring support?

In my program I used wstring to print out text I needed but it gave me random ciphers (those due to different encoding scheme). For example, I have this block of code.
wstring text;
text.append(L"Some text");
Then I use directX to render it on screen. I used to use wchar_t but I heard it has portability problem so I switched to swtring. wchar_t worked fine but it seemed only took English character from what I can tell (the print out just totally ignore the non-English character entered), which was fine, until I switch to wstring: I only got random ciphers that looked like Chinese and Korean mixed together. And interestingly, my computer locale for non-unicode text is Chinese. Based on what I saw I suspected that it would render Chinese character correctly, so then I tried and it does display the charactor correctly but with a square in front (which is still kind of incorrect display). I then guessed the encoding might depend on the language locale so I switched the locale to English(US) (I use win8), then I restart and saw my Chinese test character in the source file became some random stuff (my file is not saved in unicode format since all texts are English) then I tried with English character, but no luck, the display seemed exactly the same and have nothing to do with the locale. But I don't understand why it doesn't display correctly and looked like asian charactor (even I use English locale).
Is there some conversion should be done or should I save my file in different encoding format? The problem is I wanted to display English charactore correctly which is the default.

In the absence of code that demonstrates your problem, I will give you a correspondingly general answer.
You are trying to display English characters, but see Chinese characters. That is what happens when you pass 8 bit ANSI text to an API that receives UTF-16 text. Look for somewhere in your program where you cast from char* to wchar_t*.

First of all what is type of file you are trying to store text in?Normal txt files stores in ANSI by default (so does excel). So when you are trying to print a Unicode character to a ANSI file it will print junk. Two ways of over coming this problem is:
try to open the file in UTF-8 or 16 mode and then write
convert Unicode to ANSI before writing in file. If you are using windows then MSDN provides particular API to do Unicode to ANSI conversion and vice-verse. If you are using Linux then Google for conversion of Unicode to ANSI. There are lot of solution out there.
Hope this helps!!!

std::wstring does not have any locale/internationalisation support at all. It is just a container for storing sequences of wchar_t.
The problem with wchar_t is that its encoding is unspecified. It might be Unicode UTF-16, or Unicode UTF-32, or Shift-JIS, or something completely different. There is no way to tell from within a program.
You will have the best chances of getting things to work if you ensure that the encoding of your source code is the same as the encoding used by the locale under which the program will run.
But, the use of third-party libraries (like DirectX) can place additional constraints due to possible limitations in what encodings those libraries expect and support.

Bug solved, it turns out to be the CASTING problem (not rendering problem as previously said).
The bugged text is a intermediate product during some internal conversion process using swtringstream (which I forgot to mention), the code is as follows
wstringstream wss;
wstring text;
textToGenerate.append(L"some text");
wss << timer->getTime()
text.append(wss.str());
Right after this process the debugger shows the text as a bunch of random stuff but later somehow it converts back so it's readable. But the problem appears at rendering stage using DirectX. I somehow left the casting for wchar_t*, which results in the incorrect rendering.
old:
LPCWSTR lpcwstrText = (LPCWSTR)textToDraw->getText();
new:
LPCWSTR lpcwstrText = (*textToDraw->getText()).c_str();
By changing that solves the problem.
So, this is resulted by a bad cast. As some kind people provided correction to my statement.

How can I recognize RTL strings in C++

I need to know the direction of my text before printing.
I'm using Unicode Characters.
How can I do that in C++?

If you don't want to use ICU, you can always manually parse the unicode database (.e.g., with a python script). It's a semicolon-separated text file, with each line representing a character code point. Look for the fifth record in each line - that's the character class. If it's R or AL, you have an RTL character, and 'L' is an LTR character. Other classes are weak or neutral types (like numerals), which I guess you'd want to ignore. Using that info, you can generate a lookup table of all RTL characters and then use it in your C++ code. If you really care about code size, you can minimize the size the lookup table takes in your code by using ranges (instead of an entry for each character), since most characters come in blocks of their BiDi class.
Now, define a function called GetCharDirection(wchar_t ch) which returns an enum value (say: Dir_LTR, Dir_RTL or Dir_Neutral) by checking the lookup table.
Now you can define a function GetStringDirection(const wchar_t*) which runs through all characters in the string until it encounters a character which is not Dir_Neutral. This first non-neutral character in the string should set the base direction for that string. Or at least that's how ICU seems to work.

You could use the ICU library, which has a functions for that (ubidi_getDirection ubidi_getBaseDirection).
The size of ICU can be reduced, by recompiling the data library (which is normally about 15MB big), to include only the converters/locals which are needed for the project.
The section Reducing the Size of ICU's Data: Conversion Tables of the site http://userguide.icu-project.org/icudata, contains information how you can reduce the size of the data library.
If only need support for the most common encodings (US-ASCII, ISO-8859-1, UTF-7/8/16/32, SCSU, BOCU-1, CESU-8), the data library wont be needed anyway.

From Boaz Yaniv said before, maybe something like this will easier and faster than parsing the whole file:
int aft_isrtl(int c){
if (
(c==0x05BE)||(c==0x05C0)||(c==0x05C3)||(c==0x05C6)||
((c>=0x05D0)&&(c<=0x05F4))||
(c==0x0608)||(c==0x060B)||(c==0x060D)||
((c>=0x061B)&&(c<=0x064A))||
((c>=0x066D)&&(c<=0x066F))||
((c>=0x0671)&&(c<=0x06D5))||
((c>=0x06E5)&&(c<=0x06E6))||
((c>=0x06EE)&&(c<=0x06EF))||
((c>=0x06FA)&&(c<=0x0710))||
((c>=0x0712)&&(c<=0x072F))||
((c>=0x074D)&&(c<=0x07A5))||
((c>=0x07B1)&&(c<=0x07EA))||
((c>=0x07F4)&&(c<=0x07F5))||
((c>=0x07FA)&&(c<=0x0815))||
(c==0x081A)||(c==0x0824)||(c==0x0828)||
((c>=0x0830)&&(c<=0x0858))||
((c>=0x085E)&&(c<=0x08AC))||
(c==0x200F)||(c==0xFB1D)||
((c>=0xFB1F)&&(c<=0xFB28))||
((c>=0xFB2A)&&(c<=0xFD3D))||
((c>=0xFD50)&&(c<=0xFDFC))||
((c>=0xFE70)&&(c<=0xFEFC))||
((c>=0x10800)&&(c<=0x1091B))||
((c>=0x10920)&&(c<=0x10A00))||
((c>=0x10A10)&&(c<=0x10A33))||
((c>=0x10A40)&&(c<=0x10B35))||
((c>=0x10B40)&&(c<=0x10C48))||
((c>=0x1EE00)&&(c<=0x1EEBB))
) return 1;
return 0;
}

If you are using Windows GDI, it would seem that GetFontLanguageInfo(HDC) returns a DWORD; if GCP_REORDER is set, the language requires reordering for display, for example, Hebrew or Arabic.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js