MultibyteToWideChar Byte Boundary - c++

I am attempting to use MultibyteToWideChar to convert some text in ANY encoding supported by that function, to another encoding such as UTF-8.
The issue is that MultibyteToWideChar when used along a character boundary will just report the error, but will give NO indication at which character it failed at.
Take this:
tes字hello
and say it's UTF-8. I want to convert it into UTF-16.
Now for my situation, I read say 4 bytes. I Then, I call MultibyteToWideChar on those 4 bytes.
Well, the asian character is split into 2 boundaries.
Now MultibyteToWideChar will fail, and will NOT tell me WHICH BYTE it failed, so I can readjust.
I read 4 bytes, or bufferSize bytes, because I have streaming data.
I have used iconv for encoding conversion, but it's MUCH too slow.
I have also used ICU, and it's fast, but with it completely trimmed it is STILL 6.5MB in size which is too big.
Is there another solution that is also fast but small and supports a wide range of encodings?
I have also tried the CharNextExA functions and such but they don't work with other encodings.
The return value of the function only returns characters, and so I do not know how many bytes have been converted. Multibyte characters can vary in length.
I need the number of bytes converted because then I can copy over those bytes into the next buffer for reuse.
What I'm trying to do is read in a very large file in chunks, and convert that files encoding, which varies, into UTF-8
NOTE:
I'm curious, how does ICU4C work? Basically, I copy the source files over, but out of box it only supports encoding like UTF-8, but not Big5. To add Big5, I have to create a 5MB .data file which I then send to ICU4C, and then Big5 is availiable. The thing is, I don't think the .data file is code. Because when compiled for x64, it works perfectly fine for x86. Is there a way to avoid that 5MB?

You could use the return value from MultiByteToWideChar as an input to WideCharToMultiByte and then the length would tell you the number of MB characters that were converted. Most of the time if I need this level of detail I simply suck it up and use ICU and ignore the resulting size.

I don't think there is a one function solution.
Without using a 3rd-party library you might be stuck with something like this:
Read a byte into a buffer.
If IsDBCSLeadByteEx is true, append the next byte to the buffer.
Call MultiByteToWideChar. If this fails the trailing byte (if any) was incorrect.
Note that IsDBCSLeadByteEx does not support Unicode so when the code page is UTF-8 you need to do your own length handling until your buffer contains one complete code point.

Related

Read file with unknown character type

I need to read text from a file that could contain any type of character (char, char8_t, wchar_t, etc). How can i determine which type of character is used and create an instance of basic_ifstream<char_type> depending on that type?
So I guess you want to auto-detect the encoding of an unknown text file.
This is impossible to do in a 100% reliable way. However, my experience shows that you can achieve very high reliability (> 99.99%) in most practical situations. The bigger the file, the most reliable it is to guess its encoding: some tenths of bytes is usually already enough to be confident with the guess.
A valid Unicode code point is a value from U+1 to U+10FFFF included, excluding the surrogate range U+D800 to U+DFFF. Code point U+0 is actually valid, but excluding it highly decreases the number of false positive guesses (NUL bytes should never appear in any practical text file). For an even better guess, we can exclude some more very rare control characters.
Here is the algorithm I would propose:
If the file begins with a valid BOM (UTF-8, UTF-16BE/LE, UTF-32BE/LE), trust that BOM.
If the file contains only ASCII characters (non null bytes < 128), treat it as ASCII (use char).
If the file is valid UTF-8, then assume it is UTF-8 (use char8_t, but char will work also). Note that ASCII is a subset of UTF-8, so the previous check could be bypassed.
If the file is valid UTF-32 (check both little and big endian versions), then assume UTF-32 (char32_t, possibly also wchar_t on Linux or macOS). Swap the bytes if needed.
If the file is valid UTF-16 (check both little and big endian versions), including restrictions on surrogate pairs, and there is a higher correlation between even or odd bytes than between all bytes together, assume UTF-16 (char16_t, possibly also wchar_t on Windows). Swap the bytes if needed.
Otherwise, the file is probably not in some Unicode encoding, and may use old code pages. Good luck to auto-detect which one. The more common one by far is 8859-1 (Latin-1), use char. It may also be some raw binary data.
It's impossible to know for sure. You have to be told what the character type is. Frequently text files will begin with a Byte-Order-Mark to clue you in, but even that's not entirely foolproof.
You can make reasonable guesses as to the file contents, for example, if you "know" that most of it is ASCII-range text, it should be easy to figure out if the file is full of char or wchar_t characters. Even this relies on assumptions and should not be considered bulletproof.

How to compare/replace non-ASCII chars in array in C++?

I have a large char array, which contains Czech diacritical characters (e.g. "á"), coded in UTF-8. I need to replace them to their ASCII equivalents (e.g. "a"), because program must work on Windows (Linux console accepts these chars perfectly).
I am reading array char by char and writing content into string.
Here is code I am using, this doesnt work:
int array_size = 50000; //size of file array
char * array = new char[array_size]; //array to store file contents
string ascicontent="";
if ('\u00E1'==array[zacatek]) { //check if char is "á"
ascicontent +='a'; //write ordinal "a" into string
}
I even tried replacing '\u00E1' with 'á', but it also doesnt work. Guessing there is problem that these chars are longer than ascii.
How can I declare the non-ascii char, so it could be compared?
Each char is a single byte, however UTF-8 can use multiple bytes to encode a single character. In particular U+00E1 is encoded as two bytes: 0xC3 0xA1. So you can't do what you want with just comparing a single char.
There are multiple ways that you might be able to tackle your problem:
A) First, try googling for "windows console utf-8" and see if that gives anything which might make things just work without having to alter the characters at all. (I don't know if anything can work for you, I've never tried this.)
B) Convert the data to wide characters (wchar_t) using MultiByteToWideChar or mbstowcs and then google how to use wcout or such to output UTF-16 to the console.
C) Use MultiByteToWideChar to convert the data from UTF-8 to UTF-16. Then use WideCharToMultiByte to convert from UTF-16 to the console's code page, relying on the fact that it can automatically "best fit" common characters (such as "á" to "a").
D) If you really only care about a limited set of characters (such as only the accented characters in the Czech code page), then you could possibly write your own lookup table of UTF-8 byte sequences and your desired replacements. You just need to be doing comparisons on the UTF-8 by those multiple bytes rather than individual chars. Among various tools out there, I've found this page helpful for seeing how characters are encoded in various ways.
Which of these make the most sense for your program depends on various factors, such as how easy or hard it might be to keep the Windows-specific pieces from conflicting with the Linux-specific or cross-platform parts.
char in C is not unicode, it is really a byte; it only gets converted to a glyph by the terminal console you happen to use. On some Linux implementations (like Debian) it defaults to UTF-8, so if your program outputs a sequence of bytes encoded in UTF-8, your terminal will display the proper glyph. If you know that array is UTF-8 encoded, you must check for the proper byte sequence.
Edit: take a look at The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Please take a look at this link http://en.wikipedia.org/wiki/Wide_character.
And I believe this code might help you:
std::wstring str(L"cccccááddddddd");
std::replace( str.begin(), str.end(), L'á', L'a');

Create UTF-16 string from char*

So I have standard C string:
char* name = "Jakub";
And I want to convert it to UTF-16. I figured out, that UTF-16 will be twice as long - one character takes two chars.
So I create another string:
char name_utf_16[10]; //"Jakub" is 5 characters
Now, I believe with ASCII characters I will only use lower bytes, so for all of them it will be like 74 00 for J and so on. With that belief, I can make such code:
void charToUtf16(char* input, char* output, int length) {
/*Todo: how to check if output is long enough?*/
for(int i=0; i<length; i+=2) //Step over 2 bytes
{
//Lets use little-endian - smallest bytes first
output[i] = input[i];
output[i+1] = 0; //We will never have any data for this field
}
}
But, with this process, I ended with "Jkb". I know no way to test this properly - I've just sent the string to Minecraft Bukkit Server. And this is what it said upon disconnecting:
13:34:19 [INFO] Disconnecting jkb?? [/127.0.0.1:53215]: Outdated server!
Note: I'm aware that Minecraft uses big-endian. Code above is just an example, in fact, I have my conversion implemented in class.
Before I answer your question, consider this:
This area of programming is full of man traps. It makes a lot of sense to understand the differences between ASCII, UTF7/8 and ANSI/'MultiByte Character Strings (MBCS)', all of which to an english speaking programmer will look and feel identical, but need very different handling if they are introduced to a european or asian user.
ASCII: Characters are in range 32-127. only ever one byte. The clue is in the name, they are great for Americans, but not fit for purpose in the rest of the world.
ANSI/MBCS: This is the reason for 'code pages'. Characters 32-127 are the same as ASCII, but it is possible to have characters in the range of 128-255 as well for additional characters, and some of the 128-255 range can be used as a flag to mark that the character continues into a second, third or even fourth byte. To process the string correctly, you need both the string bytes and the correct code page. If you try processing the string using the wrong code page you will not have the right characters, and misinterpret whether a character is a one, two or even 4 byte character.
UTF7/8: These are 8-bit wide formatting of 21-bit unicode character points. in UTF-7 and UTF-8 unicode characters can be between one and four bytes long. The advantage that UTF encodings have over ANSI/MBCS is that there is no ambiguity caused by code pages. Each glyph in every script has a unique unicode code point, which means it is not possible to mangle the character sets by interpreting the data on a different computer with different regional settings.
So to to start to answer your question:
Whilst you are making the assumption that your char* will only point to an ASCII string, that is a really dangerous choice to make, users are in control of data that is typed in, not the programmer. Windows programs will be storing this as MBCS by default.
You are making the second assumption is that a UTF-16 encoding will be twice the size of an 8 bit encoding. That is not generally a safe assumption. depending on the source encoding the UTF-16 encoding may be twice the size, may be less than twice the size, and in an extreme example may actually be shorter in length.
So, what is the safe solution?
The safe option is to implement your application internally as unicode. On windows, this is a compiler option, and then means your windows controls all use wchar_t* strings for their data type. On linux I'm less sure that you can always use unicide graphics and OS libraries. You must also use the wcslen() functions to get the length of strings etc. When you interact with the outside world, be precise in the character encodings used.
To answer to your question then becomes changing the question to, what do i do when I receive non UTF-16 data?
Firstly, be very clear about what assumptions on its formatting are you making? and secondly, accept the fact that sometimes the conversion to UTF-16 may fail.
If you are clear on the source formatting, you can then choose the appropriate win32 or the stl converter to convert the format, and you should then look for evidence the conversion failed before using the result. e.g. mbstowcs in or MultiByteToWideChar() on windows. However the use of both of these approaches safely means you need to understand ALL of the above answer.
All other options introduce risk. Use mbcs strings and you will have data strings mangled by being entered using one code page, and processed using a different code page. Assume ASCII data, and when you encounter a non ascii character your code will break, and you will 'blame' the user for your short comings.
Why do you want to make your own Unicode conversion functionality when theres's existing C/C++ functions for this, like mbstowcs() which is included in <cstdlib>.
If you still want to make your own stuff, then have a look at Unicode Consortium's open source code which can be found here:
Convert UTF-16 to UTF-8 under Windows and Linux, in C
output[i] = input[i];
This will assign every other byte of the input, because you increment i by 2. So no wonder that you obtain "Jkb".
You probably wanted to write:
output[i] = input[i / 2];

UTF-8 decoding library

I have to code in an application which is in Unicode UTF-8 in Windows, MSVC 10. I'm aware that the UTF-8 encoded strings would use either 1 or 2 bytes per character. So, my question is : Is std::string suitable for this? If yes, how do I decode the strings? As far as I understand std::string is just an array of bytes and it doesn't provide any decoding logic.
How can I know the logical length of the string? How can I extract logical characters from a string? Are there any libraries which helps me to extract logical characters from the string?
e.g : If I have the string "olé" in std::string, I need to know that the length is 3, but not 4.
A commonally used library is ICU - International Components for Unicode
Yes, std::string is appropriare but as you’ve noticed it only operates on bytes, not Unicode code points. In that, std::string is an opaque type; this isn’t necessarily bad (in fact, it does have some advantages, see the links below for information) but it makes it necessary to decode the string if you need information about characters.
For the actual handling of UTF-8 (where necessary), you can use the Boost.NoWide library to decode UTF-8.
Furthermore, I suggest reading the UTF-8 everywhere manifesto for some information about the use of UTF-8 vs. other Unicode transformations.
First you may want to call the mbstowcs() function to transform the UTF-8 characters to wide characters. Then if you want the result to be 8 bits, you'll have a loss of data in the event you have "Unicode" characters (characters outside of the ISO-8859-1 plane, also called Latin 1.)
Note that the "Windows" encoding is not 1 to 1 equivalent to ISO-8859-1, but in most cases ISO-8859-1 is what people use these days.
Reference: http://www.cplusplus.com/reference/clibrary/cstdlib/mbstowcs/
Okay, if you just want the length in characters, use the mblen() function:
len = mblen(str.c_str(), str.length());
Additional note: an easy way to implementation mblen() is to count the number of bytes that are not between 0x80 and 0xBF since those are part of a multi-bytes sequence. This is particularly useful if you receive a UTF-8 byte sequence over a flaky serial connection.

Determine if a byte array contains an ANSI or Unicode string?

Say I have a function that receives a byte array:
void fcn(byte* data)
{
...
}
Does anyone know a reliable way for fcn() to determine if data is an ANSI string or a Unicode string?
Note that I'm intentionally NOT passing a length arg, all I receive is the pointer to the array. A length arg would be a great help, but I don't receive it, so I must do without.
This article mentions an OLE API that apparently does it, but of course they don't tell you WHICH api function: http://support.microsoft.com/kb/138142
First, a word on terminology. There is no such thing as an ANSI string; there are ASCII strings, which represents a character encoding. ASCII was developed by ANSI, but they're not interchangable.
Also, there is no such thing as a Unicode string. There are Unicode encodings, but those are only a part of Unicode itself.
I will assume that by "Unicode string" you mean "UTF-8 encoded codepoint sequence." And by ANSI string, I'll assume you mean ASCII.
If so, then every ASCII string is also a UTF-8 string, by the definition of UTF-8's encoding. ASCII only defines characters up to 0x7F, and all UTF-8 code units (bytes) up to 0x7F mean the same thing as they do under ASCII.
Therefore, your concern would be for the other 128 possible values. That is... complicated.
The only reason you would ask this question is if you have no control over the encoding of the string input. And therefore, the problem is that ASCII and UTF-8 are not the only possible choices.
There's Latin-1, for example. There are many strings out there that are encoded in Latin-1, which takes the other 128 bytes that ASCII doesn't use and defines characters for them. That's bad, because those other 128 bytes will conflict with UTF-8's encoding.
There are also code pages. Many strings were encoded against a particular code page; this is particularly so on Windows. Decoding them requires knowing what codepage you're working on.
If you are in a situation where you are certain that a string is either ASCII (7-bit, with the high bit always 0) or UTF-8, then you can make the determination easily. Either the string is ASCII (and therefore also UTF-8), or one or more of the bytes will have the high bit set to 1. In which case, you must use UTF-8 decoding logic.
Unless you are truly certain of that these are the only possibilities, you are going to need to do a bit more. You can validate the data by trying to run it through a UTF-8 decoder. If it runs into an invalid code unit sequence, then you know it isn't UTF-8. The problem is that it is theoretically possible to create a Latin-1 string that is technically valid UTF-8. You're kinda screwed at that point. The same goes for code page-based strings.
Ultimately, if you don't know what encoding the string is, there's no guarantee you can display it properly. That's why it's important to know where your strings come from and what they mean.