trouble with std::codecvt_utf8 facet - c++

Here is a snippet of a code that is using std::codecvt_utf8<> facet to convert from wchar_t to UTF-8. With Visual Studio 2012, my expectations are not met (see the condition at the end of the code). Are my expectations wrong? Why? Or is this a Visual Studio 2012 library issue?
#include <locale>
#include <codecvt>
#include <cstdlib>
int main ()
{
std::mbstate_t state = std::mbstate_t ();
std::locale loc (std::locale (), new std::codecvt_utf8<wchar_t>);
typedef std::codecvt<wchar_t, char, std::mbstate_t> codecvt_type;
codecvt_type const & cvt = std::use_facet<codecvt_type> (loc);
wchar_t ch = L'\u5FC3';
wchar_t const * from_first = &ch;
wchar_t const * from_mid = &ch;
wchar_t const * from_end = from_first + 1;
char out_buf[1];
char * out_first = out_buf;
char * out_mid = out_buf;
char * out_end = out_buf + 1;
std::codecvt_base::result cvt_res
= cvt.out (state, from_first, from_end, from_mid,
out_first, out_end, out_mid);
// This is what I expect:
if (cvt_res == std::codecvt_base::partial
&& out_mid == out_end
&& state != 0)
;
else
abort ();
}
The expectation here is that the out() function output one byte of the UTF-8 conversion at a time but the middle of the if conditional above is false with Visual Studio 2012.
UPDATE
What fails is the out_mid == out_end and state != 0 conditions. Basically, I expect at least one byte to be produced and the necessary state, for next byte of the UTF-8 sequence to be producible, to be stored in the state variable.

The standard description of partial return code of codecvt::do_out says exactly this:
in Table 83:
partial not all source characters converted
In 22.4.1.4.2[locale.codecvt.virtuals]/5:
Returns: An enumeration value, as summarized in Table 83. A return value of partial, if (from_next==from_end), indicates that either the destination sequence
has not absorbed all the available destination elements, or that additional source elements are needed before another destination element can be produced.
In your case, not all (zero) source characters were converted, which technically says nothing of the contents of the output sequence (the 'if' clause in the sentence is not entered), but speaking generally, "the destination sequence has not absorbed all the available destination elements" here talks about valid multibyte characters. They are the elements of the multibyte character sequence produced by codecvt_utf8.
It would be nice to have a more explicit standard wording, but here are two circumstantial pieces of evidence:
One: the old C's wide-to-multibyte conversion function std::wcsrtombs (whose locale-specific variants are usually called by the existing implementations of codecvt::do_out for system-supplied locales) is defined as follows:
Conversion stops [...] when the next multibyte character would exceed the limit of len total bytes to be stored into the array pointed to by dst.
And two, look at the existing implementations of codecvt_utf8: you've already explored Microsoft's, and here's what's in libc++: codecvt_utf8::do_out here calls ucs2_to_utf8 on Windows and ucs4_to_utf8 on other systems, and ucs2_to_utf8 does the following (comments mine):
else if (wc < 0x0800)
{
// not relevant
}
else // if (wc <= 0xFFFF)
{
if (to_end-to_nxt < 3)
return codecvt_base::partial; // <- look here
*to_nxt++ = static_cast<uint8_t>(0xE0 | (wc >> 12));
*to_nxt++ = static_cast<uint8_t>(0x80 | ((wc & 0x0FC0) >> 6));
*to_nxt++ = static_cast<uint8_t>(0x80 | (wc & 0x003F));
}
nothing is written to the output sequence if it cannot fit a multibyte character that results from consuming one input wide character.

Although there is no direct reference of it, I'd think that is most logical behavior of std::codecvt::out. Consider following scenario:
You use std::codecvt::out in the same manner as you did - not translating any characters (possibly without knowing) into your out_buf.
You now want to translate another string into your out_buf (again using std::codecvt::out) such that it appends the content which is already inside
To do so, you decide to use your buf_mid as you know it points directly after your string that you translated in the first step.
Now, if std::codecvt::out worked according to your expectations (buf_mid pointing to the character after first) then the first character of your out_buf would never be written which would not be exactly what you would want/expect in this case.
In essence, extern_type*& to_next (last parameter of std::codecvt::out) is here for you as a reference of where you left of - so you know where to continue - which is in your case indeed the same position as where you started (extern_type* to) parameter.
cppreferece.com on std::codecvt::out
cpulusplus.com on std::codecvt::out

Related

How can I "convert" ISO-8859-7 strings to UTF-8 in C++?

I'm working with 10+ years old machines which use ISO 8859-7 to represent Greek characters using a single byte each.
I need to catch those characters and convert them to UTF-8 in order to inject them in a JSON to be sent via HTTPS.
Also, I'm using GCC v4.4.7 and I don't feel like upgrading so I can't use codeconv or such.
Example: "OΛΑ":
I get char values [ 0xcf, 0xcb, 0xc1, ], I need to write this string "\u039F\u039B\u0391".
PS: I'm not a charset expert so please avoid philosophical answers like "ISO 8859 is a subset of Unicode so you just need to implement the algorithm".
Given that there are so few values to map, a simple solution is to use a lookup table.
Pseudocode:
id_offset = 0x80 // 0x00 .. 0x7F same in UTF-8
c1_offset = 0x20 // 0x80 .. 0x9F control characters
table_offset = id_offset + c1_offset
table = [
u8"\u00A0", // 0xA0
u8"‘", // 0xA1
u8"’",
u8"£",
u8"€",
u8"₯",
// ... Refer to ISO 8859-7 for full list of characters.
]
let S be the input string
let O be an empty output string
for each char C in S
reinterpret C as unsigned char U
if U less than id_offset // same in both encodings
append C to O
else if U less than table_offset // control code
append char '\xC2' to O // lead byte
append char C to O
else
append string table[U - table_offset] to O
All that said, I recommend to save some time by using a library instead.
One way could be to use the Posix libiconv library. On Linux, the functions needed (iconv_open, iconv and iconv_close) are even included in libc so no extra linkage is needed there. On your old machines you may need to install libiconv but I doubt it.
Converting may be as simple as this:
#include <iconv.h>
#include <cerrno>
#include <cstring>
#include <iostream>
#include <iterator>
#include <stdexcept>
#include <string>
// A wrapper for the iconv functions
class Conv {
public:
// Open a conversion descriptor for the two selected character sets
Conv(const char* to, const char* from) : cd(iconv_open(to, from)) {
if(cd == reinterpret_cast<iconv_t>(-1))
throw std::runtime_error(std::strerror(errno));
}
Conv(const Conv&) = delete;
~Conv() { iconv_close(cd); }
// the actual conversion function
std::string convert(const std::string& in) {
const char* inbuf = in.c_str();
size_t inbytesleft = in.size();
// make the "out" buffer big to fit whatever we throw at it and set pointers
std::string out(inbytesleft * 6, '\0');
char* outbuf = out.data();
size_t outbytesleft = out.size();
// the const_cast shouldn't be needed but my "iconv" function declares it
// "char**" not "const char**"
size_t non_rev_converted = iconv(cd, const_cast<char**>(&inbuf),
&inbytesleft, &outbuf, &outbytesleft);
if(non_rev_converted == static_cast<size_t>(-1)) {
// here you can add misc handling like replacing erroneous chars
// and continue converting etc.
// I'll just throw...
throw std::runtime_error(std::strerror(errno));
}
// shrink to keep only what we converted
out.resize(outbuf - out.data());
return out;
}
private:
iconv_t cd;
};
int main() {
Conv cvt("UTF-8", "ISO-8859-7");
// create a string from the ISO-8859-7 data
unsigned char data[]{0xcf, 0xcb, 0xc1};
std::string iso88597_str(std::begin(data), std::end(data));
auto utf8 = cvt.convert(iso88597_str);
std::cout << utf8 << '\n';
}
Output (in UTF-8):
ΟΛΑ
Using this you can create a mapping table, from ISO-8859-7 to UTF-8, that you include in your project instead of iconv:
Demo
Ok I decided to do this myself instead of looking for a compatible library. Here's how I did.
The main problem was figuring out how to fill the two bytes for Unicode using the single one for ISO, so I used the debugger to read the value for the same character, first written by the old machine and then written with a constant string (UTF-8 by default). I started with "O" and "Π" and saw that in UTF-8 the first byte was always 0xCE while the second one was filled with the ISO value plus an offset (-0x30). I built the following code to implement this and used a test string filled with all greek letters, both upper and lower case. Then I realised that starting from "π" (0xF0 in ISO) both the first byte and the offset for the second one changed, so I added a test to figure out which of the two rules to apply. The following method returns a bool to let the caller know whether the original string contained ISO characters (useful for other purposes) and overwrites the original string, passed as reference, with the new one. I worked with char arrays instead of strings for coherence with the rest of the project which is basically a C project written in C++.
bool iso_to_utf8(char* in){
bool wasISO=false;
if(in == NULL)
return wasISO;
// count chars
int i=strlen(in);
if(!i)
return wasISO;
// create and size new buffer
char *out = new char[2*i];
// fill with 0's, useful for watching the string as it gets built
memset(out, 0, 2*i);
// ready to start from head of old buffer
i=0;
// index for new buffer
int j=0;
// for each char in old buffer
while(in[i]!='\0'){
if(in[i] >= 0){
// it's already utf8-compliant, take it as it is
out[j++] = in[i];
}else{
// it's ISO
wasISO=true;
// get plain value
int val = in[i] & 0xFF;
// first byte to CF or CE
out[j++]= val > 0xEF ? 0xCF : 0xCE;
// second char to plain value normalized
out[j++] = val - (val > 0xEF ? 0x70 : 0x30);
}
i++;
}
// add string terminator
out[j]='\0';
// paste into old char array
strcpy(in, out);
return wasISO;
}

SetWindowText with a single dimensional array

Is it possible to display a single dimensional array of values using SetWindowsText() in a text box on windows api?
for example. SetWindowText(hwndStatic3, sArray);
******************EDIT************
I have a textbox on the windows api where I use GetWindowText() to retrieve the string written in the text box then I convert the string to decimal array. I then convert this decimal array value to hexadecimal value as I am trying to print those values using SetwindowsText within another textbox. However only the last value of the array is printing. How can I print all the values?
******************EDIT************
code:
GetWindowText(hwndtext1, value, 256);
for (i = 15; i >= 0; i--)
{
temp[i] = atoll(value); //converts sting to decimal
ulltoa(temp[i] , sArray, 16); //converts decimal to hexadecimal
buf[i] = temp[i];
}
SetWindowText(hwndStatic3, sArray);
SetWindowText is just a macro with signature:
BOOL SetWindowText(HWND, const TCHAR*);
Depending on your build settings, it will call one of the following:
BOOL SetWindowTextA(HWND, const char*); //ansi version
BOOL SetWindowTextW(HWND, const wchar_t*); //unicode version
where TCHAR is defined as:
#ifdef _UNICODE
typedef wchar_t TCHAR;
#else
typedef char TCHAR;
#endif
So, an array of strings is not compatible with SetWindowText but an array of characters will work, provided that the array is of type TCHAR *, or of type (char * or wchar_t *) that is compatible with your settings.
First, atoll and ulltoa aren't documented with the Microsoft Visual C/C++ (which is what I use for Windows) so I'm working from documentation I found online. Either your versions do more than those I've found documented, or you've left out some significant code from your example.
Based on the loop control, I'm guessing that you expect to always find 15 values in the string you read from the first control. BUT... the atoll and ulltoa functions only operate on one value at a time and do nothing to advance through the input list. So your loop is converting the first number from string to 64 bit int and then converting that into a string 15 times.
Since you say the last value is the only one you see, your functions must actually be parsing the value string in some way that is not apparent in your example. However, ulltoa seems to always be placing the value into the same place in the same string variable, with each subsequent call in the loop overwriting the previous call. My lazy self would add a bit like this:
int len = 0;
char szOutput[15*20]; // enough space for 15 64 bit hex strings
GetWindowText(hwndtext1, value, 256);
for (i = 15; i >= 0; i--)
{
temp[i] = atoll(value); //converts sting to decimal
ulltoa(temp[i] , sArray, 16); //converts decimal to hexadecimal
buf[i] = temp[i];
len += sprintf( szOutput+len, "%s ", sArray );
}
szOutput[len-1] - '\0'; // remove the final space
SetWindowText(hwndStatic3, szOutput);
Of course, with the sprintf you could also skip the ulltoa call entirely and change the sprintf line to:
len += sprintf( szOutput+len, "%16.16I64X", temp[i] );
(or whatever flavor/form of the hex output you want (see the printf format documentation for details.) If you want your list to be one item per line, then replace the trailing space with a newline. Oh, the I64 in the %16.16I64X is a Microsoft thing that might be different in other compilers/libraries.
FYI, the sprintf technique I used lets the function keep appending to the end of the buffer but incrementing the offset into the buffer (len) by the length of the string just appended, which is the value returned by sprintf. It is a quick and easy way to assembling string lists such as yours.

How does one manipulate Unicode strings at the character level?

Sometimes manipulating character strings at the character level is unavoidable.
Here I have a function written for ANSI/ASCII based character strings that replaces CR/LF sequences with LF only, and also replaces CR with LF. We use this because incoming text files often have goofy line endings due to various text or email programs that have made a mess of them, and I need them to be in a consistent format to make parsing / processing / output work properly down the road.
Here's a fairly efficient implementation of this compression from various line-endings to LF only, for single byte per character implementations:
// returns the in-place conversion of a Mac or PC style string to a Unix style string (i.e. no CR/LF or CR only, but rather LF only)
char * AnsiToUnix(char * pszAnsi, size_t cchBuffer)
{
size_t i, j;
for (i = 0, j = 0; pszAnsi[i]; ++i, ++j)
{
// bounds checking
ASSERT(i < cchBuffer);
ASSERT(j <= i);
switch (pszAnsi[i])
{
case '\n':
if (pszAnsi[i + 1] == '\r')
++i;
break;
case '\r':
if (pszAnsi[i + 1] == '\n')
++i;
pszAnsi[j] = '\n';
break;
default:
if (j != i)
pszAnsi[j] = pszAnsi[i];
}
}
// append null terminator if we changed the length of the string buffer
if (j != i)
pszAnsi[j] = '\0';
// bounds checking
ASSERT(pszAnsi[j] == 0);
return pszAnsi;
}
I'm trying to transform this into something that will work correctly with multibyte/unicode strings, where the size of the next character can be multible bytes wide.
So:
I need to look at a character only at a valid character-point (not in the middle of a character)
I need to copy over the portion of the character that is part of the rejected piece properly (i.e. copy whole characters, not just bytes)
I understand that _mbsinc() will give me the address of the next start of a real character. But what is the equivalent for Unicode (UTF16), and are there already primitives to be able to copy a full character (e.g. length_character(wsz))?
One of the beautiful things about UTF-8 is that if you only care about the ASCII subset, your code doesn't need to change at all. The non-ASCII characters get encoded to multi-byte sequences where all of the bytes have the upper bit set, keeping them out of the ASCII range themselves. Your CR/LF replacement should work without modification.
UTF-16 has the same property. Characters that can be encoded as a single 16-bit entity will never conflict with the characters that require multiple entities.
Do not try to keep text internally in mix of whatever encodings and work with those it is true Hell.
First pick some "internal" encoding. When target platform is UNIX then UTF-8 is good candidate, it is slightly easier to display there. When target platform is Windows then UTF-16 is good candidate, Windows uses it internally anyway everywhere. Whatever you pick, stick to it an only it.
Then you convert all incoming "dirty" text into that encoding. Also you may make some re-formatting that actually looks exactly like your code, only that on case of wchar_t containing UTF-16 you have to use literals like L'\n'.

Convert wchar_t to int

how can I convert a wchar_t ('9') to a digit in the form of an int (9)?
I have the following code where I check whether or not peek is a digit:
if (iswdigit(peek)) {
// store peek as numeric
}
Can I just subtract '0' or is there some Unicode specifics I should worry about?
If the question concerns just '9' (or one of the Roman
digits), just subtracting '0' is the correct solution. If
you're concerned with anything for which iswdigit returns
non-zero, however, the issue may be far more complex. The
standard says that iswdigit returns a non-zero value if its
argument is "a decimal digit wide-character code [in the current
local]". Which is vague, and leaves it up to the locale to
define exactly what is meant. In the "C" locale or the "Posix"
locale, the "Posix" standard, at least, guarantees that only the
Roman digits zero through nine are considered decimal digits (if
I understand it correctly), so if you're in the "C" or "Posix"
locale, just subtracting '0' should work.
Presumably, in a Unicode locale, this would be any character
which has the general category Nd. There are a number of
these. The safest solution would be simply to create something
like (variables here with static lifetime):
wchar_t const* const digitTables[] =
{
L"0123456789",
L"\u0660\u0661\u0662\u0663\u0664\u0665\u0666\u0667\u0668\u0669",
// ...
};
//! \return
//! wch as a numeric digit, or -1 if it is not a digit
int asNumeric( wchar_t wch )
{
int result = -1;
for ( wchar_t const* const* p = std::begin( digitTables );
p != std::end( digitTables ) && result == -1;
++ p ) {
wchar_t const* q = std::find( *p, *p + 10, wch );
if ( q != *p + 10 ) {
result = q - *p;
}
return result;
}
If you go this way:
you'll definitely want to download the
UnicodeData.txt file from the Unicode consortium
("Uncode Character
Database"—this page has a links to both the Unicode data
file and an explination of the encodings used in it), and
possibly write a simple parser of this file to extract the
information automatically (e.g. when there is a new version of
Unicode)—the file is designed for simple programmatic
parsing.
Finally, note that solutions based on ostringstream and
istringstream (this includes boost::lexical_cast) will not
work, since the conversions used in streams are defined to only
use the Roman digits. (On the other hand, it might be
reasonable to restrict your code to just the Roman digits. In
which case, the test becomes if ( wch >= L'0' && wch <= L'9' ),
and the conversion is done by simply subtracting L'0'—
always supposing the the native encoding of wide character
constants in your compiler is Unicode (the case, I'm pretty
sure, of both VC++ and g++). Or just ensure that the locale is
"C" (or "Posix", on a Unix machine).
EDIT: I forgot to mention: if you're doing any serious Unicode programming, you
should look into ICU. Handling Unicode
correctly is extremely non-trivial, and they've a lot of functionality already
implemented.
Look into the atoi class of functions: http://msdn.microsoft.com/en-us/library/hc25t012(v=vs.71).aspx
Especially _wtoi(const wchar_t *string); seems to be what you're looking for. You would have to make sure your wchar_t is properly null terminated, though, so try something like this:
if (iswdigit(peek)) {
// store peek as numeric
wchar_t s[2];
s[0] = peek;
s[1] = 0;
int numeric_peek = _wtoi(s);
}
You could use boost::lexical_cast:
const wchar_t c = '9';
int n = boost::lexical_cast<int>( c );
Despite MSDN documentation, a simple test suggest that not only ranger L'0'-L'9' returns true.
for(wchar_t i = 0; i < 0xFFFF; ++i)
{
if (iswdigit(i))
{
wprintf(L"%d : %c\n", i, i);
}
}
That means that L'0' subtraction probably won't work as you may expected.
For most purposes you can just subtract the code for '0'.
However, the Wikipedia article on Unicode numerials mentions that the decimal digits are represented in 23 separate blocks (including twice in Arabic).
If you are not worried about that, then just subtract the code for '0'.

Convert wchar_t to char

I was wondering is it safe to do so?
wchar_t wide = /* something */;
assert(wide >= 0 && wide < 256 &&);
char myChar = static_cast<char>(wide);
If I am pretty sure the wide char will fall within ASCII range.
Why not just use a library routine wcstombs.
assert is for ensuring that something is true in a debug mode, without it having any effect in a release build. Better to use an if statement and have an alternate plan for characters that are outside the range, unless the only way to get characters outside the range is through a program bug.
Also, depending on your character encoding, you might find a difference between the Unicode characters 0x80 through 0xff and their char version.
You are looking for wctomb(): it's in the ANSI standard, so you can count on it. It works even when the wchar_t uses a code above 255. You almost certainly do not want to use it.
wchar_t is an integral type, so your compiler won't complain if you actually do:
char x = (char)wc;
but because it's an integral type, there's absolutely no reason to do this. If you accidentally read Herbert Schildt's C: The Complete Reference, or any C book based on it, then you're completely and grossly misinformed. Characters should be of type int or better. That means you should be writing this:
int x = getchar();
and not this:
char x = getchar(); /* <- WRONG! */
As far as integral types go, char is worthless. You shouldn't make functions that take parameters of type char, and you should not create temporary variables of type char, and the same advice goes for wchar_t as well.
char* may be a convenient typedef for a character string, but it is a novice mistake to think of this as an "array of characters" or a "pointer to an array of characters" - despite what the cdecl tool says. Treating it as an actual array of characters with nonsense like this:
for(int i = 0; s[i]; ++i) {
wchar_t wc = s[i];
char c = doit(wc);
out[i] = c;
}
is absurdly wrong. It will not do what you want; it will break in subtle and serious ways, behave differently on different platforms, and you will most certainly confuse the hell out of your users. If you see this, you are trying to reimplement wctombs() which is part of ANSI C already, but it's still wrong.
You're really looking for iconv(), which converts a character string from one encoding (even if it's packed into a wchar_t array), into a character string of another encoding.
Now go read this, to learn what's wrong with iconv.
An easy way is :
wstring your_wchar_in_ws(<your wchar>);
string your_wchar_in_str(your_wchar_in_ws.begin(), your_wchar_in_ws.end());
char* your_wchar_in_char = your_wchar_in_str.c_str();
I'm using this method for years :)
A short function I wrote a while back to pack a wchar_t array into a char array. Characters that aren't on the ANSI code page (0-127) are replaced by '?' characters, and it handles surrogate pairs correctly.
size_t to_narrow(const wchar_t * src, char * dest, size_t dest_len){
size_t i;
wchar_t code;
i = 0;
while (src[i] != '\0' && i < (dest_len - 1)){
code = src[i];
if (code < 128)
dest[i] = char(code);
else{
dest[i] = '?';
if (code >= 0xD800 && code <= 0xD8FF)
// lead surrogate, skip the next code unit, which is the trail
i++;
}
i++;
}
dest[i] = '\0';
return i - 1;
}
Technically, 'char' could have the same range as either 'signed char' or 'unsigned char'. For the unsigned characters, your range is correct; theoretically, for signed characters, your condition is wrong. In practice, very few compilers will object - and the result will be the same.
Nitpick: the last && in the assert is a syntax error.
Whether the assertion is appropriate depends on whether you can afford to crash when the code gets to the customer, and what you could or should do if the assertion condition is violated but the assertion is not compiled into the code. For debug work, it seems fine, but you might want an active test after it for run-time checking too.
Here's another way of doing it, remember to use free() on the result.
char* wchar_to_char(const wchar_t* pwchar)
{
// get the number of characters in the string.
int currentCharIndex = 0;
char currentChar = pwchar[currentCharIndex];
while (currentChar != '\0')
{
currentCharIndex++;
currentChar = pwchar[currentCharIndex];
}
const int charCount = currentCharIndex + 1;
// allocate a new block of memory size char (1 byte) instead of wide char (2 bytes)
char* filePathC = (char*)malloc(sizeof(char) * charCount);
for (int i = 0; i < charCount; i++)
{
// convert to char (1 byte)
char character = pwchar[i];
*filePathC = character;
filePathC += sizeof(char);
}
filePathC += '\0';
filePathC -= (sizeof(char) * charCount);
return filePathC;
}
one could also convert wchar_t --> wstring --> string --> char
wchar_t wide;
wstring wstrValue;
wstrValue[0] = wide
string strValue;
strValue.assign(wstrValue.begin(), wstrValue.end()); // convert wstring to string
char char_value = strValue[0];
In general, no. int(wchar_t(255)) == int(char(255)) of course, but that just means they have the same int value. They may not represent the same characters.
You would see such a discrepancy in the majority of Windows PCs, even. For instance, on Windows Code page 1250, char(0xFF) is the same character as wchar_t(0x02D9) (dot above), not wchar_t(0x00FF) (small y with diaeresis).
Note that it does not even hold for the ASCII range, as C++ doesn't even require ASCII. On IBM systems in particular you may see that 'A' != 65