c++ string to utf8 valid string using utf8proc - c++

I have a std::string output. Using utf8proc i would like to transform it into an valid utf8 string.
http://www.public-software-group.org/utf8proc-documentation
typedef int int32_t;
#define ssize_t int
ssize_t utf8proc_reencode(int32_t *buffer, ssize_t length, int options)
Reencodes the sequence of unicode characters given by the pointer buffer and length as UTF-8. The result is stored in the same memory area where the data is read. Following flags in the options field are regarded: (Documentation missing here) In case of success the length of the resulting UTF-8 string is returned, otherwise a negative error code is returned.
WARNING: The amount of free space being pointed to by buffer, has to exceed the amount of the input data by one byte, and the entries of the array pointed to by str have to be in the range of 0x0000 to 0x10FFFF, otherwise the program might crash!
So first, how do I add an extra byte at the end? Then how do I convert from std::string to int32_t *buffer?
This does not work:
std::string g = output();
fprintf(stdout,"str: %s\n",g.c_str());
g += " "; //add an extra byte??
g = utf8proc_reencode((int*)g.c_str(), g.size()-1, 0);
fprintf(stdout,"strutf8: %s\n",g.c_str());

You very likely don't actually want utf8proc_reencode() - that function takes a valid UTF-32 buffer and turns it into a valid UTF-8 buffer, but since you say you don't know what encoding your data is in then you can't use that function.
So, first you need to figure out what encoding your data is actually in. You can use http://utfcpp.sourceforge.net/ to test whether you already have valid UTF-8 with utf8::is_valid(g.begin(), g.end()). If that's true, you're done!
If false, things get complicated...but ICU ( http://icu-project.org/ ) can help you; see http://userguide.icu-project.org/conversion/detection
Once you somewhat reliably know what encoding your data is in, ICU can help again with getting it to UTF-8. For example, assuming your source data g is in ISO-8859-1:
UErrorCode err = U_ZERO_ERROR; // check this after every call...
// CONVERT FROM ISO-8859-1 TO UChar
UConverter *conv_from = ucnv_open("ISO-8859-1", &err);
std::vector<UChar> converted(g.size()*2); // *2 is usually more than enough
int32_t conv_len = ucnv_toUChars(conv_from, &converted[0], converted.size(), g.c_str(), g.size(), &err);
converted.resize(conv_len);
ucnv_close(conv_from);
// CONVERT FROM UChar TO UTF-8
g.resize(converted.size()*4);
UConverter *conv_u8 = ucnv_open("UTF-8", &err);
int32_t u8_len = ucnv_fromUChars(conv_u8, &g[0], g.size(), &converted[0], converted.size(), &err);
g.resize(u8_len);
ucnv_close(conv_u8);
after which your g is now holding UTF-8 data.

Related

Converting to UTF-8 from ToUnicodeEx()

I get input using GetAsyncKeyState() which I then convert to unicode using ToUnicodeEx():
wchar_t character[1];
ToUnicodeEx(i, scanCode, keyboardState, character, 1, 0, layout);
I can write this to a file using wfstream like so:
wchar_t buffer[128]; // Will not print unicode without these 2 lines
file.rdbuf()->pubsetbuf(buffer, 128);
file.put(0xFEFF); // BOM needed since it's encoded using UCS-2 LE
file << character[0];
When I open this file in Notepad++ it's in UCS-2 LE, when I want it to be in UTF-8 format. I believe ToUnicodeEx() is returning it in UCS-2 LE format, it also only works with wide chars. Is there any way to do this using either fstream or wfstream by somehow converting into UTF-8 first? Thanks!
You might want to use the WideCharToMultiByte function.
For example:
wchar_t buffer[LEN]; // input buffer
char output_buffer[OUT_LEN]; // output buffer where the utf-8 string will be written
int num = WideCharToMultiByte(
CP_UTF8,
0,
buffer,
number_of_characters_in_buffer, // or -1 if buffer is null-terminated
output_buffer,
size_in_bytes_of_output_buffer,
NULL,
NULL);
Windows API generally refers to UTF-16 as unicode which is a little confusing. This means most unicode Win32 function calls operate on or give utf-16 strings.
So ToUnicodeEx returns a utf-16 string.
If you need this as utf 8 you'll need to convert it using WideCharToMultiByte
Thank you for all the help, I've managed to solve my problem with additional help from a blog post about WideCharToMultiByte() and UTF-8 here.
This function converts wide char arrays to a UTF-8 string:
// Takes in pointer to wide char array and length of the array
std::string ConvertCharacters(const wchar_t* buffer, int len)
{
int nChars = WideCharToMultiByte(CP_UTF8, 0, buffer, len, NULL, 0, NULL, NULL);
if (nChars == 0)
{
return u8"";
}
std::string newBuffer;
newBuffer.resize(nChars);
WideCharToMultiByte(CP_UTF8, 0, buffer, len, const_cast<char*>(newBuffer.c_str()), nChars, NULL, NULL);
return newBuffer;
}

Number of bytes of CString in C++

I have a Unicode string stored in CString and I need to know the number bytes this string takes in UTF-8 encoding. I know CString has a method getLength(), but that returns number of characters, not bytes.
I tried (beside other things) converting to char array, but I get (logically, I guess) only array of wchar_t, so this doesn't solve my problem.
To be clear about my goal. For the input lets say "aaa" I want "3" as output (since "a" takes one byte in UTF-8). But for the input "āaa", I'd like to see output "4" (since ā is two byte character).
I think this has to be quite common request, but even after 1,5 hours of search and experimenting, I couldn't find the correct solution.
I have very little experience with Windows programming, so maybe I left out some crucial information. If you feel like that, please let me know, I'll add any information you request.
As your CString contains a series of wchar_t, you can just use WideCharToMultiByte with the output charset as CP_UTF8. The function will return the number of bytes written to the output buffer, or the length of the UTF-8 encoded string
LPWSTR instr;
char outstr[MAX_OUTSTR_SIZE];
int utf8_len = WideCharToMultiByte(CP_UTF8, 0, instr, -1, outstr, MAX_OUTSTR_SIZE, NULL, NULL);
If you don't need the output string, you can simply set the output buffer size to 0
cbMultiByte
Size, in bytes, of the buffer indicated by lpMultiByteStr. If this parameter is set to 0, the function returns the required buffer size for lpMultiByteStr and makes no use of the output parameter itself.
In that case the function will return the number of bytes in UTF-8 without really outputting anything
int utf8_len = WideCharToMultiByte(CP_UTF8, 0, instr, -1, NULL, 0, NULL, NULL);
If your CString is really CStringA, i.e. _UNICODE is not defined, then you need to use Multi­Byte­To­Wide­Char to convert the string to UTF-16 and then convert from UTF-16 to UTF-8 with Wide­Char­To­Multi­byte. See How do I convert an ANSI string directly to UTF-8? But new code should never be compiled without Unicode support anyway

How to declare appropriate size for the buffer

I'm using TCHAR in the Visual C++ poject I'm working on, which definition is shown below:
#ifdef _UNICODE
typedef wchar_t TCHAR;
#else
typedef char TCHAR;
#endif
I need to put some data into buffer buff:
char buff[size] = {0}; // how to declare the buffer size - what should be its value ?
sprintf(buff, "%s (ID: %i)", res->name(), res->id());
where:
name() returns TCHAR*
id() returns int
How to calculate the value of size - exact buffer capacity for actual needs (smaller if no unicode is defined, bigger if unicode is defined) ? In addition I'd like to protect myself from buffer overflow possibility, what kind of protection should I use ?
What's more, I've declared here the buffer as char. If I declare the buffer as int, would it be any difference for the size value (i.e 4 times smaller if compared to declared as char) ?
UPDATE
What I come up with partially based on Mats Petersson answer is:
size_t len;
const char *FORMAT;
#ifndef _UNICODE
len = strlen((char*)res->name());
FORMAT = "%s (ID: %i)";
#else
len = wcslen(res->name());
FORMAT = "%S (ID: %i)";
#endif
int size = 7 * sizeof(TCHAR) + /* place for characters inside format string */
len * sizeof(TCHAR) + /* place for "name" characters */
strlen(_itoa(id, ioatmp, 10)) * sizeof(TCHAR) + /* place for "id" digits */
1 * sizeof(TCHAR); /* zero byte(s) string terminator */
char *buff = new char[size]; /* buffer has to be declared dynamically on the heap,
* because its exact size is not known at compilation time */
sprintf(buff, FORMAT, name, id);
delete[] buff;
Is it correct thinking or did I miss something ?
To begin from the back, buff should always be char, because that's what is being stored by sprintf.
Second, if your res->name() is returning a wide-char (unicode) string, your format string should use "%S", for regular ASCII you should use "%s".
Now, to calculate the length required for the buffer, and avoid overflows. It's not that hard to do something like
const TCHAR *nm = res->name();
size_t len;
#ifndef UNICODE
len = strlen(nm);
#else
... see below.
#endif
and then guesstimate the length of the number (an integer can't take more than 12 places), along with the exact number of characters produced as constants in the format string.
This works fine for the standard ASCII variant.
However, it gets more fun with the wide char variant, as that can take up multiple bytes in the output string (e.g. writing Chinese characters that always require multibyte encoding). One solution is:
len = snprintf(0, NULL, "%S", nm);
which should give you the correct number [I think]. It's a pretty cumbersome method, but it will work. I'm not sure there is an easy way to convert a wide-string to "number of bytes needed to store this string" in another way.
Edit: I would seriously consider if it's much point in supporting non-UNICOD veariant, and then just convert the whole thing to using swprintf(...) instead. You still need the length, but it should just be the result of of wcslen(res->name()), rather than requiring some complex conversion calculation.
you can use: snprintf / swnprintf, it will return you number of chars/wchars needed.
here char buff[size] = {0}; you are writing outside of the buffer. UPDATE: I'll take that back - it just a declaration with initialization if size is constant.
this "%s (ID: %i)" shall be changed to this: "%s (ID: %d)" if last parameter is int.

OPOS BSTR* not converting properly

So after a bunch of research I found that using WideCharToMultiByte worked great for sending data from the Control Object through OPOS to my custom SO. well we encountered a bug. In the DirectIO portion the C# Control Object's map is DirectIO(int command, ref int data, ref string object);
and for the longest we only needed to send simple commands through DirectIO. For instance to turn on the LED we would set the data to be the length in milleseconds, and the object to be the color. When we needed to write data to a tag or a card the text had to be parsed from a special XML styled string to a byte array... well now the need has come about that we need to have a byte array, use ASCII encoding to put that array into string form, and have it write..
The problem comes around that when I convert this string in my Service Object it doesn't convert it right. it seems to stop on null, even though SysStringLen knows the length is 4bytes. example The control object does this
int page = 16;
byte[] data = new byte[] { 0x19, 0x00, 0x30, 0x00 };
string pData = System.Text.ASCIIEncoding.ASCII.GetString(data);
msr.DirectIO(902, ref page, ref pData);
The SO sees this
int len = (int)SysStringLen(*pString);
long dataData = *pData;
char* dataObject = new char[1+len];
WideCharToMultiByte(CP_ACP, 0, *pString, -1, dataObject, len, NULL, NULL);
ByteUtil::LogArray("dataObject", (BYTE*)dataObject, len);
yields the output of
dataObject(4)-19:00:00:00
basically as soon as teh first null character is reached the rest of the data is lost. now if i convert the number from a string to a string it works out ok because I have a ByteUtil function just for that occasion... but it doesn't seem right for me to have to do that... why can't i just have it as a BYTE array?
Very easy, just change this line:
WideCharToMultiByte(CP_ACP, 0, *pString, -1, dataObject, len, NULL, NULL);
into:
WideCharToMultiByte(CP_ACP, 0, *pString, len, dataObject, len, NULL, NULL);
If you set the fourth parameter to -1, WideCharToMultiByte treats the input string as a null-terminated string. BSTRs are null-terminated for compatibility reasons, but you should never treat them as null-terminated because they can contain null characters as part of the string.

Call popen() on a command with Chinese characters on Mac

I'm trying to execute a program on a file using the popen() command on a Mac. For this, I create a command of the form <path-to_executable> <path-to-file> and then call popen() on this command. Right now, both these two components are declared in a char*. I need to read the output of the command so I need the pipe given by popen().
Now it turns out that path-to-file can contain Chinese, Japanese, Russian and pretty much any other characters. For this, I can represent the path-to-file as wchar_t*. But this doesn't work with popen() because apparently Mac / Linux don't have a wide _wpopen() like Windows.
Is there any other way I can make this work? I'm getting the path-to-file from a data structure that can only give me wchar_t* so I have to take it from there and convert it appropriately, if needed.
Thanks in advance.
Edit:
Seems like one of those days when you just end up pulling your hair out.
So I tried using wcstombs, but the setlocale call failed for "C.UTF-8" and any of its permutations. Unsurprisingly, the wcstombs call failed returning -1 after that.
Then I tried to write my own iconv implementation based on some sample codes searched on Google. I came up with this, which stubbornly refuses to work:
iconv_t cd = iconv_open("UTF-8", "WCHAR_T");
// error checking here
wchar_t* inbuf = ...; // get wchar_t* here
char outbuf[<size-of-inbuf>*4+1];
size_t inlen = <size-of-inbuf>;
size_t outlen = <size-of-inbuf>*4+1;
char* c_inbuf = (char*) inbuf;
char* c_outbuf = outbuf;
int ret = iconv(cd, &c_inbuf, &inlen, &c_outbuf, &outlen);
// more error checking here
iconv always returns -1 and the errno is set to EINVAL. I've verified that <size-of-len> is set correctly. I've got no clue why this code's failing now.
Edit 2:
iconv was failing because I was not setting the input buffer length right. Also, Mac doesn't seem to support the "WCHAR_T" encoding so I've changed it to UTF-16. Now I've corrected the length and changed the from encoding but iconv just returns without converting any character. It just returns 0.
To debug this issue, I even changed the input string to a temp string and set the input length appropriately. Even this iconv call just returns 0. My code now looks like:
iconv_t cd = iconv_open("UTF-8", "UTF-16");
// error checking here
wchar_t* inbuf = ...; // get wchar_t* here - guaranteed to be UTF-16
char outbuf[<size-of-inbuf>*4+1];
size_t inlen = <size-of-inbuf>;
size_t outlen = <size-of-inbuf>*4+1;
char* c_inbuf = "abc"; // (char*) inbuf;
inlen = 4;
char* c_outbuf = outbuf;
int ret = iconv(cd, &c_inbuf, &inlen, &c_outbuf, &outlen);
// more error checking here
I've confirmed that the converter descriptor is being opened correctly. The from-encoding is correct. The input buffer contains a few simple characters. Everything is hardcoded and still, iconv doesn't convert any characters and just returns 0 and outbuf remains empty.
Sanity loss alert!
You'll need an UTF-8 string for popen. For this, you can use iconv to convert between different encodings, including from the local wchar_t encoding to UTF-8. (Note that on my Mac OS install, wchar_t is actually 32 bits, and not 16.)
EDIT Here's an example that works on OS X Lion. I did not have problems using the wchar_t encoding (and it is documented in the iconv man page).
#include <sys/param.h>
#include <string.h>
#include <iconv.h>
#include <stdio.h>
#include <errno.h>
char* utf8path(const wchar_t* wchar, size_t utf32_bytes)
{
char result_buffer[MAXPATHLEN];
iconv_t converter = iconv_open("UTF-8", "wchar_t");
char* result = result_buffer;
char* input = (char*)wchar;
size_t output_available_size = sizeof result_buffer;
size_t input_available_size = utf32_bytes;
size_t result_code = iconv(converter, &input, &input_available_size, &result, &output_available_size);
if (result_code == -1)
{
perror("iconv");
return NULL;
}
iconv_close(converter);
return strdup(result_buffer);
}
int main()
{
wchar_t hello_world[] = L"/éè/path/to/hello/world.txt";
char* utf8 = utf8path(hello_world, sizeof hello_world);
printf("%s\n", utf8);
free(utf8);
return 0;
}
The utf8_hello_world function accepts a wchar_t string with its byte length and returns the equivalent UTF-8 string. If you deal with pointers to wchar_t instead of an array of wchar_t, you'll want to use (wcslen(ptr) + 1) * sizeof(wchar_t) instead of sizeof.
Mac OS X uses UTF-8, so you need to convert the wide-character strings into UTF-8. You can do this using wcstombs, provided you first switch into a UTF-8 locale. For example:
// Do this once at program startup
setlocale(LC_ALL, "en_US.UTF-8");
...
// Error checking omitted for expository purposes
wchar_t *wideFilename = ...; // This comes from wherever
char filename[256]; // Make sure this buffer is big enough!
wcstombs(filename, wideFilename, sizeof(filename));
// Construct popen command using the UTF-8 filename
You can also use libiconv to do the UTF-16 to UTF-8 conversion for you if you don't want to change your program's locale setting; you could also roll your own implementation, as doing the conversion is not all that complicated.