UTF8 encoded cstring byte lenght - c++

Can I use std::strlen() on null terminated UTF8 string and expect it to work? That is count the string lenght in bytes not in glyphs/codepoints?
Or does the UTF8 multi byte codepoints just break strlen() in this case?
I'm bit paranoid about it and I'm currently doing following with utf8cpp library:
size_t utf8bytelen(const char * utf8str) {
const char * itr = utf8str;
while(*itr && utf8::unchecked::next(itr));
return std::distance(utf8str, itr);
}

Related

What is the right way to convert UTF16 string to wchar_t on Mac?

In the project that still uses XCode 3 (no C++11 features like codecvt)
Use a conversion library, like libiconv. You can set its input encoding to "UTF-16LE" or "UTF-16BE" as needed, and set its output encoding to "wchar_t" rather than any specific charset.
#include <iconv.h>
uint16_t *utf16 = ...; // input data
size_t utf16len = ...; // in bytes
wchar_t *outbuf = ...; // allocate an initial buffer
size_t outbuflen = ...; // in bytes
char *inptr = (char*) utf16;
char *outptr = (char*) outbuf;
iconv_t cvt = iconv_open("wchar_t", "UTF-16LE");
while (utf16len > 0)
{
if (iconv(cvt, &inptr, &utf16len, &outptr, &outbuflen) == (size_t)(−1))
{
if (errno == E2BIG)
{
// resize outbuf to a larger size and
// update outptr and outbuflen according...
}
else
break; // conversion failure
}
}
iconv_close(cvt);
Why do you want wchar_t on mac? wchar_t does not necessary be 16 bit, it is not very useful on mac.
I suggest to convert yo NSString using
char* payload; // point to string with UTF16 encoding
NSString* s = [NSString stringWithCString:payload encoding: NSUTF16LittleEndianStringEncoding];
To convert NSString to UTF16
const char* payload = [s cStringUsingEncoding:NSUTF16LittleEndianStringEncoding];
Note that mac support NSUTF16BigEndianStringEncoding as well.
Note2: Although const char* is used, the data is encoded with UTF16 so don't pass it to strlen().
I would go the safest route.
Get the UTF-16 string as a UTF-8 string (using NSString)
set the locale to UTF-8
use mbstowcs() to convert the UTF-8 multi-byte string to a wchart_t
At each step you are ensured the string value will be protected.

searching an unsigned char array for characters

I have a binary data file that I am trying to read. The values in the file are 8-bit unsigned integers, with "record" delimiters that are ASCII text ($MSG, $GRP, for example). I read the data as one big chunk, as follows:
unsigned char *inBuff = (unsigned char*)malloc(file_size*sizeof(unsigned char));
result = fread(inBuff, sizeof(unsigned char), file_size, pFile);
I need to search this array to find records that start with $GRP (so I can then read the data that follows), can someone suggest a good way to do this? I have tried several things, and none of them have worked. For example, my most recent attempt was:
std::stringstream str1;
str1 << inBuff;
std::string strTxt = str1.str();
However, when I check the length on this, it is only 5. I looked at the file in Notepad, and noticed that the sixth character is a NULL. So it seems like it is cutting off there because of the NULL. Any ideas?
Assuming the fread does not return a -1, the value in it will tell you how many bytes are available to search.
It is unreasonable to expect to be able to do a string search on binary data, as there my be NUL characters in the binary data which will cause the length function to terminate early.
One possibly way is to to search for the data is to use memcmp on the buffer, with your search key, and length of the search key.
(As per my comment)
C str functions assume zero-terminated strings. Any C string function will stop at the very first binary 0. Use memchr to locate the $ and then use strncmp or memcmp. In particular, do not assume the byte immediately after the 4-byte identifier is a binary 0.
In code (C, not tested):
/* recordId should point to a simple string such as "$GRP" */
unsigned char *find_record (unsigned char *data, size_t max_length, char *recordId)
{
unsigned char *ptr;
size_t remaining_length;
ptr = startOfData;
if (strlen(recordId) > max_length)
return NULL;
remaining_length = max_length;
do
{
/* fast scan for the first character only */
ptr = memchr (ptr, recordId[0], remaining_length);
if (!ptr)
return NULL;
/* first character matches, test entire string */
if (!memcmp (ptr, recordId, strlen(recordId))
return ptr;
/* no match; test onwards from the next possible position */
ptr++;
/* take care not to overrun end of data */
/* It's tempting to test
remaining_length = ptr - startOfData;
but there is a chance this will end up negative, and
size_t does not like to be negative.
*/
if (ptr >= startOfData+max_length)
break;
remaining_length = ptr-startOfData;
} while (1);
return NULL;
}

Replace method changes size of QByteArray

I want to manipulate a 32 bit write command which I have stored in a QByteArray. But the thing that confuses me is that my QByteArray changes size and I cannot figure out why that happens.
My code:
const char CMREFCTL[] = {0x85,0x00,0x00,0x0B};
QByteArray test = QByteArray::fromRawData(CMREFCTL, sizeof(CMREFCTL));
qDebug()<<test.toHex();
const char last1 = 0x0B;
const char last2 = 0x0A;
test.replace(3,1,&last2);
qDebug()<<test.toHex();
test.replace(3,1,&last1);
qDebug()<<test.toHex();
Generates:
"0x8500000b"
"0x8500000a0ba86789"
"0x8500000ba867890ba86789"
I expected the following output:
"0x8500000b"
"0x8500000a"
"0x8500000b"
Using test.replace(3,1,&last2,1) works but I dont see why my code above dont give the same result.
Best regards!
Here is the documentation for the relevant method:
QByteArray & QByteArray::replace ( int pos, int len, const char *
after )
This is an overloaded function.
Replaces len bytes from index position pos with the zero terminated
string after.
Notice: this can change the length of the byte array.
You are not giving the byte array a zero-terminated string, but a pointer to a single char. So it will scan forward in memory from that pointer until it hits a 0, and treat all that memory as the string to replace with.
If you just want to change a single character test[3] = last2; should do what you want.

libxml2 xmlChar * to std::wstring

libxml2 seems to store all its strings in UTF-8, as xmlChar *.
/**
* xmlChar:
*
* This is a basic byte in an UTF-8 encoded string.
* It's unsigned allowing to pinpoint case where char * are assigned
* to xmlChar * (possibly making serialization back impossible).
*/
typedef unsigned char xmlChar;
As libxml2 is a C library, there's no provided routines to get an std::wstring out of an xmlChar *. I'm wondering whether the prudent way to convert xmlChar * to a std::wstring in C++11 is to use the mbstowcs C function, via something like this (work in progress):
std::wstring xmlCharToWideString(const xmlChar *xmlString) {
if(!xmlString){abort();} //provided string was null
int charLength = xmlStrlen(xmlString); //excludes null terminator
wchar_t *wideBuffer = new wchar_t[charLength];
size_t wcharLength = mbstowcs(wideBuffer, (const char *)xmlString, charLength);
if(wcharLength == (size_t)(-1)){abort();} //mbstowcs failed
std::wstring wideString(wideBuffer, wcharLength);
delete[] wideBuffer;
return wideString;
}
Edit: Just an FYI, I'm very aware of what xmlStrlen returns; it's the number of xmlChar used to store the string; I know it's not the number of characters but rather the number of unsigned char. It would have been less confusing if I had named it byteLength, but I thought it would have been clearer as I have both charLength and wcharLength. As for the correctness of the code, the wideBuffer will be larger or equal to the required size to hold the buffer, always (I believe). As characters that require more space than wide_t will be truncated (I think).
xmlStrlen() returns the number of UTF-8 encoded codeunits in the xmlChar* string. That is not going to be the same number of wchar_t encoded codeunits needed when the data is converted, so do not use xmlStrlen() to allocate the size of your wchar_t string. You need to call std::mbtowc() once to get the correct length, then allocate the memory, and call mbtowc() again to fill the memory. You will also have to use std::setlocale() to tell mbtowc() to use UTF-8 (messing with the locale may not be a good idea, especially if multiple threads are involved). For example:
std::wstring xmlCharToWideString(const xmlChar *xmlString)
{
if (!xmlString) { abort(); } //provided string was null
std::wstring wideString;
int charLength = xmlStrlen(xmlString);
if (charLength > 0)
{
char *origLocale = setlocale(LC_CTYPE, NULL);
setlocale(LC_CTYPE, "en_US.UTF-8");
size_t wcharLength = mbtowc(NULL, (const char*) xmlString, charLength); //excludes null terminator
if (wcharLength != (size_t)(-1))
{
wideString.resize(wcharLength);
mbtowc(&wideString[0], (const char*) xmlString, charLength);
}
setlocale(LC_CTYPE, origLocale);
if (wcharLength == (size_t)(-1)) { abort(); } //mbstowcs failed
}
return wideString;
}
A better option, since you mention C++11, is to use std::codecvt_utf8 with std::wstring_convert instead so you do not have to deal with locales:
std::wstring xmlCharToWideString(const xmlChar *xmlString)
{
if (!xmlString) { abort(); } //provided string was null
try
{
std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> conv;
return conv.from_bytes((const char*)xmlString);
}
catch(const std::range_error& e)
{
abort(); //wstring_convert failed
}
}
An alternative option is to use an actual Unicode library, such as ICU or ICONV, to handle Unicode conversions.
There are some problems in this code, besides the fact that you are using wchar_t and std::wstring which is a bad idea unless you're making calls to the Windows API.
xmlStrlen() does not do what you think it does. It counts the number of UTF-8 code units (a.k.a. bytes) in a string. It does not count the number of characters. This is all stuff in the documentation.
Counting characters will not portably give you the correct size for a wchar_t array anyway. So not only does xmlStrlen() not do what you think it does, what you wanted isn't the right thing either. The problem is that the encoding of wchar_t varies from platform to platform, making it 100% useless for portable code.
The mbtowcs() function is locale-dependent. It only converts from UTF-8 if the locale is a UTF-8 locale!
This code will leak memory if the std::wstring constructor throws an exception.
My recommendations:
Use UTF-8 if at all possible. The wchar_t rabbit hole is a lot of extra work for no benefit (except the ability to make Windows API calls).
If you need UTF-32, then use std::u32string. Remember that wstring has a platform-dependent encoding: it could be a variable-length encoding (Windows) or fixed-length (Linux, OS X).
If you absolutely must have wchar_t, then chances are good that you are on Windows. Here is how you do it on Windows:
std::wstring utf8_to_wstring(const char *utf8)
{
size_t utf8len = std::strlen(utf8);
int wclen = MultiByteToWideChar(
CP_UTF8, 0, utf8, utf8len, NULL, 0);
wchar_t *wc = NULL;
try {
wc = new wchar_t[wclen];
MultiByteToWideChar(
CP_UTF8, 0, utf8, utf8len, wc, wclen);
std::wstring wstr(wc, wclen);
delete[] wc;
wc = NULL;
return wstr;
} catch (std::exception &) {
if (wc)
delete[] wc;
}
}
If you absolutely must have wchar_t and you are not on Windows, use iconv() (see man 3 iconv, man 3 iconv_open and man 3 iconv_close for the manual). You can specify "WCHAR_T" as one of the encodings for iconv().
Remember: You probably don't want wchar_t or std::wstring. What wchar_t does portably isn't useful, and making it useful isn't portable. C'est la vie.
add
#include <boost/locale.hpp>
convert xmlChar* to string
std::string strGbk((char*)node);
convert string to wstring
std::string strGbk = "china powerful forever";
std::wstring wstr = boost::locale::conv::to_utf<wchar_t>(strGbk, "gbk");
std::cout << strGbk << std::endl;
std::wcout << wstr. << std::endl;
it works for me,good lucks.

Need to convert char* (pointer) to wchar_t* (pointer)

I am doing some serial port communcation to a computer controlled pump and the createfile function I used to communicate requires the com port name to be parsed as a wchar_t pointer.
I am also using QT to create a form and aquire the com port name as a QString.
This QString is converted to a char array and pointed to as follows:
char* Dialog::GetPumpSerialPortNumber(){
QString mystring;
mystring = ui->comboBox_2->currentText();
char * mychar;
mychar = mystring.toLatin1().data();
return mychar;
I now need to set my port number which is stored as a wchar_t* in my pump object. I do this by calling the following function:
void pump::setPortNumber(wchar_t* portNumber){
this->portNumber = portNumber;
}
Thus how do I change my char* (mychar) into a wchar_t* (portNumber)?
Thanks.
If you're talking about just needing a char array to a wchar_t array, here's a solution for you:
static wchar_t* charToWChar(const char* text)
{
size_t size = strlen(text) + 1;
wchar_t* wa = new wchar_t[size];
mbstowcs(wa,text,size);
return wa;
}
An enhancement to leetNightshade's answer could be
size_t unistrlen(const char *s) {
size_t sz = 0;
const char *sc;
for (sc = s; *sc != '\0'; sc+=(
((*sc&0x80)==0x80) ? 2 :/*1st byte of 2-byte character*/
((*sc&0xc0)==0xc0) ? 3 :/*1st byte of 3-byte character*/
((*sc&0xe0)==0xe0) ? 4 :/*1st byte of 4-byte character*/
((*sc&0xf0)==0xf0) ? 1 :/*2nd, 3rd, or 4th byte of multi-byte character*/
1) /*single byte character*/)
if ((*sc&0xf0)!=0xf0) sz++;
return sz;
}
wchar_t* charToWChar(const char* text) {
size_t size = unistrlen(text) + 1;
wchar_t* wa = new wchar_t[size];
mbstowcs(wa,text,size);
return wa;
}
Where unistrlen will return how many characters (single or multi bytes characters) in your string unlike strlen which returns the length byte by byte and that might waste some memory if your string contains some multi-byte characters.
You can use the toWCharArray function of QString to have your wchar_t* value and return a wchar_t* from your GetPumpSerialPortNumber function.
I've found a helpful article in MSDN - How to: Convert Between Various String Types. I guess it should be useful.
QString::toWCharArray ( wchar_t * array ) ?