Using iconv while maintaining code correctness

Using iconv while maintaining code correctness - c++

I'm currently using iconv to convert documents with different encodings.
The iconv() function has the following prototype:
size_t iconv (
iconv_t cd,
const char* * inbuf,
size_t * inbytesleft,
char* * outbuf,
size_t * outbytesleft
);
So far, I only had to convert buffers of type char* but I also realized I could have to convert buffers of type wchar_t*. In fact, iconv even has a dedicated encoding name "wchar_t" for such buffers: this encoding adapts to the operating system settings: that is, on my computers, it refers to UCS-2 on Windows and to UTF-32 on Linux.
But here lies the problem: if I have a buffer of wchar_t* I can reinterpret_cast it to a buffer of char* to use it in iconv, but then I face implementation defined behavior: I cannot be sure that the all compilers will behave the same regarding the cast.
What should I do here ?

reinterpret_cast<char const*> is safe and not implementation defined, at least not on any real implementations.
The language explicitly allows any object to be reinterpreted as an array of characters and the way you get that array of characters is using reinterpret_cast.

Related

c++ how to convert wchar_t to char16_t

Convert between string, u16string & u32string
This post explains the opposite of my question. So I need to post a new question
I need to convert wchar_t to char16_t. I found a sample of doing the opposite ( char16_t -> wchar_t) here:
I am not familiar with templates etc, sorry. Can anybody give me an example of converting wchar_t to char16_t please?
I have this piece of code that I want to adapt for converting wchar_t to char16_t.
std::wstring u16fmt(const char16_t* str) {
std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> convert_wstring;
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> convert;
std::string utf8str = convert.to_bytes(str);
std::wstring wstr = convert_wstring.from_bytes(utf8str);
return wstr;
}
Ah, and it should run on Windows and Linux

If sizeof( wchar_t ) == 2 (*), you're straddled with Windows and can only hope your wstring holds UTF-16 (and hasn't been smashed flat to UCS-2 by some old Windows function).
If sizeof( wchar_t ) == 4 (*), you're not on Windows and need to do a UTF-32 to UTF-16 conversion.
(*): Assuming CHAR_BIT == 8.
I am, however, rather pessimistic about standard library's Unicode capabilities beyond simple "piping through", so if you're going to do any actual work on those strings, I'd recommend ICU, the de-facto C/C++ standard library for all things Unicode.
icu::UnicodeString has a wchar_t * constructor, and you can call getTerminatedBuffer() to get a (non-owning) const char16_t *. Or, of course, just use icu::UnicodeString, which uses UTF-16 internally.

Is there a function to concatenate two char16_t

There is the function wcsncat_s() for concatenating two wchar_t*:
errno_t wcsncat_s( wchar_t *restrict dest, rsize_t destsz, const wchar_t *restrict src, rsize_t count );
Is there an equivalent function for concatenating two char16_t?

Not really.
On Windows, though, wchar_t is functionally identical to char16_t, so you could just cast your char16_t* to a wchar_t*.
Otherwise you can do it simply enough by writing yourself a function for it.

You could use std::u16string if you want something portable.
std::u16string str1(u16"The quick brown fox ");
std::u16string str2(u16"Jumped over the lazy dog");
std::u16string str3 = str1+str2; // concatenate
const char16_t* psz = str3.c_str();
The validity of psz lasts as long as str3 doesn't go out of out scope.
But the more portable and flexible solution is to just use wchar_t everywhere (which is 32-bit on Mac). Unless you are explicitly using 16-bit char strings (perhaps for a specific UTf16 processing routine), it's easier to just keep your code in the wide char (wchar_t) space. Plays nicer with native APIs and libraries on Mac and Windows.

Convert ICU Unicode string to std::wstring (or wchar_t*)

Is there an icu function to create a std::wstring from an icu UnicodeString ? I have been searching the ICU manual but haven't been able to find one.
(I know i can convert UnicodeString to UTF8 and then convert to platform dependent wchar_t* but i am looking for one function in UnicodeString which can do this conversion.

The C++ standard doesn't dictate any specific encoding for std::wstring. On Windows systems, wchar_t is 16-bit, and on Linux, macOS, and several other platforms, wchar_t is 32-bit. As far as C++'s std::wstring is concerned, it is just an arbitrary sequence of wchar_t in much the same way that std::string is just an arbitrary sequence of char.
It seems that icu::UnicodeString has no in-built way of creating a std::wstring, but if you really want to create a std::wstring anyway, you can use the C-based API u_strToWCS() like this:
icu::UnicodeString ustr = /* get from somewhere */;
std::wstring wstr;
int32_t requiredSize;
UErrorCode error = U_ZERO_ERROR;
// obtain the size of string we need
u_strToWCS(nullptr, 0, &requiredSize, ustr.getBuffer(), ustr.length(), &error);
// resize accordingly (this will not include any terminating null character, but it also doesn't need to either)
wstr.resize(requiredSize);
// copy the UnicodeString buffer to the std::wstring.
u_strToWCS(wstr.data(), wstr.size(), nullptr, ustr.getBuffer(), ustr.length(), &error);
Supposedly, u_strToWCS() will use the most efficient method for converting from UChar to wchar_t (if they are the same size, then it is just a straightfoward copy I suppose).

Take wchar_t and put into char?

i've tried a few things and haven't yet been able to figure out how to get const wchar_t *text (shown bellow) to pass into the variable StoreText (shown below). What am i doing wrong?
void KeyboardComplete(int localClientNum, const wchar_t *text, unsigned int len)
{
char* StoreText = text; //This is where error occurs
}

You cannot directly assign a wchar_t* to a char*, as they are different and incompatible data types.
If StoreText needs to point at the same memory address that text is pointing at, such as if you are planning on looping through the individual bytes of the text data, then a simple type-cast will suffice:
char* StoreText = (char*)text;
However, if StoreText is expected to point to its own separate copy of the character data, then you would need to convert the wide character data into narrow character data instead. Such as by:
using the WideCharToMultiByte() function on Windows:
void KeyboardComplete(int localClientNum, const wchar_t *text, unsigned int len)
{
int StoreTextLen = 1 + WideCharToMultiByte(CP_ACP, 0, text, len, NULL, 0, NULL, NULL);
std::vector<char> StoreTextBuffer(StoreTextLen);
WideCharToMultiByte(CP_ACP, 0, text, len, &StoreTextBuffer[0], StoreTextLen, NULL, NULL);
char* StoreText = &StoreText[0];
//...
}
using the std::wcsrtombs() function:
#include <cwchar>
void KeyboardComplete(int localClientNum, const wchar_t *text, unsigned int len)
{
std::mbstate_t state = std::mbstate_t();
int StoreTextLen = 1 + std::wcsrtombs(NULL, &text, 0, &state);
std::vector<char> StoreTextBuffer(StoreTextLen);
std::wcsrtombs(&StoreTextBuffer[0], &text, StoreTextLen, &state);
char *StoreText = &StoreTextBuffer[0];
//...
}
using the std::wstring_convert class (C++11 and later):
#include <locale>
void KeyboardComplete(int localClientNum, const wchar_t *text, unsigned int len)
{
std::wstring_convert<std::codecvt<wchar_t, char, std::mbstate_t>> conv;
std::string StoreTextBuffer = conv.to_bytes(text, text+len);
char *StoreText = &StoreTextBuffer[0];
//...
}
using similar conversions from the ICONV or ICU library.

First of all, for strings you should use std::wstring/std::string instead of raw pointers.
The C++11 Locale (http://en.cppreference.com/w/cpp/locale) library can be used to convert wide string to narrow string.
I wrote a wrapper function below and have used it for years. Hope it will be helpful to you, too.
#include <string>
#include <locale>
#include <codecvt>
std::string WstringToString(const std::wstring & wstr, const std::locale & loc /*= std::locale()*/)
{
std::string buf(wstr.size(), 0);
std::use_facet<std::ctype<wchar_t>>(loc).narrow(wstr.c_str(), wstr.c_str() + wstr.size(), '?', &buf[0]);
return buf;
}

wchar_t is a wide character. It is typically 16 or 32 bits per character, but this is system dependent.
char is a good ol' CHAR_BIT-sized data type. Again, how big it is is system dependent. Most likely it's going to be one byte, but I can't think of a reason why CHAR_BIT can't be 16 or 32 bits, making it the same size as wchar_t.
If they are different sizes, a direct assignment is doomed. For example an 8 bit char will see 2 characters, and quite likely 2 completely unrelated characters, for every 1 character in a 16 bit wchar_t. This would be bad.
Second, even if they are the same size, they may have different encodings. For example, the numeric value assigned to the letter 'A' may be different for the char and the wchar_t. It could be 65 in char and 16640 in wchar_t.
To make any sense in the different data type char and wchar_t will need to be translated to the other's encoding. std::wstring_convert will often perform this translation for you, but look into the locale library for more complicated translations. Both require a compiler supporting C++11 or better. In previous C++ Standards, a small army of functions provided conversion support. Third party libraries such as Boost::locale are helpful to unify and provide wider support.
Conversion functions are supplied by the operating system to translate between the encoding used by the OS and other common encodings.

You have to do a cast, you can do this:
char* StoreText = (char*)text;
I think this may work.
But you can use the wcstombs function of cstdlib library.
char someText[12];
wcstombs(StoreText,text, 12);
Last parameter most be a number of byte available in the array pointed.

Given a buffer (void*) and an encoding name, how can I find a string or character's index?

I have a void* to a buffer (and its length, in bytes). I also know the text encoding of the buffer, it might be plain ASCII, UTF-8, Shift-JIS, UTF-16 (both LE or BE), or something else. And I need to find out if "<<<" appears in the buffer.
I'm using C++11 (VS2013) and coming from a .NET background the problem wouldn't be that hard: create an Encoding object instance for the encoding, then pass it the data (as Byte[]) and convert it to a String instance (internally using UTF-16LE), and use the string functions IndexOf or a Regex.
However the C++ equivalent doesn't necessarily work as Microsoft's C++ runtime library, specifically its implementation of locale, does not support UTF-8 multi-byte encoding (a complete mystery to me). I'm also not keen on the prospect of performing so many buffer copies (like how in .NET, you would Marshal.Copy the void* into Byte[], and then again into a String via the Encoding instance.
Here's the psuedo-code high-level logic I'm after:
const char* needle = "<<<";
size_t IndexOf(void* buffer, size_t bufferCB, std::string encodingName, std::string needle) {
char* needleEncoded = Convert( needle, encodingName );
setlocale( encodingName );
size_t index = strstr( buffer, bufferCB, needleEncoded );
return index;
}
My versions of setlocale and strstr aren't in the standard, but exemplify the kind of behaviour I need.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using iconv while maintaining code correctness - c++

reinterpret_cast<char const*> is safe and not implementation defined, at least not on any real implementations. The language explicitly allows any object to be reinterpreted as an array of characters and the way you get that array of characters is using reinterpret_cast.

Related

c++ how to convert wchar_t to char16_t

Is there a function to concatenate two char16_t

Convert ICU Unicode string to std::wstring (or wchar_t*)

Take wchar_t and put into char?

Given a buffer (void*) and an encoding name, how can I find a string or character's index?

Categories

Resources