How to pass string from C# to C++ specifying encoding - c++

I'm writing native C++ library to be used from C#. I need a C++ method receiving string (or char array) and Encoding. Inside this method I want to convert this string to byte array with respect of Encoding, work with it and send back string converted from byte array with respect of Encoding. As far as this method will be called from C#, I can pass System.Text.Encoding to it, but I don't know any analog in C++ for it. What approach would you suggest?

It would be much simpler to pass the byte array to the C++ library if you're only going to operate on the bytes anyway...

Related

Convert array to encoded string

I am using libiconv to convert array of char to encoded string.
I am new to the library.
I wonder how I can know what encoded type the given array is encoded in before I call iconv_open("To-be-encoded","given-encoded-type")
It's the second parameter that I need to know.
It's the second parameter that I need to know.
Yes, you indeed need to know it. I.e. it is you that needs to tell iconv what encoding your array is in. There is no reliable way of detecting what encoding was used to produce a set of bytes - at best you can take a guess based on character frequencies or other such heuristics.
But there is no way to be 100% sure without other information, from metadata or from the file/data format itself. (e.g. HTTP provides headers to indicate encoding, XML has that capability too.)
In other words, if you don't know how a stream of bytes you have is encoded, you cannot convert it to anything else. You need to know the starting point.

What to use to store Unicode (UTF-16) strings? (C++11)

I ask this question in the light of the innovations that C++11 brings, namely uchar16_t/u16string.
I write an application that should have multilingual support. According to my plan the localization strings will be stored in XML as UTF-16, and retrieved with pugixml. THe strings will be used both for the GUI and generating HTML report of the computation results. Since I have understood wchar_t/wstring as being deprecated in favour of new u16string, I've planned to use u16string for storing language strings inside the program.
But since both pugixml and MFC's CString use wchar_t as underlining storage type for the Unicode, should I perhaps forget about u16string for now and instead use straightforwardly wstring?
Language-portability is crucial, platform portability doesn't matter.
I use MVS 2013 with Intel compiler.
The encoding used for storing the data outside the program is the only one that matters.
That data is likely to be used from other software. Someone will want to write those strings and they'll probably use some kind of specialised editor or gasp a general-purpose text editor. UTF-8 has much better support from other software than UTF-16, and that's what I would recommend and why.
Inside the program, what encoding you use doesn't matter, as long as you do it consistently and don't mix them up in stupid ways.
Obviously, if you use the same encoding inside the program as you do outside of it, you don't need to perform any conversions and the risk of mixing them up and producing mojibake is not there.
The thing with pugixml using wchar_t is that the encoding it uses then depends on the size of wchar_t. If the size is 2, it uses UTF-16; if the size is 4 it uses UTF-32. pugixml also has the option to use UTF-8 with char by setting the PUGIXML_WCHAR_MODE macro appropriately, so you can use that instead.
If you use wchar_t API, stick to wstring. Remember: since we're inside the program, it doesn't matter if it's going to be UTF-16 or UTF-32, as long as we're consistent. If you use the char API, stick to string. You could, I guess, perform conversions from wchar_t to char16_t and use u16strings, but that wouldn't give much benefit.
The saving and loading functions in pugixml take an xml_encoding parameter that lets you pick what encoding will be on the data outside the program, and that doesn't have to match what you use internally. Pick whichever you find the most convenient.

In C++ when to use WCHAR and when to use CHAR

I have a question:
Some libraries use WCHAR as the text parameter and others use CHAR (as UTF-8): I need to know when to use WCHAR or CHAR when I write my own library.
Use char and treat it as UTF-8. There are a great many reasons for this; this website summarises it much better than I can:
http://utf8everywhere.org/
It recommends converting from wchar_t to char (UTF-16 to UTF-8) as soon as you receive it from any library, and converting back when you need to pass strings to it. So to answer your question, always use char except at the point that an API requires you to pass or receive wchar_t.
WCHAR (or wchar_t on Visual C++ compiler) is used for Unicode UTF-16 strings.
This is the "native" string encoding used by Win32 APIs.
CHAR (or char) can be used for several other string formats: ANSI, MBCS, UTF-8.
Since UTF-16 is the native encoding of Win32 APIs, you may want to use WCHAR (and better a proper string class based on it, like std::wstring) at the Win32 API boundary, inside your app.
And you can use UTF-8 (so, CHAR/char and std::string) to exchange your Unicode text outside your application boundary. For example: UTF-8 is widely used on the Internet, and when you exchange UTF-8 text between different platforms you don't have the problem of endianness (instead with UTF-16 you have to consider both the UTF-16BE big-endian and the UTF-16LE little-endian cases).
You can convert between UTF-16 and UTF-8 using the WideCharToMultiByte() and MultiByteToWideChar() Win32 APIs. These are pure-C APIs, and these can be conveniently wrapped in C++ code, using string classes instead of raw character pointers, and exceptions instead of raw error codes. You can find an example of that here.
The right question is not which type to use, but what should be your contract with your library users. Both char and wchar_t can mean more than one thing.
The right answer to me, is use char and consider everything utf-8 encoded, as utf8everywhere.org suggests. This will also make it easier to write cross-platform libraries.
Make sure you make correct use of strings though. Some APIs like fopen(), would accept a char* string and treat it differently (not as UTF-8) when compiled on Windows. If Unicode is important to you (and it probably is, when you are dealing with strings), be sure to handle your strings correctly. A good example can be seen in boost::locale. I also recommend using boost::nowide on Windows to get strings handled correctly inside your library.
In Windows we stick to WCHARS. std::wstring. Mainly because if you don't you end up having to convert because calling Windows functions.
I have a feeling that trying to use utf8 internally simply because of http://utf8everywhere.org/ is gonna bite us in the bum later on down the line.
It is best recommended that, when developing a Windows application, resort to TCHARs. The good thing about TCHARs is that they can be either regular chars or wchars, depending whether the unicode setting is set or not. Once you resort to TCHARs, you make sure that all string manipulations that you use also start with the _t prefix (e.g. _tcslen for length of string). That way you will know that your code will work both in Unicode and ASCII environments.

How to use ICU with UTF-16?

I'm looking into using ICU for Unicode string processing in a native Node.js module because it seems to me that v8::String (according to these docs) doesn't have a C++ API for this purpose.
To my knowledge V8 expects UTF-16 in ExternalStringResource and other APIs, so I'd like to use ICU for UTF-16 processing.
I specifically need to:
Iterate over the characters (not just the 16-bit code units) of an UTF-16 string
Tell the number of characters (not just the 16-bit code units) that an UTF-16 string contains
So I looked at the ICU documentation and found the UnicodeString and CharacterIterator classes. However, UnicodeString doesn't have a fromUTF16 method, only fromUTF8 and fromUTF32.
The other thing I'm unsure about is, does the UnicodeString constructor copy the data I give it or not? I'd very much prefer to use a zero-copy approach where I'd just work with an immutable object so it shouldn't perform any copy operations, just use the buffer I point it at.
I'm also unsure if I can just use UCharIterator (assuming I can somehow convert UChar* from my UTF-16 strings).
So my question is: How do I use ICU for the above purposes?
Thanks in advance for your answers!
UnicodeString uses UTF-16 for storage by default. That's why it only has fromUTF8 and fromUTF32: from UTF-16 there is no conversion to be made.
It does copy the data. It is an owning string, much like std::string.
You can use UCharIterator if you don't want to copy the data. UChar is a 16-bit value. You can force it to be whatever 16-bit type you prefer working with by defining the UCHAR_TYPE macro:
Define UChar to be UCHAR_TYPE, if that is #defined (for example, to char16_t), or wchar_t if that is 16 bits wide; always assumed to be unsigned.
If neither is available, then define UChar to be uint16_t.
This makes the definition of UChar platform-dependent but allows direct string type compatibility with platforms with 16-bit wchar_t types.

Trouble creating char * object to send to c++/c shared object library

First off I am new to ctypes and did search for an answer to my question. Definitely will appreciate any insight from here.
I have a byte string supplied to me by another tool. It contains what appears to be hex and other values. I'm creating the c_char_p object as follows:
mybytestring = b'something with a lot of hex \x00\x00\xc7\x87\x9bb and other alphanumeric and non-word characters' # Length of this is very long let's say 480
mycharp = c_char_p(mybytestring)
I also create a c_char_Array as follows:
mybuff = create_string_buffer(mybytestring)
The problem is when I send either mycharp or mybuff to a c++ library .so function, the string gets cut off at the NULL terminator (first occurrence of '\x00')
I'm loading the c++ library and calling the function as follows:
lib_handle = cdll.LoadLibrary(mylib.so)
lib_handle.myfunction(mycharp)
lib_handle.myfunction(mybuff)
The c++ function expects a char *
Does someone know how to be able to send the whole string with NULL terminators ('\x00') included?
Thanks
Add your original data to a vector<char> vec, and send vec.data()
But the actual problem is
The c++ function expects a char *.
You will need to change this (to accept a second arg=length of the buffer, or for example, to accept a vector<char>) if you want it to accept an array of char including null.
Alternatively you can figure out what do you actually want the c++ function to do, and make self a "preprocessing" of the char array, adding a null-terminator to each new array, and after that send to the c++ function.
For example, you may decide that the “input” array is actually a set of c-string: you will need to do a simple parse to “split” and send to the c++ in a cycle one, after other.
Or maybe you decide that the input could be a string in an UTF16 and not UTF8. Then you need to, as good as possible, convert it to UTF8 and send to the c++ function.