I am using libiconv to convert array of char to encoded string.
I am new to the library.
I wonder how I can know what encoded type the given array is encoded in before I call iconv_open("To-be-encoded","given-encoded-type")
It's the second parameter that I need to know.
It's the second parameter that I need to know.
Yes, you indeed need to know it. I.e. it is you that needs to tell iconv what encoding your array is in. There is no reliable way of detecting what encoding was used to produce a set of bytes - at best you can take a guess based on character frequencies or other such heuristics.
But there is no way to be 100% sure without other information, from metadata or from the file/data format itself. (e.g. HTTP provides headers to indicate encoding, XML has that capability too.)
In other words, if you don't know how a stream of bytes you have is encoded, you cannot convert it to anything else. You need to know the starting point.
Related
I am attempting to use MultibyteToWideChar to convert some text in ANY encoding supported by that function, to another encoding such as UTF-8.
The issue is that MultibyteToWideChar when used along a character boundary will just report the error, but will give NO indication at which character it failed at.
Take this:
teså—hello
and say it's UTF-8. I want to convert it into UTF-16.
Now for my situation, I read say 4 bytes. I Then, I call MultibyteToWideChar on those 4 bytes.
Well, the asian character is split into 2 boundaries.
Now MultibyteToWideChar will fail, and will NOT tell me WHICH BYTE it failed, so I can readjust.
I read 4 bytes, or bufferSize bytes, because I have streaming data.
I have used iconv for encoding conversion, but it's MUCH too slow.
I have also used ICU, and it's fast, but with it completely trimmed it is STILL 6.5MB in size which is too big.
Is there another solution that is also fast but small and supports a wide range of encodings?
I have also tried the CharNextExA functions and such but they don't work with other encodings.
The return value of the function only returns characters, and so I do not know how many bytes have been converted. Multibyte characters can vary in length.
I need the number of bytes converted because then I can copy over those bytes into the next buffer for reuse.
What I'm trying to do is read in a very large file in chunks, and convert that files encoding, which varies, into UTF-8
NOTE:
I'm curious, how does ICU4C work? Basically, I copy the source files over, but out of box it only supports encoding like UTF-8, but not Big5. To add Big5, I have to create a 5MB .data file which I then send to ICU4C, and then Big5 is availiable. The thing is, I don't think the .data file is code. Because when compiled for x64, it works perfectly fine for x86. Is there a way to avoid that 5MB?
You could use the return value from MultiByteToWideChar as an input to WideCharToMultiByte and then the length would tell you the number of MB characters that were converted. Most of the time if I need this level of detail I simply suck it up and use ICU and ignore the resulting size.
I don't think there is a one function solution.
Without using a 3rd-party library you might be stuck with something like this:
Read a byte into a buffer.
If IsDBCSLeadByteEx is true, append the next byte to the buffer.
Call MultiByteToWideChar. If this fails the trailing byte (if any) was incorrect.
Note that IsDBCSLeadByteEx does not support Unicode so when the code page is UTF-8 you need to do your own length handling until your buffer contains one complete code point.
I am trying to convert a std::string Buffer - containing data from a bitmap file - to std::wstring.
I am using MultiByteToWideChar, but that does not work, because the function stops after it encounters the first '\0'-character. Seems like it interprets it as the end of the string.
When i dont pass -1 as the length-parameter, but the real length of the data in the std::string-Buffer, it messes the Unicode-String up with characters that definetly not appeared at that position in the original string...
Do I have to write my own conversion function?
Or maybe shall i keep the data as a casual char-array, because the special-symbols will be converted incorrectly?
With regards
There are many, many things that will fail with this approach. Among other things, extra bytes may be added to your data without your realizing it.
It's odd that your only option takes a std::wstring(). If this is a home-grown library, you should take the trouble to write a new function. If it's not, make sure there's nothing more suitable before writing your own.
Since I am not getting an answer on this question I gotta prototype and check myself, as my dataset headers need to be fixed size, I need fixed size strings. So, is it possible to specify fixed size strings or byte arrays in protocol buffers ? It is not readily apparent here, and I kinda feel bad about forcing fixed size strings into the header message. --i.e, std::string('\0', 128);
If not I'd rather use a #pragma pack(1) struct header {...};'
edit
Question indirectly answered here. Will answer and except
protobuf does not have such a concept in the protocol, nor in the .proto schema language. In strings and blobs, the data is always technically variable length using a length prefix (which itself uses varint encoding, so even the length is variable length).
Of course, if you only ever store data of a particular length, then it will line up. Note also that since strings in protobuf are unicode using UTF-8 encoding, the length of the encoded data is not as simple as the number of characters (unless you are using only ASCII characters).
This is a slight clarification to the previous answer. Protocol Buffers does not encode strings as UTF-8, it encodes them as regular bytes. The on-wire format would be the number of bytes consumed followed by the actual bytes. See https://developers.google.com/protocol-buffers/docs/encoding/.
While the on-wire format is always the same, protocol buffers provides two interfaces for developers to use, string and bytes, with the primary difference being that the former will generally try to provide string types to the developer where as the latter will try to provide byte types (I.e. Java would provide String for string and ByteArray for bytes).
I'm using a Japanese string as a wchar_t, and I need to convert it to a char*. Is there any method or function to convert wchar_t* to char* without losing data?
It is not enough to say "I have a string as wchar_t". You must also know what encoding the characters of the string are in. This is probably UTF-16, but you need to know definitely.
It is also not enough to say "I want to convert to char". Again, you must make a decision on what encoding the characters will be represented in. JIS? Shift-JIS? EUC? UTF-8? Another encoding?
If you know the answers to the two questions above, you can do the conversion without any problem using WideCharToMultiByte.
What you have to do first is to choose the string encoding such as UTF-8 or UTF-16. And then, encode your wchar_t[] strings in the encoding you choose via libiconv or other similar string encoding library.
You need to call WideCharToMultiByte and pass in the code page encoding identifier for the Japanese multibyte encoding you want. See the MDSN for that function. On Windows, the local multibyte set is CP932, the MS variation on ShiftJIS. However, you might conceivably want UTF-8 to send to someone who wants it.
Having an untyped pointer pointing to some buffer which can hold either ANSI or Unicode string, how do I tell whether the current string it holds is multibyte or not?
Unless the string itself contains information about its format (e.g. a header or a byte order mark) then there is no foolproof way to detect if a string is ANSI or Unicode. The Windows API includes a function called IsTextUnicode() that basically guesses if a string is ANSI or Unicode, but then you run into this problem because you're forced to guess.
Why do you have an untyped pointer to a string in the first place? You must know exactly what and how your data is representing information, either by using a typed pointer in the first place or provide an ANSI/Unicode flag or something. A string of bytes is meaningless unless you know exactly what it represents.
Unicode is not an encoding, it's a mapping of code points to characters. The encoding is UTF8 or UCS2, for example.
And, given that there is zero difference between ASCII and UTF8 encoding if you restrict yourself to the lower 128 characters, you can't actually tell the difference.
You'd be better off asking if there were a way to tell the difference between ASCII and a particular encoding of Unicode. And the answer to that is to use statistical analysis, with the inherent possibility of inaccuracy.
For example, if the entire string consists of bytes less than 128, it's ASCII (it could be UTF8 but there's no way to tell and no difference in that case).
If it's primarily English/Roman and consists of lots of two-byte sequences with a zero as one of the bytes, it's probably UTF16. And so on. I don't believe there's a foolproof method without actually having an indicator of some sort (e.g., BOM).
My suggestion is to not put yourself in the position where you have to guess. If the data type itself can't contain an indicator, provide different functions for ASCII and a particular encoding of Unicode. Then force the work of deciding on to your client. At some point in the calling hierarchy, someone should now the encoding.
Or, better yet, ditch ASCII altogether, embrace the new world and use Unicode exclusively. With UTF8 encoding, ASCII has exactly no advantages over Unicode :-)
In general you can't
You could check for the pattern of zeros - just one at the end probably means ansi 'c', every other byte a zero probably means ansi text as UTF16, 3zeros might be UTF32