I am trying to load an ID3 image tag that has been saved in UTF-16 JFIF format. The library I am using (Juce) fails to parse the image, as it assumes that the data is in a raw binary format.
The majority of image tags I've parsed successfully report the encoding as ISO-8859 (latin-1), but because latin-1 is a subset of UTF-16 a conversion wouldn't work.
How can I get this UTF16 encoded binary block in the raw format that I want? And could anybody enlighten me as to the benefits of storing an image in UTF16 format?!
latin1 is not a subset of UTF-16!
I think you misunderstood text encoding and binary encoding. UTF-16 is used for character encoding, the base unit is a 16-bits integer (UTF-8 is using 8 bits integer).
A JPEG picture (JFIF) is binary encoded, and its data should never get converted via character encoding algorithm.
If you actually did so, you're out of luck, since using a character conversion algorithm on a binary stream depends on whatever "source" text charset that was used at the time.
You can probably try to convert that (UTF-16) binary data back to binary by guessing the initial source charset, using iconv.
Related
If a C++ program receives a Protocol Buffers message that has a Protocol Buffers string field, which is represented by a std::string, what is the encoding of text in that field? Is it UTF-8?
Protobuf strings are always valid UTF-8 strings.
See the Language Guide:
A string must always contain UTF-8 encoded or 7-bit ASCII text.
(And ASCII is always also valid UTF-8.)
Not all protobuf implementations enforce this, but if I recall correctly, at least the Python library refuses to decode non-unicode strings.
WinAPI uses UTF-16LE encoding, so if I called some WinAPI function that returns a string, it will return it as UTF-16LE encoded.
So I'm thinking of using UTF-16LE encoding for strings in my program, and when it's time to send the data over the network, I convert it to UTF-8, and on the other side I convert it back to UTF-16LE. This is so there is less amount of data to send.
Is there's a reason why I shouldn't do that?
With UTF-8 encoding, you'll use:
1 byte for ASCII chars
2 bytes for unicode chars between U+0000 and U+07FF
more bytes if necesseray
So, if your text is western language, in most case it will probably be shorter in UTF-8 than in UTF-16LE encoding: the western alphabets are encoded between U-0000 and U-0590.
On the opposite, if your text is asian, then the UTF8 encoding might inflate significantly your data. The asian caracter sets are beyond U+7FF and require hence at least 3 bytes
In the UTF8 everywhere article you can find some (basic) statistics about length of text encoding, as well as other arguments supporting the use of UTF8.
One that comes to my mind for networking, is taht UTF8 representation is the same représentation on all platforms, whereas for UTF16 you have the LE and BE, depending on OS and CPU architecure.
Is there any way to detect std::string encoding?
My problem: I have an external web services which give data in different encodings. Also I have a library witch parse that data and store it in std::string. Than I want to display data in Qt GUI. The problem is that std::string can have different encodings. Some string can be converted using QString::fromAscii(), some QString::fromUtf8().
I haven't looked into it but I did use some Qt3.3 in the past.
ASCII vs Unicode + UTF-8
Utf8 is 8-bit, ascii 7-bit. I guess you can try to look into the values of string array
http://doc.qt.digia.com/3.3/qstring.html#ascii and http://doc.qt.digia.com/3.3/qstring.html#utf8
it seems ascii returns an 8-bit ASCII representation of the string, still I think it should have values from 0 to 127 or something like that. you must compare more characters in the string.
I would like to read data from a application/octet-stream charset=binary file with fread on linux and convert it to UTF-8 encoding. I tried with iconv, but it doesn't support binary charset. I haven't found any solution yet. Can anyone help me with it?
Thanks.
According to the MIME that you've given, you're reading data that's in non-textual binary format. You cannot convert it with iconv or similar, because it's meant for converting text from one (textual) encoding to another. If your data is not textual, then a conversion to any character encoding is meaningless and will just corrupt the data, but not make it any more readable.
The typical way to present binary as readable text for inspection is hex dump. There's an existing answer for implementing it in c++: https://stackoverflow.com/a/16804835/2079303
I have a function which requires me to pass a UTF-8 string pointed by a char*, and I have the char pointer to a single-byte string. How can I convert the string to UTF-8 encoding in C++? Is there any code I can use to do this?
Thanks!
Assuming Linux, you're looking for iconv. When you open the converter (iconv_open), you pass from and to encoding. If you pass an empty string as from, it'll convert from the locale used on your system which should match the file system.
On Windows, you have pretty much the same with MultiByteToWideChar where you pass CP_ACP as the codepage. But on Windows you can simply call the Unicode version of the functions to get Unicode straight away and then convert to UTF-8 with WideCharToMultiByte and CP_UTF8.
To convert a string to a different character encoding, use any of various character encoding libraries. A popular choice is iconv (the standard on most Linux systems).
However, to do this you first need to figure out the encoding of your input. There is unfortunately no general solution to this. If the input does not specify its encoding (like e.g. web pages generally do), you'll have to guess.
As to your question: You write that you get the string from calling readdir on a FAT32 file system. I'm not quite sure, but I believe readdir will return the file names as they are stored by the file system. In the case of FAT/FAT32:
The short file names are encoded in some DOS code page - which code page depends on how the files where written, there's no way to tell from just the file system AFAIK.
The long file names are in UTF-16.
If you use the standard vfat Linux kernel module to access the FAT32 partition, you should get long file names from readdir (unless a file only has an 8.3 name). These can be decoded as UTF-16. FAT32 stores the long file names in UTF-16 internally. The vfat driver will convert them to the encoding given by the iocharset= mount parameter (with the default being the default system encoding, I believe).
Additional information:
You may have to play with the mount options codepage and iocharset (see http://linux.die.net/man/8/mount ) to get filenames right on the FAT32 volume. Try to mount such that filenames are shown correctly in a Linux console, then proceed. There is some more explanation here: http://www.nslu2-linux.org/wiki/HowTo/MountFATFileSystems
I guess the top bit is set on the 1 byte string so the function you're passing that to is expecting more than 1 byte to be passed.
First, print the string out in hex.
i.e.
unsigned char* str = "your string";
for (int i = 0; i < strlen(str); i++)
printf("[%02x]", str[i]);
Now have a read of the wikipedia article on UTF8 encoding which explains it well.
http://en.wikipedia.org/wiki/UTF-8
UTF-8 is variable width where each character can occupy from 1 to 4 bytes.
Therefore, convert the hex to binary and see what the code point is.
i.e. if the first byte starts 11110 (in binary) then it's expecting a 4 byte string. Since ascii is 7-bit 0-127 the top bit is always zero so there should be only 1 byte. By the way, the bytes following the first byte in a wide character of a UTF8 string will start "10..." for the top bits. These are the continuation bytes... that's what your function is complaining about... i.e. the continuation bytes are missing when expected.
So the string is not quite true ascii as you thought it was.
You can convert using as someone suggested iconv, or perhaps this library http://utfcpp.sourceforge.net/