I have a large (megabytes) string in a QJsonValue, that I need to convert to QByteArray, as I am sending the string as data with a QNetworkRequest.
Currently I am doing this:
myQJsonObject["myQJsonValue"].toString().toUtf8()
Would this incur copying the same data to memory many times for some reason? If so, how would you go about implementing this without unnecessary copyings?
and why you do not use QJsonDocument? This should be used for reading and writing. There is a method QJsonDocument::toBinaryData.
This API should do everything with most effective way.
Update to comment:
Single JSon value is must be one of other JSon types: object, string or some number. I'm pretty sure you have JSon object.
So your code should look like this::
JSonValue val = someJsond["someKey"];
if (val.isObject()) {
QJSonDocument doc(val.toObject());
SendToServer(doc.toBinaryData());
} else {
// error or:
SendToServer(val.toString().toUtf8());
}
The call to myQJsonObject["myQJsonValue"].toString() does not involve data copy thanks to copy-on-write semantics of Qt.
The toUtf8 call is costly. QString stores the data as Unicode (16-bit QChars), and encoding it in UTF-8 involves more than data copy.
QString::constData() returns a pointer to the underlying character array. But then, each character is represented by 2 bytes instead of 1 or 2 bytes in case of Utf-8. This might mean sending two times more data over the network.
So if your data consists of mostly ASCII characters, then UTF-8 is probably a better option. If it contains lots of non-Ascii characters, and the other side can handle UTF-16, then UTF-16 is worth considering.
Related
I am working on this code which receives a cpprest sdk response containing a base64_encoded payload which is a json. here is my code snippet:
typedef std::wstring string_t; //defined in basic_types.h in cpprest lib
void demo() {
http_response response;
//code to handle respose ...
json::value output= response.extract_json();
string_t payload = output.at(L"payload").as_string();
vector<unsigned char> base64_encoded_payload = conversions::from_base64(payload);
std::string utf8_payload(base64_encoded_payload.begin(), base64_encoded_payload.end()); //in debugger I see the Japanese chars are garbled.
string_t utf16_payload = utf8_to_utf16(utf8_payload); //in debugger I see the Japanese chars are good here
//then I need to process the utf8_payload which is an xml.
//I have an API available to process the xml which takes an string
processXML(utf16_payload); //need to convert utf16_payload to a string here;
}
I also tried this and I see str contains garbled chars!
#include <codecvt> // for codecvt_utf8_utf16
#include <locale> // for wstring_convert
#include <string> // for string, wstring
void wstr2str(void) {
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> conversion;
std::wstring japanese = L"北島 美奈";
std::string str = conversion.to_bytes(japanese); //str is garbled:(
}
my questions is: can utf8 containing Japanese char be converted to std::string without being garbled?
Update: I gained access to the processXML() code and changed the input argument type to std::wstring and it worked.
I figured when the xml was getting created, it was converting the std::string to wstring; however, it was not turning out good!
void processXML(std::wstring xmlStrBuf) { //chaned xmlStrBuf to wstring and worked
// more code
CComBSTR xmlBuff = xmlStrBuf.c_str();
VARIANT_BOOL bSuccess = false;
xmlDoc->loadXML(xmlBuff, &bSuccess);
//more code
}
Thanks for the answers and they were helpful when mentioned the string is only a storage.
You are confusing different concepts here.
Storage
This is how we save/store/hold our data. A std::string is a collection of chars, which are bytes. A std::wstring is a collection of wchar_ts, which are sometimes 2-byte wide value (but this is not guaranteed!).
Encoding
This is what the data means, and how it should be interpreted. A std::string, a collection of bytes, could hold UTF-8, or UTF-16, or UTF-32, or ASCII, or ShiftJIS, or morse code, or a JPEG, or a movie, or my DNA (lucky string!).
There are some strong conventions in play in the world. For example, on Windows, a std::wstring is generally accepted to hold UTF-16 (because the two-byte storage is convenient for this, and also because that's how the Windows API does it).
Newer versions of C++ give us things like std::u16_string and std::u32_string as well, which still do not directly have any notion of encoding, but are intended to be used for UTF-16 and UTF-32 respectively because their names make that intention more obvious to readers of code. C++20 will introduce std::u8_string which is intended to signify a UTF-8 encoded string (and is otherwise more or less like a std::string).
But these are just conventions. Nothing about the type std::string says "UTF-8" or any other thing. It doesn't know about or care about or enforce any encoding. It just stores bytes.
So, your question about "converting UTF-8 to std::string" does not really make any sense; it's like asking how to convert a road into a car.
"What should I do, then?"
Well, Base64 is also not an encoding. Well, actually, it totally is, but it's an encoding on top of the string encoding. It's a way of transmitting/escaping/sanitising the raw bytes, not a way of describing how to interpret them later. By asking cpprest to convert from Base64, that's just transforming the way the raw bytes are provided. That's why it gives you a std::vector<char> rather than a std::string because, although (as discussed above) std::string doesn't care about encoding, we sometimes use a std::vector<char> to really, properly, completely say that "this collection does not have any particular encoding, so please don't try to guess from convention or whatever what the encoding is in this use case; all it knows is that it is a bunch of bytes". This is down to opinion. Some people will still use a std::string for that; the authors of cpprest decided not to.
The point is that the use of the function from_base64 cannot tell us anything about the encoding of the text that you've retrieved. For that, we have to go back to the documentation for the text. We have no access to that, and you did not tell us anything about it. If it were just a JSON string, the encoding would be down to the cpprest JSON library and so you'd already be done. However, it's not: it's something packed into a Base64 representation by whoever created the JSON object. Again, that information is not something that you shared with us.
But, based on the variable names you've chosen, the data you're looking at is already UTF-8. You've then attempted to convert it to UTF-16, which is rather the opposite of what you've described you wanted to do.
(Similarly, in your second example, you've taken a std::wstring that [probably] already stores UTF-16 thanks to the L"wide string literal", then told the computer that it's UTF-8 and to convert it "again" to UTF-16, then extracted the raw bytes into a std::string. None of that makes sense.)
Instead, why not literally just processXML(utf8_payload);?
General advice
Encoding can be quite complex, although it's significantly easier to deal with once you've wrapped your mind around the basic concepts of all these layers of abstraction. For the future, and for this question if you wish to clarify it, you will need to ensure that you are absolutely clear, at each stage of the "pipeline" of your data as it gets transmitted from place A to place B, and gets converted from type C to type D, and whatever else, about what encoding it should be at each of those steps. If you want to change the encoding at one of those steps, then do so (though this should be rare!). But before you write any code make sure that you know for sure what it is that you need, otherwise you'll get yourself in a massive tangle.
Eventually you'll start to detect patterns that can help, though. For example, if you were expecting some delicious non-ASCII output and instead see strange text with lots of "Å" characters in it, that's probably UTF-8 that's being interpreted as ASCII by mistake. That's because of the way that the special sequence denoting Unicode codepoints larger than one byte in UTF-8 often starts with a byte whose numerical value is the same as that of the letter "Å" in ASCII (well, ISO/IEC 8859, but close enough).
Similarly, if you get Japanese and didn't expect it, in my experience that's usually because you've given the computer some bytes and told it that they are a string in UTF-16 encoding, when actually they were UTF-8. You just get more experienced at recognising these patterns as you work more, and it can help you to fix your bugs faster.
Just last week the last example there saved me quite a bit of time: I knew immediately that my source data must have been UTF-8, and was therefore able to quickly decide to remove the byte-copy into a std::wstring that I'd been attempting. Examining the bytes in an encoding-agnostic way revealed the "Å" pattern as well and then that was that. This was important because I had no documentation for the data source and thus no way to just look up what the encoding was supposed to be. I had to guess/deduce it. Hopefully that won't be the case for you here.
std::string is just a container for 8-bit wide char, and does not know/care about the encoding. Always think in symbols (letters, numbers, punctuation, etc.) The first 128 characters (0-127) were defined per the ASCII standard, thus requiring a single char to store each symbol. With all the languages and symbols there is, we couldn't represent each of them with just 256 possibilities. The UTF-8 encoding introduces a way to deal with this problem by allowing a single symbol to take 1, 2, 3 or 4 char wide. But, for the std::string object, this is entirely transparent and it's still dealing with a series of chars.
The reason why you're thinking the string is garbled is probably because your debugger assumes the contents of the std::string is always 1 symbol per char (extended ASCII for example), and as such, it's displaying the wrong characters.
Edit: you might want to read this post also.
I want to test serialized data conversion in my application, currently the object is stored in file and read the binary file and reloading the object.
In my unit test case I want to test this operation. As the file operations are costly I want to hard code the binary file content in the code itself.
How can I do this?
Currently I am trying like this,
std::string FileContent = "\00\00\00\00\00.........";
and it is not working.
You're right that a string can contain '\0', but here you're still initializing it from const char*, which, by definition, stops at the first '\0'. I'd recommend you to use uint8_t[] or even uint32_t[] (that is, without passing to std::string), even if the second might have up to 3 bytes of overhead (but it's more compact when in source). That's e.g. how X bitmaps are usually stored.
Another possibility is base64 encoding, which is printable but needs (a relatively quick) decoding.
If you really want to put the const char[] to a std::string, first convert the pointer to const char*, then use the two-iterator constructor of std::string. While it's true that std::string can hold '\0', it's somewhat an antipattern to store binary in a string, thus I'm not giving the exact code, just the hint.
The following should do what you need, however probably not recommended as most people wouldn't expect an std::string to contain null bytes.
std::string FileContent { "\x00\x00\x00\x00\x00", 5 };
I have this code which works:
QString qs = QString::fromUtf8(bp,ut).at(0);
QChar c(qs[0]);
Where bp is a QByteArray::const_pointer, and ut is the maximum expected length of the UTF-8 encoded Unicode code-point.
I then grab the first QChar c from the QString qs.
It seems that there should be a more efficient way to simply get only the next QChar from the UTF-8 byte array without having to convert an arbitrary amount of the QByteArray into a QString and then getting only the first QChar.
EDIT From the comments below, it is clear that no one yet understands my question. So I will start with some basics. UTF-8 and UTF-16 are two different encodings of the world standard Unicode. The most common and encouraged Unicode encoding for transfer over the Internet and Unicode text files is UTF-8 which results in every Unicode code-point using 1 to 4 bytes in UTF-8 encoding. UTF-16 on the other hand is more convenient for handling characters inside a program. Therefore the vast majority of software out there is making the conversion between these two encodings all the time. A QChar is the more convenient UTF-16 encoding of all the Unicode code-points from 0x00 to 0xffff, which covers the majority of the languages and symbols so far defined and in common use. Surrogate pairs are used for the higher Unicode code-point values. At present surrogate pairs seem to have limited support, and are not of interest to me as for the present question.
When you read a text file into a QPlainTextEdit the conversion is done automatically and behind the scenes. Reading a QString from a QByteArray can also be done automatically (provided your locale and codec settings are set for UTF-8), or they can be done explicitly using toUtf8() or fromUtf8() as in my code above.
The conversion in the other direction can efficiently be done implicitly (behind the scenes) or explicitly with the following code:
ba += *si; // Depends on the UTF-8 codec
or
ba += QString(*si).toUtf8(); // UTF-8 explicitly
where ba is a QByteArray and si is QString::const_iterator. These do exactly the same thing (assuming the codec is set to UTF-8). They both convert the next (one) character from the QChar pointed to within a QString resulting in appending one or more bytes in ba.
All I am trying to do is the inverse conversion for only one character at a time, efficiently. Internally this is being done for every character being converted, and I'm sure it is being done very efficiently.
The problem with QString::fromUtf8(p,n) is that n is the number of bytes to process rather than the number of characters to convert. Therefore you must allow for the largest number of bytes which could be 3 (or 4 if it actually handled surrogate pairs). So if all you want is the next character, you must be prepared to process several bytes, and they do get converted and then are discarded if the result is a QString with more than one character.
Q: Is there a conversion function that does this one character at a time?
You want to use QTextDecoder.
It is, according to the documentation:
The QTextDecoder class provides a state-based decoder.
A text decoder converts text from an encoded text format into Unicode using a specific codec.
The decoder converts text in this format into Unicode, remembering any state that is required between calls.
The important thing here is state. QString and QTextCodec are stateless, so they work on entire strings, start to end.
QTextDecoder, on the other hand, allows you to work on byte buffers one byte at a time, maintaining a state between calls so the caller knows if an UTF-8 sequence has been only partially decoded.
For example:
QTextDecoder decoder(QTextCodec::codecForName("UTF-8"));
QString result;
for (int i = 0; i < bytearray.size(); i++) {
result = decoder.toUnicode(bytearray.constData() + i, 1);
if (!result.isEmpty()) {
break; // we got our character !
}
}
The rationale behind this loop is that as long as the decoder is not able to decode a complete UTF-8 character, it will return an empty string.
As soon as it is able to, the result string will contain the one decoded unicode character.
This loop is as efficient as possible, and by memorizing the loop index, next characters can be obtained in the same way.
I am compressing string. And the compressed string sometimes having NULL character inside before the end NULL. I want to return the string till the end null.But the compressor function is returning the sting till the occurring of the first NULL. I made a question for c before about it. But consecutively I need also the solution in C++ now, and in next C#. Please help me.Thanks.
char* compressor(char* str)
{
char *compressed_string;
//After some calculation
compressed_string="bk`NULL`dk";// at the last here is automatic an NULL we all know
return compressed_string;
}
void main()
{
char* str;
str=compressor("Muhammad Ashikuzzaman");
printf("Compressed Value = %s",str);
}
The output is : Compressed Value = bk;
And all other characters from compressor function is not here. Is there any way to show all the string.
The fundamental problem that you have is that compression algorithms operate on binary data rather than text. If you compress something, then expect some of the compressed bytes to be zero. Thus the compressed data cannot be stored in a null-terminated string.
You need to change your mindset to work with binary data.
To compress do the following:
Convert from text to binary using some well-defined encoding. For instance, UTF-8. This will yield an array of unsigned char.
Compress the unsigned char, which will again yield an array of unsigned char, but now compressed.
To decompress you just reverse these steps.
Since you are writing C++ code you would be well advised to use standard containers. Such as std::string or std::wstring and std::vector<T>.
The exact same principles apply in all languages. When you come to code this in C#, you need to convert from text to binary. Use Encoding.GetBytes() to do that. That yields a byte array, byte[]. Compress that to another byte array. And so on.
But you really must first overcome this desire to attempt to store binary data in text data types.
I am trying to convert a std::string Buffer - containing data from a bitmap file - to std::wstring.
I am using MultiByteToWideChar, but that does not work, because the function stops after it encounters the first '\0'-character. Seems like it interprets it as the end of the string.
When i dont pass -1 as the length-parameter, but the real length of the data in the std::string-Buffer, it messes the Unicode-String up with characters that definetly not appeared at that position in the original string...
Do I have to write my own conversion function?
Or maybe shall i keep the data as a casual char-array, because the special-symbols will be converted incorrectly?
With regards
There are many, many things that will fail with this approach. Among other things, extra bytes may be added to your data without your realizing it.
It's odd that your only option takes a std::wstring(). If this is a home-grown library, you should take the trouble to write a new function. If it's not, make sure there's nothing more suitable before writing your own.