Curl replacing \u in response to \\u in c++ - c++

I am sending a request using libcurl in windows and the response I get has some universal characters in them that start with \u. Libcurl is not recognizing this universal character and as a result, it escapes the \ turning the universal character to \\u.
Is there any way to fix this? I have tried using str.replace but it can not replace escaped sequences
the code I used to implent this was
#include <iostream>
#include <string>
#include <cpr/cpr.h>
int main()
{
auto r = cpr::Get(cpr::Url{"http://prayer.osamaanees.repl.co/api"});
std::string data = r.text;
std::cout << data << std::endl;
return 0;
}
This code uses the cpr library which is a wrapper for curl.
It prints out the following:
{
"times":{"Fajr":"04:58 AM","Sunrise":"06:16 AM","Dhuhr":"12:30 PM","Asr":"04:58 PM","Maghrib":"06:43 PM","Isha":"08:00 PM"},
"date":"Tuesday, 20 Mu\u1e25arram 1442AH"
}
Notice the word Mu\u1e25arram, it should have been Muḥarram but since curl escaped the \ before u it prints out as \u1e25

Your analysis is wrong. Libcurl is not escaping anything. Load the URL in a web browser of your choosing and look at the raw data that is actually being sent. For example, this is what I see in Firefox:
The server really is sending Mu\u1e25arram, not Muḥarram like you are expecting. And this is perfectly fine, because the server is sending back JSON data, and JSON is allowed to escape Unicode characters like this. Read the JSON spec, particularly Section 9 on how Unicode codepoints may be encoded using hexidecimal escape sequences (which is optional in JSON, but still allowed). \u1e25 is simply the JSON hex-escaped form of ḥ.
You are merely printing out the JSON content as-is, exactly as the server sent it. You are not actually parsing it at all. If you were to use an actual JSON parser, Mu\u1e25arram would be decoded to Muḥarram for you. For example, here is how Firefox parses the JSON:
It is not libcurl's job to decode JSON data. Its job is merely to give you the data that the server sends. It is your job to interpret the data afterwards as needed.

I would like to thank Remy for pointing out how wrong I was in thinking curl or the JSON parser was the problem when in reality I needed to convert my console to UTF-8 mode.
It was after I fixed my Codepage I was able to get the output I wanted.
For future reference, I am adding the code that fixed my problem:
We need to include Windows.h
#include <Windows.h>
Then at the start of our code:
UINT oldcp = GetConsoleOutputCP();
SetConsoleOutputCP(CP_UTF8);
After this we need to reset the console back to the original codepage with:
SetConsoleOutputCP(oldcp);

Related

How to use Unicode characters in x.dump() in nolohmann::json?

I'm trying to use Unicode characters (like משתמש לא רשום/קיים and other languages).
But when I try to do json_object.dump(), it throws an exception:
{_Data={_What=0x00c80b08 "[json.exception.type_error.316] invalid UTF-8 byte at index 1: 0xF9" _DoFree=...} }
When
#include <nlohmann/json.hpp>
#include <iostream>
using JSON = nlohmann::json;
int main()
{
std::string str = "משתמש לא רשום/קיים";
JSON json_obj;
json_obj["message"] = str;
std::cout << json_obj.dump();
}
For some reason it works in the "minimal reproducible example", but not in my project. Maybe the problem is somwhere else, but I don't know where...
In one sentence: How to support other languages' characters in nolohmann::json?
Nlohmann json works with UTF8 std::string(s) no matter your platform encoding.
It means that you should either encode your code files as UTF8 or convert str content to UTF8 at runtime.
This is the same for every string you pass to nlohmann or you get from nlohmann.
On unix, the platform encoding is generally UTF8.
If your program runs on windows however, the debugger and system API(s) probably expect that you store ANSI encoded text in std::string(s), which means you'll have to do manual conversions.

C++ output Unicode in variable

I'm trying to output a string containing unicode characters, which is received with a curl call. Therefore, I'm looking for something similar to u8 and L options for literal strings, but than applicable for variables. E.g.:
const char *s = u8"\u0444";
However, since I have a string containing unicode characters, such as:
mit freundlichen Grüßen
When I want to print this string with:
cout << UnicodeString << endl;
it outputs:
mit freundlichen Gr??en
When I use wcout, it returns me:
mit freundlichen Gren
What am I doing wrong and how can I achieve the correct output. I return the output with RapidJSON, which returns the string as:
mit freundlichen Gr��en
Important to note, the application is a CGI running on Ubuntu, replying on browser requests
If you are on Windows, what I would suggest is using Unicode UTF-16 at the Windows boundary.
It seems to me that on Windows with Visual C++ (at least up to VS2015) std::cout cannot output UTF-8-encoded-text, but std::wcout correctly outputs UTF-16-encoded text.
This compilable code snippet correctly outputs your string containing German characters:
#include <fcntl.h>
#include <io.h>
#include <iostream>
int main()
{
_setmode(_fileno(stdout), _O_U16TEXT);
// ü : U+00FC
// ß : U+00DF
const wchar_t * text = L"mit freundlichen Gr\u00FC\u00DFen";
std::wcout << text << L'\n';
}
Note the use of a UTF-16-encoded wchar_t string.
On a more general note, I would suggest you using the UTF-8 encoding (and for example storing text in std::strings) in your cross-platform C++ portions of code, and convert to UTF-16-encoded text at the Windows boundary.
To convert between UTF-8 and UTF-16 you can use Windows APIs like MultiByteToWideChar and WideCharToMultiByte. These are C APIs, that can be safely and conveniently wrapped in C++ code (more details can be found in this MSDN article, and you can find compilable C++ code here on GitHub).
On my system the following produces the correct output. Try it on your system. I am confident that it will produce similar results.
#include <string>
#include <iostream>
using namespace std;
int main()
{
string s="mit freundlichen Grüßen";
cout << s << endl;
return 0;
}
If it is ok, then this points to the web transfer not being 8-bit clean.
Mike.
containing unicode characters
You forgot to specify which unicode encoding does the string contain. There is the "narrow" UTF-8, which can be stored in a std::string and printed using std::cout, as well as wider variants, which can't. It is crucial to know which encoding you're dealing with. For the remainder of my answer, I'm going to assume you want to use UTF-8.
When I want to print this string with:
cout << UnicodeString << endl;
EDIT:
Important to note, the application is a CGI running on Ubuntu, replying on browser requests
The concerns here are slightly different from printing onto a terminal.
You need to set the Content-Type response header appropriately or else the client cannot know how to interpret the response. For example Content-Type: application/json; charset=utf-8.
You still need to make sure that the source string is in fact the correct encoding corresponding to the header. See the old answer below for overview.
The browser has to support the encoding. Most modern browsers have had support for UTF-8 a long time now.
Answer regarding printing to terminal:
Assuming that
UnicodeString indeed contains an UTF-8 encoded string
and that the terminal uses UTF-8 encoding
and the font that the terminal uses has the graphemes that you use
the above should work.
it outputs:
mit freundlichen Gr??en
Then it appears that at least one of the above assumptions don't hold.
Whether 1. is true, you can verify by inspecting the numeric value of each code unit separately and comparing it to what you would expect of UTF-8. If 1. isn't true, then you need to figure out what encoding does the string actually use, and either convert the encoding, or configure the terminal to use that encoding.
The terminal typically, but not necessarily, uses the system native encoding. The first step of figuring out what encoding your terminal / system uses is to figure out what terminal / system you are using in the first place. The details are probably in a manual.
If the terminal doesn't use UTF-8, then you need to convert the UFT-8 string within your program into the character encoding that the terminal does use - unless that encoding doesn't have the graphemes that you want to print. Unfortunately, the standard library doesn't provide arbitrary character encoding conversion support (there is some support for converting between narrow and wide unicode, but even that support is deprecated). You can find the unicode standard here, although I would like to point out that using an existing conversion implementation can save a lot of work.
In the case the character encoding of the terminal doesn't have the needed grapehemes - or if you don't want to implement encoding conversion - is to re-configure the terminal to use UTF-8. If the terminal / system can be configured to use UTF-8, there should be details in the manual.
You should be able to test if the font itself has the required graphemes simply by typing the characters into the terminal and see if they show as they should - although, this test will also fail if the terminal encoding does not have the graphemes, so check that first. Manual of your terminal should explain how to change the font, should it be necessary. That said, I would expect üß to exist in most fonts.

European characters switch to strange characters in response when posting to server using C++

I am struggeling to get the response from the server in correct format under Windows. I have tried two C++ libraries Beast, (based on Boost Asio) and Cpr (based on libcurl) and I get the exact same issue with both.
The strange thing is that I also tried this in C# (HttpClient) and everything works just fine. Also, in Postman and other REST tools it looks good.
When I post to the server and should get back the name René I get Ren� instead. Other European characters like æ,ø,å,ö give the same strange output. To me it looks like an issue with utf-8 / iso-8859-1 but I cannot figure it out. The server (based on node.js) and the response is set to push out utf-8. We have tried to just redirect the response so it does not hit a database or anything like that. So, the problem is under C++ it seems. Any suggestions to what I can try would be greatly appreciated.
Example code:
nlohmann::json test_json = nlohmann::json
{
{ "text", "Hi, my name is René" },
{ "language", "en" }
};
auto r = cpr::Post(cpr::Url{ "http://www.exampleserver.com" },
cpr::Body{ test_json.dump() },
cpr::Header{ { "content-type", "application/json; charset=utf-8" } });
std::cout << r.text << std::endl;
It looks like you've got some ISO-8859-1 content being sent through but it's labelled as UTF-8. This causes a whole rash of conversion errors which can mangle non-ASCII characters beyond recognition.
The way to fix this is to either identify the non-UTF-8 data and properly convert it, or identify the payload with the correct MIME type and encoding.
Your issue is with the encoded string. The string is most likely coming back UTF-8 encoded but you are not converting it properly.
There are various libraries that help you convert. It all depends on the version of C++ you're using. Hard to tell you what to use without more details.

Can I decode € (euro sign) as a char and not as a wstring/wchar?

Let's try explain my problem. I have to receive a message from a server (programmed in delphi) and do some things with that message in the client side (which is the side I programm, in c++).
Let's say that the message is: "Hello €" that means that I have to work with std::wstring as €(euro sign) needs 2 bytes instead of 1 byte, so knowing that I have made all my work with wstrings and if I set the message it works fine. Now, I have to receive the real one from the server, and here comes the problem.
The person on the server side is sending that message as a string. He uses a EncodeString() function in delphi and he says that he is not gonna change it. So my question is: If I Decode that string into a string in c++, and then I convert it into a wstring, will it work? Or will I have problems and have other message on my string var instead of "Hello €".
If yes, if I can receive that string with no problem, then I have another problem. The function that I have to use to decode the string is void DecodeString(char *buffer, int length);
so normally if you receive a text, you do something like:
char Text[255];
DescodeString(Text, length); // length is a number decoded before
So... can I decode it with no problem and have in Text the "Hello €" message? with that I'll just need to convert it and get the wstring.
Thank you
EDIT:
I'll add another example. If i know that the server is going to send me always a text of length 30 max, in the server they do something like:
EncodeByte(lengthText);
EncodeString(text)
and in the client you do:
int length;
char myText[30];
DecodeByte(length);
DecodeString(myText,length);
and then, you can work with myText as a string lately.
Hope that helps a little more. I'm sorry for not having more information but I'm new in that work and I don't know much more about the server.
EDIT 2
Trying to summarize... The thing is that I have to receive a message and do something with it, with the tool I said I have to decode it. So as de DecodeString() needs a char and I need a wstring, I just need a way to get the data received by the server, decode it with decodeString() and get it into a wstring, but I don't really know if its possible, and if it is, I'm not sure about how to do it and what type of vars use to get it
EDIT 3
Finally! I know what code pages are using. Seems that the client uses the ANSI ones and that the server doesn't, so.. I'll have to tell to the person who does that part to change it to the ANSI ones. Thanks everybody for helping me with my big big ignorance about the existence of code pages.
Since you're using wstring, I guess that you are on Windows (wstring isn't popular on *nix).
If so, you need the Delphi app to send you UTF-16, which you can use in the wstring constructor. Example:
char* input = "\x0ac\x020"; // UTF-16 encoding for euro sign
wchar_t* input2 = reinterpret_cast<wchar_t*>(input);
wstring ws(input2);
If you're Linux/Mac, etc, you need to receive UTF-32.
This method is far from perfect though. There can be pitfalls and edge cases for unicodes beyond 0xffff (chinese, etc). Supporting that probably requires a PhD.

C++ character encoding when converting from string to const char* for Ruby FFI interface

I am using an external C++ lib that does some HTTPS communication and provides the XML server response. On serverside the response is encoded via ISO-8859-15 and I get a std::string that represents that response out of the API. When I print it out / write it to a file it looks proper.
The std::string and an int error code have to be passed to my external caller. So I return both values inside a struct:
extern "C" {
struct FoobarResponse {
const char* responseText;
int returnCode;
};
}
Unfortunately I have to convert the std::string response into a const char* C-style string representation with help of std::c_str() before. Reason: My caller is a Ruby script making use of Ruby FFI to communicate with my C++ lib, and interlanguage type conversion here is Ruby::string -> C::const char*.
Interesting here: If I std::cout the converted string after I put it into the struct, it is still ok.
The problem: When handling the server response on Ruby side, it is broken. Instead of the original answer like:
<?xml version="1.0" encoding="ISO-8859-15"?>
<Foobar xmlns="http://www.foobar.com/2012/XMLSchema">
...
</Foobar>
I receive a string obviously containing non printable characters which is always broken at the beginning and at the end.
?O[
l version="1.0" encoding="ISO-8859-15"?>
<Foobar xmlns="http://www.foobar.com/2012/XMLSchema">
</Fo??
In fact the string contains linebreaks, carriage returns and tabs at least, maybe more.
I tried to :force_encoding the string on Ruby side as ASCII-8BIT, ISO-8859-15 and UTF-8, no change.
I tried to base64 encode on C++ side before putting the string into the struct and base64 decode on Ruby side using this code, no change.
I had countless attepts to convert the string using Iconv as well, no change.
I also tried to remove non printable characters from the string before putting it into the struct, but I failed on that.
I have no idea what is going on here and running out of options.
Can someone point me into the right direction?
Regards
Felix
The value returned by c_str() is destroyed as soon as the std::string goes out of scope.
If you intend to pass this value to your script you should allocate memory and copy the string into your newly allocated space. See this example: http://www.cplusplus.com/reference/string/string/c_str/
You should also ensure the ruby script will correctly release memory.
I think this is what is explained there: https://github.com/ffi/ffi/wiki/Examples.
Example with a struct passed to Ruby from C:
https://github.com/ffi/ffi/wiki/Examples#-structs