Can I use JsonCpp to partially-validate JSON input? - c++

I'm using JsonCpp to parse JSON in C++.
e.g.
Json::Reader r;
std::stringstream ss;
ss << "{\"name\": \"sample\"}";
Json::Value v;
assert(r.parse(ss, v)); // OK
assert(v["name"] == "sample"); // OK
But my actual input is a whole stream of JSON messages, that may arrive in chunks of any size; all I can do is to get JsonCpp to try to parse my input, character by character, eating up full JSON messages as we discover them:
Json::Reader r;
std::string input = "{\"name\": \"sample\"}{\"name\": \"aardvark\"}";
for (size_t cursor = 0; cursor < input.size(); cursor++) {
std::stringstream ss;
ss << input.substr(0, cursor);
Json::Value v;
if (r.parse(ss, v)) {
std::cout << v["name"] << " ";
input.erase(0, cursor);
}
} // Output: sample aardvark
This is already a bit nasty, but it does get worse. I also need to be able to resync when part of an input is missing (for any reason).
Now it doesn't have to be lossless, but I want to prevent an input such as the following from potentially breaking the parser forever:
{"name": "samp{"name": "aardvark"}
Passing this input to JsonCpp will fail, but that problem won't go away as we receive more characters into the buffer; that second name is simply invalid directly after the " that precedes it; the buffer can never be completed to present valid JSON.
However, if I could be told that the fragment certainly becomes invalid as of the second n character, I could drop everything in the buffer up to that point, and then simply wait for the next { to consider the start of a new object, as a best-effort resync.
So, is there a way that I can ask JsonCpp to tell me whether an incomplete fragment of JSON has already guaranteed that the complete "object" will be syntactically invalid?
That is:
{"name": "sample"} Valid (Json::Reader::parse == true)
{"name": "sam Incomplete (Json::Reader::parse == false)
{"name": "sam"LOL Invalid (Json::Reader::parse == false)
I'd like to distinguish between the two fail states.
Can I use JsonCpp to achieve this, or am I going to have to write my own JSON "partial validator" by constructing a state machine that considers which characters are "valid" at each step through the input string? I'd rather not re-invent the wheel...

It certainly depends if you actually control the packets (and thus the producer), or not. If you do, the most simple way is to indicate the boundaries in a header:
+---+---+---+---+-----------------------
| 3 | 16|132|243|endofprevious"}{"name":...
+---+---+---+---+-----------------------
The header is simple:
3 indicates the number of boundaries
16, 132 and 243 indicate the position of each boundary, which correspond to the opening bracket of a new object (or list)
and then comes the buffer itself.
Upon receiving such a packet, the following entries can be parsed:
previous + current[0:16]
current[16:132]
current[132:243]
And current[243:] is saved for the next packet (though you can always attempt to parse it in case it's complete).
This way, the packets are auto-synchronizing, and there is no fuzzy detection, with all the failure cases it entails.
Note that there could be 0 boundaries in the packet. It simply implies that one object is big enough to span several packets, and you just need to accumulate for the moment.
I would recommend making the numbers representation "fixed" (for example, 4 bytes each) and settling on a byte order (that of your machine) to convert them into/from binary easily. I believe the overhead to be fairly minimal (4 bytes + 4 bytes per entry given that {"name":""} is already 11 bytes).

Iterating through the buffer character-by-character and manually checking for:
the presence of alphabetic characters
outside of a string (being careful that " can be escaped with \, though)
not part of null, true or false
not a e or E inside what looks like a numeric literal with exponent
the presence of a digit outside of a string but immediately after a "
...is not all-encompassing, but I think it covers enough cases to fairly reliably break parsing at the point of or reasonably close to the point of a message truncation.
It correctly accepts:
{"name": "samL
{"name": "sam0
{"name": "sam", 0
{"name": true
as valid JSON fragments, but catches:
{"name": "sam"L
{"name": "sam"0
{"name": "sam"true
as being unacceptable.
Consequently, the following inputs will all result in the complete trailing object being parsed successfully:
1. {"name": "samp{"name": "aardvark"}
// ^ ^
// A B - B is point of failure.
// Stripping leading `{` and scanning for the first
// free `{` gets us to A. (*)
{"name": "aardvark"}
2. {"name": "samp{"0": "abc"}
// ^ ^
// A B - B is point of failure.
// Stripping and scanning gets us to A.
{"0": "abc"}
3. {"name":{ "samp{"0": "abc"}
// ^ ^ ^
// A B C - C is point of failure.
// Stripping and scanning gets us to A.
{ "samp{"0": "abc"}
// ^ ^
// B C - C is still point of failure.
// Stripping and scanning gets us to B.
{"0": "abc"}
My implementation passes some quite thorough unit tests. Still, I wonder whether the approach itself can be improved without exploding in complexity.
* Instead of looking for a leading "{", I actually have a sentinel string prepended to every message which makes the "stripping and scanning" part even more reliable.

Just look at expat or other streamed xml parsers. The logic of jsoncpp should be similar if its not. (Ask developers of this library to improve stream reading if needed.)
In other words, and from my point of view:
If some of your network (not JSON) packets are lost its not problem of JSON parser, just use more reliable protocol or invent your own. And only then transfer JSON over it.
If JSON parser reports errors and this error happened on the last parsed token (no more data in stream but expected) - accumulate data and try again (this task should be done by the library itself).
Sometimes it may not report errors though. For example when you transfer 123456 and only 123 is received. But this does not match your case since you don't transfer primitive data in a single JSON packet.
If the stream contains valid packets followed by semi-received packets, some callback should be called for each valid packet.
If the JSON parser reports errors and it's really invalid JSON, the stream should be closed and opened again if necessary.

Related

How to 'read' from a (binary==true) boost::beast::websocket::stream<tcp::socket> into a buffer (boost::beast::flat_buffer?) so it is not escaped?

I am using boost::beast to read data from a websocket into a std::string. I am closely following the example websocket_sync_client.cpp in boost 1.71.0, with one change--the I/O is sent in binary, there is no text handler at the server end, only a binary stream. Hence, in the example, I added one line of code:
// Make the stream binary?? https://github.com/boostorg/beast/issues/1045
ws.binary(true);
Everything works as expected, I 'send' a message, then 'read' the response to my sent message into a std::string using boost::beast::buffers_to_string:
// =============================================================
// This buffer will hold the incoming message
beast::flat_buffer wbuffer;
// Read a message into our buffer
ws.read(wbuffer);
// =============================================================
// ==flat_buffer to std::string=================================
string rcvdS = beast::buffers_to_string(wbuffer.data());
std::cout << "<string_rcvdS>" << rcvdS << "</string_rcvdS>" << std::endl;
// ==flat_buffer to std::string=================================
This just about works as I expected, except there is some kind of escaping happening on the data of the (binary) stream.
There is no doubt some layer of boost logic (perhaps character traits?) that has enabled/caused all non-printable characters to be '\u????' escaped, human-readable text.
The binary data that is read contains many (intentional) non-printable ASCII control characters to delimit/organize chunks of data in the message:
I would rather not have the stream escaping these non-printable characters, since I will have to "undo" that effort anyway, if I cannot coerce the 'read' buffer into leaving the data as-is, raw. If I have to find another boost API to undo the escaping, that is just wasted processing that no doubt is detrimental to performance.
My question has to have a simple solution. How can I cause the resulting flat_buffer that is ws.read into 'rcvdS' to contain truely raw, unescaped bytes of data? Is it possible, or is it necessary for me to simply choose a different buffer template/class, so that the escaping does not happen?
Here is a visual aid - showing expected vs. actual data:
Beast does not alter the contents of the message in any way. The only thing that binary() and text() do is set a flag in the message which the other end receives. Text messages are validated against the legal character set, while binary messages are not. Message data is never changed by Beast. buffers_to_string just transfers the bytes in the buffer to a std::string, it does not escape anything. So if the buffer contains a null, or lets say a ctrl+A, you will get a 0x00 and a 0x01 in the std::string respectively.
If your message is being encoded or translated, it isn't Beast that is doing it. Perhaps it is a consequence of writing the raw bytes to the std::cout? Or it could be whatever you are using to display those messages in the image you posted. I note that the code you provided does not match the image.
If anyone else lands here, rest assured, it is your server end, not the client end that is escaping your data.

Getting the best performance out of {fmt}

I need to format a FILETIME value info a wide string buffer and configuration provides the format string.
What I am actually doing:
Config provides the format string: L"{YYYY}-{MM}-{DD} {hh}:{mm}:{ss}.{mmm}"
Convert the FILETIME to System time:
SYSTEMTIME stUTC;
FileTimeToSystemTime(&fileTime, &stUTC);
Format the string with
fmt::format_to(std::back_inserter(buffer), strFormat,
fmt::arg(L"YYYY", stUTC.wYear),
fmt::arg(L"MM", stUTC.wMonth),
fmt::arg(L"DD", stUTC.wDay),
fmt::arg(L"hh", stUTC.wHour),
fmt::arg(L"mm", stUTC.wMinute),
fmt::arg(L"ss", stUTC.wSecond),
fmt::arg(L"mmm", stUTC.wMilliseconds));
I perfectly understand that with a service comes a cost :) but my code is calling this statement millions of time and the performance penalty is clearly present (more than 6% of CPU usage).
"Anything" I could do to improve this code would be welcomed.
I saw that {fmt} has a time API support.
Unfortunately, it seems to be unable to format the millisecond part of the time/date and, it requires some conversion effort from FILETIME to std::time_t...
Should I forget about the "custom" format string and provide a custom formatter for the FILETIME (or SYSTEMTIME) types? Would that result in a significant performance boost?
I'd appreciate any guidance you can provide.
In the comments I suggested parsing your custom time format string into a simple state machine. It does not even have to be a state machine as such. It is simply a linear series of instructions.
Currently, the fmt class needs to do a bit of work to parse the format type and then convert an integer to a zero-padded string. It is possible, though unlikely, that it is as heavily optimized as I'm about to suggest.
The basic idea is to have a (large) lookup table, which of course can be generated at runtime, but for the purposes of quick illustration:
const wchar_t zeroPad4[10000][5] = { L"0000", L"0001", L"0002", ..., L"9999" };
You can have 1-, 2- and 3-digit lookup tables if you want, or alternatively recognize that these values are all contained in the 4-digit lookup table if you just add an offset.
So to output a number, you just need to know what the offset in SYSTEMTIME is, what type the value is, and what string offset to apply (0 for 4-digit, 1 for 3-digit, etc). It makes things simpler, given that all struct elements in SYSTEMTIME are the same type. And you should reasonably assume that no values require range checks, although you can add that if unsure.
And you can configure it like this:
struct Output {
int dataOffset; // offset into SYSTEMTIME struct
int count; // extra adjustment after string lookup
};
What about literal strings? Well, you can either copy those or just repurpose Output to use a negative dataOffset representing where to start in the format string and count to hold how many characters to output in that mode. If you need extra output modes, extend this struct with a mode member.
Anwyay, let's take your string L"{YYYY}-{MM}-{DD} {hh}:{mm}:{ss}.{mmm}" as an example. After you parse this, you would end up with:
Output outputs[] {
{ offsetof(SYSTEMTIME, wYear), 0 }, // "{YYYY}"
{ -6, 1 }, // "-"
{ offsetof(SYSTEMTIME, wMonth), 2 }, // "{MM}"
{ -11, 1 }, // "-"
{ offsetof(SYSTEMTIME, wDay), 2 }, // "{DD}"
{ -16, 1 }, // " "
// etc... you get the idea
{ offsetof(SYSTEMTIME, wMilliseconds), 1 }, // "{mmm}"
{ -1, 0 }, // terminate
};
It shouldn't take much to see that, when you have a SYSTEMTIME as input, a pointer to the original format string, the lookup table, and this basic array of instructions you can go ahead and output the result into a pre-sized buffer very quickly.
I'm sure you can come up with the code to execute these instructions efficiently.
The main drawback of this approach is the size of the lookup table may lead to cache issues. However, most lookups will occur in the first 100 elements. You could also compress the table to ordinary char values and then inject the wchar_t zero bytes when copying.
As always: experiment, measure, and have fun!

Does Serial.find Clears Buffer If It Can't Find Anything

I'm trying to look for keywords in serial buffer in Arduino.
if (Serial.find("SOMETHING"))
{
// do something
}
else if (Serial.find("SOMETHING ELSE"))
{
// do another thing
}
But only the first if works. Even if I send "SOMETHINGELSE" it doesn't check at all. Does find function clear buffer completely even if it can't find anything ? If yes, what can i do in this situation?
Serial.find(); reads Serial buffer and removes every single byte from it, up to the point where it can find specified by you String or Character.
If you use it in an conditional statement like in your example, it will always find "SOMETHING" even if "SOMETHING ELSE" exist because everything up to the point of "SOMETHING" is removed from buffer ( if "SOMETHING" actually arrived before "SOMETHING ELSE" ).
If we assume that your data arrives in order SOMETHING and then SOMETHING ELSE, your Serial buffer will look like this: SOMETHING ELSESOMETHING
in which case:
It will find "SOMETHING" and stop in there as first condition to meet is to search for this word exactly.
I assumed that you don't actually mean to send "SOMETHING" so lets say that first String to look for is StringA and then StringB. Your buffer then will look like this: StringBStringA however based on your conditional statement it will still only find StringA. This will happen because StringA still exist in buffer and now when first condition is checked you basically ask to search for StringA and by doing this you are removing StringB using Serial.find(StringA) - it simply skips StringB because its not aware that you are going to ask about it in your else if later on.
Solution to your problem depends on data that you expect to receive. You can tag beginning of data that you are awaiting for with some specific character or sequence of characters:
For example lets assume that you await for String type of data. Before you send it to your Serial put each message in a specific format like $START$SOMETHING$
You can then use this to find first command that starts with your tag and load content of the message to String so you can compare it with expected results using conditional statement.
Note!!! Code below will stop on first message with $START$ tag so if you want to look into your Serial buffer for other messages you don't want to break while(Serial.available > 0) and use arrays to store each result.
char myCharacter;
String myIncomingData;
if(Serial.find("$START$"))
{
while (Serial.available() > 0) {
// Reads byte of Serial at the time
myCharacter = Serial.read();
// Stops at the end of data
if (myCharacter == "$") {
break;
}
// Adds each character to String with your data
myIncomingData += myCharacter;
}
if (myIncomingData == "SOMETHING") {
// Do whatever you like to with your data
} else if (myIncomingData == "SOMETHING ELSE") {
// Do whatever you like to with your data
}
I would use this solution only if you want to use Serial.find(), Im sure that you can get your results in many different ways as well, at the end you can always go through entire 64 bytes of you buffer byte by byte using your own code :D.

eof from string, not a stream

I have a secret "mission" to write Vigenère cipher with it's analysis with ascii alphabet.
I have some troubles with encrypting text.
There are two kinds of them:
1) If I use whole ascii table, there are some troubles with decrypting text, because i use "system" chars that kills my text (by the way, it is "War and Peace" written by Tolstoy). Should i use it truncated version?
if yes, so - could i do operations from next question with truncated ascii table?
2) I want to have whole my text in one string. I can do it by this:
string s;
string p = "";
ifstream in("text_for_encryption.txt");
while (getline(in, s))
{
p+=s;
p+="\n";
}
"s" is the temporary string, and "p" is the string that has all text from file in it (with endl's and, of course, EOF)
i will make a cycle for "p" which looks like as
while (not eof in p)
{
take first keyword.length() chars from "p"? check every of them for EOF and encrypt them. (they will be deleted from p)
kick them in file "encrypted_text.txt"
}
in pseudocode (yeah, it is shit-like :( ).
so, the question is - how can i compare a string element with eof?
maybe, i can't google good, but i couldn't find the answer for this question.
Thanks in advance for every advice!
Update:
if i will encrypt string-by-string, it wll be easy to get a length of a key by Fridman's method (if the key is quite small).
so i want to encrypt text with endl's for more security
For encrypting, it depends largely on what you want to encrypt,
and what you want to do with the encrypted text. The usual
solution is to encrypt the bytes values (not the characters);
this means that you'll have to read and write the encrypted file
in binary mode, but since it's not meant to be readable anyway,
that's usually not an issue.
For the rest, strings do not have "EOF" characters. In fact,
there is no such thing as an EOF character[1]. (Nor en endl
character, either.) EOF is, in fact, an "event" which occurs
when reading from a stream; in C++, it is, in fact, treated as
a sort of an error. std::istream functions which can return
EOF (e.g. std::istream::get()) return int, and not char,
in order to be able to return an out of band value.
Strings do have a known length. To visit all of the characters
in a string:
for ( std::string::const_iterator current = s.begin();
current != s.end();
++ current ) {
// Do something with *current...
}
(If you have C++11, you can replace
std::string::const_iterator with auto. This is much simpler
to type, but until you master the iterator idioms, it's probably
better to write the type out, to ensure you understand what is
going on.)
[1] Historically, text files have had EOF characters on some
systems. This is not the end of file that you see with
std::istream::get(), but even today, if you open a file in
text mode under Windows, a 0x1A in the file will trigger the end
of file event in the input.

String issue with assert on erase

I am developing a program in C++, using the string container , as in std::string to store network data from the socket (this is peachy), I receive the data in a maximum possible 1452 byte frame at a time, the protocol uses a header that contains information about the data area portion of the packets length, and header is a fixed 20 byte length. My problem is that a string is giving me an unknown debug assertion, as in , it asserts , but I get NO message about the string. Now considering I can receive more than a single packet in a frame at a any time, I place all received data into the string , reinterpret_cast to my data struct, calculate the total length of the packet, then copy the data portion of the packet into a string for regex processing, At this point i do a string.erase, as in mybuff.Erase(totalPackLen); <~ THIS is whats calling the assert, but totalpacklen is less than the strings size.
Is there some convention I am missing here? Or is it that the std::string really is an inappropriate choice here? Ty.
Fixed it on my own. Rolled my own VERY simple buffer with a few C calls :)
int ret = recv(socket,m_buff,0);
if(ret > 0)
{
BigBuff.append(m_buff,ret);
while(BigBuff.size() > 16){
Header *hdr = reinterpret_cast<Header*>(&BigBuff[0]);
if(ntohs(hdr->PackLen) <= BigBuff.size() - 20){
hdr->PackLen = ntohs(hdr->PackLen);
string lData;
lData.append(BigBuff.begin() + 20,BigBuff.begin() + 20 + hdr->PackLen);
Parse(lData); //regex parsing helper function
BigBuff.erase(hdr->PackLen + 20); //assert here when len is packlen is 235 and string len is 1458;
}
}
}
From the code snippet you provided it appears that your packet comprises a fixed-length binary header followed by a variable length ASCII string as a payload. Your first mistake is here:
BigBuff.append(m_buff,ret);
There are at least two problems here:
1. Why the append? You presumably have dispatched with any previous messages. You should be starting with a clean slate.
2. Mixing binary and string data can work, but more often than not it doesn't. It is usually better to keep the binary and ASCII data separate. Don't use std::string for non-string data.
Append adds data to the end of the string. The very next statement after the append is a test for a length of 16, which says to me that you should have started fresh. In the same vein you do that reinterpret cast from BigBuff[0]:
Header *hdr = reinterpret_cast<Header*>(&BigBuff[0]);
Because of your use of append, you are perpetually dealing with the header from the first packet received rather than the current packet. Finally, there's that erase:
BigBuff.erase(hdr->PackLen + 20);
Many problems here:
- If the packet length and the return value from recv are consistent the very first call will do nothing (the erase is at but not past the end of the string).
- There is something very wrong if the packet length and the return value from recv are not consistent. It might mean, for example, that multiple physical frames are needed to form a single logical frame, and that in turn means you need to go back to square one.
- Suppose the physical and logical frames are one and the same, you're still going about this all wrong. As noted, the first time around you are erasing exactly nothing. That append at the start of the loop is exactly what you don't want to do.
Serialization oftentimes is a low-level concept and is best treated as such.
Your comment doesn't make sense:
BigBuff.erase(hdr->PackLen + 20); //assert here when len is packlen is 235 and string len is 1458;
BigBuff.erase(hdr->PackLen + 20) will erase from hdr->PackLen + 20 onwards till the end of the string. From the description of the code - seems to me that you're erasing beyond the end of the content data. Here's the reference for std::string::erase() for you.
Needless to say that std::string is entirely inappropriate here, it should be std::vector.