how to use protobuf parse big data? [duplicate]

how to use protobuf parse big data? [duplicate] - c++

I'm getting this warning and an error afterwards when I try to parse a large message. I know than 64MB which is the default limit. I am using message.ParseFromIstream now. Does any one know to get access to CodedInputStream object to call the SetTotalBytesLimit function? or any other way to solve this problem?
Reading dangerously large protocol message. If the message turns out
to be larger than 67108864 bytes, parsing will be halted for security
reasons. To increase the limit (or to disable these warnings), see
CodedInputStream::SetTotalBytesLimit() in
google/protobuf/io/coded_stream.h.

The correct fix: You should try to limit the sizes of your protobuf messages. Please see:
https://developers.google.com/protocol-buffers/docs/techniques#streaming
The quick and dirty (read not recommended) approach:
In the file coded_stream.h of the protobuf library source, change the values of kDefaultTotalBytesLimit and kDefaultTotalBytesWarningThreshold, recompile, and reinstall.

Just reading the documentation of the function that the error already told you about, would've answered that question:
Hint: If you are reading this because your program is printing a
warning about dangerously large protocol messages, you may be confused
about what to do next. The best option is to change your design such
that excessively large messages are not necessary. For example, try to
design file formats to consist of many small messages rather than a
single large one. If this is infeasible, you will need to increase the
limit. Chances are, though, that your code never constructs a
CodedInputStream on which the limit can be set. You probably parse
messages by calling things like Message::ParseFromString(). In this
case, you will need to change your code to instead construct some sort
of ZeroCopyInputStream (e.g. an ArrayInputStream), construct a
CodedInputStream around that, then call
Message::ParseFromCodedStream() instead. Then you can adjust the
limit. Yes, it's more work, but you're doing something unusual.
Source
Also it's probably a really good idea to follow the first part of the advice and redesign the application.

Here's a comment from the code (google/protobuf/io/coded_stream.h) that sets the message limit for those who's wondering what is the security reason they are talking about. In my case I cannot modify how my application work so I have to change this limit.
This thread is quite old, but recently deep learning has got attention and the library Caffe used Protobuf so maybe more people will stumbled upon this. I have to do neural network stuff with Caffe, and the whole network took so much memory even with smallest batch size.
// Total Bytes Limit -----------------------------------------------
// To prevent malicious users from sending excessively large messages
// and causing integer overflows or memory exhaustion, CodedInputStream
// imposes a hard limit on the total number of bytes it will read.
// Sets the maximum number of bytes that this CodedInputStream will read
// before refusing to continue. To prevent integer overflows in the
// protocol buffers implementation, as well as to prevent servers from
// allocating enormous amounts of memory to hold parsed messages, the
// maximum message length should be limited to the shortest length that
// will not harm usability. The theoretical shortest message that could
// cause integer overflows is 512MB. The default limit is 64MB. Apps
// should set shorter limits if possible. If warning_threshold is not -1,
// a warning will be printed to stderr after warning_threshold bytes are
// read. For backwards compatibility all negative values get squashed to -1,
// as other negative values might have special internal meanings.
// An error will always be printed to stderr if the limit is reached.
//
// This is unrelated to PushLimit()/PopLimit().
//
// Hint: If you are reading this because your program is printing a
// warning about dangerously large protocol messages, you may be
// confused about what to do next. The best option is to change your
// design such that excessively large messages are not necessary.
// For example, try to design file formats to consist of many small
// messages rather than a single large one. If this is infeasible,
// you will need to increase the limit. Chances are, though, that
// your code never constructs a CodedInputStream on which the limit
// can be set. You probably parse messages by calling things like
// Message::ParseFromString(). In this case, you will need to change
// your code to instead construct some sort of ZeroCopyInputStream
// (e.g. an ArrayInputStream), construct a CodedInputStream around
// that, then call Message::ParseFromCodedStream() instead. Then
// you can adjust the limit. Yes, it's more work, but you're doing
// something unusual.

Related

Parser for TCP buffers

I want to implement a protocol to share the data between server and client.
I don't know the correct one. By keeping performance as main criteria can anyone suggest the best protocol to parse the data.
I have one in mind, don't the actual name but it will be like the this
[Header][Message][Header][Message]
the header contains the length of the message and the header size is fixed.
I have tried this with some by performing a lot of concatenation and substring operation which are costlier. can any suggest the best implementation for this

The question is very broad.
On the topic of avoiding buffer/string concatenation, like at Buffer Sequences, described in Boost Asio's "Scatter-Gather" Documentation

For parsing there are two common solutions:
small messages
Receive the data into a buffer, e.g. 64k. Then use pointers into that buffer to parse the header and message. Since the messages are small there can be many messages in the buffer and you would call the parser again as long as there is data in the buffer. Note that the last message in the buffer might be truncated. In which case you have to keep the partial message and read more data into the buffer. If the message is near the end of the buffer then copying it to the front might be necessary.
large messages
With large messages it makes sense to first only read the header. Then parse the header to get the message size, allocate an appropriate buffer for the message and then read the whole message into it before parsing it.
Note: In both cases you might want to handle overly large messages by either skipping them or terminating the connection with an error. For the first case a message can not be larger than the buffer and should be a lot smaller. For the second case you don't want to allocate e.g. a gigabyte to buffer a message if they are supposed to be around 1MB only.
For sending messages it's best to first collect all the output. A std::vector can be sufficient. Or a rope of strings. Avoid copying the message into a larger buffer over and over. At most copy it once at the end when you have all the pieces. Using writev() to write a list of buffers instead of copying them all into one buffer can also be a solution.
As for the best protocol... What's best? Simply sending the data in binary format is fastest but will break when you have different architectures or versions. Something like google protobuffers can solve that but at the cost of some speed. It all depends on what your needs are.

Improving/optimizing file write speed in C++

I've been running into some issues with writing to a file - namely, not being able to write fast enough.
To explain, my goal is to capture a stream of data coming in over gigabit Ethernet and simply save it to a file.
The raw data is coming in at a rate of 10MS/s, and it's then saved to a buffer and subsequently written to a file.
Below is the relevant section of code:
std::string path = "Stream/raw.dat";
ofstream outFile(path, ios::out | ios::app| ios::binary);
if(outFile.is_open())
cout << "Yes" << endl;
while(1)
{
rxSamples = rxStream->recv(&rxBuffer[0], rxBuffer.size(), metaData);
switch(metaData.error_code)
{
//Irrelevant error checking...
//Write data to a file
std::copy(begin(rxBuffer), end(rxBuffer), std::ostream_iterator<complex<float>>(outFile));
}
}
The issue I'm encountering is that it's taking too long to write the samples to a file. After a second or so, the device sending the samples reports its buffer has overflowed. After some quick profiling of the code, nearly all of the execution time is spent on std::copy(...) (99.96% of the time to be exact). If I remove this line, I can run the program for hours without encountering any overflow.
That said, I'm rather stumped as to how I can improve the write speed. I've looked through several posts on this site, and it seems like the most common suggestion (in regard to speed) is to implement file writes as I've already done - through the use of std::copy.
If it's helpful, I'm running this program on Ubuntu x86_64. Any suggestions would be appreciated.

So the main problem here is that you try to write in the same thread as you receive, which means that your recv() can only be called again after copy is complete. A few observations:
Move the writing to a different thread. This is about a USRP, so GNU Radio might really be the tool of your choice -- it's inherently multithreaded.
Your output iterator is probably not the most performant solution. Simply "write()" to a file descriptor might be better, but that's performance measurements that are up to you
If your hard drive/file system/OS/CPU aren't up to the rates coming in from the USRP, even if decoupling receiving from writing thread-wise, then there's nothing you can do -- get a faster system.
Try writing to a RAM disk instead
In fact, I don't know how you came up with the std::copy approach. The rx_samples_to_file example that comes with UHD does this with a simple write, and you should definitely favor that over copying; file I/O can, on good OSes, often be done with one copy less, and iterating over all elements is probably very slow.

Let's do a bit of math.
Your samples are (apparently) of type std::complex<std::float>. Given a (typical) 32-bit float, that means each sample is 64 bits. At 10 MS/s, that means the raw data is around 80 megabytes per second--that's within what you can expect to write to a desktop (7200 RPM) hard drive, but getting fairly close to the limit (which is typically around 100-100 megabytes per second or so).
Unfortunately, despite the std::ios::binary, you're actually writing the data in text format (because std::ostream_iterator basically does stream << data;).
This not only loses some precision, but increases the size of the data, at least as a rule. The exact amount of increase depends on the data--a small integer value can actually decrease the quantity of data, but for arbitrary input, a size increase close to 2:1 is fairly common. With a 2:1 increase, your outgoing data is now around 160 megabytes/second--which is faster than most hard drives can handle.
The obvious starting point for an improvement would be to write the data in binary format instead:
uint32_t nItems = std::end(rxBuffer)-std::begin(rxBuffer);
outFile.write((char *)&nItems, sizeof(nItems));
outFile.write((char *)&rxBuffer[0], sizeof(rxBuffer));
For the moment I've used sizeof(rxBuffer) on the assumption that it's a real array. If it's actually a pointer or vector, you'll have to compute the correct size (what you want is the total number of bytes to be written).
I'd also note that as it stands right now, your code has an even more serious problem: since it hasn't specified a separator between elements when it writes the data, the data will be written without anything to separate one item from the next. That means if you wrote two values of (for example) 1 and 0.2, what you'd read back in would not be 1 and 0.2, but a single value of 10.2. Adding separators to your text output will add yet more overhead (figure around 15% more data) to a process that's already failing because it generates too much data.
Writing in binary format means each float will consume precisely 4 bytes, so delimiters are not necessary to read the data back in correctly.
The next step after that would be to descend to a lower-level file I/O routine. Depending on the situation, this might or might not make much difference. On Windows, you can specify FILE_FLAG_NO_BUFFERING when you open a file with CreateFile. This means that reads and writes to that file will basically bypass the cache and go directly to the disk.
In your case, that's probably a win--at 10 MS/s, you're probably going to use up the cache space quite a while before you reread the same data. In such a case, letting the data go into the cache gains you virtually nothing, but costs you some data to copy data to the cache, then somewhat later copy it out to the disk. Worse, it's likely to pollute the cache with all this data, so it's no longer storing other data that's a lot more likely to benefit from caching.

Defending classes with 'magic numbers'

A few months ago I read a book on security practices, and it suggested the following method for protecting our classes from overwriting with e.g. overflows etc.:
first define a magic number and a fixed-size array (can be a simple integer too)
use that array containing the magic number, and place one at the top, and one at the bottom of our class
a function compares these numbers, and if they are equal, and equal to the static variable, the class is ok, return true, else it is corrupt, and return false.
place this function at the start of every other class method, so this will check the validity of the class on function calls
it is important to place this array at the start and the end of the class
At least this is as I remember it. I'm coding a file encryptor for learning purposes, and I'm trying to make this code exception safe.
So, in which scenarios is it useful, and when should I use this method, or is this something totally useless to count on? Does it depend on the compiler or OS?
PS: I forgot the name of the book mentioned in this post, so I cannot check it again, if anyone of you know which one was it please tell me.

What you're describing sounds a Canary, but within your program, as opposed to the compiler. This is usually on by default when using gcc or g++ (plus a few other buffer overflow countermeasures).
If you're doing mutable operations on your class and you want to make sure you don't have side effects, I don't know if having a magic number is very useful. Why rely on a homebrew validity check when there are mothods out there that are more likely to be successful?
Checksums: I think it'd be more useful for you to hash the unencrypted text and add that to the end of the encrypted file. When decrypting, remove the hash and compare the hash(decrypted text) with what it should be.
I think most, if not all, widely used encryptors/decryptors store some sort of checksum in order to verify that the data has not changed.

This type of a canary will partially protect you against a very specific type of overflow attack. You can make it a little more robust by randomizing the canary value every time you run the program.
If you're worried about buffer overflow attacks (and you should be if you are ever parsing user input), then go ahead and do this. It probably doesn't cost too much in speed to check your canaries every time. There will always be other ways to attack your program, and there might even be careful buffer overflow attacks that get around your canary, but it's a cheap measure to take so it might be worth adding to your classes.

Variable Length Array Performance Implications (C/C++)

I'm writing a fairly straightforward function that sends an array over to a file descriptor. However, in order to send the data, I need to append a one byte header.
Here is a simplified version of what I'm doing and it seems to work:
void SendData(uint8_t* buffer, size_t length) {
uint8_t buffer_to_send[length + 1];
buffer_to_send[0] = MY_SPECIAL_BYTE;
memcpy(buffer_to_send + 1, buffer, length);
// more code to send the buffer_to_send goes here...
}
Like I said, the code seems to work fine, however, I've recently gotten into the habit of using the Google C++ style guide since my current project has no set style guide for it (I'm actually the only software engineer on my project and I wanted to use something that's used in industry). I ran Google's cpplint.py and it caught the line where I am creating buffer_to_send and threw some comment about not using variable length arrays. Specifically, here's what Google's C++ style guide has to say about variable length arrays...
http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml#Variable-Length_Arrays_and_alloca__
Based on their comments, it appears I may have found the root cause of seemingly random crashes in my code (which occur very infrequently, but are nonetheless annoying). However, I'm a bit torn as to how to fix it.
Here are my proposed solutions:
Make buffer_to_send essentially a fixed length array of a constant length. The problem that I can think of here is that I have to make the buffer as big as the theoretically largest buffer I'd want to send. In the average case, the buffers are much smaller, and I'd be wasting about 0.5KB doing so each time the function is called. Note that the program must run on an embedded system, and while I'm not necessarily counting each byte, I'd like to use as little memory as possible.
Use new and delete or malloc/free to dynamically allocate the buffer. The issue here is that the function is called frequently and there would be some overhead in terms of constantly asking the OS for memory and then releasing it.
Use two successive calls to write() in order to pass the data to the file descriptor. That is, the first write would pass only the one byte, and the next would send the rest of the buffer. While seemingly straightforward, I would need to research the code a bit more (note that I got this code handed down from a previous engineer who has since left the company I work for) in order to guarantee that the two successive writes occur atomically. Also, if this requires locking, then it essentially becomes more complex and has more performance impact than case #2.
Note that I cannot make the buffer_to_send a member variable or scope it outside the function since there are (potentially) multiple calls to the function at any given time from various threads.
Please let me know your opinion and what my preferred approach should be. Thanks for your time.

You can fold the two successive calls to write() in your option 3 into a single call using writev().
http://pubs.opengroup.org/onlinepubs/009696799/functions/writev.html

I would choose option 1. If you know the maximum length of your data, then allocate that much space (plus one byte) on the stack using a fixed size array. This is no worse than the variable length array you have shown because you must always have enough space left on the stack otherwise you simply won't be able to handle your maximum length (at worst, your code would randomly crash on larger buffer sizes). At the time this function is called, nothing else will be using the further space on your stack so it will be safe to allocate a fixed size array.

how to find the maximum size of ostream class in c++

I am assigning std::ostream *pout_ = output_.get();
where output_ is scoped_ptr<std::stringstream> output_;
after assigning i am filling *pout_ <<"large strings"; appx 1172528 characters.
But after some extent i am not able to insert charcters inside *pout_. I tried surfing on the net whats the maximum size of this class but couldn't find.
Someone please tell how much characters maximum i can store in *pout_. Is there any function which can tell me the maximum size of this class??

There are several possible causes for output to an ostream to fail.
The most obvious is that the underlying supporting media (memory, in the
case of an std::ostringstream) is full. Another is that you've
reached some internal limit: a lot of systems have (or had) file size
limits which would hit you long before the disk was full, and some
systems have had similar constraints on single objects in memory (and
the std::stringbuf class typically keeps its data in a single object).
(There's also the possiblity of a hardware error, but if this occurs
with std::stringbuf, i.e. a memory error, the hardware probably won't
detect it.)
All of these mean that there is no hard limit as to how much you can
write to a stream, string or otherwise. It all depends, and one time,
you might succeed writing 2 GB, and the next fail after 1 MB or less.
Practically speaking, in most cases, you should be aware of the fact
that writes can fail, test the results (after a final flush) and be
prepared to do something reasonable if they do.
In the specific case of string streams, of course, about the only
failure you're likely to be able to detect is out of memory. (An
ostream does not propagate an exception from the streambuf; it
sets the badbit when one occurs, and if exceptions have been activated
for badbit, it will throw its own exception, not rethrow the original
one. Which means that an std::bad_alloc will not propagate out.)
A lot of applications don't handle out of memory, and should logically
have set the new handler to abort. If you've set the new handler to
abort, then you can probably forego such error checking on string
outputs.

The answer depends on your compiler, platform etc. It also depends on the manner in which you write stuff to the stringstream (since it may need to reallocate memory for its buffer, potentially causing memory fragmentation and therefore under-utilisation).
The only practical way to get an idea of the limit is by conducting realistic experiments in your target environment. Even then, you should only use the results of such experiments as a rough guide.

A stringstream stores the data in a memory buffer. It will grow the buffer as needed, as long as it can.
You can continue to write to it until you run out of memory. There is not fixed limit.

It depends on a huge number of factors, such as
what load addresses are chosen for the shared objects you used
how much memory the rest of your program uses
how badly fragmented your address space has become
That's why you couldn't find any documented limit on the web.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js