How to parse very large buffer with flex and bison

How to parse very large buffer with flex and bison - c++

I've written a json library that uses flex and bison to parse serialized-json (i.e strings) — and deserialize them to json objects. It works great for small strings.
However, it fails to work with a very large strings (I tried strings of almost 3 GB) with this error:
‘fatal flex scanner internal error--end of buffer missed’
I want to know what is the maximum size of buffer which I can pass to this function:
//js: serialized json stored in std::string
yy_scan_bytes(js.data(), js.size());
and how can make flex/bison work with large buffers?

It looks to me like you are using an old version of the flex skeleton (and hence of flex), in which string lengths were assumed to fit into ints. The error message you are observing is probably the result of an int overflowing to a negative value.
I believe that if you switch to version 2.5.37 or more recent, you'll find that most of those ints have become size_t and you should have no problem calling yy_scan_bytes with an input buffer whose size exceeds 2 gigabytes. (The prototype for that function now takes a size_t rather than an int, for example.)
I have a hard time believing that doing so is a good idea, however. For a start, yy_scan_bytes copies the entire string, because the lexical scanner wants a string it is allowed to modify, and because it wants to assure itself that the string has two NUL bytes at the end. Making that copy is going to needless use up a lot of memory, and if you're going to copy the buffer anyway, you might as well copy it in manageable pieces (say, 64Kib or even 1MiB.) That will only prove problematic if you have single tokens which are significantly larger than the chunk size, because flex is definitely not optimized for large single tokens. But for all normal use cases, it will probably work out a lot better.
Flex doesn't provide an interface for splitting a huge input buffer into chunks, but you can do it very easily by redefining the YY_INPUT macro. (If you do that, you'll probably end up using yyin as a pointer to your own buffer structure, which is theoretically non-portable. However, it will work on any Posix architecture, where all object pointers have the same representation.)
Of course, you would normally not want to wait while 3GB of data is accumulated in memory to start parsing it. You could parse incrementally as you read the data. (You might still need to redefine YY_INPUT, depending on how you are reading the data.)

Related

How to determinte how much space to allocate for boost::stacktrace::safe_dump_to?

I'm looking at the boost::stacktrace::safe_dump_to API and I cannot for the life of me work out how to determine how much space to allocate for the safe_dump_to() call. If I pass (nullptr, 0), it just returns 0, so that's not it. I could guess some constant number, but how do i know that's enough?

The docs specify:
This header contains low-level async-signal-safe functions for dumping call stacks. Dumps are binary serialized arrays of void*, so you could read them by using 'od -tx8 -An stacktrace_dump_failename' Linux command or using boost::stacktrace::stacktrace::from_dump functions.
And additionally
Returns:
Stored call sequence depth including terminating zero frame. To get the actually consumed bytes multiply this value by the sizeof(boost::stacktrace::frame::native_frame_ptr_t))
It's not overly explicit, but this means you need sizeof(boost::stacktrace::frame::native_frame_ptr_t)*N where N is the number of stackframes in the trace.
Now of course, you can find out N, but there is not async-safe way to do allocate dynamically anyways, so you'd simply have to pick a number that suits your application. E.g. 256 frames might be reasonable, but you should look at your own needs (e.g. DEBUG builds will show a lot more stack frames, especially with code that heavily relies on template generics, YMMV).
Since the whole safe_dump_to construct is designed to be async-safe, I always just use the overload that writes to a file. When reading it back (typically after a restart) you will be able to deduce the number of frames from the file-size.
Optionally see some of my answers for code samples/more background information

std::string vs. byte buffer (difference in c++)

I have a project where I transfer data between client and server using boost.asio sockets. Once one side of the connection receives data, it converts it into a std::vector of std::strings which gets then passed on to the actualy recipient object of the data via previously defined "callback" functions. That way works fine so far, only, I am at this point using methods like atoi() and to_string to convert other data types than strings into a sendable format and back. This method is of course a bit wasteful in terms of network usage (especially when transferring bigger amounts of data than just single ints and floats). Therefore I'd like to serialize and deserialize the data. Since, effectively, any serialisation method will produce a byte array or buffer, it would be convenient for me to just use std::string instead. Is there any disadvantage to doing that? I would not understand why there should be once, since strings should be nothing more than byte arrays.

In terms of functionality, there's no real difference.
Both for performance reasons and for code clarity reasons, however, I would recommend using std::vector<uint8_t> instead, as it makes it far more clear to anyone maintaining the code that it's a sequence of bytes, not a String.

You should use std::string when you work with strings, when you work with binary blob you better work with std::vector<uint8_t>. There many benefits:
your intention is clear so code is less error prone
you would not pass your binary buffer as a string to function that expects std::string by mistake
you can override std::ostream<<() for this type to print blob in proper format (usually hex dump). Very unlikely that you would want to print binary blob as a string.
there could be more. Only benefit of std::string that I can see that you do not need to do typedef.

You're right. Strings are nothing more than byte arrays. std::string is just a convenient way to manage the buffer array that represents the string. That's it!
There's no disadvantage of using std::string unless you are working on something REALLY REALLY performance critical, like a kernel, for example... then working with std::string would have a considerable overhead. Besides that, feel free to use it.
--
An std::string behind the scenes needs to do a bunch of checks about the state of the string in order to decide if it will use the small-string optimization or not. Today pretty much all compilers implement small-string optimizations. They all use different techniques, but basically it needs to test bitflags that will tell if the string will be constructed in the stack or the heap. This overhead doesn't exist if you straight use char[]. But again, unless you are working on something REALLY critical, like a kernel, you won't notice anything and std::string is much more convenient.
Again, this is just ONE of the things that happens under the hood, just as an example to show the difference of them.

Depending on how often you're firing network messages, std::string should be fine. It's a convenience class that handles a lot of char work for you. If you have a lot of data to push though, it might be worth using a char array straight and converting it to bytes, just to minimise the extra overhead std::string has.
Edit: if someone could comment and point out why you think my answer is bad, that'd be great and help me learn too.

Variable Length Array Performance Implications (C/C++)

I'm writing a fairly straightforward function that sends an array over to a file descriptor. However, in order to send the data, I need to append a one byte header.
Here is a simplified version of what I'm doing and it seems to work:
void SendData(uint8_t* buffer, size_t length) {
uint8_t buffer_to_send[length + 1];
buffer_to_send[0] = MY_SPECIAL_BYTE;
memcpy(buffer_to_send + 1, buffer, length);
// more code to send the buffer_to_send goes here...
}
Like I said, the code seems to work fine, however, I've recently gotten into the habit of using the Google C++ style guide since my current project has no set style guide for it (I'm actually the only software engineer on my project and I wanted to use something that's used in industry). I ran Google's cpplint.py and it caught the line where I am creating buffer_to_send and threw some comment about not using variable length arrays. Specifically, here's what Google's C++ style guide has to say about variable length arrays...
http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml#Variable-Length_Arrays_and_alloca__
Based on their comments, it appears I may have found the root cause of seemingly random crashes in my code (which occur very infrequently, but are nonetheless annoying). However, I'm a bit torn as to how to fix it.
Here are my proposed solutions:
Make buffer_to_send essentially a fixed length array of a constant length. The problem that I can think of here is that I have to make the buffer as big as the theoretically largest buffer I'd want to send. In the average case, the buffers are much smaller, and I'd be wasting about 0.5KB doing so each time the function is called. Note that the program must run on an embedded system, and while I'm not necessarily counting each byte, I'd like to use as little memory as possible.
Use new and delete or malloc/free to dynamically allocate the buffer. The issue here is that the function is called frequently and there would be some overhead in terms of constantly asking the OS for memory and then releasing it.
Use two successive calls to write() in order to pass the data to the file descriptor. That is, the first write would pass only the one byte, and the next would send the rest of the buffer. While seemingly straightforward, I would need to research the code a bit more (note that I got this code handed down from a previous engineer who has since left the company I work for) in order to guarantee that the two successive writes occur atomically. Also, if this requires locking, then it essentially becomes more complex and has more performance impact than case #2.
Note that I cannot make the buffer_to_send a member variable or scope it outside the function since there are (potentially) multiple calls to the function at any given time from various threads.
Please let me know your opinion and what my preferred approach should be. Thanks for your time.

You can fold the two successive calls to write() in your option 3 into a single call using writev().
http://pubs.opengroup.org/onlinepubs/009696799/functions/writev.html

I would choose option 1. If you know the maximum length of your data, then allocate that much space (plus one byte) on the stack using a fixed size array. This is no worse than the variable length array you have shown because you must always have enough space left on the stack otherwise you simply won't be able to handle your maximum length (at worst, your code would randomly crash on larger buffer sizes). At the time this function is called, nothing else will be using the further space on your stack so it will be safe to allocate a fixed size array.

Is it recommended to std::move a string into containers that is going to be overwritten?

I have the following code
std::vector<std::string> lines;
std::string currentLine;
while(std::getline(std::cin, currentLine)) {
// // option 1
// lines.push_back(std::move(currentLine));
// // option 2
// lines.push_back(currentLine);
}
I see different costs for the two
The first approach will clear currentLine, making the getline need to allocate a new buffer for the string. But it will use the buffer for the vector instead.
The second approach will make getline be able to reuse the buffer, and require a new buffer allocation for the in-vector string.
In such situations, is there a "better" way? Can the compiler optimize the one or other approach more efficiently? Or are there clever string implementations that make one option way more performant than the other?

Given the prevalence of the short string optimization, my immediate guess is that in many cases none of this will make any difference at all -- with SSO, a move ends up copying the contained data anyway (even if the source is an rvalue so it's eligible as the source for a move).
Between the two you've given, I think I'd tend to favor the non-moving version, but I doubt it's going to make a big difference either way. Given that (most of the time) you're going to be re-using the source immediately after the move, I doubt that moving is really going to do a lot of good (even at best). Assuming SSO isn't involved, your choice is being creating a new string in the vector to hold a copy of the string you read, or move from the string you read and (in essence) create a new string to hold the next line in the next iteration. Either way, the expensive part (allocating a buffer to hold the string, copy data into that buffer) is going to be pretty much the same either way.
As far as: "is there a better way" goes, I can think of at least a couple possibilities. The most obvious would be to memory map the file, then walk through that buffer, find the ends of lines, and use emplace_back to create strings in the vector directly from the data in the buffer, with no intermediate strings at all.
That does have the minor disadvantage of memory mapping not being standardized -- if you can't live with that level of non-portability, you can read the whole file into a buffer instead of memory mapping.
The next possibility after that would be to create a class with an interface like a const string's, that just maintains a pointer to the data in the big buffer instead of making a copy of it (e.g., CLang uses something like this). This will typically reduce total allocation, heap fragmentation, etc., but if you (for example) need to modify the strings afterward, it's unlikely to be of much (if any) use.

Resizable char buffer container type for C++

I'm using libcurl (HTTP transfer library) with C++ and trying to download files from remote HTTP servers. As file is downloaded, my callback function is called multiple times (e.g. every 10 kb) to send me buffer data.
Basically I need something like "string bufer", a data structure to append char buffer to existing string. In C, I allocate (malloc) a char* and then as new buffers come, I realloc and then memcpy so that I can easily copy my buffer to resized array.
In C, there are multiple solutions to achieve this.
I can keep using malloc, realloc, memcpy but I'm pretty sure that they are not recommended in C++.
I can use vector<char>.
I can use stringstream.
My use cases is, I'll append a few thousands of items (chars) at a time, and after it all finishes (download is completed), I will read all of it at once. But I may need options like seek in the future (easy to achieve in array solution (1)) but it is low priority now.
What should I use?

I'd go for stringstream. Just insert into it as you recieve the data, and when you're done you can extract a full std::string from it. I don't see why you'd want to seek into an array? Anyway, if you know the block size, you can calculate where in the string the corresponding block went.

I'm not sure if many will agree with this, but for that use case I would actually use a linked list, with each node containing an arbitrarily large array of char that were allocated using new. My reasoning being:
Items are added in large chunks at a time, one at a time at the back.
I assume this could use quite a large amount of space, so you avoid reallocation events when a vector would otherwise need more space.
Since items are read sequentially, the penalty of link lists being unidirectional doesn't affect you.
Should Seeking through the list become a priority, this wouldn't work though. If it's not a lot of data ultimately, I honestly think a vector would be fine, dispite not being the most efficient structure.

If you just need to append char buffers, you can also simply use std::string and the member function append. On top of that stringstream gives you formatting, functionality, so you can add numbers, padding etc., but from your description you appear not to need that.

I would use vector<char>. But they will all work even with a seek, so your question is really one of style and there are no definitive answers there.

I think I'd use a deque<char>. Same interface as vector, and vector would do, but vector needs to copy the whole data each time an append exceeds its existing capacity. Growth is exponential, but you'd still expect about log N reallocations, where N is the number of equal-sized blocks of data you append. Deque doesn't reallocate, so it's the container of choice in cases where a vector would need to reallocate several times.
Assuming the callback is handed a char* buffer and length, the code to copy and append the data is simple enough:
mydeque.insert(mydeque.end(), buf, buf + len);
To get a string at the end, if you want one:
std::string mystring(mydeque.begin(), mydeque.end());
I'm not exactly sure what you mean by seek, but obviously deque can be accessed by index or iterator, same as vector.
Another possibility, though, is that if you expect a content-length at the start of the download, you could use a vector and reserve() enough space for the data before you start, which avoids reallocation. That depends on what HTTP requests you're making, and to what servers, since some HTTP responses will use chunked encoding and won't provide the size up front.

Create your own Buffer class to abstract away the details of the storage. If I were you I would likely implement the buffer based on std::vector<char>.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to parse very large buffer with flex and bison - c++

Related

How to determinte how much space to allocate for boost::stacktrace::safe_dump_to?

std::string vs. byte buffer (difference in c++)

Variable Length Array Performance Implications (C/C++)

Is it recommended to std::move a string into containers that is going to be overwritten?

Resizable char buffer container type for C++

Categories

Resources