Memory consumption during parsing of YAML with yaml-cpp - c++

I am developing Qt application for embedded system with limited memory. I need to receive a few megabytes of JSON data and parse it as fast as possible and without using too much memory.
I was thinking about using streams:
JSON source (HTTP client) ---> ZIP decompressor ---> YAML parser ----> objects mapped to database
Data will arrive from network much slower than I can parse it.
How much memory yaml-cpp need to parse 1MB of data?
I would like already parsed raw data from decompressor and internal memory used for that data by YAML parser to be released as soon as object mapped to database is created. Is it possible?
Does yaml-cpp support asynchronous parsing? So as soon as JSON object is parsed, I can store it in database without waiting for full content from HTTP source.

Since you have memory constraints and your data is already in JSON, you should use a low-memory JSON parser instead of a YAML parser. Try jsoncpp - although I'm not sure what their support for streaming is (since JSON doesn't have the concept of documents).
yaml-cpp is designed for streaming, so it won't block if there are documents to parse but the stream is still open; however, there is an outstanding issue in yaml-cpp where it reads more than a single document at a time, so it really isn't designed for extremely low memory usage.
As for how much memory it takes to parse 1 MB of data, it is probably on the order of 3 MB (the raw input stream, plus the parsed stream, plus the resulting data structure), but it may vary dramatically depending on what kind of data you're parsing.

Related

Real time parsing

I am quite new to parsing text files. While googling a bit, I found out that a parser builds a tree structure usually out of a text file. Most of the examples consists of parsing files, which in my view is quite static. You load the file to parser and get the output.
My problem is something different from parsing files. I have a stream of JSON data coming from a server socket at TCP port 6000. I need to parse the incoming data.I have some questions in mind:
1) Do I need to save the incoming JSON data at the client side with some sought of buffer? Answer: I think yes I need to save it, but are there any parsers which can do it directly like passing the JSON object as an argument to the parse function.
2) How would the structure of the real time parser look like`? Answer: Since on google only static parsing tree structure is available. In my view each object is parsed and have some sought of parsed tree and then it is deleted from the memory. Otherwise it will cause memory overflow because the data is continuous.
There are some parser libraries available like JSON-C and JSON lib. One more thing which comes into my mind is that can we save a JSON object in any C/C++ array. Just thought of that but could realize how to do that.

Why isn't lossless compression automatic on computers?

I was just wondering what could be the impact if, say, Microsoft decided to automaticly "lossless" compress every single file saved in a computer.
What are the pros? The cons? Is it feasible?
Speed.
When compressing a file of any kind you're encoding its contents in a more compact form, often using dictionaries and/or prefix codes (An example: huffman coding). To access the data you have to uncompress it, and this translates to time and used memory, as to access a specific piece of the file you have to decompress it as a whole. While decompressing you ave to save the results somewhere and the most appropriate place is RAM.
Of course this wouldn't be a great problem (decompressing the whole file) if all of it needed to be read, and not even in the case of a stream reading it, but if a program wanted to write to the compressed file all the data would have to be compressed again, or at least a part of it.
As you can see, compressing files in the filesystem would reduce a lot the bandwidth available to applications - to read a single byte you have to read a chunk of the file and decompress it - and also require more RAM.

Which libxml2 API should I use for large files?

Our program currently uses the libxml2 DOM API (xmlReadFile) to load an entire file into memory. Unfortunately, this breaks down on "large" XML files, as the basic memory consumption of libxml2 DOM is about 4-5 times the base file size.
It seems libxml2 offers two APIs for reading XML when I don't want to store the whole tree in memory: SAX2 and xmlReader.
I haven't dug into the APIs yet, but I'm wondering which one is preferable under which circumstances?
Note: All I need to do with the XML file is populate some C++ datastructures with the data found in the XML file. And these will in turn be a lot smaller than the (very verbose) XML definition. At the moment, with xmlReadFile and the DOM API the process takes about 100MB memory for a 20MB XML file. The C++ data in memory for such a file is more like 5MB -- so I could go from 1:4 to 4:1, which would already help a lot.
I follow this approach, if the processing is sparse (need only an element here and there) xmlReader is better, if you need to process all elements, SAX is better. Although, opinion could come in to play as to whether you want to push the processing or you want the processing to push your code...
If you need to process large XML documents then size becomes the primary consideration. As you saw with 20MB -> 100MB for DOM parsing, if you get much larger than this that can be prohibitively expensive and SAX may be the only way to process it. For embedded or memory constrained devices SAX may be required even for small files.
If you want to start parsing before the file is complete SAX is the way to go. If you are writing a browser, are streaming XML, or require responsiveness then you will need to use SAX.
SAX is more of a pain, if you can get away with DOM parsing that will usually lead to less code and simpler code, for simpler DOM queries you can avoid a state machine for example. If you only care about a handful of fields in the document you could even avoid querying a DOM parser directly and query XSLT instead.

boost property tree performance

I am planning to use boost property tree for our application http://www.boost.org/doc/libs/1_41_0/doc/html/property_tree.html. Now I wonder, everytime we call this method pt.get("debug.level", 0); does it read the whole file again or the value is served form internal cache. Is there any performance evaluation result of this library? Does it read the whole file in memory and serves the data from there? Anybody can share their experience using this library?
The library works well. You load the file into memory, operate on the property tree (query, update, whatever), and then write it out again when you finish.
We have used it for some JSON files large enough to run out of address space when loading them on a 32 bit machine using a boost::property_tree with std::string. Replacing std::string with a caching string class worked fine.
For most applications where you're really just looking at configuration files it will be fine.

Multi-part gzip file random access (in Java)

This may fall in the realm of "not really feasible" or "not really worth the effort" but here goes.
I'm trying to randomly access records stored inside a multi-part gzip file. Specifically, the files I'm interested in are compressed Heretrix Arc files. (In case you aren't familiar with multi-part gzip files, the gzip spec allows multiple gzip streams to be concatenated in a single gzip file. They do not share any dictionary information, it is simple binary appending.)
I'm thinking it should be possible to do this by seeking to a certain offset within the file, then scan for the gzip magic header bytes (i.e. 0x1f8b, as per the RFC), and attempt to read the gzip stream from the following bytes. The problem with this approach is that those same bytes can appear inside the actual data as well, so seeking for those bytes can lead to an invalid position to start reading a gzip stream from. Is there a better way to handle random access, given that the record offsets aren't known a priori?
The BGZF file format, compatible with GZIP was developped by the biologists.
(...) The advantage of
BGZF over conventional gzip is that
BGZF allows for seeking without having
to scan through the entire file up to
the position being sought.
In http://picard.svn.sourceforge.net/viewvc/picard/trunk/src/java/net/sf/samtools/util/ , have a look at BlockCompressedOutputStream and BlockCompressedInputStream.java
The design of GZIP, as you have realized, is not friendly to random access.
You can do as you describe, and then if you run into an error in the decompressor, conclude that the signature you found was actually compressed data.
If you finish decompressing, then it's easy to verify the validity of the stream just decompressed, via the CRC32.
If the files are not so big, you might consider just de-compressing all of the entries in series, and retaining the offsets of the signatures so as to build a directory. As you decompress, dump the bytes to a bit bucket. At that point you will have generated a directory, and you can then support random access based on filename, date, or other metadata.
This will be reasonably fast for files below 100k. Just as a guess, if you had 10 files of around 100k each, it would probably be done in 2s on a modern CPU. This is what I mean by "pretty fast". But only you know the perf requirements of your application .
Do you have a GZipInputStream class? If so you are half-way there.