Context: I'm using a .mbtiles file, a geomapping file format, which is a sqlite database file containing vector tiles.
Those vector tiles are packed using protocol buffer and then gzipped.
I'm using C++, and currently reading the zlib usage decompression example, but I am not sure about how to handle chunks and the end of stream event.
SQLite gives me a void* pointer and a length.
I quote the page:
For applications where zlib streams are embedded in other data, this
routine would need to be modified to return the unused data, or at
least indicate how much of the input data was not used, so the
application would know where to pick up after the zlib stream.
The protocol buffer class methods either take void* or std::string. I guess I should go with void*.
I'm not sure how those events work, and the example doesn't seem to provide a case for bytes arrays. How should I change the code to avoid errors ?
It sounds like SQLite is giving you a zlib stream without anything after it. If so, then that comment doesn't apply.
In any case, you are looking at the right page. (You didn't say what "the page" is, but I recognize the quote, since I wrote it.) That shows in general how to use the zlib functions. You should be able to figure out how to apply it to a byte array instead of file input.
If the data is really "gzipped", then you will need to use inflateInit2() instead of inflateInit(). Read the zlib documentation in zlib.h.
Related
I am working on a TCP server using boost asio and I got lost with choosing the best data type to work with when dealing with byte buffers.
Currently I am using std::vector<char> for everything. One of the reasons is that most of examples of asio use vectors or arrays. I receive data from network and put it in a buffer vector. Once a packet is available, it is extracted from the buffer and decrypted/decompressed if needed (both operations may result in more amount of data). Then multiple messages are extracted from the payload.
I am not happy with this solution because it involves inserting and removing data from vectors constantly, but it does the job.
Now I need to work on data serialization. There is not an easy way to read or write arbitrary data types from a char vector so I ended up implementing a "buffer" that hides a vector inside, and allows to write (wrapper for insert) and read (wrapper for casting) from it. Then I can write uint16 code; buffer >> code; and also add serialization/deserialization methods to other objects while keeping things simple.
The thing is that every time I think about this I feel like I am using the wrong data type as container for the binary data. Reasons are:
Streams already do a good job as potentially endless source of data input or data output. While in background this may result in inserting and removing data, probably does a better job than using a char vector.
Streams already allow to read or write basic data types, so I don't have to reinvent the wheel.
There is no need to access to a specific position of data. Usually I need to read or write sequentially.
In this case, are streams the best choice or is there something that I am not seeing?. And if so, is stringstream the one I should use?
Any reasons to avoid streams and work only with containers?
PD: I can not use boost serialization or any other existing solution because I don't have control over the network protocol.
Your approach seems fine. You might consider a deque instead of a vector if you're doing a lot of stealing from the end and erasing from the front, but if you use circular-buffer logic while iterating then this doesn't matter either.
You could switch to a stream, but then you're completely at the mercy of the standard library, its annoyances/oddities, and the semantics of its formatted extraction routines — if these are insufficient then you have to extract N bytes and do your own reinterpretation anyway, so you're back to square one but with added copying and indirection.
You say you don't need random-access, so that's another reason not to care either way. Personally I like to have random-access in case I need to resync, or seek-ahead, or seek-behind, or even just during debugging want to have better capabilities without having to suddenly refactor all my buffer code.
I don't think there's any more specific answer to this in the general case.
I want to compress my data, using compress zlib function, so, code, looks like following:
ifs.read(srcBuf,srcLen) // std::ifstream, srcLen = 256kb
compress(dstBuf, &dstLen, srcBuf, srcLen); // casts are omitted
ofs.write(dstBuf, dstLen); // std::ofstream
dstLen = dstBufSize;
Result file is ~4% smaller, than original (380mb vs 360mb), which is, actually, awful.
Meanwhile, Winrar compress this file to 70mb file. I've tried bzip2 and zlib, and both provide similar result. I guess the problem is that 256KB buffer is too small, but I'd like to understand how it works, and how I can use zlib, to achieve better compression.
Overall, I want to make lowlevel facility to compress several files to 1 big one for internal use, and compress() looks very suited for it, but...
Deep explanations are very welcome. Thanks in advance.
I believe your problem is that by using the compress() function (rather than deflateInit()/deflate()/deflateEnd()), you are underutilizing zlib's compression abilities.
The key insight here is that zlib compression is implemented by building up a Huffman tree, which is a dictionary-type data structure that specifies short "tokens" that will succinctly represent longer sequences of input bytes. That way, whenever those longer sequences are repeated later on in the input stream, they can be replaced by their equivalent tokens in the output stream, greatly reducing the total size of the compressed data.
However, the efficiency of that process depends a lot of the persistence of that built-up Huffman tree, which in turn depends on your program keeping the deflate-algorithm's state for the entire duration of the compression process. But your code is calling compress(), which is meant to be a single-shot convenience function for small amounts of data, and as such compress() does not provide any way for your program to retain state across multiple calls to it. With each call to compress(), a brand-new Huffman tree is generated, written to the output stream used for the remainder of the data passed to that call, and then forgotten -- it will be inaccessible to any subsequent compress() calls. That is likely the source of the poor efficiency you are seeing.
The fix is not to use compress() in cases where you need to compress the data in more than one step. Instead, call deflateInit() (to allocate the state for the algorithm), then call deflate() multiple times (to compress data using, and updating that state), and then finally call deflateEnd() to clean up.
Use deflateInit(), deflate(), and deflateEnd() instead of compress(). I don't know whether or not that will improve the compression, since you provided no information on the data, and only the slightest clue as to what your program does (are those lines inside a loop?). However if you are compressing something large that you are not loading into memory all at once, then you do not use compress().
I have an application that takes a handle and performs some tasks. The handle is currently being created with CreateFile. My problem is that CreateFile takes a filepath as one of the arguments. I am looking for a way to return a handle from a byte array because the data in I need to process is not on disk. Does anyone know of any functions that take a byte array and return a handle or how I would go about doing this?
You have a few choices:
re-design your processing logic to read data from a memory pointer instead of a HANDLE, then you can pass your byte array as-is to your processing logic. In case you also need to process a file, you can read the file data into a byte array, then process it accordingly.
re-design your processing logic to read data from an IStream interface, then you can use SHCreateStreamOnFileEx() and SHCreateMemStream(), like Jonathan Potter suggested.
if you must read data from a HANDLE using ReadFile() or related function, you can either:
a. write your byte array to a temp file, then read back from that file.
b. create an anonymous pipe using CreatePipe(), then write the byte array to one end and read the data from the other end, like Harry Johnston suggested.
Using CreateFile() with the FILE_ATTRIBUTE_TEMPORARY attribute allows the operating system to keep the file in memory. You still have a copy happening as you have to write your memory buffer to the file, and then read that data back from that file, but if you have enough cache memory, nothing will hit the hard drive.
See for more details here:
CREATEFILE2_EXTENDED_PARAMETERS structure | Caching Behavior
It is not impossible that you could also use file mapping where the data written to the file is forced to stay in memory, but that's a lot more complicated for probably no gain as it is not unlikely going to be slower overall.
I have a file in binary format having a large amount of data.
If I have knowledge of the file structure, how do I read information from the binary file, and populate a record of these structures?
The data is complex.
I'd like to do it with Qt, but I would do it in C++ as well, if required.
Thanks for your help..
If the binary file is really large then it’s better to load it as (char*) array if enough RAM is available via low level read function http://crasseux.com/books/ctutorial/Reading-files-at-a-low-level.html
and then you can parse it.
But this will only help you to load large files, not to parse complex structures.
Not sure, but you could also take a look at yacc.
This doesn't sound like yacc will be a solution, he isn't trying to parse the file, he wants to read binary formatted data to a data structure.
You can read the data in and then map it to a struct that matches the format. If the data is complex you may need to lay structs over it in a variety of ways depending on how the data layout works. So basically read the file into a char* or and then select the element where your struct starts, cast that element to a pointer to your stuct and then access the element. Without more detail it's impossible to be more specific than this.
http://courses.cs.vt.edu/~cs2604/fall00/binio.html would be help for you. I've learned from there. (hint always cast your data as(char*) ).
I have looked a bit but I was unable to find what I figured might have been something that has already been created.
I am looking for an application that would read in a binary file, allow the inputing of the types of patterns/rules in someway that are expected (like a set of messages each of which are header + data) and then deserialize the data into a text format based on the patterns/rules (e.g., the binary file is a set of M messages with a header that contains the type of struct and the number of bytes the struct's serialization takes up directly serialized to the file).
Specifically, lets say I know ahead of time that I will have a file that contains a sequence of serialized C structs (or C++ classes) which are all prepended by a header indicating which struct in serialized in the next N bytes (where N is contained in the header).
I know how to write C/C++ code to go through and deserialize the data (provided I know all the types ahead of time) but I am wondering if there exists some type of application which would help facilitate this process if you were not entirely sure of the format/structs ahead of time (other than a hexeditor). Something graphical where you could see the dynamic effect of changing the structs/rules/patterns would be optimal if it exists.
boost::serialization already does something quite similar to this, without having to get your hands quite as dirty in the details. It supports various archive formats, including XML, text and binary ones, is very extensible and can cope with smart pointers, containers etc.