Looking around I have found the question being asked, but not great answers. If this is a stackoverflow duplicate (sorry!)
My goal is to have a zlib compressed file that I append to using C/C++ at different intervals (such as a log file). Due to buffer size constraints I was hoping to avoid having to keep the entire file in memory for appending new items.
Mark Adler's answer was very close to what I needed, but due to already being entrenched in the zlib library and on an embedded device with limited resources I was/am stuck.
I ended up simply appending a delimiter to each section of data (ex: ##delimiter##) and once ready to read the finished file, (different application) it seeks these sections and creates an array object of the compressed sections that are then individually decompressed.
I am still marking Adler's answer as correct, as it was useful info that will be of more help to other programmers.
It sounds like you are trying to keep something like a compressed log, appending small amounts of data each time. For that you can look at gzlog.h and gzlog.c for an example of how to do this.
You can also look at gzappend, which appends data to a gzip file.
These are all easily adaptable to a zlib stream.
Related
I have to read a dat-file byte by byte from a zip-file in a char[] buffer. The zip-file contains only one dat-file. I guess unzip chunk by chunk would be good. I am using Visual Studio 2013 with c++.
I have found zip-utils (http://www.codeproject.com/Articles/7530/Zip-Utils-clean-elegant-simple-C-Win), would this be ok, because its nearly 10 years old? Would Minizip be a good way? I guess zlib alone would not be enough for this use case, right?
My question is, whats the best way to do the unzipping? I have no experience with handling zip-files and would like to hear a suggestion by somebody with experience.
Thank you,
Friedrich
Minizip would work. Please notice that it still requires zlib source code to link with.
A zip file is not just chunks of zlib compressed content.
It's an archive.
There is a directory header, and per element header you must decode too even if the archive only contains a single file. Typically, the header will tell you from which offset in the zip file you'll find your DAT compressed content. Then you'll likely use zlib to decode chunk by chunk starting at the given offset.
Please notice also that zip file format does not always imply zlib as a compressor (you can have many different compressor). If you master the code that create the zip file, it's not an issue. But if it comes from hostile user, then you should rely actually check the compressor used and assert it's zlib else you should deny decompressing the file because you'll not be able to do so.
I am working on a project which needs to deal with large seismic data of SEGY format (from several GB to TB). This data represents the 3D underground structure.
Data structure is like:
1st tract, 2,3,5,3,5,....,6
2nd tract, 5,6,5,3,2,....,3
3rd tract, 7,4,5,3,1,....,8
...
What I want to ask is, in order to read and deal with the data fast, do I have to convert the data into another form? Or it's better to read from the original SEGY file? And is there any existing C package to do that?
If you need to access it multiple times and
if you need to access it randomly and
if you need to access it fast
then load it to a database once.
Do not reinvent the wheel.
When dealing of data of that size, you may not want to convert it into another form unless you have to - though some software does do just that. I found a list of free geophysics software on Wikipedia that look promising; many are open source and read/write SEGY files.
Since you are a newbie to programming, you may want to consider if the Python library segpy suits your needs rather than a C/C++ option.
Several GB is rathe medium, if we are toking about poststack.
You may use segy and convert on the fly, you may invent your own format. It depends whot you needed to do. Without changing segy format it's enough to createing indexes to traces. If segy is saved as inlines - it's faster access throug inlines, although crossline access is not very bad.
If it is 3d seismic, the best way to have the same quick access to all inlines/crosslines is to have own format - based od beans, e.g 8x8 traces - loading all beans and selecting tarces access time may be very quick - 2-3 secends. Or you may use SSD disk, or 2,5x RAM as your SEGY.
To quickly access timeslices you have 2 ways - 3D beans or second file stored as timeslices (the quickes way). I did same kind of that 10 years ago - access time to 12 GB SEGY was acceptable - 2-3 seconds in all 3 directions.
SEGY in database? Wow ... ;)
The answer depends upon the type of data you need to extract from the SEG-Y file.
If you need to extract only the headers (Text header, Binary header, Extended Textual File headers and Trace headers) then they can be easily extracted from the SEG-Y file by opening the file as binary and extracting relevant information from the respective locations as mentioned in the data exchange formats (rev2). The extraction might depend upon the type of data (Post-stack or Pre-stack). Also some headers might require conversions from one format to another (e.g Text Headers are mostly encoded in EBCDIC format). The complete details about the byte locations and encoding formats can be read from the above documentation
The extraction of trace data is a bit tricky and depends upon various factors like the encoding, whether the no. of trace samples is mentioned in the trace headers, etc. A careful reading of the documentation and getting to know about the type of SEG data you are working on will surely make this task a lot easier.
Since you are working with the extracted data, I would recommend to use already existing libraries (segpy: one of the best python library I came across). There are also numerous free available SEG-Y readers, a very nice list has already been mentioned by Daniel Waechter; you can choose any one of them that suits your requirements and the type file format supported.
I recently tried to do something same using C++ (Although it has only been tested on post-stack data). The project can be found here.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What are important points when designing a (binary) file format?
I am going to develop a program which will store data in file.
The file can be big. The data in the file is basically made up with variable length records. And I need random access to the records.
I just want to read some resouces/books about how to design the structure of a data file. But I can't find any yet.
Any suggestion is much appreciated.
You might find http://decoy.iki.fi/texts/filefd/filefd useful. It's a general starting point to the techniques to consider.
Also look at this question here on SO: What are important points when designing a (binary) file format?
The problem you describe is a central theme of Database Theory.
Any decent text on the subject should give you some good ideas. The standard text from uni was:
Fundamentals of Database Systems- Elmasari & Nava (PDF) (Amazon)
Another approach is to use a memory mapped array of structs, take a look at my bountied answer to a similar question
Yet another approach is to use a binary protocol like Google protobuf and "send" your data to the file when writing and "receive" it when writing.
If the answer you're looking for is "what book to read" I can't help.
If "how do to that" may be good for you as well I've some suggestions.
One good solution is the one suggested by Srykar; I would just add that I'd use SQLite instead of MySQL. It's an open source C library that you can embed in your program. It lets you store data in a DB just the way you'd do with SQL statement, but calling the library C functions instead. In your case you may keep everything in memory and then save the data to disk at proper time.
Reference:
http://www.sqlite.org
Another option is the old "do it yourself way". I mean: there's nothing very complicated about storing your data to a file (unless your data is very very structured, but I'd go with option nr. 1 in this case).
You write down a plan of how you want the structure of your file to be. And you follow that plan both when writing the file to disk and when reading it re-storing the data into memory.
If you have n records. Write n to disk, then write each record.
If each record has variable lenght, then write the length of each record before writing the record.
You talk about "random access" in your question. Probably you mean that the file is very big and at access time you want to read from disk only the portion you're interested in.
If so plan to build an index; that index will tell the offset of each element in bytes from the beginning of the file. Store the index at the beginning of the file and then store the data.
When you read the file you start reading the index, get the offset to the data you need, and read that portion of file.
These are very basic examples, just to get the idea...
Hope they helps!
Is there any reason you are not considering putting this data in a persistent DB store like mysql? these system are built to deal with random data access with proper indexes to speeden you data retrieval. Plus while reading from a file, you would have to read the entire file to get what you want as there are no indexes and no query language.
Added to this they have systems in place to make sure multiple running processes can access the same data without data getting corrupted. It provided data recovery incase of inconsistencies.
So just storing is the simple part, it does not end there. You would have to provide all the other solutions eventually. Better use whats available.
I've seen a lot of games use something similar to a .DAT file or a specific file type that the game has for itself. I'm just beginning with C++ and DirectX and I was interested in keeping my information in something similar to a .DAT.
My initial conception was that it would hold information on the files you wanted to store within the .DAT file. Something similar to a .RAR file. Unfortunately, my googleing skills did not help me in finding the answers.
Right now I'm simply loading textures and sound files from a folder called Data.
EDIT: While I understand that .DAT is short for data, and I've found that a .DAT file generally contains any assortment of information, I'm still unsure about how to go about doing something as packing images and sound files into any type of file and being able to read them.
I'm not sure about using fstreams to achieve my task, however I will look into streams related to storing data and how to properly read from that data. Meanwhile if anyone has another answer to offer based on this new information, it would be appreciated.
EDIT: Thanks to the answers, I stumbled across a similar question on stackoverflow and felt I'd share it here. Combining resources into a single binary file
I don't think there is really such thing as .dat file format. It's short for "data," and different applications just put in some proprietary stuff in it and call it ".dat." You can read up on fstream classes to do file IO in C++. See Input/Output with files.
What you then do is make up your own file format. For example, first 4 byte is int that indicates the number of blocks in the .dat and for each block, you have 4 byte indicating the length of each block, 4 byte indicating the type of the block, the variable length data itself .. something like that.
DAT obviously stands for data, and there is no real or de facto standard on what that extension actually refers to. Your decisions on the best file formats should be based on technical considerations, not pointless attempts at security through obscurity.
Professional games use a technique where they put all the needed resources (models, textures, sounds, ai, config, etc) zipped/packed into a single file thus making it faster to manage, harder to change (some even make use of a virtual filing system from what's inside the data file). Now, for what's inside the file is different depending on the needs of the game and the data structures that you use.
If you're just starting into gamedev, i recommend you stick with keeping all you assets separate and don't bother too much about packing them into a single file.
Now if you really want to start using a packed format here's a good pointer:
Creating a PAK File Format
Here's a link which claims that .dat is a movie format, 'DAT' being short for Digital Audio Tape.
I'm not sure I believe the link, but I do remember something about a Microsoft supported format called DAT, from long ago, when I used an earlier version of Windows.
It makes more sense as a logical extension for a DATA file of some kind.
.dat, as others have said, is literally just a data file. In reality, the file extension means nothing other than association with a program. For example, I could make a word processor that saves all the documents with the .mp3 file extension. These files wouldn't be playable in any media software, but the software might try. File extensions are used to help programs know what types of files they can and cannot open--however those rules don't have to be followed.
Anyway, you can dump any sort of information to a file. Programmers/software writers will often choose .dat as the extension of that file because it has become the standard to signify 'this file just holds a ton of data' and that the data doesn't necessarily hold any standardized headers, footers, or formatting.
A dat file could really contain anything. It might be as simple as a zip archive with the extension changed, or it could be a completely custom file type. If you're just starting out, you probably don't want to write your own file format, although doing so can be fun and educational. If you want to encapsulate your data files into some kind of container, you should probably go with a zip, paq, or maybe tar.gz.
I want to concat two or more gzip streams without recompressing them.
I mean I have A compressed to A.gz and B to B.gz, I want to compress them to single gzip (A+B).gz without compressing once again, using C or C++.
Several notes:
Even you can just concat two files and gunzip would know how to deal with them, most of programs would not be able to deal with two chunks.
I had seen once an example of code that does this just by decompression of the files and then manipulating original and this significantly faster then normal re-compression, but still requires O(n) CPU operation.
Unfortunaly I can't found this example I had found once (concatenation using decompression only), if someone can point it I would be greatful.
Note: it is not duplicate of this because proposed solution is not fits my needs.
Clearification edit:
I want to concate several compressed HTML pices and send them to browser as one page, as per request: "Accept-Encoding: gzip", with respnse "Content-Encoding: gzip"
If the stream is concated as simple as cat a.gz b.gz >ab.gz, Gecko (firefox) and KHTML web engines gets only first part (a); IE6 does not display anything and Google Chrome displays first part (a) correctly and the second part (b) as garbage (does not decompress at all).
Only Opera handles this well.
So I need to create a single gzip stream of several chunks and send them without re-compressing.
Update: I had found gzjoin.c in the examples of zlib, it does it using only decompression. The problem is that decompression is still slower them simple memcpy.
It is still faster 4 times then fastest gzip compression. But it is not enough.
What I need is to find the data I need to save together with gzip file in order to
not run decompression procedure, and how do I find this data during compression.
Look at the RFC1951 and RFC1952
The format is simply a suites of members, each composed of three parts, an header, data and a trailer. The data part is itself a set of chunks with each chunks having an header and data part.
To simulate the effect of gzipping the result of the concatenation of two (or more files), you simply have to adjust the headers (there is a last chunk flag for instance) and trailer correctly and copying the data parts.
There is a problem, the trailer has a CRC32 of the uncompressed data and I'm not sure if this one is easy to compute when you know the CRC of the parts.
Edit: the comments in the gzjoin.c file you found imply that, while it is possible to compute the CRC32 without decompressing the data, there are other things which need the decompression.
The gzip manual says that two gzip files can be concatenated as you attempted.
http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage
So it appears that the other tools may be broken. As seen in this bug report.
http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=97263
Apart from filing a bug report with each one of the browser makers, and hoping they comply, perhaps your program can cache the most common concatenations of the required data.
As others have mentioned you may be able to perform surgery:
http://www.gzip.org/zlib/rfc-gzip.html
And this requires a CRC-32 of the final uncompressed file. The required size of the uncompressed file can be easily calculated by adding the lengths of the individual sub-files.
In the bottom of the last link, there is code for calculating a running crc-32 named update_crc.
Calculating the crc on the uncompressed files each time your process is run, is probably cheaper than the gzip algorithm itself.
It seems that the original compression of the individual files is done by you. It also seems that the desired result (concatenation of several pieces) is small enough to be sent to a web browser in one page. In that case your efficiency concerns seem to be unwarranted.
Please note that (1) the gzjoin.c approach is highly likely to be the best answer that you could get to your question as stated (2) it is complicated microsurgery performed by one of the gzip originators and may not have been subject to extensive stress testing.
Please consider a boring understandable reliable approach: storing the original pieces UNcompressed, then select required pieces, and concatenate and compress them. Note that the compression ratio may be better than that obtained by glueing together small compressed pieces.
If taring them is not out of the question (since the linked cat solution isn't viable for you):
tar cf A_B.gz.tar A.gz B.gz
Then, to get them back:
tar xf A_B.gz.tar