How to read/restore big data file (SEGY format) with C/C++? - c++

I am working on a project which needs to deal with large seismic data of SEGY format (from several GB to TB). This data represents the 3D underground structure.
Data structure is like:
1st tract, 2,3,5,3,5,....,6
2nd tract, 5,6,5,3,2,....,3
3rd tract, 7,4,5,3,1,....,8
What I want to ask is, in order to read and deal with the data fast, do I have to convert the data into another form? Or it's better to read from the original SEGY file? And is there any existing C package to do that?

If you need to access it multiple times and
if you need to access it randomly and
if you need to access it fast
then load it to a database once.
Do not reinvent the wheel.

When dealing of data of that size, you may not want to convert it into another form unless you have to - though some software does do just that. I found a list of free geophysics software on Wikipedia that look promising; many are open source and read/write SEGY files.
Since you are a newbie to programming, you may want to consider if the Python library segpy suits your needs rather than a C/C++ option.

Several GB is rathe medium, if we are toking about poststack.
You may use segy and convert on the fly, you may invent your own format. It depends whot you needed to do. Without changing segy format it's enough to createing indexes to traces. If segy is saved as inlines - it's faster access throug inlines, although crossline access is not very bad.
If it is 3d seismic, the best way to have the same quick access to all inlines/crosslines is to have own format - based od beans, e.g 8x8 traces - loading all beans and selecting tarces access time may be very quick - 2-3 secends. Or you may use SSD disk, or 2,5x RAM as your SEGY.
To quickly access timeslices you have 2 ways - 3D beans or second file stored as timeslices (the quickes way). I did same kind of that 10 years ago - access time to 12 GB SEGY was acceptable - 2-3 seconds in all 3 directions.
SEGY in database? Wow ... ;)

The answer depends upon the type of data you need to extract from the SEG-Y file.
If you need to extract only the headers (Text header, Binary header, Extended Textual File headers and Trace headers) then they can be easily extracted from the SEG-Y file by opening the file as binary and extracting relevant information from the respective locations as mentioned in the data exchange formats (rev2). The extraction might depend upon the type of data (Post-stack or Pre-stack). Also some headers might require conversions from one format to another (e.g Text Headers are mostly encoded in EBCDIC format). The complete details about the byte locations and encoding formats can be read from the above documentation
The extraction of trace data is a bit tricky and depends upon various factors like the encoding, whether the no. of trace samples is mentioned in the trace headers, etc. A careful reading of the documentation and getting to know about the type of SEG data you are working on will surely make this task a lot easier.
Since you are working with the extracted data, I would recommend to use already existing libraries (segpy: one of the best python library I came across). There are also numerous free available SEG-Y readers, a very nice list has already been mentioned by Daniel Waechter; you can choose any one of them that suits your requirements and the type file format supported.
I recently tried to do something same using C++ (Although it has only been tested on post-stack data). The project can be found here.


Structured message compression

Are there any libraries for compression of structured messages? (like protobufs)
I am looking for something better than just passing a serialized stream through GZip. For example, if my message stores a triangle mesh, the coordinates of adjacent vertices will be highly correlated, so a smart compressor could store deltas instead of the raw coordinates, which would require less bits to encode.
Whereas a general compressor, that doesn't know anything about the stream structure, would be looking for repeating byte sequences, which in data like that, there won't be many.
Ideally, this should work completely automatically after being provided with a schema, but I wouldn't mind adding annotations to my schema, if it came to that.
The main problem here is that most of the time writing some schema will have a similar effort to programming a preprocessor for the data yourself. E.g. for your triangle mesh example, reordering the data or doing a delta on coordinates can be implemented very easy and will support any subsequent compressor very well.
A compressor going in that direction is ZPAQ. It can use config files tailored to specific data (the sample configuration site includes EXE, JPG, BMP configs as well as a specialized one to compress a file containing the mathematical constant pi). The downside is that the script language used here (ZPAQL) is quite complicated to use and you've got to know much of the ZPAQ internals.
Older versions of WinRAR used a virtual machine named RarVM (though deprecated now) that allowed for assembler-like code for custom data transformations, there's an open source project named rarvmtools on GitHub with some related tools.
For protobuf compression, there's a Google project called riegeli that might be able to further compress them.

"Best" Input File Formats for C++? [closed]

I am starting work on a new piece of software that will end up needing some robust and expandable file IO. There are a lot of formats out there. XML, JSON, INI, etc. However, there are always plusses and minuses so I thought I would ask for some community input.
Here are some rough requirements:
The format is a "standard"...I don't want to reinvent the wheel if I don't have to. It doesn't have to be a formal IEEE standard, but something you could Google and get some information on as a new user, may have some support tools (editors) beyond vi. (Though the software users will generally be computer savvy and happy to use vi.)
Easily integrates with C++. I don't want to have to pull along a 100mb library and three different compilers to get it up and running.
Supports tabular input (2d, n-dimensional)
Supports POD types
Can expand as more inputs are required, binds well to variables, etc.
Parsing speed is not terribly important
Ideally, as easy to write (reflect) as it is to read
Works well on Windows and Linux
Supports compositing (one file referencing another file to read, and so on.)
Human Readable
In a perfect world, I would use a header-only library or some clean STL implementation, but I'm fine with leveraging Boost or some small external library if it works well.
So, what are your thoughts on various formats? Drawbacks? Advantages?
Options to consider? Anything else to add?
Google Protocol Buffers
Boost Serialization
There is one excellent format that meets all your criteria:
Please read article about using SQLite as an application file format. Also, please watch Google Tech Talk by D. Richard Hipp (SQLite author) about this very topic.
Now, lets see how SQLite meets your requirements:
The format is a "standard"
SQLite has become format of choice for most mobile environments, and for many desktop apps (Firefox, Thunderbird, Google Chrome, Adobe Reader, you name it).
Easily integrates with C++
SQLite has standard C interface, which is only one source file and one header file. There are C++ wrappers too.
Supports tabular input (2d, n-dimensional)
SQLite table is as tabular as you could possibly imagine. To represent say 3-dimensional data, create table with columns x,y,z,value and store your data as a set of rows like this:
Supports POD types
I assume by POD you meant Plain Old Data, or BLOB. SQLite lets you store BLOB fields as is.
Can expand as more inputs are required, binds well to variables
This is where it really shines.
Parsing speed is not terribly important
But SQLite speed is superb. In fact, parsing is basically transparent.
Ideally, as easy to write (reflect) as it is to read
Just use INSERT to write and SELECT to read - what could be easier?
Works well on Windows and Linux
You bet, and all other platforms as well.
Supports compositing (one file referencing another file to read)
You can ATTACH one database to another.
Human Readable
Not in binary, but there are many excellent SQLite browsers/editors out there. I like SQLite Expert Personal on Windows and sqliteman on Linux. There is also SQLite editor plugin for Firefox.
There are other advantages that SQLite gives you for free:
Data is indexable which makes it very fast to search. You just cannot do this using XML, JSON or any other text-only formats.
Data can be edited partially, even when amount of data is very large. You do not have to rewrite few gigabytes just to edit one value.
SQLite is fully transactional: it guarantees that your data is consistent at all times. Even if your application (or whole computer) crashes, your data will be automatically restored to last known consistent state on next first attempt to connect to the database.
SQLite stores your data verbatim: you do not need to worry about escaping junk characters in your data (including zero bytes embedded in your strings) - simply always use prepared statements, that's all it takes to make it transparent. This can be big and annoying problem when dealing with text data formats, XML in particular.
SQLite stores all strings in Unicode: UTF-8 (default) or UTF-16. In other words, you do not need to worry about text encodings or international support for your data format.
SQLite allows you to process data in small chunks (row by row in fact), thus it works well in low memory conditions. This can be a problem for any text based formats, because often they need to load all text into memory to parse it. Granted, there are few efficient stream-based XML parsers out there, but in general any XML parser will be quite memory greedy compared to SQLite.
Having worked quite a bit with both XML and json, here's my rather subjective opinion of both as extendable serialization formats:
The format is a "standard": Yes for both
Easily integrates with C++: Yes for both. In each case you'll probably wind up with some kind of library to handle it. On Linux, libxml2 is a standard, and libxml++ is a C++ wrapper for it; you should be able to get both of those from your distro's package manager. It will take some small effort to get those working on Windows. There appears to be some support in Boost for json, but I haven't used it; I've always dealt with json using libraries. Really, the library route is not very onerous for either.
Supports tabular input (2d, n-dimensional): Yes for both
Supports POD types: Yes for both
Can expand as more inputs are required: Yes for both - that's one big advantage to both of them.
Binds well to variables: If what you mean is some way inside the file itself to say "This piece of data must be automatically deserialized into this variable in my program", then no for both.
As easy to write (reflect) as it is to read: Depends on the library you use, but in my experience yes for both. (You can actually do a tolerable job of writing json using printf().)
Works well on Windows and Linux: Yes for both, and ditto Mac OS X for that matter.
Supports one file referencing another file to read: If you mean something akin to a C #include, then XML has some ability to do this (e.g. document entities), while json doesn't.
Human readable: Both are typically written in UTF-8, and permit line breaks and indentation, and thus can be human-readable. However, I've just been working with a 479 KB XML file that's all on one line, so I had to run it through a prettyprinter to make sense of it. json can also be pretty unreadable, but in my experience is often formatted better than XML.
When starting new projects, I generally prefer json; it's more compact and more human-readable. The main reason I might select XML over json would be if I were worried about receiving badly-formed documents, since XML supports automated document format validation, while you have to write your own validation code with json.
Check out google buffers. This handles most of your requirements.
From their documentation, the high level steps are:
Define message formats in a .proto file.
Use the protocol buffer compiler.
Use the C++ protocol buffer API to write and read messages.
For my purposes, I think the way to go is XML.
The format is a standard, but allows for modification and flexibility for the schema to change as the program requirements evolve.
There are several library options. Some are larger (Xerces-C) some are smaller (ezxml), but there are many options, so we won't be locked in to a single provider or very specific solution.
It can supports tabular input (2d, n-dimensional). This requires more parsing work on "our" end, and is likely the weakest point for XML.
Supports POD types: Absolutely.
Can expand as more inputs are required, binds well to variables, etc. through schema modifications and parser modifications.
Parsing speed is not terribly important, so processing a text file or files is not an issue.
XML can be programmatically written just as easily as read.
Works well on Windows and Linux or any other OS that supports C and text files.
Supports compositing (one file referencing another file to read, and so on.)
Human Readable with many text editors (Sublime, vi, etc.) supporting syntax highlighting out of the box. Many web browsers display the data well.
Thanks for all the great feedback! I think if we wanted a purely binary solution, Protocol Buffers or boost::serialization is likely the way that we would go.

Structure for storing data from thousands of files on a mobile device

I have more than 32000 binary files that store a certain kind of spatial data. I access the data by file name. The files range in size from 0-400kb. I need to be able to access the content of these files randomly and at various time points. I don't like the idea of having 32000+ separate files of data installed on a mobile device (even though the total file size is < 100mb). I want to merge the files into a single structure that will still let me access the data I need just as quickly. I'd like suggestions as to what the best way to do this is. Any suggestions should have C/C++ libs for accessing the data and should have a liberal license that allows inclusion in commercial, closed-source applications without any issue.
The only thing I've thought of so far is storing everything in an sqlite database, though I'm not sure if this is the best method, or what considerations I need to take into account for storing blob data with quick look up times (ie, what schema I'd use).
Why not roll your own?
Your requirements sound pretty simple and straight forward. Just bundle everything into a single binary file and add an index at the beginning telling which file starts where and how bit it is.
30 lines of C++ code max. Invest a good 10 minutes designing a good interface for it so you could replace the implementation when and if the need occurs.
That is of course if the data is read only. If you need to change it as you go, it gets hairy fast.

Writing to the middle of the file (without overwriting data)

In windows is it possible through an API to write to the middle of a file without overwriting any data and without having to rewrite everything after that?
If it's possible then I believe it will obviously fragment the file; how many times can I do it before it becomes a serious problem?
If it's not possible what approach/workaround is usually taken? Re-writing everything after the insertion point becomes prohibitive really quickly with big (ie, gigabytes) files.
Note: I can't avoid having to write to the middle. Think of the application as a text editor for huge files where the user types stuff and then saves. I also can't split the files in several smaller ones.
I'm unaware of any way to do this if the interim result you need is a flat file that can be used by other applications other than the editor. If you want a flat file to be produced, you will have to update it from the change point to the end of file, since it's really just a sequential file.
But the italics are there for good reason. If you can control the file format, you have some options. Some versions of MS Word had a quick-save feature where they didn't rewrite the entire document, rather they appended a delta record to the end of the file. Then, when re-reading the file, it applied all the deltas in order so that what you ended up with was the right file. This obviously won't work if the saved file has to be usable immediately to another application that doesn't understand the file format.
What I'm proposing there is to not store the file as text. Use an intermediate form that you can efficiently edit and save, then have a step which converts that to a usable text file infrequently (e.g., on editor exit). That way, the user can save as much as they want but the time-expensive operation won't have as much of an impact.
Beyond that, there are some other possibilities.
Memory-mapping (rather than loading) the file may provide efficiences which would speed things up. You'd probably still have to rewrite to the end of the file but it would be happening at a lower level in the OS.
If the primary reason you want fast save is to start letting the user keep working (rather than having the file available to another application), you could farm the save operation out to a separate thread and return control to the user immediately. Then you would need synchronisation between the two threads to prevent the user modifying data yet to be saved to disk.
The realistic answer is no. Your only real choices are to rewrite from the point of the modification, or build a more complex format that uses something like an index to tell how to arrange records into their intended order.
From a purely theoretical viewpoint, you could sort of do it under just the right circumstances. Using FAT (for example, but most other file systems have at least some degree of similarity) you could go in and directly manipulate the FAT. The FAT is basically a linked list of clusters that make up a file. You could modify that linked list to add a new cluster in the middle of a file, and then write your new data to that cluster you added.
Please note that I said purely theoretical. Doing this kind of manipulation under a complete unprotected system like MS-DOS would have been difficult but bordering on reasonable. With most newer systems, doing the modification at all would generally be pretty difficult. Most modern file systems are also (considerably) more complex than FAT, which would add further difficulty to the implementation. In theory it's still possible -- in fact, it's now thoroughly insane to even contemplate, where it was once almost reasonable.
I'm not sure about the format of your file but you could make it 'record' based.
Write your data in chunks and give each chunk an id.
Id could be data offset in file.
At the start of the file you could
have a header with a list of ids so
that you can read records in
At the end of 'list of ids' you could point to another location in the file (and id/offset) that stores another list of ids
Something similar to filesystem.
To add new data you append them at the end and update index (add id to the list).
You have to figure out how to handle delete record and update.
If records are of the same size then to delete you can just mark it empty and next time reuse it with appropriate updates to index table.
Probably the most efficient way to do this (if you really want to do it) is to call ReadFileScatter() to read the chunks before and after the insertion point, insert the new data in the middle of the FILE_SEGMENT_ELEMENT[3] list, and call WriteFileGather(). Yes, this involves moving bytes on disk. But you leave the hard parts to the OS.
If using .NET 4 try a memory-mapped file if you have an editor-like application - might jsut be the ticket. Something like this (I didn't type it into VS so not sure if I got the syntax right):
MemoryMappedFile bigFile = MemoryMappedFile.CreateFromFile(
new FileStream(#"C:\bigfile.dat", FileMode.Create),
1024 * 1024,
MemoryMappedViewAccessor view = MemoryMapped.CreateViewAccessor();
int offset = 1000000000;
view.Write<ObjectType>(offset, ref MyObject);
I noted both paxdiablo's answer on dealing with other applications, and Matteo Italia's comment on Installable File Systems. That made me realize there's another non-trivial solution.
Using reparse points, you can create a "virtual" file from a base file plus deltas. Any application unaware of this method will see a continuous range of bytes, as the deltas are applied on the fly by a file system filter. For small deltas (total <16 KB), the delta information can be stored in the reparse point itself; larger deltas can be placed in an alternative data stream. Non-trivial of course.
I know that this question is marked "Windows", but I'll still add my $0.05 and say that on Linux it is possible to both insert or remove a lump of data to/from the middle of a file without either leaving a hole or copying the second half forward/backward:
fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, offset, len)
fallocate(fd, FALLOC_FL_INSERT_RANGE, offset, len)
Again, I know that this probably won't help the OP but I personally landed here searching for a Linix-specific answer. (There is no "Windows" word in the question, so web search engine saw no problem with sending me here.

Game Programming: .DAT file?

I've seen a lot of games use something similar to a .DAT file or a specific file type that the game has for itself. I'm just beginning with C++ and DirectX and I was interested in keeping my information in something similar to a .DAT.
My initial conception was that it would hold information on the files you wanted to store within the .DAT file. Something similar to a .RAR file. Unfortunately, my googleing skills did not help me in finding the answers.
Right now I'm simply loading textures and sound files from a folder called Data.
EDIT: While I understand that .DAT is short for data, and I've found that a .DAT file generally contains any assortment of information, I'm still unsure about how to go about doing something as packing images and sound files into any type of file and being able to read them.
I'm not sure about using fstreams to achieve my task, however I will look into streams related to storing data and how to properly read from that data. Meanwhile if anyone has another answer to offer based on this new information, it would be appreciated.
EDIT: Thanks to the answers, I stumbled across a similar question on stackoverflow and felt I'd share it here. Combining resources into a single binary file
I don't think there is really such thing as .dat file format. It's short for "data," and different applications just put in some proprietary stuff in it and call it ".dat." You can read up on fstream classes to do file IO in C++. See Input/Output with files.
What you then do is make up your own file format. For example, first 4 byte is int that indicates the number of blocks in the .dat and for each block, you have 4 byte indicating the length of each block, 4 byte indicating the type of the block, the variable length data itself .. something like that.
DAT obviously stands for data, and there is no real or de facto standard on what that extension actually refers to. Your decisions on the best file formats should be based on technical considerations, not pointless attempts at security through obscurity.
Professional games use a technique where they put all the needed resources (models, textures, sounds, ai, config, etc) zipped/packed into a single file thus making it faster to manage, harder to change (some even make use of a virtual filing system from what's inside the data file). Now, for what's inside the file is different depending on the needs of the game and the data structures that you use.
If you're just starting into gamedev, i recommend you stick with keeping all you assets separate and don't bother too much about packing them into a single file.
Now if you really want to start using a packed format here's a good pointer:
Creating a PAK File Format
Here's a link which claims that .dat is a movie format, 'DAT' being short for Digital Audio Tape.
I'm not sure I believe the link, but I do remember something about a Microsoft supported format called DAT, from long ago, when I used an earlier version of Windows.
It makes more sense as a logical extension for a DATA file of some kind.
.dat, as others have said, is literally just a data file. In reality, the file extension means nothing other than association with a program. For example, I could make a word processor that saves all the documents with the .mp3 file extension. These files wouldn't be playable in any media software, but the software might try. File extensions are used to help programs know what types of files they can and cannot open--however those rules don't have to be followed.
Anyway, you can dump any sort of information to a file. Programmers/software writers will often choose .dat as the extension of that file because it has become the standard to signify 'this file just holds a ton of data' and that the data doesn't necessarily hold any standardized headers, footers, or formatting.
A dat file could really contain anything. It might be as simple as a zip archive with the extension changed, or it could be a completely custom file type. If you're just starting out, you probably don't want to write your own file format, although doing so can be fun and educational. If you want to encapsulate your data files into some kind of container, you should probably go with a zip, paq, or maybe tar.gz.