Dealing with large data binary files

Dealing with large data binary files - c++

I am working with large binary files (aprox 2 Gb each) that contain raw data. These files have a well defined structure, where each file is an array of events, and each event is an array of data banks. Each event and data bank have a structure (header, data type, etc.).
From these files, all I have to do is extract whatever data I might need, and then I just analyze and play with the data. I might not need all of the data, sometimes I just extract XType data, other just YType, etc.
I don't want to shoot myself in the foot, so I am asking for guidance/best practice on how to deal with this. I can think of 2 possibilities:
Option 1
Define a DataBank class, this will contain the actual data (std::vector<T>) and whatever structure this has.
Define a Event class, this has a std::vector<DataBank> plus whatever structure.
Define a MyFile class, this is a std::vector<Event> plus whatever structure.
The constructor of MyFile will take a std:string (name of the file), and will do all the heavy lifting of reading the binary file into the classes above.
Then, whatever I need from the binary file will just be a method of the MyFile class; I can loop through Events, I can loop through DataBanks, everything I could need is already in this "unpacked" object.
The workflow here would be like:
int main() {
MyFile data_file("data.bin");
std::vector<XData> my_data = data_file.getXData();
\\Play with my_data, and never again use the data_file object
\\...
return 0;
}
Option 2
Write functions that take std::string as an argument, and extract whatever I need from the file e.g. std::vector<XData> getXData(std::string), int getNumEvents(std::string), etc.
The workflow here would be like:
int main() {
std::vector<XData> my_data = getXData("data.bin");
\\Play with my_data, and I didn't create a massive object
\\...
return 0;
}
Pros and Cons that I see
Option 1 seems like a cleaner option, I would only "unpack" the binary file once in the MyFile constructor. But I will have created a huge object that contains all the data from a 2 Gb file, which I will never use. If I need to analyze 20 files (each of 2 Gb), will I need 40 Gb of ram? I don't understand how these are handled, will this affect performance?
Option number 2 seems to be faster; I will just extract whatever data I need, and that's it, I won't "unpack" the entire binary file just to later extract the data I care about. The problem is that I will have to deal with the binary file structure in every function; if this ever changes, that will be a pain. I will only create objects of the data I will play with.
As you can see from my question, I don't have much experience with dealing with large structures and files. I appreciate any advice.

I do not know whether the following scenario matches yours.
I had a case of processing huge log files of hardware signal logging in the automotive area. Signals like door locked, radio on, temperature, and thousands more, appearing sometimes periodically. The operator selects some signal types and then analizes diagrams of signal values.
This scenario is based on a huge log file growing on passing time.
What I did was for every signal type creating its own logfile extract, in optimized binary format (one would load a fixed sized byte[] array).
This meant that having the diagram for just 10 types would be feasible to display fast, in real time. Zooming in on a time interval, dynamically selecting signal types, and so on.
I hope you got some ideas.

Related

RocksDb: Multiple values per key (c++)

RocksDb: Multiple values per key (c++)
what i am trying to do
I am trying to adapt my simple blockchain implementation to save the blockchain to the hard drive periodically and so i looked info different db solutions. i decided to use RocksDb due to its ease of use and good documentation & examples. i read through the documentation and could not figure out how to adapt it to my use case.
i have a class Block
`
class Block {
public:
string PrevHash;
private:
blockheader header; // The header of the block
uint32_t index; // height of this block
std::vector<tx_data> transactions; // All transactions in the block in a vector
std::string hash; // The hash of the block
uint64_t timestamp; // The timestamp this block was created by the node
std::string data; // Extra data that can be appended to blocks (for example text or a smart contract)
// - The larger this feild the higher the fee and the max size is defined in config.h
};
which contains a few variables and a vector of a struct tx_data. i want to load this data into a rocksdb database.
what i have tried
after google failed to return any results on storing multiple values with one keypair i decided i would have to just enclose each block data in 0xa1 at the beginning then at the end 0x2a
*0x2a*
header
index
txns
hash
timestamp
data
*0x2a*
but decided there was surely a simpler way. I tried looking at the code used by turtlecoin, a currency that uses rocksdb for its database but the code there is practically indecipherable, i have heard about serialization but there seems to be little info out there on it.
perhaps i am misunderstanding the use of a DB?

You need to serialization it. Serialization is the process of taking a structured set of data and making it into one string, number or vector of bytes that can then be de-serialized later on back into that struct. One method would be to take the hash of the block and use it as the key in the db then crate a new struct which does not contain the hash. Then write a function that takes a Block struct and a path and constructs a BlockNoHash struct and saves it. Then another function to read a block from a hash and spit out a Block Struct. Very basically you could split each field with a charector which will never occur in the data (eg ` or |), though this means if one piece of the data is corrupted then you cant get any of the other data

There are two related questions here.
One is: how do you store complex data -- more than just a simple integer or string -- within a key-value store like RocksDB. As Leo says, you need to serialize them.
Rather than writing your own code, the typical easier way is to use a framework like Protobuf or Thrift to generate code to translate between your in-memory structures and a flat bytes representation suitable to store in a database (or send over the network.)
A related question, from the title: how do you store multiple values per key?
There are two main options:
Use a compound key, that distinguishes the various values. By walking a key prefix you can find all the values in a set of related keys. This is better if the values get very large or if you want to find and update them independently.
Or, make the value for a single key actually be a compound object that includes several inner values. This is easiest if you always want to fetch all the sub-values in a single operation.

FatFS - can I create multiple seek locations?

I have a working integration of FatFS in my C++ application running on a Cortex M4-based platform.
My application consists of logging data to a data format called MDF.
On the implementation side, I log data (to a given file) in batches of buffers; The number of buffers depends on how fast I acquire the data: log batch of one buffer . . . do other stuff . . . log batch of five buffer . . . do other stuff . . . etc.
There is also a header which is 24 bytes and contains the number of bytes of data. On a PC, I would just save the header at the end of the measurement but this is an embedded product which could be de-powered at any point in time. If I don't save the header periodically, the file becomes "corrupted".
Therefore, in order to maintain coherency I need to re-save the header after saving every batch of data and that's where my issue is.
This means that I have to call f_lseek before writing the header and then before I write the batch of data.
I am using f_cache_fptr so f_lseek is not painfully slow but I'd like to avoid needing to call f_lseek so frequently.
QUESTION
Is it possible to somehow have 2 seek locations so that I don't need to call f_seek to ping-pong between header-location and data-location?
I am open to modifying FatFS.
The problem, at the low-level, is simpler because the header only shares one 512 byte sector with the data: 24 bytes of header followed by 488 bytes of data.

Is it possible to somehow have 2 seek locations so that I don't need to call f_seek to ping-pong between header-location and data-location?
Not as far as I can tell, no, and it doesn't really seem to make sense. A FIL has only one current position, indicating where the next data written to it will go. What would it even mean for there to be two? How would the system know where to write? It certainly wouldn't be correct to write to both places.
Note in particular that with some operating systems and file systems, it is possible to open the same file more than once, but FatFS supports duplicate file opens only when all openings involved are for read-only mode.
I guess it would be possible to modify FatFS to give it the ability to store one file position when you seek to another, and then later to return to the first. So that would mean adding at least one member to the FIL structure, and adding at least one new function.
But why muck with the innards of FatFS? That's going to be at least a little risky. As long as you have to add a function anyway, how about just implementing a FRESULT my_f_write_at_beginning(FIL* fp, const void* buff, UINT btw, UINT* bw) on top of the existing functions? It can store the current position, seek to the beginning of the file, perform the write (maybe ensuring that the full number of bytes specified is written), and then seek back to the original position.
But fundamentally, no, there is no escaping ping-ponging back and forth, because doing so is part of the requirement you laid out.

On a PC, I would just save the header at the end of the measurement but this is an embedded product which could be de-powered at any point in time. If I don't save the header periodically, the file becomes "corrupted".
Therefore, in order to maintain coherency I need to re-save the header after saving every batch of data and that's where my issue is.
More correctly; you need to save the buffer and the header (footer?), and update the directory entry to reflect the new file size, and update the file allocation table to account for sectors allocated; and you need to write to at least 3 completely separate sectors "atomically" so that everything is consistent if there's a power failure at the wrong time.
This isn't entirely possible on most hardware.
However, there is a way to do it "somewhat safely". Specifically:
pre-allocate enough clusters for a completely new copy of the file (including the new data to append to the end) and update the file allocation table accordingly. If there's a power failure while doing this (or immediately after this point) the risk is lost clusters, which is an "ignore-able" problem that will waste some space but can be fixed easily with a typical "check disk" utility.
create a whole new copy of the file's data in the pre-allocated clusters (copy the old data, then append the new data and header). If there's a power failure in the middle of doing this (or immediately after this point), then the risk is the same as before - just some lost clusters (ignore-able).
atomically update the directory entry; changing both the file size and the "starting cluster number" with the same atomic (single sector) write. If there's a power failure after this point the risk is the same lost clusters (where the old version of the file's data was instead of where the new version of the file data is).
free the clusters that the old version of the file used by doing writes to the file allocation table. After this point you've completed successfully, so a power failure is fine.
To make this less awful for performance you can have two "cluster chains" and alternate between them; such that one chain of clusters is for the current version of the file and the other will become the next version of the file. This avoids the need to copy a lot of older data from one place to another (if you know the old data is still in previously used clusters). It could also avoid the need to allocate and free most clusters in the file allocation table, but only with a significant increase in the risk of lost clusters.
Of course for any of this to work you'd need a guarantee that single-sector writes are atomic; and you can't be using FAT12 (where an entry in the file allocation table can be split by a sector boundary).

Fast CSV parser in C++

I am trying to read a .csv file with 20k+ lines, and each line has ~300 fields.
I am using my own code to read it line by line, then I separate the lines to fields, and convert the fields to corresponding data type (such as integer, double, etc). Then these data are transfered to class objects via their constructor.
However, I found it is not very efficient. It took about 1 min to read these 20k+ lines and create 20k+ objects.
I've googled about fast csv parser, and found there are many options. I've tried some of them, but not very satisfied with the time performance.
Does anyone have a better method to read large .csv files? Many thanks in advance.

An efficient method for parsing or for that matter processing of files is to read as much of the file into memory before you start parsing.
File I/O has been, since the dawn of computers, one of the slower parts of a computer system. For example, parsing your data may take 1 microsecond. Reading the data from a hard drive may take 1 millisecond == 1000 microseconds.
I've made programs faster by allocating a large array for the data then reading the data into the array. Next I process the data in the array and repeat until the entire file is processed.
Another technique is called memory mapping, where the OS handles reading the file into memory as needed.
Please edit your post to show the code where the bottleneck is.

What is cheaper: cast to int or Trim the strings in C++?

I am reading several files from linux /proc fs and I will have to insert those values in a database. I should be as optimal as possible. So what is cheaper:
i) to cast then to int, while I stored then in memory, for later cast to string again while I build my INSERT statement
ii) or keep them as string, just sanitizing the values (removing ':', spaces, etc...)
iii) What should I take in account to learn to make this decision?
I am already doing a split in the lines, because the order they came is not good enough for me.
Thanks,
Pedro
Edit - Clarification
Sorry guys, my scenario is the following: I am measuring cpu, memory, network, disk, etc... every 10 seconds. We are developing our database system, so I cannot count with anything more than just INSERT statements.
I got interested in this optimization because the frequency off parsing data. Its gonna be write once - there will be no updates over the data after it is written.

You seem to be performing some archiving activity [write-once, read-probably-atmost-once] (storing the DB for a later rare/non-frequent use), if not, you should put the optimization emphasize based on how the data will be read (not written).
If this is the archiving case, maybe inseting BLOBs (binary large objects, [or similar concepts]) into the DB will be more efficient.
Addition:
Apparently it will depend on how you will read the data. Are you just listing the data for browse purpose later on, or there will be more complex fetching queries based on the benchmark values.
For example if you are later performing something like: SELECT * from db.Log WHERE log.time > time1 and Max (Memory) < 5000 then it is best to keep each data in its original format (int in integer, string in String, etc) so that the main data processing is left to DB server.

What's the best way to write to more files than the kernel allows open at a time?

I have a very large binary file and I need to create separate files based on the id within the input file. There are 146 output files and I am using cstdlib and fopen and fwrite. FOPEN_MAX is 20, so I can't keep all 146 output files open at the same time. I also want to minimize the number of times I open and close an output file.
How can I write to the output files effectively?
I also must use the cstdlib library due to legacy code.
The executable must also be UNIX and windows cross-platform compatible.

A couple possible approaches you might take:
keep a cache of opened output file handles that's less than FOPEN_MAX - if a write needs to occur on a files that already open, then just do the write. Otherwise, close one of the handles in the cache and open the output file. If your data is generally clumped together in terms of the data for a particular set of files is grouped together in the input file, this should work nicely with an LRU policy for the file handle cache.
Handle the output buffering yourself instead of letting the library do it for you: keep your own set of 146 (or however many you might need) output buffers and buffer the output to those, and perform an open/flush/close when a particular output buffer gets filled. You could even combine this with the above approach to really minimize the open/close operations.
Just be sure you test well for the edge conditions that can happen on filling or nearly filling an output buffer.

It may also be worth scanning the input file, making a list of each output id and sorting it so that you write all the file1 entries first, then all the file2 entries etc..

If you cannot increase the max FOPEN_MAX somehow, you can create a simple queue of requests and then close and re-open files as needed.
You can also keep track of the last write-time for each file, and try to keep the most recently written files open.

The solution seems obvious - open N files, where N is somewhat less than FOPEN_MAX. Then read through the input file and extract the contents of the first N output files. Then close the output files, rewind the input, and repeat.

First of all, I hope you are running as much in parallel as possible. There is no reason why you can't write to multiple files at the same time. I'd recommend doing what thomask said and queue requests. You can then use some thread synchronization to wait until the entire queue is flushed before allowing the next round of writes to go through.

You haven't mentioned if it's critical to write to these outputs in "real-time", or how much data is being written. Subject to your constraints, one option might be to buffer all the outputs and write them at the end of your software run.
A variant of this is to setup internal buffers of a fixed size, once you hit the internal buffer limit, open the file, append, and close, then clear the buffer for more output. The buffers reduce the number of open/close cycles and give you bursts of writes which the file system is usually setup to handle nicely. This would be for cases where you need somewhat real-time writes, and/or data is bigger than available memory, and file handles exceed some max in your system.

You can do it in 2 steps.
1) Write the first 19 ids to one file, the next 19 ids to the next file and so on. So you need 8 output files (and the input file) opened in parallel for this step.
2) For every so created file create 19 (only 13 for the last one) new files and write the ids to it.
Independent of how large the input file is and how many id-datasets it contains, you always need to open and close 163 files. But you need to write the data twice, so it may only worth it, if the id-datasets are really small and randomly distributed.
I think in most cases it is more efficient to open and close the files more often.

The safest method is to open a file and flush after writing, then close if no more recent writing will take place. Many things outside your program's control can corrupt the content of your file. Keep this in mind as you read on.
I suggest keeping an std::map or std::vector of FILE pointers. The map allows you to access file pointers by an ID. If the ID range is small, you could create a vector, reserving elements, and using the ID as an index. This will allow you to keep a lot of files open at the same time. Beware the concept of data corruption.
The limit of simultaneous open files is set by the operating system. For example, if your OS has a maximum of 10, you will have make arrangements when the 11th file is requested.
Another trick is reserve buffers in dynamic memory for each file. When all the data is processed, open a file (or more than one), write the buffer (using one fwrite), close and move on. This may be faster since you are writing to memory during the data processing rather than a file. An interesting side note is that your OS may also page the buffers to the hard drive as well. The size and quantities of buffers is an optimization issue that is platform dependent (you'll have to adjust and test to get a good combination). Your program will slow down if the OS pages the memory to the disk.

Well, if I was writing it with your listed constraints in the OP, I would create 146 buffers and plop the data into them, then at the end, sequentially walk through the buffers and close/open a single file-handle.
You mentioned in a comment that speed was a major concern and that the naive approach is too slow.
There are a few things that you can start considering. One is a reorganizing of the binary file into sequential strips, which would allow parallel operations. Another is a least-recently used approach to your filehandle collection. Another approach might be to fork out to 8 different processes, each outputting to 19-20 files.
Some of these approaches will be more or less practical to write depending on binary organization(Highly fragmented vs highly sequential).
A major constraint is the size of your binary data. Is it bigger than cache? bigger than memory? streamed out of a tape deck? Continually coming off a sensor stream and only existing as a 'file' in memory? Each of those presents a different optimization strategy...
Another question is usage patterns. Are you doing occasional spike writes to the files, or are you having massive chunks written only a few times? That determines the effectiveness of the different caching/paging strategies of filehandles.

Assuming you are on a *nix system, the limit is per process, not system-wide. So that implies you could launch multiple processes, each responsible for a subset of the id's you are filtering for. Each could keep within the FOPEN_MAX for its process.
You could have one parent process reading the input file then sending the data to various 'write' processes through pipe special files.

"Fewest File Opens" Strategy:
To achieve a minimum number of file opens and closes, you will have to read through the input multiple times. Each time, you pick a subset of the ids that need sorting, and you extract only those records into the output files.
Pseudocode for each thread:
Run through the file, collect all the unique ids.
fseek() back to the beginning of the input.
For every group of 19 IDs:
Open a file for each ID.
Run through the input file, appending matching records to the corresponding output file.
Close this group of 19 output files.
fseek() to the beginning of the input.
This method doesn't work quite as nicely with multiple threads, because eventually the threads will be reading totally different parts of the file. When that happens, it's difficult for the file cache to be efficient. You could use barriers to keep the threads more-or-less in lock-step.
"Fewest File Operations" Strategy
You could use multiple threads and a large buffer pool to make only one run-through of the input. This comes at the expense of more file opens and closes (probably). Each thread would, until the whole file was sorted:
Choose the next unread page of the input.
Sort that input into 2-page buffers, one buffer for each output file. Whenever one buffer page is full:
Mark the page as unavailable.
If this page has the lowest page-counter value, append it to the file using fwrite(). If not, wait until it is the lowest (hopefully, this doesn't happen much).
Mark the page as available, and give it the next page number.
You could change the unit of flushing output files to disk. Maybe you have enough RAM to collect 200 pages at a time, per output file?
Things to be careful about:
Is your data page-aligned? If not, you'll have to be clever about reading "the next page".
Make sure you don't have two threads fwrite()'ing to the same output file at the same time. If that happens, you might corrupt one of the pages.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js