How to deserialize a file containing multiple records

How to deserialize a file containing multiple records - c++

i've written a thrift-definition, and used this defintion to serialize multiple records in one file (i've added the size of the whole record at the beginning of each record). That is in short what I have done.
boost::shared_ptr<apache::thrift::transport::TMemoryBuffer> transport(new apache::thrift::transport::TMemoryBuffer);
boost::shared_ptr<apache::thrift::protocol::TBinaryProtocol> protocol(new apache::thrift::protocol::TBinaryProtocol(transport));
myClass->write(protocol.get());
const std::string & data(transport->getBufferAsString());
Afterwards i just print the string data in binary mode. Now I want to deserialize this file again. I wouldn't have any problem if there was only on record in the file, unfortunately I have to print multiple files, so I guess I have to work with offset based on the size i saved in the file along with the record itself. However, I can't seem to find any example I can use to achieve my goals, and the official documentation is quite lacking. Has anyone any tipps for me. If I'm missing some information, just ask.
Further Informations:
Of course I want to use use thrift to deserialize. However, one file can contain multiple records. For example: Imagine I have defined a struct in a thrift-definition file that contains car-Information. Now I serialize multiple car-structs in one output file. Serializing is no problem as i just append the data. If i want to deserialize however, I have to know where one record starts, and the next begins. That is my problem. I don't know how to tell thrift where one record begins and ends. I've searched the internet, but can't seem to find an example for c++ (i got one for python so far, but am not able to translate it to c++). The structure of one file can be described as followed: [lenghtofrecord1][record1][lengthofrecord2][record2][...]
Thanks in Advance
Michael

How about having a list<records> that you de/serialize as a whole? Or is it an absolute requirement to read them independently and randomly? If yes, I see 1,5 (one and a half) possible solutions:
Have a second file as an index. This holds a map< recordNumber, offset>, or simply a sorted list of integers-pairs, to quickly locate records. Since these data are much less than the records you probably can cache it in memory all the time.
The half solution: iff the record size is fixed, any records position could be calculated easily by multiplying recordSize * (recordNr-1). This way you don't even need the size prefix. If you have strings in the record or other variable-sized entities, this will not work, unless you force a fixed record size by reserving a buffer for each record with a predefined (maximum) size. It's a little ugly, thus the "half" solution, but you don't need the index file.

Although maybe not the perfect solution, this seems to work for me:
boost::shared_ptr<apache::thrift::transport::TMemoryBuffer> transport(new apache::thrift::transport::TMemoryBuffer);
boost::shared_ptr<apache::thrift::protocol::TBinaryProtocol> protocol(new apache::thrift::protocol::TBinaryProtocol(transport));
transport->resetBuffer((uint8_t*) buffer, sizeOfEntry);
Buffer is a char array containing the desired record (I used seekg for the offset) and sizeOfEntry is the records size. Afterwards I can go on with the automatically generated read-Method of my thrift-generated class. In Fact I had this solution earlier, I just messed up my offset, thus it didn't work.

Related

Data structure to store integers, floats and text at the same time

I am simulating a DELETE command from SQL in c++. I've tried all the possible data types to temporary store the data from a certain table( a .bin file in my case), but none of them worked.
If I store the entire file at once in a char buffer[5000], the text is safe, but the integers somehow are stored on the address, and in the buffer they look like " }+;-." and so on.
If I select them one by one and concatenate each value to a similar buffer with strncat, and with itoa() if needed, I end up having them perfect but I can't navigate in it using specific data-type size.
I would apreciate very much a suggestion into this. I've literally searched all the internet by now. Thank you!

FatFS - can I create multiple seek locations?

I have a working integration of FatFS in my C++ application running on a Cortex M4-based platform.
My application consists of logging data to a data format called MDF.
On the implementation side, I log data (to a given file) in batches of buffers; The number of buffers depends on how fast I acquire the data: log batch of one buffer . . . do other stuff . . . log batch of five buffer . . . do other stuff . . . etc.
There is also a header which is 24 bytes and contains the number of bytes of data. On a PC, I would just save the header at the end of the measurement but this is an embedded product which could be de-powered at any point in time. If I don't save the header periodically, the file becomes "corrupted".
Therefore, in order to maintain coherency I need to re-save the header after saving every batch of data and that's where my issue is.
This means that I have to call f_lseek before writing the header and then before I write the batch of data.
I am using f_cache_fptr so f_lseek is not painfully slow but I'd like to avoid needing to call f_lseek so frequently.
QUESTION
Is it possible to somehow have 2 seek locations so that I don't need to call f_seek to ping-pong between header-location and data-location?
I am open to modifying FatFS.
The problem, at the low-level, is simpler because the header only shares one 512 byte sector with the data: 24 bytes of header followed by 488 bytes of data.

Is it possible to somehow have 2 seek locations so that I don't need to call f_seek to ping-pong between header-location and data-location?
Not as far as I can tell, no, and it doesn't really seem to make sense. A FIL has only one current position, indicating where the next data written to it will go. What would it even mean for there to be two? How would the system know where to write? It certainly wouldn't be correct to write to both places.
Note in particular that with some operating systems and file systems, it is possible to open the same file more than once, but FatFS supports duplicate file opens only when all openings involved are for read-only mode.
I guess it would be possible to modify FatFS to give it the ability to store one file position when you seek to another, and then later to return to the first. So that would mean adding at least one member to the FIL structure, and adding at least one new function.
But why muck with the innards of FatFS? That's going to be at least a little risky. As long as you have to add a function anyway, how about just implementing a FRESULT my_f_write_at_beginning(FIL* fp, const void* buff, UINT btw, UINT* bw) on top of the existing functions? It can store the current position, seek to the beginning of the file, perform the write (maybe ensuring that the full number of bytes specified is written), and then seek back to the original position.
But fundamentally, no, there is no escaping ping-ponging back and forth, because doing so is part of the requirement you laid out.

On a PC, I would just save the header at the end of the measurement but this is an embedded product which could be de-powered at any point in time. If I don't save the header periodically, the file becomes "corrupted".
Therefore, in order to maintain coherency I need to re-save the header after saving every batch of data and that's where my issue is.
More correctly; you need to save the buffer and the header (footer?), and update the directory entry to reflect the new file size, and update the file allocation table to account for sectors allocated; and you need to write to at least 3 completely separate sectors "atomically" so that everything is consistent if there's a power failure at the wrong time.
This isn't entirely possible on most hardware.
However, there is a way to do it "somewhat safely". Specifically:
pre-allocate enough clusters for a completely new copy of the file (including the new data to append to the end) and update the file allocation table accordingly. If there's a power failure while doing this (or immediately after this point) the risk is lost clusters, which is an "ignore-able" problem that will waste some space but can be fixed easily with a typical "check disk" utility.
create a whole new copy of the file's data in the pre-allocated clusters (copy the old data, then append the new data and header). If there's a power failure in the middle of doing this (or immediately after this point), then the risk is the same as before - just some lost clusters (ignore-able).
atomically update the directory entry; changing both the file size and the "starting cluster number" with the same atomic (single sector) write. If there's a power failure after this point the risk is the same lost clusters (where the old version of the file's data was instead of where the new version of the file data is).
free the clusters that the old version of the file used by doing writes to the file allocation table. After this point you've completed successfully, so a power failure is fine.
To make this less awful for performance you can have two "cluster chains" and alternate between them; such that one chain of clusters is for the current version of the file and the other will become the next version of the file. This avoids the need to copy a lot of older data from one place to another (if you know the old data is still in previously used clusters). It could also avoid the need to allocate and free most clusters in the file allocation table, but only with a significant increase in the risk of lost clusters.
Of course for any of this to work you'd need a guarantee that single-sector writes are atomic; and you can't be using FAT12 (where an entry in the file allocation table can be split by a sector boundary).

C++ Read only random lines in a file

I had requirement to read text file but its too large then I decide to only read some lines in this file. Can I use seek method for jump given line? Then I can only read that line because that text file is too large reading whole file is wasting lot of time. If its not possible, any one give better solution for that? (seek to given line and read it) (I know binary text files are reading byte by byte)
ex of my file
event1 0
subevent 1
subevent 2
event2 3
(In my file after one event its display number of lines I want to seek for previous event)

Yes, you can seek to a point in the file then read from there. One possible problem is that if the lines are all different lengths, a random location in the file will have a higher probability of being in a longer line: you're not getting evenly distributed probabilities of different lines. If you really really must have identical probabilities then you need to make at least one pass over the file to find the start of each line - then you can store those offsets in a vector and randomly select a vector element to guide seeking to the line data in the file. If you only care a little bit, then you can perhaps advance a small but random number of lines past the one you initially seek to... that will even the odds a bit, avoids the initial pass, but isn't perfect. hansmaad's comment adds a neat approach too - perfect results with pretty-good performance - but requires that you have all the lines numbered in the file itself.

Unless each line has exactly the same length, you're going to have to scan through it.
If you want to jump around in it, you can scan through it, saving the offset of each line in a container of your choice, and then use that to seek to a specific line.

Assuming that the lines are variable / random length, I don't believe there is any built-in way to jump directly to the start of a particular line. You can seek to an arbitrary byte position in the file. However, this might land anywhere in the beginning / middle / end of a line.
My best suggestion would be to attack the problem in two steps:
First, make a complete pass through the file, byte by byte, searching for the start of each line. Record the byte position of each line and store it into an array, vector, etc. (Basically, you are creating an index that maps from line number to starting position.) Then, when you have this index built up, you can easily jump to a particular line, by looking up the position in your index.

As far as I know, there is no built-in way to seek to a new line without already knowing where the lines are. I can't tell you the best way to achieve your goal, because most of your question details how you're trying to accomplish it, not what it is you're actually trying to accomplish. Therefore, I might go one of two ways with this:
1) If you actually need every last bit of data from the file (there is no metadata or other information that can be discarded):
Someone mentioned scanning through the file, tracking the lines as you go and building an index with it so you can read in one line at a time. This might work, and it would be the way to go if you actually need each line in its entirety, or if you only need the line number and plan on reading in small pieces at a time from there. However, without knowing details about your constraints or requirements, I would not recommend reading in entire lines using this method for one main reason: I have no way of knowing that one line will not itself be too large to load (what if there is only one line in the file?).
Instead, I would simply allocate a buffer of a size that is an appropriate amount to process at a time, and process the file in chunks of that size until you reach the end. You can stream more data in as you go. Without additional details, I can't tell you what that magic number should be, but the size of the largest chunk of information you might need to process is a good starting point as a minimum.
2) If you don't need every last bit of data from the file (you can discard some of the information in it), then you only need some of it. If you only need select pieces of data, then they are easier to find if they are tagged (which is what XML is for). There are lots of free XML parsers, or you can write your own. Then you'd search for tags instead of arbitrary line numbers, and changes to the file that result in the data being in a different location won't affect your ability to find it if it's tagged, as it would if you're just going by line numbers.

Delete record from file C++

I'm working on a simple database console application in C++ for adding, editing and deleting records in a .dat file. I have the addition and modification down, I'm just finding it hard to understand the concept of deletion in this scenario. Below is how I write a record.
Write record
fh.seekp(num*sizeof(customerObj),ios::beg); // Move the write pointer to where rec is
fh.write((char*)&customerObj,sizeof(customerObj)); // Write updated rec
Any ideas how instead of write() I could have something equivalent to delete()... or is it not that simple?

C and C++ don't have functions to delete parts of files. Many operating systems don't either.
Possible options:
If this is the last record, truncate the file. If not, move (=copy) all records after it, overwriting it, then truncate. Alternatively you could move (=copy) the last record to it and then truncate.
Create an extra file and copy to it all records before this and after this record. Then delete the old file and rename the new file.
Mark the record as unused. When writing new records check if you have any unused locations and use them first.
Use a file per record.

Introduce marker of deleted record. So instead of movement of large chunks of file you need write 1 symbol. When you need allocate new record you could iterate over already deleted and just remove marker.

Well, deleting from files is not trivial as you cannot delete a row from a file directly.
One approach is to read the entire file (or in bulks) and write it back without the required line. (quite robust and not efficient for large files).
Maybe if you divide you record file into smaller partitioned files than doing the above will be more efficient.
Another thing you can do is just mark a row in the file as invalid (as done in memory when deleting a pointer) and overwrite it when needed which of course depends on how you write your records but I hope you get what I mean.

C++ inserting a line into a file at a specific line number

I want to be able to read from an unsorted source text file (one record in each line), and insert the line/record into a destination text file by specifying the line number where it should be inserted.
Where to insert the line/record into the destination file will be determined by comparing the incoming line from the incoming file to the already ordered list in the destination file. (The destination file will start as an empty file and the data will be sorted and inserted into it one line at a time as the program iterates over the incoming file lines.)
Incoming File Example:
1 10/01/2008 line1data
2 11/01/2008 line2data
3 10/15/2008 line3data
Desired Destination File Example:
2 11/01/2008 line2data
3 10/15/2008 line3data
1 10/01/2008 line1data
I could do this by performing the sort in memory via a linked list or similar, but I want to allow this to scale to very large files. (And I am having fun trying to solve this problem as I am a C++ newbie :).)
One of the ways to do this may be to open 2 file streams with fstream (1 in and 1 out, or just 1 in/out stream), but then I run into the difficulty that it's difficult to find and search the file position because it seems to depend on absolute position from the start of the file rather than line numbers :).
I'm sure problems like this have been tackled before, and I would appreciate advice on how to proceed in a manner that is good practice.
I'm using Visual Studio 2008 Pro C++, and I'm just learning C++.

The basic problem is that under common OSs, files are just streams of bytes. There is no concept of lines at the filesystem level. Those semantics have to be added as an additional layer on top of the OS provided facilities. Although I have never used it, I believe that VMS has a record oriented filesystem that would make what you want to do easier. But under Linux or Windows, you can't insert into the middle of a file without rewriting the rest of the file. It is similar to memory: At the highest level, its just a sequence of bytes, and if you want something more complex, like a linked list, it has to be added on top.

If the file is just a plain text file, then I'm afraid the only way to find a particular numbered line is to walk the file counting lines as you go.
The usual 'non-memory' way of doing what you're trying to do is to copy the file from the original to a temporary file, inserting the data at the right point, and then do a rename/replace of the original file.
Obviously, once you've done your insertion, you can copy the rest of the file in one big lump, because you don't care about counting lines any more.

A [distinctly-no-c++] solution would be to use the *nix sort tool, sorting on the second column of data. It might look something like this:
cat <file> | sort -k 2,2 > <file2> ; mv <file2> <file>
It's not exactly in-place, and it fails the request of using C++, but it does work :)
Might even be able to do:
cat <file> | sort -k 2,2 > <file>
I haven't tried that route, though.
* http://www.ss64.com/bash/sort.html - sort man page

One way to do this is not to keep the file sorted, but to use a separate index, using berkley db (BerkleyDB). Each record in the db has the sort keys, and the offset into the main file. The advantage to this is that you can have multiple ways of sorting, without duplicating the text file. You can also change lines without rewriting the file by appending the changed line at the end, and updating the index to ignore the old line and point to the new one. We used this successfully for multi-GB text files that we had to make many small changes to.
Edit: The code I developed to do this is part of a larger package that can be downloaded here. The specific code is in the btree* files under source/IO.

Try a modifed Bucket Sort. Assuming the id values lend themselves well to it, you'll get a much more efficient sorting algorithm. You may be able to enhance I/O efficiency by actually writing out the buckets (use small ones) as you scan, thus potentially reducing the amount of randomized file/io you need. Or not.

Hopefully, there are some good code examples on how to insert a record based on line number into the destination file.
You can't insert contents into a middle of the file (i.e., without overwriting what was previously there); I'm not aware of production-level filesystems that support it.

I think the question is more about implementation rather than specific algorithms, specifically, handling very large datasets.
Suppose the source file has 2^32 lines of data. What would be an efficent way to sort the data.
Here's how I'd do it:
Parse the source file and extract the following information: sort key, offset of line in file, length of line. This information is written to another file. This produces a dataset of fixed size elements that is easy to index, call it the index file.
Use a modified merge sort. Recursively divide the index file until the number of elements to sort has reached some minimum amount - true merge sort recurses to 1 or 0 elements, I suggest stopping at 1024 or something, this will need fine tuning. Load the block of data from the index file into memory and perform a quicksort on it and then write the data back to disk.
Perform the merge on the index file. This is tricky, but can be done like this: load a block of data from each source (1024 entries, say). Merge into a temporary output file and write. When a block is emptied, refill it. When no more source data is found, read the temporary file from the start and overwrite the two parts being merged - they should be adjacent. Obviously, the final merge doesn't need to copy the data (or even create a temporary file). Thinking about this step, it is probably possible to set up a naming convention for the merged index files so that the data doesn't need to overwrite the unmerged data (if you see what I mean).
Read the sorted index file and pull out from the source file the line of data and write to the result file.
It certainly won't be quick with all that file reading and writing, but is should be quite efficient - the real killer is the random seeking of the source file in the final step. Up to that point, the disk access is usually linear and should therefore be reasonably efficient.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js