I have a stl::map data-structure
key:data pair
which I need to store in a binary file.
key is an unsigned short value, and is not sequential
data is another big structure, but is of fixed size.
This map is managed based on some user actions of add, modify or delete. And I have to keep the file updated every time I update the map. This is to survive a system crash scenario.
Adding can always be done at the end of the file. But, user can modify or delete any of the existing records.
That means I have to randomly access the file to update that modified/deleted record.
My questions are:
Is there a way I can reach the modified record in the file directly without sequentially searching thru the whole records ? ( Max record size is 5000)
On a delete, how do I remove it from the file and move the next record to the deleted record's position ?
Appreciate your help!
Assuming you have no need for the tree structure of std::map and you just need an associative container, the most common way I've seen to do this is to have two files: One with the keys and one with the data. In the key file, it will contain all of they keys along with the corresponding offset of their data in the data file. Since you said the data is all of the same size, updating should be easy to do (since it won't change any of the offsets). Adding is done by appending. Deleting is the only hard part; you can delete the key to remove it from the database, but it's up to you if you want to keep track of "freed" data sections and try to write over them. To keep track of the keys, you might want another associative container (map or unordered_map) in memory with the location of keys in the key file.
Edit: For example, the key file might be (note that offsets are in bytes)
key1:0
key2:5
and the corresponding data file would be
data1data2
This is a pretty tried and true pattern, used in everyone from hadoop to high speed local databases. To get an idea of persistence complications you might consider, I would highly recommend reading this Redis blog, it taught me a lot about persistence when I was dealing with similar issues.
Related
I have what I hope is an easy question. I am using the Google Storage Client library to loop over blobs in a bucket. After I get the list of blobs on the bucket I am unable to loop over the bucket unless I re-run the command to list the bucket.
I read the documentation on page iterators but I still dont quite understand why this sort of thing couldnt just be stored in memory like a normal variable in python. Why is this ValueError being thrown when I try to loop over the object again? Does anyone have any suggestions on how to interact with this data better?
For many sources of data, the potential returned items could be huge. While you may only have dozens or hundreds of objects in your bucket, there is absolutely nothing to prevent you from having millions (billions?) of objects. If you list a bucket, it would make no sense to return a million entries and have any hope of maintaining their state in memory. Instead, Google says you should "page" or "iterate" through them. Each time you ask for a new page, you get the next set of data and are presumed to have lost reference to the previous set of data ... and hence maintain only one set of data at a time at your client.
It is the back-end server that maintains your "window" into that data that is being returned. All you need do is say "give me more data ... my context is " and the next chunk of data is returned.
If you want to walk through your data twice then I would suggest asking for a second iteration. Be careful though, the result of the first iteration may not be the same as the second. If new files are added or old ones removed, the results will be different between one iteration and another.
If you really believe that you can hold the results in memory then as you execute your first iteration, save the results and keep appending new values as you page through them. This may work for specific use cases but realize that you are likely setting yourself up for trouble if the number of items gets too large.
I'm starting a project- Mini Database System, basically a small database like MySQL. I'm planning to use C++, I read several articles and understood that tables will be stored and retrieved using files. Further I need to use B+ trees for accessing and updating of data.
Can someone explain me with example how data will be actually stored inside files,
For example I've a database "test" with table "student" in it.
student(id,name,grade,class) with some of the student entries. So how the entries of this table will be stored inside the file, whether it will stored in single file, or divided into files if later, then how ?
A B+Tree on disk is a bunch of fixed-length blocks. Your program will read/write whole blocks.
Within a block, there are a variable number of records. Those are arranged by some mechanism of your choosing, and need to be ordered in some way.
"Leaf nodes" contain the actual data. In "non-leaf nodes", the "records" contain pointers to child nodes; this is the way BTrees work.
B+Trees have the additional links (and maintenance hassle) of chaining blocks at the same level.
Wikipedia has some good discussions.
I'm trying to implement DRUM (Disk Repository with Update Management) in Java as per the IRLBot paper (relevant pages start at 4) but as quick summary it's essentially just an efficient way of batch updating (key, value) pairs against a persistent repository. In the linked paper it's used as the backbone behind the crawler's URLSeen test, RobotsTxt check and DNS cache.
There has helpfully been an implementation in c++ done here, which lays out the architecture in a much more digestable way. For ease of reference, this is the architecture diagram from the c++ implementation:
The part which I'm struggling to understand is the reasoning behind keeping the (key, value) buckets and auxiliary buckets separate. The article with the c++ implementation states the following:
During merge a key/value bucket is read into a separate buffer and
sorted. Its content is synchronized with that of the persistent
repository. Checks and updates happen at this moment. Afterwards, the
buffer is re-sorted to its original order so that key/value pairs
match again the corresponding auxiliary bucket. A dispatching
mechanism then forwards the key, value and auxiliary for further
processing along with the operation result. This process repeats for
all buckets sequentially.
So if the order of the (key, value) buckets need to be restored to that of the auxiliary buckets in order to re-link the (key, value) pairs with the auxiliary information, why not just keep the (key, value, aux) values together in singular buckets? What is the reasoning behind keeping them separate and would it be more efficient to just keep them together (since you no longer need to restore the original unsorted order of the bucket)?
On merge-time DRUM loads the content of the key/value disk file of the respective bucket and depending on the operation used checks, updates or check+updates every single entry of that file with the backing datastore.
The auxiliary disk file is therefore irrelevant and not loading the auxiliary data into memory simply saves some memory footprint while sorting which DRUM tries to minimize in order to process the uniqueness of more than 6 billion entries. In case of f.e. the RobotsCache the auxiliary data can be even some 100kb per entry. This is however only a thesis on my own, if you really want to know why they separated these two buffers and disk-files you should probably ask Dmitri Loguinov.
I've also create a Java based DRUM implementation (also a Java-based IRLbot implementation), but both might need a bit more love though. There is also a further Java-based Github project called DRUMS which extends DRUM with a select feature which was used to store genome codes.
I'm rewriting an application which handles a lot of data (about 100 GB) which is designed as a relational model.
The application is very complex; it is some kind of conversion tool for open street map data of huge sizes (the whole world) and converts it into a map file for our own route planning software. The converter application for example holds the nodes in the open street map with their coordinate and all its tags (a lot of more than that, but this should serve as an example in this question).
Current situation:
Because this data is very huge, I split it into several files: Each file is a map from an ID to an atomic value (let's assume that the list of tags for a node is an atomic value; it is not but the data storage can treat it as such). So for nodes, I have a file holding the node's coords, one holding the node's name and one holding the node's tags, where the nodes are identified by (non-continuous) IDs.
The application once was split into several applications. Each application processes one step of the conversion. Therefore, such an application only needs to handle some of the data stored in the files. For example, not all applications need the node's tags, but a lot of them need the node's coords. This is why I split the relations into files, one file for each "column".
Each processing step can read a whole file at once into a data structure within RAM. This ensures that lookups can be very efficient (if the data structure is a hash map).
I'm currently rewriting the converter. It should now be one single application. And it should now not use separated files for each "column". It should rather use some well-known architecture to hold external data in a relational manner, like a database, but much faster.
=> Which library can provide the following features?
Requirements:
It needs to be very fast in iterating over the existing data (while not modifying the set of rows, but some values in the current row).
It needs to provide constant or near-constant lookup, similar to hash maps (while not modifying the whole relation at all).
Most of the types of the columns are constantly sized, but in general they are not.
It needs to be able to append new rows to a relation in constant or logarithmic time per row. Live-updating some kind of search index will not be required. Updating (rebuilding) the index can happen after a whole processing step is complete.
Some relations are key-value-based, while others are an (continuously indexed) array. Both of them should provide fast lookups.
It should NOT be a separate process, like a DBMS like MySQL would be. The number of queries will be enormous (around 10 billions) and will be totally the bottle neck of the performance. However, caching queries would be a possible workaround: Iterating over a whole table can be done in a single query while writing to a table (from which no data will be read in the same processing step) can happen in a batch query. But still: I guess that serializing, inter-process-transmitting and de-serializing SQL queries will be the bottle neck.
Nice-to-have: easy to use. It would be very nice if the relations can be used in a similar way than the C++ standard and Qt container classes.
Non-requirements (Why I don't need a DBMS):
Synchronizing writing and reading from/to the same relation. The application is split into multiple processing steps; every step has a set of "input relations" it reads from and "output relations" it writes into. However, some steps require to read some columns of a relation while writing in other columns of the same relation.
Joining relations. There are a few cross-references between different relations, however, they can be resolved within my application if lookup is fast enough.
Persistent storage. Once the conversion is done, all the data will not be required anymore.
The key-value-based relations will never be re-keyed; the array-based relations will never be re-indexed.
I can think of several possible solutions depending on lots of factors that you have not quantified in your question.
If you want a simple store to look things up and you have sufficient disk, SQLite is pretty efficient as a database. Note that there is no SQLite server, the 'server' is linked into your application.
Personally this job smacks of being embarrassingly parallel. I would think that a small Hadoop cluster would make quick work of the entire job. You could spin it up in AWS, process your data, and shut it down pretty inexpensively.
Is it possible to access a specific one (using a specific index) in a text file without knowing the size of each record?
If you maintain a separate index of record offsets then you can simply consult it for the appropriate location to seek to. Otherwise, no.
If the records happened to be sorted on a convenient key and you can identify where one record ends and another begins, then you can implement a binary or interpolation search approach. You may be able to add this to your text-file format retrospectively to aid the lookup. Otherwise, you're stuck with serial searches from a position with known index (obviously start of file is one, if you know the total number of records you can work backwards from end of file). You can also consider doing one pass to create an index to allow direct access, or having the file embed a list of offsets that can be easily read.
Check out the dbopen() function. If you pass DB_RECNO as the type parameter you can access variable-length records. These records can be delimited by newlines. Essentially your "database" is a flat text file.
The API will conveniently handle inserts and deletes for you as well.