Make a persistent copy of a tree (C++) - c++

I am thinking about using this STL-like tree library for C++ http://tree.phi-sci.com/ to store hierarchical data (think organisation chart).
In my case the tree only contains the structure, the 'payload' of each node is stored elsewhere. So it will probably end up as a tree<int> or a tree<simple_class_containing_a_couple_of_ints>
I would like to find the best way to persist the tree. To be more specific I would like to find the best way to persist the tree to a SQL database so it can be loaded back into the application on startup.
So my question is: How can I persist a tree contained in a tree.hh container to a SQL database?
Note: It is not necessary to store it as a tree structure in the database (i.e. no need for nested set, adjacency list). There is no need to query the database as the whole tree will be loaded into memory.
UPDATE:
I have found this class as an alternative to tree.hh here: http://stlplus.sourceforge.net/stlplus3/docs/ntree.html
I cannot comment yet on any performance differences, but it mostly implements what I need and has a persistence class (sorry no link as not enough reputation) that I can dump to a BLOB. I haven't entered this as an answer yet because I am still interested in any alternative solutions.

I would persist each node in one SQL table (one row per node) and perhaps each node -> sibling relation in another table.
I am not sure SQL is the best way to persist. You could consider using JSON.

Related

Should I use a relational database or write my own search tree

basically my whole career is based on reading question here but now I'm stuck since I even do not know how to ask this correctly.
I'm designing a SQLITE database which is meant for the construction of data sheets out of existing data sheets. People like reusing stuff and I want to manage this with a DB and an interface. A data sheet has reusable elements like pictures, text, formulas, sections, lists, frontpages and variables. Sections can contain elements -> This can be coped with recursive CTEs - thanks "mu is too short" for that hint. Texts, Formulas, lists etc. can contain variables. At the end I want to be able to manage variables which must be unique per data sheet, manage elements which are an ordered list making up the data sheet. So selecting a data sheet I must know which elements are contained and what variables within the elements are used. I must be able to create a new data sheet by re-using elements and/or creating new ones if desired.
I came so far to have (see also link to screen shot at the bottom)
a list of variables
which (several of them) can be contained in elements
a list of elements
elements make up the
a list of data sheets
Reading examples like
Store array in SQLite that is referenced in another table
How to store a list in a column of a database table
give me already helpful hints like that I need to create for each data sheet a new atomic list containing the elements and the position of them. Same for the variables which are referenced by each element. But the troubles start when I want to have it consistent and actually how to query it.
How do I connect the the variables which are contained within elements and the elements that are contained within the data sheets. How do I check when one element or variable is being modified, which data sheets need to be recompiled since they are using the same variables and/or elements?
The more I think about this, the more it sounds like I need to write my own search tree based on an object oriented inheritance class structure and must not use data bases. Can somebody convince me that a data base is the right tool for my issue?
I learned data bases once but this is quite some time ago and to be honest the university was not giving good lectures since we never created a database by our own but only worked on existing ones.
To be more specific, my knowledge leads to this solution so far without knowing how to correctly query for a list of data sheets when changing the content of one value since the reference is a text containing the name of a table:
screen shot since I'm a greenhorn
Update:
I think I have to search for unique connections, so it would end up in many-to-many tables. Not perfectly happy with it but I think I can go on with it.
still a green horn, how are you guys using correct high lightning for sql?

Data Storage and Retrieval in Relational Database

I'm starting a project- Mini Database System, basically a small database like MySQL. I'm planning to use C++, I read several articles and understood that tables will be stored and retrieved using files. Further I need to use B+ trees for accessing and updating of data.
Can someone explain me with example how data will be actually stored inside files,
For example I've a database "test" with table "student" in it.
student(id,name,grade,class) with some of the student entries. So how the entries of this table will be stored inside the file, whether it will stored in single file, or divided into files if later, then how ?
A B+Tree on disk is a bunch of fixed-length blocks. Your program will read/write whole blocks.
Within a block, there are a variable number of records. Those are arranged by some mechanism of your choosing, and need to be ordered in some way.
"Leaf nodes" contain the actual data. In "non-leaf nodes", the "records" contain pointers to child nodes; this is the way BTrees work.
B+Trees have the additional links (and maintenance hassle) of chaining blocks at the same level.
Wikipedia has some good discussions.

Random file access of stl::map data in C++

I have a stl::map data-structure
key:data pair
which I need to store in a binary file.
key is an unsigned short value, and is not sequential
data is another big structure, but is of fixed size.
This map is managed based on some user actions of add, modify or delete. And I have to keep the file updated every time I update the map. This is to survive a system crash scenario.
Adding can always be done at the end of the file. But, user can modify or delete any of the existing records.
That means I have to randomly access the file to update that modified/deleted record.
My questions are:
Is there a way I can reach the modified record in the file directly without sequentially searching thru the whole records ? ( Max record size is 5000)
On a delete, how do I remove it from the file and move the next record to the deleted record's position ?
Appreciate your help!
Assuming you have no need for the tree structure of std::map and you just need an associative container, the most common way I've seen to do this is to have two files: One with the keys and one with the data. In the key file, it will contain all of they keys along with the corresponding offset of their data in the data file. Since you said the data is all of the same size, updating should be easy to do (since it won't change any of the offsets). Adding is done by appending. Deleting is the only hard part; you can delete the key to remove it from the database, but it's up to you if you want to keep track of "freed" data sections and try to write over them. To keep track of the keys, you might want another associative container (map or unordered_map) in memory with the location of keys in the key file.
Edit: For example, the key file might be (note that offsets are in bytes)
key1:0
key2:5
and the corresponding data file would be
data1data2
This is a pretty tried and true pattern, used in everyone from hadoop to high speed local databases. To get an idea of persistence complications you might consider, I would highly recommend reading this Redis blog, it taught me a lot about persistence when I was dealing with similar issues.

Non-permanent huge external data storage in C++ application

I'm rewriting an application which handles a lot of data (about 100 GB) which is designed as a relational model.
The application is very complex; it is some kind of conversion tool for open street map data of huge sizes (the whole world) and converts it into a map file for our own route planning software. The converter application for example holds the nodes in the open street map with their coordinate and all its tags (a lot of more than that, but this should serve as an example in this question).
Current situation:
Because this data is very huge, I split it into several files: Each file is a map from an ID to an atomic value (let's assume that the list of tags for a node is an atomic value; it is not but the data storage can treat it as such). So for nodes, I have a file holding the node's coords, one holding the node's name and one holding the node's tags, where the nodes are identified by (non-continuous) IDs.
The application once was split into several applications. Each application processes one step of the conversion. Therefore, such an application only needs to handle some of the data stored in the files. For example, not all applications need the node's tags, but a lot of them need the node's coords. This is why I split the relations into files, one file for each "column".
Each processing step can read a whole file at once into a data structure within RAM. This ensures that lookups can be very efficient (if the data structure is a hash map).
I'm currently rewriting the converter. It should now be one single application. And it should now not use separated files for each "column". It should rather use some well-known architecture to hold external data in a relational manner, like a database, but much faster.
=> Which library can provide the following features?
Requirements:
It needs to be very fast in iterating over the existing data (while not modifying the set of rows, but some values in the current row).
It needs to provide constant or near-constant lookup, similar to hash maps (while not modifying the whole relation at all).
Most of the types of the columns are constantly sized, but in general they are not.
It needs to be able to append new rows to a relation in constant or logarithmic time per row. Live-updating some kind of search index will not be required. Updating (rebuilding) the index can happen after a whole processing step is complete.
Some relations are key-value-based, while others are an (continuously indexed) array. Both of them should provide fast lookups.
It should NOT be a separate process, like a DBMS like MySQL would be. The number of queries will be enormous (around 10 billions) and will be totally the bottle neck of the performance. However, caching queries would be a possible workaround: Iterating over a whole table can be done in a single query while writing to a table (from which no data will be read in the same processing step) can happen in a batch query. But still: I guess that serializing, inter-process-transmitting and de-serializing SQL queries will be the bottle neck.
Nice-to-have: easy to use. It would be very nice if the relations can be used in a similar way than the C++ standard and Qt container classes.
Non-requirements (Why I don't need a DBMS):
Synchronizing writing and reading from/to the same relation. The application is split into multiple processing steps; every step has a set of "input relations" it reads from and "output relations" it writes into. However, some steps require to read some columns of a relation while writing in other columns of the same relation.
Joining relations. There are a few cross-references between different relations, however, they can be resolved within my application if lookup is fast enough.
Persistent storage. Once the conversion is done, all the data will not be required anymore.
The key-value-based relations will never be re-keyed; the array-based relations will never be re-indexed.
I can think of several possible solutions depending on lots of factors that you have not quantified in your question.
If you want a simple store to look things up and you have sufficient disk, SQLite is pretty efficient as a database. Note that there is no SQLite server, the 'server' is linked into your application.
Personally this job smacks of being embarrassingly parallel. I would think that a small Hadoop cluster would make quick work of the entire job. You could spin it up in AWS, process your data, and shut it down pretty inexpensively.

Migrating Application Configuration from Windows Registry to SQLite

Currently, I am working on the migration mentioned in the title line. Problem is application configuration that is kept in registry has a tree like structure, for example:
X
|->Y
|->Z
|->SomeKey someValue
W
|->AnotherKey anotherValue
and so on.
How can I model this structure in SQLite (or any other DB)? If you have experience in similar problems, please send posts. Thanks in advance.
Baris, this structure its similar to a directory/file structure.
You can model this with a simple parent<>child relationship on the directories and key value pairs relatade to the directory.
Something like
Directory:
id integer auto_increment;
name string not null;
parent_id integer not null default 0;
Property:
id integer auto_increment;
key string;
value string;
directory_id integer not null;
With this you can address the root directories searching for directories with parent_id=0, child directories by looking at WHERE parent_id=someid and for properties on that looking for directory_id=someid.
Hope this helps :)
Representing hierarchies in a relational database is pretty easy. You just use self-references. For example, you have a category table that has a field called ParentCategoryId that is either null ( for leaf categories) or the id of the parent category. You can even setup foreign keys to enforce valid relationships. This kind of structure is easy to traverse in code, usually via recursion, but a pain when it comes to writing sql queries.
One way around this for a registry clone is to use the registry key path as the key. That is, have an entry where Path is "X/Y/Z/SomeKey" and Value is "someValue". This will query easier but may not express the hierarchy the way you might like. That is, you will only have the values and not the overall structure of the hierarchy.
The bottom line is you have to compromise to map a hierarchy with an unknown number of levels onto a relational database structure.
Self-referencing tables mentioned by previous posters are nice when you want to store an hierarchy, they become less attractive when you start selecting leaves of the tree.
Could you clarify the use-case of retrieving data from the configuration?
Are you going to load the whole configuration at once or to retrieve each parameter separately?
How arbitrary will be the depth of the leaf nodes?