Data Storage and Retrieval in Relational Database - c++

I'm starting a project- Mini Database System, basically a small database like MySQL. I'm planning to use C++, I read several articles and understood that tables will be stored and retrieved using files. Further I need to use B+ trees for accessing and updating of data.
Can someone explain me with example how data will be actually stored inside files,
For example I've a database "test" with table "student" in it.
student(id,name,grade,class) with some of the student entries. So how the entries of this table will be stored inside the file, whether it will stored in single file, or divided into files if later, then how ?

A B+Tree on disk is a bunch of fixed-length blocks. Your program will read/write whole blocks.
Within a block, there are a variable number of records. Those are arranged by some mechanism of your choosing, and need to be ordered in some way.
"Leaf nodes" contain the actual data. In "non-leaf nodes", the "records" contain pointers to child nodes; this is the way BTrees work.
B+Trees have the additional links (and maintenance hassle) of chaining blocks at the same level.
Wikipedia has some good discussions.

Related

Should I use a relational database or write my own search tree

basically my whole career is based on reading question here but now I'm stuck since I even do not know how to ask this correctly.
I'm designing a SQLITE database which is meant for the construction of data sheets out of existing data sheets. People like reusing stuff and I want to manage this with a DB and an interface. A data sheet has reusable elements like pictures, text, formulas, sections, lists, frontpages and variables. Sections can contain elements -> This can be coped with recursive CTEs - thanks "mu is too short" for that hint. Texts, Formulas, lists etc. can contain variables. At the end I want to be able to manage variables which must be unique per data sheet, manage elements which are an ordered list making up the data sheet. So selecting a data sheet I must know which elements are contained and what variables within the elements are used. I must be able to create a new data sheet by re-using elements and/or creating new ones if desired.
I came so far to have (see also link to screen shot at the bottom)
a list of variables
which (several of them) can be contained in elements
a list of elements
elements make up the
a list of data sheets
Reading examples like
Store array in SQLite that is referenced in another table
How to store a list in a column of a database table
give me already helpful hints like that I need to create for each data sheet a new atomic list containing the elements and the position of them. Same for the variables which are referenced by each element. But the troubles start when I want to have it consistent and actually how to query it.
How do I connect the the variables which are contained within elements and the elements that are contained within the data sheets. How do I check when one element or variable is being modified, which data sheets need to be recompiled since they are using the same variables and/or elements?
The more I think about this, the more it sounds like I need to write my own search tree based on an object oriented inheritance class structure and must not use data bases. Can somebody convince me that a data base is the right tool for my issue?
I learned data bases once but this is quite some time ago and to be honest the university was not giving good lectures since we never created a database by our own but only worked on existing ones.
To be more specific, my knowledge leads to this solution so far without knowing how to correctly query for a list of data sheets when changing the content of one value since the reference is a text containing the name of a table:
screen shot since I'm a greenhorn
Update:
I think I have to search for unique connections, so it would end up in many-to-many tables. Not perfectly happy with it but I think I can go on with it.
still a green horn, how are you guys using correct high lightning for sql?

Random file access of stl::map data in C++

I have a stl::map data-structure
key:data pair
which I need to store in a binary file.
key is an unsigned short value, and is not sequential
data is another big structure, but is of fixed size.
This map is managed based on some user actions of add, modify or delete. And I have to keep the file updated every time I update the map. This is to survive a system crash scenario.
Adding can always be done at the end of the file. But, user can modify or delete any of the existing records.
That means I have to randomly access the file to update that modified/deleted record.
My questions are:
Is there a way I can reach the modified record in the file directly without sequentially searching thru the whole records ? ( Max record size is 5000)
On a delete, how do I remove it from the file and move the next record to the deleted record's position ?
Appreciate your help!
Assuming you have no need for the tree structure of std::map and you just need an associative container, the most common way I've seen to do this is to have two files: One with the keys and one with the data. In the key file, it will contain all of they keys along with the corresponding offset of their data in the data file. Since you said the data is all of the same size, updating should be easy to do (since it won't change any of the offsets). Adding is done by appending. Deleting is the only hard part; you can delete the key to remove it from the database, but it's up to you if you want to keep track of "freed" data sections and try to write over them. To keep track of the keys, you might want another associative container (map or unordered_map) in memory with the location of keys in the key file.
Edit: For example, the key file might be (note that offsets are in bytes)
key1:0
key2:5
and the corresponding data file would be
data1data2
This is a pretty tried and true pattern, used in everyone from hadoop to high speed local databases. To get an idea of persistence complications you might consider, I would highly recommend reading this Redis blog, it taught me a lot about persistence when I was dealing with similar issues.

How are documents retrieved after reduce produces the output?

So, after reduce completes its job we have data stored in the files something like this:
But what happens when the user types something? How is search performed when the data is stored just in files?
MapReduce is for processing. So once you have processed the data and generated your aggregate information, which is on HDFS, you will either have to read the file in some program to display to user. Or several alternative options are available to read the data from HDFS :
You could use Hive and create a table on top of this data and read the data using SQL like queries. A simple web application can connect to this using the thrift server which provides a JDBC interface to hive.
Other options include loading data to HBase, Shark etc. All depends on what your use case is interms of the size of the aggregated data, performance requirements
What you have constructed after MapReduce is a inverted index, a nice little data structure. Now you have to use it.
For example, in case of google, this inverted index is sharded across many servers and stores the entire list on each of them. So for example, server 500 has the list for be, and another has the list for to. These are implementation details, you could theoretically store it on one box in a large hash if you could hold the index in memory.
When the customer types in words into the engine. It will retrieve that entire list. If there are multiple words, it will do an intersection of those lists to show you documents that have both words.
Here is the source for the full paper on how they did it http://infolab.stanford.edu/~backrub/google.html
See "Figure 4. Google Query Evaluation"

Non-permanent huge external data storage in C++ application

I'm rewriting an application which handles a lot of data (about 100 GB) which is designed as a relational model.
The application is very complex; it is some kind of conversion tool for open street map data of huge sizes (the whole world) and converts it into a map file for our own route planning software. The converter application for example holds the nodes in the open street map with their coordinate and all its tags (a lot of more than that, but this should serve as an example in this question).
Current situation:
Because this data is very huge, I split it into several files: Each file is a map from an ID to an atomic value (let's assume that the list of tags for a node is an atomic value; it is not but the data storage can treat it as such). So for nodes, I have a file holding the node's coords, one holding the node's name and one holding the node's tags, where the nodes are identified by (non-continuous) IDs.
The application once was split into several applications. Each application processes one step of the conversion. Therefore, such an application only needs to handle some of the data stored in the files. For example, not all applications need the node's tags, but a lot of them need the node's coords. This is why I split the relations into files, one file for each "column".
Each processing step can read a whole file at once into a data structure within RAM. This ensures that lookups can be very efficient (if the data structure is a hash map).
I'm currently rewriting the converter. It should now be one single application. And it should now not use separated files for each "column". It should rather use some well-known architecture to hold external data in a relational manner, like a database, but much faster.
=> Which library can provide the following features?
Requirements:
It needs to be very fast in iterating over the existing data (while not modifying the set of rows, but some values in the current row).
It needs to provide constant or near-constant lookup, similar to hash maps (while not modifying the whole relation at all).
Most of the types of the columns are constantly sized, but in general they are not.
It needs to be able to append new rows to a relation in constant or logarithmic time per row. Live-updating some kind of search index will not be required. Updating (rebuilding) the index can happen after a whole processing step is complete.
Some relations are key-value-based, while others are an (continuously indexed) array. Both of them should provide fast lookups.
It should NOT be a separate process, like a DBMS like MySQL would be. The number of queries will be enormous (around 10 billions) and will be totally the bottle neck of the performance. However, caching queries would be a possible workaround: Iterating over a whole table can be done in a single query while writing to a table (from which no data will be read in the same processing step) can happen in a batch query. But still: I guess that serializing, inter-process-transmitting and de-serializing SQL queries will be the bottle neck.
Nice-to-have: easy to use. It would be very nice if the relations can be used in a similar way than the C++ standard and Qt container classes.
Non-requirements (Why I don't need a DBMS):
Synchronizing writing and reading from/to the same relation. The application is split into multiple processing steps; every step has a set of "input relations" it reads from and "output relations" it writes into. However, some steps require to read some columns of a relation while writing in other columns of the same relation.
Joining relations. There are a few cross-references between different relations, however, they can be resolved within my application if lookup is fast enough.
Persistent storage. Once the conversion is done, all the data will not be required anymore.
The key-value-based relations will never be re-keyed; the array-based relations will never be re-indexed.
I can think of several possible solutions depending on lots of factors that you have not quantified in your question.
If you want a simple store to look things up and you have sufficient disk, SQLite is pretty efficient as a database. Note that there is no SQLite server, the 'server' is linked into your application.
Personally this job smacks of being embarrassingly parallel. I would think that a small Hadoop cluster would make quick work of the entire job. You could spin it up in AWS, process your data, and shut it down pretty inexpensively.

How do I join huge csv files (1000's of columns x 1000's rows) efficiently using C/C++?

I have several (1-5) very wide (~50,000 columns) .csv files. The files are (.5GB-1GB) in size (avg. size around 500MB). I need to perform a join on the files on a pre-specified column. Efficiency is, of course, the key. Any solutions that can be scaled out to efficiently allow multiple join columns is a bonus, though not currently required. Here are my inputs:
-Primary File
-Secondary File(s)
-Join column of Primary File (name or col. position)
-Join column of Secondary File (name or col. position)
-Left Join or Inner Join?
Output = 1 File with results of the multi-file join
I am looking to solve the problem using a C-based language, but of course an algorithmic solution would also be very helpful.
Assuming that you have a good reason not to use a database (for all I know, the 50,000 columns may constitute such a reason), you probably have no choice but to clench your teeth and build yourself an index for the right file. Read through it sequentially to populate a hash table where each entry contains just the key column and an offset in the file where the entire row begins. The index itself then ought to fit comfortably in memory, and if you have enough address space (i.e. unless you're stuck with 32-bit addressing) you should memory-map the actual file data so you can access and output the appropriate right rows easily as you walk sequentially through the left file.
Your best bet by far is something like Sqlite, there's C++ bindings for it and it's tailor made for lighting fast inserts and queries.
For the actual reading of the data, you can just go row by row and insert the fields in Sqlite, no need for cache-destroying objects of objects :) As an optimization, you should group up multiple inserts in one statement (insert into table(...) select ... union all select ... union all select ...).
If you need to use C or C++, open the file and load the file directly into a database such as MySQL. The C and C++ languages do not have adequate data table structures nor functionality for manipulating the data. A spreadsheet application may be useful, but may not be able to handle the capacities.
That said, I recommend objects for each field (column). Define a record (file specific) as a collection of fields. Read a text line from a file into a string. Let the record load the field data from the string. Store records into a vector.
Create a new record for the destination file. For each record from the input file(s), load the new record using those fields. Finally, for each record, print the contents of each field with separation characters.
An alternative is to whip up a 2 dimensional matrix of strings.
Your performance bottleneck will be I/O. You may want to read huge blocks of data in. The thorn to the efficiency is the variable record length of a CSV file.
I still recommend using a database. There are plenty of free ones out there, such as MySQl.
It depends on what you mean by "join". Are the columns in file 1 the same as in file 2? If so, you just need a merge sort. Most likely a solution based on merge sort is "best". But I agree with #Blindy above that you should use an existing tool like Sqlite. Such a solution is probably more future proof against changes to the column lists.