Binary-tree data storage implementation - c++

I have started using binary trees in c++, and i must say i really like the idea and things are clear for me, until i think of storing data on the disk in an order where later i can instantly read a chunk of data.
So far i have stored everything (nodes) into the ram... but this is just a simple and not real life app. I am not interested in storing the whole binary tree on the disk as that would be useless again since you have to read it again back to the memory! what i am after is a method just like for example MYSQL.
I haven't found any good article on this so i would appreciate if i someone include some urls or books.

The main difference from b-tree and b+tree:
- The leaf nodes are linked for fast lockup sequential reads. Can point ascending, can point descending , or both (like i saw in one IBM DB)
You should write it on disk, if the table or file grows, you will have memory problems.
(SEEK operations on files ARE REALLY FAST. You can create a 1 GB file on disk in less than 1 second... C# filestream,method .SetFilesize)
If you manage to have multiple readers/writers, you need concurrency control over the index and table(or file).... You gona do that in memory? If a power failure occures, how do you rollback?Ye, you dont.
IE:Field f1 is indexed.
WHERE 1=1 (dont need to access b+tree, give me all and the order is irrelevant)
WHERE 1=1 ORDER BY f1 ASC/DESC (Need to access b+tree, give me all by ascending/descending order)
WHERE f1>=100 (Need to access b+tree, lock up where the leaf node =100 and give all leaf node items following right pointers. If this process is a multithreaded read, they probablly come with a strange order, but no problem... no order by in clause).
WHERE f1>=100 order by f1 asc (Need to access b+tree, lock up where the leaf node =100 and give all leaf node items following right pointers. This process shouldnt be multithreaded following the b+tree, comes naturally in order.
Field f2 indexed with a b+tree and type string.
Where name like '%ODD' (Internally, the compared value must be inverted and the all symbol stays at the end Like starts with 'DDO' and ends with anything. 'DDOT' is in the group so 'TODD' must belongs to the result!!!! Tricky, tricky logic ;P)
with this statement,
WHERE name like '%OD%' (has in the middle 'OD'). The things get hot :))))
Internally, the result is the UNION of the sub result for 'OD%' with the sub result inverted 'DO%'. After that, removes of starting 'OD' and ending 'OD' without 'OD' in the middle, otherwise its a valid result('ODODODODOD' its a valid result. Invalid results 'ODABCD' and 'ABCDOD' ).
Consider what i said and check some more things if you gona do deep:
- FastIO on files:C# Filestream no_buffered_flag, wriththought disk flag on.
- Memory mapped files/memory views: Yes we can manipulate an huge file in small portions as we need it
- Indexes:Bitmap index, hash index (hash function;perfect hash function;ambiguity of the hashfunction), sparse index, dense index, b+tree, r-tree, reversed index.
- Multithreads: lock, mutexes,semaphores
- Transactional conciderations (Log file, 2phase commit;3phase commit);
- Locks (database,table,page,record)
- Deadlocks: 3 ways to kill it (longer conflicting process;Younger conflicting process;The process which locks more objects). Modern RDBMs use a mixed of the 3 ways...
- SQL parsing (AST-Tree).
- Caching recurrent queries.
- Triggers, procedures, views, etc.
- Passing parameters to the procedures (can use the object type ;P)
DONT LOAD EVERYTHING IN MEMORY,INTELLIGENT SOLUTIONS LOADS PARTS AS THEY NEED IT AND RELEASES WHEN ITS NO LONGER USABLE. Why=> your db engine (and PC) becomes more responsive using less memory. Using b+tree for lockup the branch leaf nodes needs just 2 Disk IO's. Knowing the lockup value, you get the record long pointer. SEEK the main file for the position, read the content. This is too fast. Memory is faster... Yes it is, but can you put 10 GB's of a b+tree on memory? If so, how your DB engine program starts to behave? Slowlly?
Forget binary trees and convencional btrees: they are academic tutorials. Real life they are replaced by hashtables or b+trees (B PLUS TREE showing storage and ordered ascending- http://en.wikipedia.org/wiki/B%2B_tree)
Consider using dataspaces for the db data in multiple disks. You can parallelize Disk IO performance. Dont forget to mirrored them... Each dataspace, should have a fragment of the table with a fragment of the indice, with a partial log file. You should develop the coordinator which presents wiselly the queries for the sub units.
IE: 3 dataspaces...
INSERT INTO etc...... only should happend in 1 table space.
but
select * from TB_XPTO, should be presented to all dataspaces.
select * from TB_XPTO order by an indexed field, should be presented to all dataspaces. Each data space executes the instruction, so now we have a 3 subsets by their sub order.
The result will be on the coordinator, where will reorder it again.
Confuse, BUT FAST!!!!!!
The coordinator should controls the master transaction.
if dataspace A commited
dataspace B commited
dataspace C is in uncommited state
the coordinator will rollback a C,B and A.
if dataspace A commited
dataspace B commited
dataspace C commited
the coordinator will Commit the overall transaction.
COORDINATOR LOG:
CREATE MASTER TRANSACTION UID 121212, CHILD TRANSACTIONS(1111,2222,3333)
DATA SPACE A LOG
1111 INSERT len byte array
1111 INSERT len byte array
COMMIT 1111
DATA SPACE B LOG
2222 INSERT len byte array
2222 INSERT len byte array
COMMIT 2222
DATA SPACE C LOG
3333 INSERT len byte array
3333 ---> No more nothing..... Power failure here!!!!!!!
On startup coordinator check if the db was properlly closed, if not, it will check his log file. Well, is missing a master commit line like COMMIT 121212. So it will enquire the data spaces for the log consistency.
A,B repplies COMMITED, but C, after checked his log file, detects a failure. Replies UNCOMMITED.
Master Coordinator FORCES TABLESPACE A,B,C FOR ROLLBACK 1111,2222,3333
After that, himself rollbacks his master transaction and puts DB state=OK.
The main point here is speed on insert,selects, updates, and deletes
Consider to maintain the index well balanced. Many deletes on the index will unbalanced it. An unbalanced index drops its performance.... Add a heap on the head of the index file, for controlling it. Some math here would help. If deletes are higher than 5% of records, balance it and reset the counter. If an update is over on an indexed field, should count it too.
Be smart considering the field index. If the column is Gender, there are only 2 options(i hope, lol.... ops, can be nullable too....), a bitmap index is well applied. If the distinctness (i think i spell it badlly) of a field is 100% (all values heterogeneous), like a sequence applied on a field like Oracle do, or an identity field like SQL Server do, a b+tree is well applied. If a field is kind of geometric type, like in Oracle, the R-Tree is the best. For strings, reversed Index is well applied, or b+tree if heterogenous.
Houston, we have problems....
NULL value fields, should be considered too in the index. Its a value too!!!!
IE: WHERE F1 is null
Add some socket functionality:Async TCP/IP SERVER
-If you delete a record, dont resize the file right now. Mark it as deleted. You should do some metrics here too. If unused space > x and transactions =0, do a database lock and re-allocate pointers, then resize database. Some spaces appears on the DB file, you can try to do some page locks instead of database lock... Things can keep going and no one gets hurt.... Measure the last unlocked page of the DB, lock it for you. Check a deleted page that you can fill with your page. Not Found, release lock; If found, move for the new position, fix pointers, mark old page as deleted, resize file, release lock. Why so many operations? To keep the log well formed!!!! You can split the page in small pages, but you get fragmentation (argh...we lost speed commander?)... 2 algorithms comes here. Best-Fit, and Worst-Fit....Google it. The best is .... using both :P
And if you solve all of this stuff, you can shout out loud "DAM, I DID A DATABASE... IM GONA NAME IT ORACLE!!!!" ;P

Related

How does Amazon Redshift reconstruct a row from columnar storage?

Amazon describes columnar storage like this:
So I guess this means in what PostgreSQL would call the "heap", blocks contain all the values for one column, then the next column, and so on.
Say I want to query for all people in their 30's, and I want to know their names. So columnar storage means less IO is required to read just the age of every row and find those that are 30-something, because all the other columns don't need to be read. Also maybe some efficient compression can be applied. That's neat, I guess.
Then what? This data structure alone doesn't explain how anything useful can happen after that. After determining what records are 30-something, how are the associated names found? What data structure is used? What are its performance characteristics?
If the Age column is the Sort Key, then the rows in the table will be stored in order of Age. This is great, because each 1MB storage block on disk keeps data for only one column, and it keeps note of the minimum and maximum values within the block.
Thus, searching for the rows that contain an Age of 30 means that Redshift can "skip over" blocks that do not contain Age=30. Since reading from disk is the slowest part of a database, this means it can operate much faster.
Once it has found the blocks that potentially contain Age=30, it reads those blocks from disk. Blocks are compressed, so they might contain much more data than the 1MB on disk. This means many rows can be read with fewer disk accesses.
Once those blocks are decompressed into memory, it finds the rows with Age=30 and then loads the corresponding blocks for the Name column. The compression ratio would be different for the Name column since it is text and is not sorted, so this might result in loading more blocks from disk for Name than for Age.
Redshift then assembles the data from Name and Age for the desired rows and performs any remaining operations.
These operations are also parallelized across multiple nodes based on the Distribution Key, which distributed data based on a given column (or replicates it between nodes for often-used tables). Data is typically distributed based upon a column that is frequently used in JOIN statements so that similar data is co-located on the same node. Each node returns its data to the Leader Node, which combines the data and provides the final results.
Bottom line: Minimise the amount of data read from disk and parallelize operations on separate nodes.
AFAIK every value in the columnar storage has an ID pointer (similar to CTID you mentioned), and to get the select results Redshift needs to find and combine the values with the same ID pointer for each column that's selected from the raw data. If memory allows it's stored in memory, unless it's spilling to disk. This process is called materialization (don't confuse with materialized view materialization). In your case there are 2 technically possible scenarios:
materialize all Age/Name pairs, then filter by Age=30, and output the result
filter Age column by Age=30, get IDs, get Name values with corresponding IDs, materialize pairs and output
I guess in this case #2 is what happens because materialization is more expensive than filtering. However, there is a plenty of scenarios where this is much less obvious (with complex queries and aggregations). It is the responsibility of the query optimizer to decide what's better. #1 is still better than the row oriented because it would still read just 2 columns.

Efficiently read data from a structured file in C/C++

I have a file as follows:
The file consists of 2 parts: header and data.
The data part is separated into equally sized pages. Each page holds data for a specific metric. Multiple pages (needs not to be consecutive) might be needed to hold data for a single metric. Each page consists of a page header and a page body. A page header has a field called "Next page" that is the index of the next page that holds data for the same metric. A page body holds real data. All pages have the same & fixed size (20 bytes for header and 800 bytes for body (if data amount is less than 800 bytes, 0 will be filled)).
The header part consists of 20,000 elements, each element has information about a specific metric (point 1 -> point 20000). An element has a field called "first page" that is actually index of the first page holding data for the metric.
The file can be up to 10 GB.
Requirement: Re-order data of the file in the shortest time, that is, pages holding data for a single metric must be consecutive, and from metric 1 to metric 20000 according to alphabet order (header part must be updated accordingly).
An apparent approach: For each metric, read all data for the metric (page by page), write data to new file. But this takes much time, especially when reading data from the file.
Is there any efficient ways?
One possible solution is to create an index from the file, containing the page number and the page metric that you need to sort on. Create this index as an array, so that the first entry (index 0) corresponds to the first page, the second entry (index 1) the second page, etc.
Then you sort the index using the metric specified.
When sorted, you end up with a new array which contains a new first, second etc. entries, and you read the input file writing to the output file in the order of the sorted index.
An apparent approach: For each metric, read all data for the metric (page by page), write data to new file. But this takes much time, especially when reading data from the file.
Is there any efficient ways?
Yes. After you get a working solution, measure it's efficiency, then decide which parts you wish to optimize. What and how you optimize will depend greatly on what results you get here (what are your bottlenecks).
A few generic things to consider:
if you have one set of steps that read data for a single metric and move it to the output, you should be able to parallelize that (have 20 sets of steps instead of one).
a 10Gb file will take a bit to process regardless of what hardware you run your code on (concievably, you could run it on a supercomputer but I am ignoring that case). You / your client may accept a slower solution if it displays it's progress / shows a progress bar.
do not use string comparisons for sorting;
Edit (addressing comment)
Consider performing the read as follows:
create a list of block offset for the blocks you want to read
create a list of worker threads, of fixed size (for example, 10 workers)
each idle worker will receive the file name and a block offset, then create a std::ifstream instance on the file, read the block, and return it to a receiving object (and then, request another block number, if any are left).
read pages should be passed to a central structure that manages/stores pages.
Also consider managing the memory for the blocks separately (for example, allocate chunks of multiple blocks preemptively, when you know the number of blocks to be read).
I first read header part, then sort metrics in alphabetic order. For each metric in the sorted list I read all data from the input file and write to the output file. To remove bottlenecks at reading data step, I used memory mapping. The results showed that when using memory mapping the execution time for an input file of 5 GB was reduced 5 ~ 6 times compared with when not using memory mapping. This way temporarily solve my problems. However, I will also consider suggestions of #utnapistim.

How to optimize writing this data to a postgres database

I'm parsing poker hand histories, and storing the data in a postgres database. Here's a quick view of that:
I'm getting a relatively bad performance, and parsing files will take several hours. I can see that the database part takes 97% of the total program time. So only a little optimization would make this a lot quicker.
The way I have it set-up now is as follows:
Read next file into a string.
Parse one game and store it into object GameData.
For every player, check if we have his name in the std::map. If so; store the playerids in an array and go to 5.
Insert the player, add it to the std::map, store the playerids in an array.
Using the playerids array, insert the moves for this betting round, store the moveids in an array.
Using the moveids array, insert a movesequence, store the movesequenceids in an array.
If this isn't the last round played, go to 5.
Using the movesequenceids array, insert a game.
If this was not the final game, go to 2.
If this was not the last file, go to 1.
Since I'm sending queries for every move, for every movesequence, for every game, I'm obviously doing too many queries. How should I bundle them for best performance? I don't mind rewriting a bit of code, so don't hold back. :)
Thanks in advance.
CX
It's very hard to answer this without any queries, schema, or a Pg version.
In general, though, the answer to these problems is to batch the work into bigger coarser batches to avoid repeating lots of work, and, most importantly, by doing it all in one transaction.
You haven't said anything about transactions, so I'm wondering if you're doing all this in autocommit mode. Bad plan. Try wrapping the whole process in a BEGIN and COMMIT. If it's a seriously long-running process the COMMIT every few minutes / tens of games / whatever, write a checkpoint file or DB entry your program can use to resume the import from that point, and open a new transaction to carry on.
It'll help to use multi-valued inserts where you're inserting multiple rows to the same table. Eg:
INSERT INTO some_table(col1, col2, col3) VALUES
('a','b','c'),
('1','2','3'),
('bork','spam','eggs');
You can improve commit rates with synchronous_commit=off and a commit_delay, but that's not very useful if you're batching work into bigger transactions.
One very good option will be to insert your new data into UNLOGGED tables (PostgreSQL 9.1 or newer) or TEMPORARY tables (all versions, but lost when session disconnects), then at the end of the process copy all the new rows into the main tables and drop the import tables with commands like:
INSERT INTO the_table
SELECT * FROM the_table_import;
When doing this, CREATE TABLE ... LIKE is useful.
Another option - really a more extreme version of the above - is to write your results to CSV flat files as you read and convert them, then COPY them into the database. Since you're working in C++ I'm assuming you're using libpq - in which case you're hopefully also using libpqtypes. libpq offers access to the COPY api for bulk-loading, so your app wouldn't need to call out to psql to load the CSV data once it'd produced it.

Check a fingerprint in the database

I am saving the fingerprints in a field "blob", then wonder if the only way to compare these impressions is retrieving all prints saved in the database and then create a vector to check, using the function "identify_finger"? You can check directly from the database using a SELECT?
I'm working with libfprint. In this code the verification is done in a vector:
def test_identify():
cur = DB.cursor()
cur.execute('select id, fp from print')
id = []
gallary = []
for row in cur.fetchall():
data = pyfprint.pyf.fp_print_data_from_data(str(row['fp']))
gallary.append(pyfprint.Fprint(data_ptr = data))
id.append(row['id'])
n, fp, img = FingerDevice.identify_finger(gallary)
There are two fundamentally different ways to use a fingerprint database. One is to verify the identity of a person who is known through other means, and one is to search for a person whose identity is unknown.
A simple library such as libfprint is suitable for the first case only. Since you're using it to verify someone you can use their identity to look up a single row from the database. Perhaps you've scanned more than one finger, or perhaps you've stored multiple scans per finger, but it will still be a small number of database blobs returned.
A fingerprint search algorithm must be designed from the ground up to narrow the search space, to compare quickly, and to rank the results and deal with false positives. Just as a Google search may come up with pages totally unrelated to what you're looking for, so too will a fingerprint search. There are companies that devote their entire existence to solving this problem.
Another way would be to have a mysql plugin that knows how to work with fingerprint images and select based on what you are looking for.
I really doubt that there is such a thing.
You could also try to parallelize the fingerprint comparation, ie - calling:
FingerDevice.identify_finger(gallary)
in parallel, on different cores/machines
You can't check directly from the database using a SELECT because each scan is different and will produce different blobs. libfprint does the hard work of comparing different scans and judging if they are from the same person or not
What zinking and Tudor are saying, I think, is that if you understand how does that judgement process works (which is by the way, by minutiae comparison) you can develop a method of storing the relevant data for the process (the *minutiae, maybe?) in the database and then a method for fetching the relevant values -- maybe a kind of index or some type of extension to the database.
In other words, you would have to reimplement the libfprint algorithms in a more complex (and beautiful) way, instead of just accepting the libfprint method of comparing the scan with all stored fingerprint in a loop.
other solutions for speeding your program
use C:
I only know sufficient C to write kind of hello-world programs, but it was not hard to write code in pure C to use the fp_identify_finger_img function of libfprint and I can tell you it is much faster than pyfprint.identify_finger.
You can continue doing the enrollment part of the stuff in python. I do it.
use a time / location based SELECT:
If you know your users will scan their fingerprints with more probability at some time than other time, or at some place than other place (maybe arriving at work at some time and scanning their fingers, or leaving, or entering the building by one gate, or by other), you can collect data (at each scan) for measuring the probabilities and creating parallel tables to sort the users for their probability of arriving at each time and location.
We know that identify_finger tries to identify fingers in a loop with the fingerprint objects you provided in a list, so we can use that and give it the objects sorted in a way in which the more likely user for that time and that location will be the first in the list and so on.

What is cheaper: cast to int or Trim the strings in C++?

I am reading several files from linux /proc fs and I will have to insert those values in a database. I should be as optimal as possible. So what is cheaper:
i) to cast then to int, while I stored then in memory, for later cast to string again while I build my INSERT statement
ii) or keep them as string, just sanitizing the values (removing ':', spaces, etc...)
iii) What should I take in account to learn to make this decision?
I am already doing a split in the lines, because the order they came is not good enough for me.
Thanks,
Pedro
Edit - Clarification
Sorry guys, my scenario is the following: I am measuring cpu, memory, network, disk, etc... every 10 seconds. We are developing our database system, so I cannot count with anything more than just INSERT statements.
I got interested in this optimization because the frequency off parsing data. Its gonna be write once - there will be no updates over the data after it is written.
You seem to be performing some archiving activity [write-once, read-probably-atmost-once] (storing the DB for a later rare/non-frequent use), if not, you should put the optimization emphasize based on how the data will be read (not written).
If this is the archiving case, maybe inseting BLOBs (binary large objects, [or similar concepts]) into the DB will be more efficient.
Addition:
Apparently it will depend on how you will read the data. Are you just listing the data for browse purpose later on, or there will be more complex fetching queries based on the benchmark values.
For example if you are later performing something like: SELECT * from db.Log WHERE log.time > time1 and Max (Memory) < 5000 then it is best to keep each data in its original format (int in integer, string in String, etc) so that the main data processing is left to DB server.