So I have a boost::multi_index_container with multiple non-unique indexes. I would like to find an elegant way to do an relational-database style query to find all elements that match a set of criteria using multiple indexes.
For instance, given a list of connections between devices, I'd like to search for all elements whose source is 'server' and whose destination is 'pc2'. I've got an index on Source and an index on Dest.
Source Dest/Port
---- ---------
server pc1 23
server pc1 27
server pc1 80
server pc2 80 <- want to find these two
server pc2 90 <-
printer pc3 110
printer pc1 110
scanner mac 8080
Normally I might do lower_bound and upper_bound on the first index (to match 'server'), then do a linear search between those iterators to find those elements that match in the "Dest" column, but that's not very satisfying, since I've got a second index. Is there an elegant stl/boost-like way to take advantage of the fact that there are two indexes and avoid a linear search (or an equivalent amount of work, such as adding all intermediate results to another container, etc.)?
(Obviously in the example, a linear search would be fastest, but if there were 10000 items with 'server' as the source, having the second index would start to be nice.)
Any ideas are appreciated!
You might simply get some inspiration from relational databases...
... but first we need to demystify a thing about indexes.
Compound Indexes
In a relational database there are two types of indexes:
regular indexes: an index on one column
compound indexes: an index on multiple columns at once
The two give different performance results. When you need to use two indexes, there is a merge pass to combine the results given by them (also called join), therefore compound indexes can provide a speed boost.
Multi-Index
Boost multi-index can use compounds indexes, you are free to provide your own hashing or comparison function after all.
A key difference with a relational database is that you cannot have an efficient merge pass (merging two ROWID sets) because this require intrinsic knowledge to be efficient, therefore you are indeed stuck with a linear search among the results of the first search. It is up to you to find the most discriminant first search.
Note: the name multi-index refers to the idea that it automatically maintains multiple index when you insert, update and delete your elements. It also means that you can search using any of those indexes with a performance profile that you decided. But it is not a full-blown database engine with statistics, heuristics and a query engine.
The most elegant way to do a relational-database style query is to use a relational database. I'm not being flippant; you're using the wrong data structure. If "relational-database style query" operations are going to be something that you do frequently, I would strongly urge you to invest in SQLite.
The purpose of Boost.MultiIndex is not to be a quick-and-dirty database.
Related
I have read it over and over again that SQL, at its heart, is an unordered model. That means executing the same SQL query multiple times can return result-set in different order, unless there's an "order by" clause included. Can someone explain why can a SQL query return result-set in different order in different instances of running the query? It may not be the case always, but its certainly possible.
Algorithmically speaking, does query plan not play any role in determining the order of result-set when there is no "order by" clause? I mean when there is query plan for some query, how does the algorithm not always return data in the same order?
Note: Am not questioning the use of order by, am asking why there is no-guarantee, as in, am trying to understand the challenges due to which there cannot be any guarantee.
Some SQL Server examples where the exact same execution plan can return differently ordered results are
An unordered index scan might be carried out in either allocation order or key order dependant on the isolation level in effect.
The merry go round scanning feature allows scans to be shared between concurrent queries.
Parallel plans are often non deterministic and order of results might depend on the degree of parallelism selected at runtime and concurrent workload on the server.
If the plan has nested loops with unordered prefetch this allows the inner side of the join to proceed using data from whichever I/Os happened to complete first
Martin Smith has some great examples, but the absolute dead simple way to demonstrate when SQL Server will change the plan used (and therefore the ordering that a query without ORDER BY will be used, based on the different plan) is to add a covering index. Take this simple example:
CREATE TABLE dbo.floob
(
blat INT PRIMARY KEY,
x VARCHAR(32)
);
INSERT dbo.floob VALUES(1,'zzz'),(2,'aaa'),(3,'mmm');
This will order by the clustered PK:
SELECT x FROM dbo.floob;
Results:
x
----
zzz
aaa
mmm
Now, let's add an index that happens to cover the query above.
CREATE INDEX x ON dbo.floob(x);
The index causes a recompile of the above query when we run it again; now it orders by the new index, because that index provides a more efficient way for SQL Server to return the results to satisfy the query:
SELECT x FROM dbo.floob;
Results:
x
----
aaa
mmm
zzz
Take a look at the plans - neither has a sort operator, they are just - without any other ordering input - relying on the inherent order of the index, and they are scanning the whole index because they have to (and the cheapest way for SQL Server to scan the index is in order). (Of course even in these simple cases, some of the factors in Martin's answer could influence a different order; but this holds true in the absence of any of those factors.)
As others have stated, the ONLY WAY TO RELY ON ORDER is to SPECIFY AN ORDER BY. Please write that down somewhere. It doesn't matter how many scenarios exist where this belief can break; the fact that there is even one makes it futile to try to find some guidelines for when you can be lazy and not use an ORDER BY clause. Just use it, always, or be prepared for the data to not always come back in the same order.
Some related thoughts on this:
Bad habits to kick : relying on undocumented behavior
Why people think some SQL Server 2000 behaviors live on… 12 years later
Quote from Wikipedia:
"As SQL is a declarative programming language, SELECT queries specify a result set, but do not specify how to calculate it. The database translates the query into a "query plan" which may vary between executions, database versions and database software. This functionality is called the "query optimizer" as it is responsible for finding the best possible execution plan for the query, within applicable constraints."
It all depends on what the query optimizer picks as a plan - table scan, index scan, index seek, etc.
Other factors that might influence picking a plan are table/index statistics and parameter sniffing to name a few.
In short, the order is never guaranteed without an ORDER BY clause.
It's simple: if you need the data ordered then use an ORDER BY. It's not hard!
It may not cause you a problem today or next week or even next month but one day it will.
I've been on a project where we needed to rewrite dozens (or maybe hundreds) of queries after an upgrade to Oracle 10g caused GROUP BY to be evaluated in a different way than in had on Oracle 9i, meaning that the queries weren't necessarily ordered by the grouped columns anymore. Not fun and simple to avoid.
Remember that SQL is declarative language so you are telling the DBMS what you want and the DBMS is then working out how to get it. It will bring back the same results every time but may evaluate in a different way each time: there are no guarantees.
Just one simple example of where this might cause you problems is that new rows appear at the end of the table if you select from the table.... until they don't because you've deleted some rows and the DBMS decides to fill in the empty space.
There are an unknowable number of ways it can go wrong unless you use ORDER BY.
Why does water boil at 100 degrees C? Because that's the way it's defined.
Why are there no guarantees about result ordering without an ORDER BY? Because that's the way it's defined.
The DBMS will probably use the same query plan the next time and that query plan will probably return the data in the same order: but that is not a guarantee, not even close to a guarantee.
If you don't specify an ORDER BY then the order will depend on the plan it uses, for example if the query did a table scan and used no index then the result would be the "natural order" or the order of the PK. However if the plan determines to use IndexA that is based on columnA then the order would be in that order. Make sense?
I am looking for data structure in c++ and I need an advice.
I have nodes, every node has unique_id and group_id:
1 1.1.1.1
2 1.1.1.2
3 1.1.1.3
4 1.1.2.1
5 1.1.2.2
6 1.1.2.3
7 2.1.1.1
8 2.1.1.2
I need a data structure to answer those questions:
what is the group_id of node 4
give me list (probably vector) of unique_id's that belong to group 1.1.1
give me list (probably vector) of unique_id's that belong to group 1.1
give me list (probably vector) of unique_id's that belong to group 1
Is there a data structure that can answer those questions (what is the complexity time of inserting and answering)? or should I implement it?
I would appreciate an example.
EDIT:
at the beginning, I need to build this data structure. most of the action is reading by group id. insertion will happen but less then reading.
the time complexity is more important than memory space
To me, hierarchical data like the group ID calls for a tree structure. (I assume that for 500 elements this is not really necessary, but it seems natural and scales well.)
Each element in the first two levels of the tree would just hold vectors (if they come ordered) or maps (if they come un-ordered) of sub-IDs.
The third level in the tree hierarchy would hold pointers to leaves, again in a vector or map, which contain the fourth group ID part and the unique ID.
Questions 2-4 are easily and quickly answered by navigating the tree.
For question 1 one needs an additional map from unique IDs to leaves in the tree; each element inserted into the tree also has a pointer to it inserted into the map.
First of all, if you are going to have only a small number of nodes then it would probably make sense not to mess with advanced data structuring. Simple linear search could be sufficient.
Next, it looks like a good job for SQL. So may be it's a good idea to incorporate into your app SQLite library. But even if you really want to do it without SQL it's still a good hint: what you need are two index trees to support quick searching through your array. The complexity (if using balanced trees) will be logarithmic for all operations.
Depends...
How often do you insert? Or do you mostly read?
How often do you access by Id or GroupId?
With a max of 500 nodes I would put them in a simple Vector where the Id is the offset into the array (if the Ids are indeed as shown). The group-search can than be implemented by iterating over the array and comparing the partial gtroup-ids.
If this is too expensive and you really access the strcuture a lot and need very high performance, or you do a lot of inserts I would implement a tree with a HashMap for the Id's.
If the data is stored in a database you may use a SELECT/ CONNECT BY if your systems supports that and query the information directly from the DB.
Sorry for not providing a clear answer, but the solution depends on too many factors ;-)
Sounds like you need a container with two separate indexes on unique_id and group_id. Question 1 will be handled by the first index, Questions 2-4 will be handled by the second.
Maybe take a look at Boost Multi-index Containers Library
I am not sure of the perfect DS for this. But I would like to make use of a map.
It will give you O(1) efficiency for question 1 and for insertion O(logn) and deletion. The issue comes for question 2,3,4 where your efficiency will be O(n) where n is the number of nodes.
I am looking for the most efficient data structure to maintain an indexed list. You can easily view it interms of a STL map :
std::map<int,std::vector<int> > eff_ds;
I am using this as an example because I am currently using this setup. The operations that I would like to perform are :
Insert values based on key : similar to eff_ds[key].push_back(..);
Print the contents of the data structure in terms of each key.
I am also trying to use an unordered map and a forward list,
std::unordered_map<int,std::forward_list<int> > eff_ds;
Is this the best I could do in terms of time if I use C++ or are there other options ?
UPDATE:
I can do insertion either way - front/back as long as I do the same for all the keys. To make my problem more clear, consider the following:
At each iteration of my algorithm, I am going to have an external block give me a (key,value) - both of which are single integers - pair as an output. Of course, I will have to insert this value to the corresponding key. Also, at different iterations, the same key might be returned with different values. At the end my output data(written to a file) should look something like this:
k1: v1 v2 v3 v4
k2: v5 v6 v7
k3: v8
.
.
.
kn: vm
The number of these iterations are pretty large ~1m.
There are two dimensions to your problem:
What is the best container to use where you want to be able to look up the items in the container using a numeric key, with a large number of keys, and the keys are sparse
A numeric key might lend itself to a vector for this, however if the keys are sparsely populated that would waste a lot of memory.
Assuming you do not want to iterate through the keys in order (which you did not state as a requirement), then an unordered_map might be the best bet.
What is the best container for a list of numbers, allowing for insertion at either end and the ability to retrieve the list of numbers in order (the value type of the outer map)
The answer to this will depend on how frequently you want to insert elements at the front. If that is commonly occurring then you might want to consider a forward_list. If you are mainly inserting on the end then a vector would be lower overhead.
Based on your updated question, since you can limit yourself to adding the values to the end of the lists, and since you are not concerned with duplicate entries in the lists, I would recommend using std::unordered_map<int,vector<int> >
I want to design/find a C++ Datastructure/Container which supports two column of data and CRUD operation on those data. I reviewed the STL Containers but none of them supports my requirement (correct me if I am wrong).
My exact requirement is as follows
Datastructure with Two Columns.
Supports the following Functions
Search for a Specific item.
Search for a List of items matching a criteria
Both the column should support the above mentioned Search Operation. ie., I should be able to search for a data on both the columns.
Update a specific item
Delete a specific item
Add new item
I prefer search operation to be faster than add/delete operation.
In Addition I will be sharing this Data between Threads, hence need to support Mutex(I can also implement Mutex Lock on these data separately.)
Does any of the existing STL meets my requirement or do we have any other library or datastructure that fits best to my requirements.
Note: I cant use Database or SQLite to store my data.
Thank you
Regards,
Dinesh
If one of the column is unique then probably you can use Map. Otherwise define a class with two member variables representing the column and store it in a vector. There are algorithms that will help you in searching the container.
Search for a Specific item.
If you need one-way mapping (i.e. fast search over values in one column), you should use map or multimap container classes. There is no, however, bidirectional map in standart library, so you should build your own as pair of (multi)maps or use other libraries, such as boost::bimap
Your best bet is Boost.Bimap, because it will make it easy for you when you want to search based on either column. If you decided that you need more columns, then Boost.Multi_index might be better. Here is an example!
I am writing an email application that interfaces with a MySQL database. I have two tables that are sourcing my data, one of which contains unsubscriptions, the other of which is a standard user table. As of now, I'm creating a vector of pointers to email objects, and storing all of the unsubscribed emails in it, initially. I then have a standard SQL loop in which I'm checking to see if the email is not in the unsubscribe vector, then adding it to the global send email vector. My question, is, is there a more efficient way of doing this? I have to search the unsub vector for every single email in my system, up to 50K different ones. Is there a better structure for searching? And, a better structure for maintaining a unique collection of values? Perhaps one that would simply discard the value if it already contains it?
If your C++ Standard Library implementation supports it, consider using a std::unordered_set or a std::hash_set.
You can also use std::set, though its overhead might be higher (it depends on the cost of generating a hash for the object versus the cost of comparing two of the objects several times).
If you do use a node based container like set or unordered_set, you also get the advantage that removal of elements is relatively cheap compared to removal from a vector.
Tasks like this (set manipulations) are better left to what is MEANT to execute them - the database!
E.g. something along the lines of:
SELECT email FROM all_emails_table e WHERE NOT EXISTS (
SELECT 1 FROM unsubscribed u where e.email=u.email
)
If you want an ALGORITHM, you can do this fast by retrieving both the list of emails AND a list of unsubscriptions as ORDERED lists. Then you can go through the e-mail list (which is ordered), and as you do it you glide along the unsubscribe list. The idea is that you move 1 forward in whichever list has the "biggest" current" element. This algo is O(M+N) instead of O(M*N) like your current one
Or, you can do a hash map which maps from unsubscribed e-mail address to 1. Then you do find() calls on that map whcih for correct hash implementations are O(1) for each lookup.
Unfortunately, there's no Hash Map standard in C++ - please see this SO question for existing implementations (couple of ideas there are SGI's STL hash_map and Boost and/or TR1 std::tr1::unordered_map).
One of the comments on that post indicates it will be added to the standard: "With this in mind, the C++ Standard Library Technical Report introduced the unordered associative containers, which are implemented using hash tables, and they have now been added to the Working Draft of the C++ Standard."
Store your email adresses in a std::set or use std::set_difference().
The best way to do this is within MySQL, I think. You can modify your users table schema with another column, a BIT column, for "is unsubscribed". Better yet: add a DATETIME column for "date deleted" with a default value of NULL.
If using a BIT column, your query becomes something like:
SELECT * FROM `users` WHERE `unsubscribed` <> 0b1;
If using a DATETIME column, your query becomes something like:
SELECT * FROM `users` WHERE `date_unsubscribed` IS NULL;