Is it possible to create an unsegmented table on one node in a K-1 safe cluster in Vertica? - database-replication

Our project requirement is to have certain tables to be present on only one/two nodes out of the currently available 5 nodes.
Is it possible to keep the tables on a specific node in a K-1 safe cluster?

You can never have all of a single data set only exist on one node and still have k-1 safety. The idea of k-1 safety is that a node can go down without a reduction in data availability (and in fact, Vertica will never allow this to happen, if data is not available on a primary or buddy node, then it will shut down).
You can however specify specific nodes to unsegment on. Quite honestly, this is really not a good idea at all, though. You're going to create hot spots (concentrated data, concentrated workload). You're also going to create situations where if a person is logged into other nodes, more data will likely have to travel the private network to get to the initiator.
I've never seen anyone do this in practice and I'd caution against it. If one node in your cluster is slow, it could impact the whole cluster's performance.
To answer your question, though... to unsegment to specific nodes, this is the clause for the CREATE PROJECTION command.
UNSEGMENTED { NODE node | ALL NODES } ]
Just create two projections, one for each node. You'll need a minimum of 2 nodes to satisfy k-safety. This k-safety check won't happen until first insert (or if you try to drop other existing projections before enough new ones are refreshed that satisfies k-safety).
Only do this if you really know what you are doing, though. If this cluster is going to grow and get more workload, doing this type of config is a management nightmare.

Related

What is the most efficient data structure for designing a PRIM algorithm?

I am designing a Graph in c++ using a hash table for its elements. The hashtable is using open addressing and the Graph has no more than 50.000 edges. I also designed a PRIM algorithm to find the minimum spanning tree of the graph. My PRIM algorithm creates storage for the following data:
A table named Q to put there all the nodes in the beginning. In every loop, a node is visited and in the end of the loop, it's deleted from Q.
A table named Key, one for each node. The key is changed when necessary (at least one time per loop).
A table named Parent, one for each node. In each loop, a new element is inserted in this table.
A table named A. The program stores here the final edges of the minimum spanning tree. It's the table that is returned.
What would be the most efficient data structure to use for creating these tables, assuming the graph has 50.000 edges?
Can I use arrays?
I fear that the elements for every array will be way too many. I don't even consider using linked lists, of course, because the accessing of each element will take to much time. Could I use hash tables?
But again, the elements are way to many. My algorithm works well for Graphs consisting of a few nodes (10 or 20) but I am sceptical about the situation where the Graphs consist of 40.000 nodes. Any suggestion is much appreciated.
(Since comments were getting a bit long): The only part of the problem that seems to get ugly for very large size, is that every node not yet selected has a cost and you need to find the one with lowest cost at each step, but executing each step reduces the cost of a few effectively random nodes.
A priority queue is perfect when you want to keep track of lowest cost. It is efficient for removing the lowest cost node (which you do at each step). It is efficient for adding a few newly reachable nodes, as you might on any step. But in the basic design, it does not handle reducing the cost of a few nodes that were already reachable at high cost.
So (having frequent need for a more functional priority queue), I typically create a heap of pointers to objects and in each object have an index of its heap position. The heap methods all do a callback into the object to inform it whenever its index changes. The heap also has some external calls into methods that might normally be internal only, such as the one that is perfect for efficiently fixing the heap when an existing element has its cost reduced.
I just reviewed the documentation for the std one
http://en.cppreference.com/w/cpp/container/priority_queue
to see if the features I always want to add were there in some form I hadn't noticed before (or had been added in some recent C++ version). So far as I can tell, NO. Most real world uses of priority queue (certainly all of mine) need minor extra features that I have no clue how to tack onto the standard version. So I have needed to rewrite it from scratch including the extra features. But that isn't actually hard.
The method I use has been reinvented by many people (I was doing this in C in the 70's, and wasn't first). A quick google search found one of many places my approach is described in more detail than I have described it.
http://users.encs.concordia.ca/~chvatal/notes/pq.html#heap

data structure advice on c++

I am looking for data structure in c++ and I need an advice.
I have nodes, every node has unique_id and group_id:
1 1.1.1.1
2 1.1.1.2
3 1.1.1.3
4 1.1.2.1
5 1.1.2.2
6 1.1.2.3
7 2.1.1.1
8 2.1.1.2
I need a data structure to answer those questions:
what is the group_id of node 4
give me list (probably vector) of unique_id's that belong to group 1.1.1
give me list (probably vector) of unique_id's that belong to group 1.1
give me list (probably vector) of unique_id's that belong to group 1
Is there a data structure that can answer those questions (what is the complexity time of inserting and answering)? or should I implement it?
I would appreciate an example.
EDIT:
at the beginning, I need to build this data structure. most of the action is reading by group id. insertion will happen but less then reading.
the time complexity is more important than memory space
To me, hierarchical data like the group ID calls for a tree structure. (I assume that for 500 elements this is not really necessary, but it seems natural and scales well.)
Each element in the first two levels of the tree would just hold vectors (if they come ordered) or maps (if they come un-ordered) of sub-IDs.
The third level in the tree hierarchy would hold pointers to leaves, again in a vector or map, which contain the fourth group ID part and the unique ID.
Questions 2-4 are easily and quickly answered by navigating the tree.
For question 1 one needs an additional map from unique IDs to leaves in the tree; each element inserted into the tree also has a pointer to it inserted into the map.
First of all, if you are going to have only a small number of nodes then it would probably make sense not to mess with advanced data structuring. Simple linear search could be sufficient.
Next, it looks like a good job for SQL. So may be it's a good idea to incorporate into your app SQLite library. But even if you really want to do it without SQL it's still a good hint: what you need are two index trees to support quick searching through your array. The complexity (if using balanced trees) will be logarithmic for all operations.
Depends...
How often do you insert? Or do you mostly read?
How often do you access by Id or GroupId?
With a max of 500 nodes I would put them in a simple Vector where the Id is the offset into the array (if the Ids are indeed as shown). The group-search can than be implemented by iterating over the array and comparing the partial gtroup-ids.
If this is too expensive and you really access the strcuture a lot and need very high performance, or you do a lot of inserts I would implement a tree with a HashMap for the Id's.
If the data is stored in a database you may use a SELECT/ CONNECT BY if your systems supports that and query the information directly from the DB.
Sorry for not providing a clear answer, but the solution depends on too many factors ;-)
Sounds like you need a container with two separate indexes on unique_id and group_id. Question 1 will be handled by the first index, Questions 2-4 will be handled by the second.
Maybe take a look at Boost Multi-index Containers Library
I am not sure of the perfect DS for this. But I would like to make use of a map.
It will give you O(1) efficiency for question 1 and for insertion O(logn) and deletion. The issue comes for question 2,3,4 where your efficiency will be O(n) where n is the number of nodes.

Key Differences between Self-Adjusting Lists and Regular Linked List

I am in a data structures class (based on C++) and I must honestly say that my instructor has not taught us much about self-adjusting lists. Nonetheless, I've been assigned a project that requires implementing such a class.
The only note my instructor left for this project about self-adjusting lists was this:
A self-adjusting list is like a regular list, except that all insertions are performed at the front, and when an element is accessed by a search, it is moved to the front of the list, without changing the relative order of the other items. The elements with highest access probability are expected to be close to the front.
What I do not get about this is why all insertions must be performed at the front of the list. Wouldn't it be better to insert at the end considering that the data being inserted has been accessed zero times?
Also, are there any more key differences I should look out for? I cannot seem to find a good source online that goes in-depth about this topic.
What I do not get about this is why all insertions must be performed
at the front of the list. Wouldn't it be better to insert at the end
considering that the data being inserted has been accessed zero times?
Usually recently added items are more likely candidates for access. Moreover inserting at beginning is constant time operation.
For example if you buy books and keep the latest book on the top so that it can be accessed most easily. If you search and read an old book from pile, it is brought on the top.
Offcourse, you want to keep the latest bought book on the top, although you have never read it.
Also, are there any more key differences I should look out for? I
cannot seem to find a good source online that goes in-depth about this
topic
Although the average and worst access time of such list is same to the normal list theoretically(random node), but in practice, average access/search time is much faster.
If the number of nodes grow to really large number, a self-balancing BST (red-black tree for example) or hash would give better access time.
There are many other schemes used to keep the list self-adjusted:
For example:
Most recently used on the head (As you told)
Keep the list sorted by access count (a node recently accessed may not necessarity come at front)
When a node which is not head node is accessed, swap it with the previous node.
Exact choice of strategy depends on your requirement and profiling in target environment is the best way to choose one over another.

Ideas for clustering implementation

I am implementing a clustering algorithm. I have a vector of Node objects to keep original copies and another vector of Cluster objects: Each Cluster object holds a vector of pointers to all the nodes it contains.
In every iteration two Clusters are supposed to be joined together according to their joining cost.
Right now I use a priority queue to hold structs that contain a joining cost and pointers to both of the clusters. In every iteration I pop the one with the minimal cost and join the referenced clusters. I want to use a queue, because I have a lot of data amd looping over all the objects to find a minimum cost is not practical.
I have implemented the merging by copying the data from one cluster to another and then removing one of the clusters. The problem is that my queue now contains a lot of entries that reference the removed cluster.
How would you implement it? Maybe there is some more clever way of joining the clusters? I am looking for a genereal idea for the implementation.
I have the following idea. Let's add a field bool _isRemoved to the Cluster class . You will set it to true for every removed cluster. When you pop something from the queue, you will firstly check if any of two clusters has _isRemoved set to true. If yes you should just pop the next pair from the queue and continue processing.
I'm not C++ expert but the next problem might be that clusters that were removed and are referenced still occupies a memory. If you want to avoid this problem instead of storing pointers to clusters in the queue you can store identifiers of clusters. Identifiers can be quickly mapped to clusters by using a map (a dictionary).

Removing element from balanced KD-Tree of two dimensions

I want to remove a element from balanced KD-Tree and Tree remain balance without rebuilding the whole tree. is this possible to balance tree without rebuilding the whole tree ? if yes then how?
For the standard k-d tree, you can remove items but there is no re-balancing, so if you remove almost all the items you might just end up with a long thin unbalanced tree. Do you care? The tree doesn't actually get any deeper: it is unbalanced, but because it was balanced when it was created it should never be outrageously deep.
If deleting is all you care about, you could rebuild the tree from scratch when you have deleted half the items - this will mean that it never gets outrageously out of balance. Also, you could consider marking items as deleted instead of actually removing them, this can make some things easier.
This is a special case of making data structures dynamic, or dynamization. At the cost of a factor of log n you can make data structures that you know how to rebuild, but not balance, dynamic, to cope with inserts as well as deletes. The basic idea is to build the dynamic data-structure as a collection of non-dynamic structures with wildly varying sizes, such as powers of two: most changes lead you to rebuild only one or two of the smaller non-dynamic structures, but at longer intervals you will have to take time out to rebuild the larger ones. A very loose example of this would be keeping daily records in a loose leaf notebook and using them to update a filing cabinet overnight. See for example http://wwwisg.cs.uni-magdeburg.de/ag/lehre/SS2009/GDS/slides/S12.pdf or http://www.mpi-inf.mpg.de/~mehlhorn/ftp/mehlhorn27.pdf.