find the unique elements in the binary tree

find the unique elements in the binary tree - c++

I would like to find a maximum number of unique elements in path.
(path is from root to leaf).
for example, my tree is as below.
3
/ \
1 2
/\ \
1 3 5
Above tree, the answer will be 3. because there are three paths as below.
3-1-1
3-1-3
3-2-5.
and the unique elements of each path is as below.
3-1
3-1
3-2-5.
therefore the answer is 3.
My idea of how to get the number is as follows.
Firstly, I found all the paths from root to leaf. and when the pointer reached the leaf node, I printed the paths and calculated the unique elements. and iterated this procedure until all nodes were visited.

You can build a second similarly shaped tree that holds the number of unique elements in each subpath (from the root to any node, root and leaves included). That tree can be built from root to leaves like so : the root value is always 1 since the path from root to root contains one unique element, and the value of any other node is either its parent's value or one more.
Exemple with your tree :
3 1
/ \ / \
1 2 => 2 2
/ \ \ / \ \
1 3 5 2 2 3
The value of each leaf is the number of unique elements from the root to that leaf.
Although you could keep it for subsequent uses, you do not actually need to build the tree. You only need to perform a depth-first traversal while keeping track of the unique elements in the current subpath in a data structure, let's say a vector. Since you want the maximum number of unique elements, you need to keep track of the size of that vector when you hit a leaf.
The data structure can be something else than a vector, but it depends on what your elements are. You can probably use an ordered set, which would be equivalent to keeping a vector sorted. If you can hash your elements, you can use a "hashset" (std::unordered_set in C++11). If your elements are simple integers and their values are all within a relatively small range, you can use a vector of booleans instead of a hashset : initially, the vector holds N booleans to false, N being the size of the range your integers are in. Instead of adding and deleting elements, you toggle booleans at the corresponding indices.

Related

An efficient data structure for access by ID and finding the weighted random item

Could you, please, help me with the data structure that allows O(logN) (or at least O(sqrtN)) operations for the following:
Insert an item having ID (int64_t) and health (double)
Remove an item by ID
Find an item that is weighted random by health
The preferred language is C++ or C. By the weighted random I mean the following:
Consider totalHealth=Sum(health[0], health[1], ..., health[N-1]). I need a fast (as described above) operation that is equivalent to:
Compute const double atHealth = rand_uint64_t()*totalHealth/numeric_limits<uint64_t>::max();
Iterate over i=0 to N-1 to find the first i such that Sum(health[0], health[1], ..., health[i]) >= atHealth
Constraints: health[i] > 0, rand_uint64_t() returns a uniformly distributed integer value between 0 and numeric_limits<uint64_t>::max().
What I have tried so far is a C++ unordered_map that allows quick (Θ(1)) insertion by ID and removal by ID, but the operation #3 is still linear in N as described in my pseudo-code above.
You help is very appreciated!

I can't think of a way to do it with the existing STL containers but I can think of a way to do it if you're willing to code up your own binary tree. The trick is that each node maintains the total health of all the nodes to its left (it doesn't need to worry about nodes to its right as you'll see below). Then, if you walk the tree in ID order you can also compute the "cumulative health", also in ID order, in log(n) time. So the tree is sorted by both ID and cumulative health and you can do lookups in log(n) time either by ID or by "cumulative health". For example, consider a very simple tree like the following:
ID: 8
h: 10
chl: 15
+-------|--------+
| |
ID: 4 ID: 10
h: 15 h: 7
chl: 0 chl: 0
in the above h is the health of the node and chl is the cumulative health of all nodes to it's left. So the total health of all nodes in the above is 15 + 10 + 7 = 32 (and I assume you maintain that count separately though you could also track cumulative health of nodes the right and you wouldn't need to). Let's look at 3 cases:
You compute an atHealth < 15. Then at the first node you can see that your value is less than the chl so you know you need to go left and you end up at the correct leaf.
You compute an atHealth >= 15 < 25 so you know it's > 15 so you don't go left at the root, the node you're at has health 10 and 10 + 15 means the cumulative health at that node is between 15 and 25 so you're good.
You compute an atHealth >= 25. Every time you visit a node and go right you must add the chl and h of the node you were at to keep computing cumulative health as you walk the tree so you know you're starting at 10 + 25 = 25 when you go right and you'll add that to the h or chl of any node you encounter after that. Thus you can quickly find that the node to the right is the correct one.
When you insert a new node you increment the total health of each parent node as you walk the tree and when you remove a node you walk back up the tree subtracting from the total health. Inserts and deletions are thus still O(log(n)) and lookups by ID are also log(n) either by ID or by atHealth.
Things obviously get more complicated if you want to maintain a balanced tree but it's still do-able.

Maximize number of key per b-tree node

I implemented a b-tree that only accepts the unique keys. I am trying to optimize the code so that each node has max number key stored.
For example, if I am inserting 1,2,3,4,5,6... in sequential order (t=3)
The node will be split after inserting 5 resulting...
(3)
/ \
(1|2) (4|5)
Since the left child node can't put in anymore numbers, that node will forever contain at max of 2 node (if I keep adding keys at the end).
If I keep adding keys in order the result may look like this...
9
/ \
(3 6 ) ( 12 15 )
/ \ \ / \ \
(1|2) (4|5) (7|8) (10|11) (13|14) (16|17)
I thought about borrowing neigbhour's key which or splitting the max key node from calculating where to split but it seems like doing so can affect the balance of the tree.
The reason I am doing this is to keep memory low as possible for more limited embedded hardware.
If there are better data structure recommendation please let me know.
Any suggestions would be great!
Thank you!

What data structure should i use to support insert, delete and random selection?

I need to create a data structure of integers which supports the following operations -
Insert item at given position (Add item position)
Delete item from given position (Delete position)
randomly select any given position (Select position)
Random shuffling the items.
I need to maintain one head. which is represented by (). See the example for more details.
Ex-
lets say my initial state is
(1) 2 3 4 5
Where () represents my current head
After Add 6 2
state - (1) 6 2 3 4 5
After Delete 5
state - (1) 6 2 3 5
After Select 3
state - 1 6 (2) 3 5
After shuffle
state - 5 (2) 6 1 3
shuffle will shuffle all the items randomly. But will preserve the head.

Create bins of size 2k each, and have n/k of those (so total number of allocated space is 2n).
Each "bin" will know its start index and end index.
When inserting, first find the relevant bin. This takes O(n/k) time if using linear search. Add the element to the bin (assume it's not full for the moment), and shift to the right all elements in that specific bin. Then, update all bins after the bin you just modified. This will take O(n/k + k) time.
Deletion is similar to addition, and is again done in O(n/k + k) time.
Select: Find the relevant bin (again, linear search will give you O(n/k), and the item's place in it is found in O(1)). Note however you can find the bin using binary search in O(log(n/k)) time, so selection is O(log(n/k)).
Shuffle using fisher yates in O(n).
In addition, when bin is full, just reallocate the entire array, and if needed - increase bins size. This operation is O(n), and is done after at least O(k) insertions, so when talking about amortized times it's still O(n/k) per insertion for it.
All we have to do now, is find optimal k, so we want to minimize n/k + k.
f(k) = n/k + k, 0 < k <=n
df/dk = n*-1*k^(-2) + 1 = 0
1/n = 1/k^2
n = k^2
sqrt(n) = k
So, for optimal choice of k we need k=sqrt(n), and we get the complexities:
Insertion: O(sqrt(n))
Deletion: O(sqrt(n))
Selection: O(log(sqrt(n))) = O(log(n))
Shuffle: O(n)

O(log n) index update and search

I need to keep track of indexes in a large text file. I have been keeping a std::map of indexes and accompanying data as a quick hack. If the user is on character 230,400 in the text, I can display any meta-data for the text.
Now that my maps are getting larger, I'm hitting some speed issues (as expected).
For example, if the text is modified at the beginning, I need to increment the indexes after that position in the map by one, an O(N) operation.
What's a good way to change this to O(log N) complexity? I've been looking at AVL Arrays, which is close.
I'm hoping for O(log n) time for updates and searches. For example, if the user is on character 500,000 in the text array, I want to very quickly find if there is any meta data for that character.
(Forgot to add: The user can add meta data whenever they like)

Easy. Make a binary tree of offsets.
The value of any offset is computed by traversing the tree from the leaf to the root adding offsets any time a node is a right child.
Then if you add text early in the file you only need to update the offsets for nodes which are parents of the offsets that change. That is say you added text before the very first offset, you add the number of characters added to the root node. now one half of your offsets have been corrected. Now traverse to the left child and add the offset again. Now 3/4s of offsets have been updated. Continue traversing left children adding the offset until all the offsets are updated.
#OP:
Say you have a text buffer with 8 characters, and 4 offsets into the odd bytes:
the tree: 5
/ \
3 2
/ \ / \
1 0 0 0
sum of right
children (indices) 1 3 5 7
Now say you inserted 2 bytes at offset 4. Buffer was:
01234567
Now its
0123xx4567
So you modify just nodes that dominate parts of the array that changed. In this case just
the root node needs to be modified.
the tree: 7
/ \
3 2
/ \ / \
1 0 0 0
sum of right
children (indices) 1 3 7 9
The summation rule is walking from leaf to root I sum to myself, the value of my parent if I am that parent's right child.
To find if there is an index at my current location I start at the root and ask is this offset greater smaller than my location. If yes I traverse left and add nothing. If no I traverse right and add the value to my index. If at the end of traversal my value is equal to my index then yes there is an annotation. You can do a similar traverals with a minimum and maximum index to find the node that dominates all the indices in the range, finding all the indices to the text I'm displaying.
Oh.. and this is just a toy example. In reality you need to periodically rebalance the tree otherwise there is a chance that if you keep adding new indices just in one part of the file you will get a tree which is way out of balance, and worst case performance would no longer be O(log2 n) but would be O(n). To keep the tree balanced you would need to implement a balanced binary tree like a "red/black tree". That would guarantee O(log2 n) performance where N is the number of metadatas.

Don't store indices! There's no way to possibly do that and simultaneously have performance better than O(n) - add a character at the beginning of the array and you'll have to increment n - 1 indices, no way around it.
But if you store substring lengths instead, you'd only have to change one per level of tree structure, bringing you up to O(log n). My (untested) solution would be to use a Rope with metadata attached to the nodes - you might need to play around with that a bit, but I think it's a solid foundation.
Hope that helps!

An interview question

Given a linked list of T size , select first 2n nodes and delete first n nodes from them; Then do it for the next 2n nodes and so on...
For example-
Let's consider a linked list of size 7:
`1->2->3->4->5->6->7`
If n = 2, the desired output is :
`1->2->5->6->7`
I didn't understand what this problem is actually indicating.Could somebody help me to understand the problem ?
EDIT : Adding C and C++ tags so that this may reach to more eye balls, and of-course those are only two languages allowed in the interview itself.

That actually looks like it should say:
Given a linked list of T size , select first 2n nodes and delete last n nodes from them; Then do it for the next 2n nodes and so on...
or:
Given a linked list of T size , select first 2n nodes and keep first n nodes from them; Then do it for the next 2n nodes and so on...
That would mean select 1,2,3,4 then delete 3,4 (or keep 1,2 which is the same thing). Then select 5,6,7,8, not possible so stop.

I think it's even simpler than #paxdiablo indicates ...
do
take n
skip n
until you run out of elements to take or skip

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js