Fastest way to count/ access DOMNode children using Xerces C++ - c++

I'm trying to figure out the fastest way to count the number of child elements of a Xerces C++ DOMNode object, as I'm trying to optimise the performance of a Windows application which uses the Xerces 2.6 DOMParser.
It seems most of the time is spent counting and accessing children. Our application needs to iterate every single node in the document to attach data to it using DOMNode::setUserData() and we were initially using DOMNode::getChildNodes(), DOMNodeList::getLength() and DOMNodeList::item(int index) to count and access children, but these are comparatively expensive operations.
A large performance improvement was observed when we used a different idiom of calling
DOMNode:: getFirstChild() to get the first child node and invoke DOMNode::getNextSibling() to either access a child at a specific index or count the number of siblings of the first child element to get a total child node count.
However, getNextSibling() remains a bottleneck in our parsing step, so I'm wondering is there an even faster way to traverse and access child elements using Xerces.

Yes soon after I posted, I added code to store and manage the child count for each node, and this has made a big difference. The same nodes were being visited repeatedly and the child count was being recalculated every time. This is quite an expensive operation as Xerces essentially rebuilds the DOM structure for that node to guarantee its liveness. We have our own object which encapsulates a Xerces DOMNode along with extra info that we need , and we use DOMNode::setUserData to associate our object with the relevant DOMnode, and that now seems to be the last remaining bottleneck.

The problem with DOMNodeList is, that it is really a quite simple list, thus such operations like length and item(i) have costs of O(n) as can be seen in code, for example here for length:
XMLSize_t DOMNodeListImpl::getLength() const{
XMLSize_t count = 0;
if (fNode) {
DOMNode *node = fNode->fFirstChild;
while(node != 0){
++count;
node = castToChildImpl(node)->nextSibling;
}
}
return count;
}
Thus, DOMNodeList should not be used if one doesn't expect that the DOM-tree will be changed while iterating, because accessing an item in O(n) thus making iteration a O(n^2) operation - a disaster waiting to happen (i.e. a xml-file big enough).
Using [DOMNode::getFistChild()][2] and DOMNode::getNextSibling() is a good enough solution for an iteration:
DOMNode *child = docNode->getFirstChild();
while (child != nullptr) {
// do something with the node
...
child = child->getNextSibling();
}
Which happens as expected in O(n^2).
One also could use [DOMNodeIterator][3] , but in order to create it the right DOMDocument is needed, which is not always at hand when an iteration is needed.

Related

Should B-Tree nodes contain a pointer to their parent (C++ implementation)?

I am trying to implement a B-tree and from what I understand this is how you split a node:
Attempt to insert a new value V at a leaf node N
If the leaf node has no space, create a new node and pick a middle value of N and anything right of it move to the new node and anything to the left of the middle value leave in the old node, but move it left to free up the right indices and insert V in the appropriate of the now two nodes
Insert the middle value we picked into the parent node of N and also add the newly created node to the list of children of the parent of N (thus making N and the new node siblings)
If the parent of N has no free space, perform the same operation and along with the values also split the children between the two split nodes (so this last part applies only to non-leaf nodes)
Continue trying to insert the previous split's middle point into the parent until you reach the root and potentially split the root itself, making a new root
This brings me to the question - how do I traverse upwards, am I supposed to keep a pointer of the parent?
Because I can only know if I have to split the leaf node when I have reached it for insertion. So once I have to split it, I have to somehow go back to its parent and if I have to split the parent as well, I have to keep going back up.
Otherwise I would have to re-traverse the tree again and again each time to find the next parent.
Here is an example of my node class:
template<typename KEY, typename VALUE, int DEGREE>
struct BNode
{
KEY Keys[DEGREE];
VALUE Values[DEGREE];
BNode<KEY, VALUE, DEGREE>* Children[DEGREE + 1];
BNode<KEY, VALUE, DEGREE>* Parent;
bool IsLeaf;
};
(Maybe I should not have an IsLeaf field and instead just check if it has any children, to save space)
Even if you don't use recursion or an explicit stack while going down the tree, you can still do it without parent pointers if you split nodes a bit sooner with a slightly modified algorithm, which has this key characteristic:
When encountering a node that is full, split it, even when it is not a leaf.
With this pre-emptive splitting algorithm, you only need to keep a reference to the immediate parent (not any other ancestor) to make the split possible, since now it is guaranteed that a split will not lead to another, cascading split more upwards in the tree. This algorithm requires that the maximum degree (number of children) of the B-tree is even (as otherwise one of the two split nodes would have too few keys to be considered valid).
See also Wikipedia which describes this alternative algorithm as follows:
An alternative algorithm supports a single pass down the tree from the root to the node where the insertion will take place, splitting any full nodes encountered on the way preemptively. This prevents the need to recall the parent nodes into memory, which may be expensive if the nodes are on secondary storage. However, to use this algorithm, we must be able to send one element to the parent and split the remaining π‘ˆβˆ’2 elements into two legal nodes, without adding a new element. This requires π‘ˆ = 2𝐿 rather than π‘ˆ = 2πΏβˆ’1, which accounts for why some textbooks impose this requirement in defining B-trees.
The same article defines π‘ˆ and 𝐿:
Every internal node contains a maximum of π‘ˆ children and a minimum of 𝐿 children.
For a comparison with the standard insertion algorithm, see also Will a B-tree with preemptive splitting always have the same height for any input order of keys?
You don't need parent pointers if all your operations start at the root.
I usually code the insert recursively, such that calling node.insert(key) either returns null or a new key to insert at its parent's level. The insert starts with root.insert(key), which finds the appropriate child and calls child.insert(key).
When a leaf node is reached the insert is performed, and non-null is returned if the leaf splits. The parent would then insert the new internal key and return non-null if it splits, etc. If root.insert(key) returns non-null, then it's time to make a new root

How to store a tree on the disk and make add/delete/swap operations easy

All right, this question requires a bit of reading on your side. I'll try to keep this short and simple.
I have a tree (not a binary tree, just a tree) with data associated to each node (binary data, I don't know what they are AND I don't know how long they are)
Each node of the tree also has an index which isn't related to how it appears in the tree, to make it short it could be like that:
The index number represents the order the user WANTS the tree to be navigated and cannot be duplicated.
I need to store this structure in a file on the disk.
My problem is: how to design a flexible disk storing format that can make loading and working on the tree as easy as possible.
In fact the user should be allowed to
Create a child block to an element (and this should be easy enough, it's sufficient to add data to the file paying attention to avoiding duplicated indices)
Delete a child (I should prompt the user "do you want to delete all this node's children as well? or should I add its children to its parent?"). The tricky part about this is that deleting a node could also free up an index, and I can't let the user use that index again when adding another node (or the order he set could be messed up), I need to update the entire tree!
Swap an index with another one
I'm using C++ and Qt and by now I thought of a lot of structures with a lot of fields like this one
struct dataToBeStoredInTheFile
{
long data_size;
byte *data; //... the data here
int index;
int number_of_children;
int *children_indices; // ... array of integers
}
this has the advantage to identify each node with its respective index, but it's highly slow when swapping indices between two nodes or deleting a node and updating each other node's index because you have to traverse all the nodes and all their "children_indices" arrays.
Would using something like an "hash" to identify each node be more flexible?
Should I use two indices, one for the position in the tree and one for the user's index? If you have any better idea to store the data, you're welcome
I would suggest using something like boost.serialization, then you don't have to worry about the actual format when save on disk, and can concentrate on effective in-memory solution.
Edit: Re-reading your question I see you are using Qt, in that case it should have it's own serialization framework that you can use.
If it doesn't have to be a SINGLE file, you could use the file/directory structure to represent your tree, where each node corresponds to a single file (w/ a directory for each interior node). Maybe not the most efficient, but incredibly easy to do.
Again, if you have some flexibility on the number of files (but not as much as above), you could have one file for the tree structure (so that each node is a fixed size, simplifying its manipulation) and a separate one for storing node contents. To speed up working with the "content file", you could treat it the way a garbage collecting system would: just keep adding new/updated nodes on the end, marking old nodes as no longer in use, and periodically clearing things out.
Better yet, follow #JoachimPileborg's advice :)
I don't think you should use the user-specified index to identify the nodes, as that's not directly related to the way you're storing the tree, and you don't have an efficient way of accessing the nodes by index. You should either keep two indices for each node - the user-specified one, and another one that's implementation dependent; or maintain an array mapping the user-specified index to one you're using for the implementation.
Also, it might be better if you use a different structure to store the tree. For each node, store the following:
the index of the parent
the index of the leftmost son
the index of the left brother
the index of the right brother
This way adding a node and swapping two nodes could be done with some simple pointer manipulations (I don't mean explicit pointers - the indices are somewhat like pointers anyway). Deleting a node would still probably be slow as you have to visit all the children.
As a bonus, if you use this structure, every node has a fixed size (unlike with the linked list you're proposing). This means that you can access a node directly by seeking in the file.
You should also maintain the smallest index the user can use for new nodes - so, for example, even if the largest index was 5 and it was deleted, you still keep 6 as the next free index so 5 cannot be reused.

Fastest way to traverse arbitary depth tree for deletion?

For my own exercises I'm writing an XML-parser. To fill the tree I use a normal std::stack and push the current node on top after making it a child of the last top-node (should be depth-first?). So I now do the same for deletion of the nodes, and I want to know if there's a faster way.
Current code for deletion:
struct XmlNode{
// ignore the rest of the node implementation for now
std::vector<XmlNode*> children_;
};
XmlNode* root_ = new XmlNode;
// fill root_ with child nodes...
// and then those nodes with child nodes and so fort...
std::stack<XmlNode*> nodes_;
nodes_.push(root_);
while(!nodes_.empty()){
XmlNode* node = nodes_.top();
if(node->children_.size() > 0){
nodes_.push(node->children_.back());
node->children_.pop_back();
}else{
delete nodes_.top();
nodes_.pop();
}
}
Works totally fine but it kinda looks slow. So is there any faster / better / more common way to do this?
Don't go out of your way to do iteratively what can be easily done recursively, unless you can prove that the recursive version is either insufficient (e.g. stack overflows) or slower (which won't happen unless you start overflowing your stack, forcing the OS to either expand it or crash you).
In other words, in general, use iteration for linear structures, and recursion for tree structures.
Compared to recursion, an iterative method was around 3 times slower on my machine. If you can be sure that your XML depth won't exceed a few hundred nestings (which I've never seen inside real-world XML documents), then recursion won't be a problem.
To iterate is human; to recurse, divine. :)

How to load/save C++ class instance (using STL containers) to disk

I have a C++ class representing a hierarchically organised data tree which is very large (~Gb, basically as large as I can get away with in memory). It uses an STL list to store information at each node plus iterators to other nodes. Each node has only one parent, but 0-10 children.
Abstracted, it looks something like:
struct node {
public:
node_list_iterator parent; // iterator to a single parent node
double node_data_array[X];
map<int,node_list_iterator> children; // iterators to child nodes
};
class strategy {
private:
list<node> tree; // hierarchically linked list of nodes
struct some_other_data;
public:
void build(); // build the tree
void save(); // save the tree from disk
void load(); // load the tree from disk
void use(); // use the tree
};
I would like to implement the load() and save() to disk, and it should be fairly fast, however the obvious problems are:
I don't know the size in advance;
The data contains iterators, which
are volatile;
My ignorance of C++ is prodigious.
Could anyone suggest a pure C++ solution please?
It seems like you could save the data in the following syntax:
File = Meta-data Node
Node = Node-data ChildCount NodeList
NodeList = sequence (int, Node)
That is to say, when serialized the root node contains all nodes, either directly (children) or indirectly (other descendants). Writing the format is fairly straightforward: just have a recursive write function starting at the root node.
Reading isn't that much harder. std::list<node> iterators are stable. Once you've inserted the root node, its iterator will not change, not even when inserting its children. Hence, when you're reading each node you can already set the parent iterator. This of course leaves you with the child iterators, but those are trivial: each node is a child of its parents. So, after you've read all nodes you'll fix up the child iterators. Start with the second node, the first child (The first node one was the root) and iterate to the last child. Then, for each child C, get its parent and the child to its parent's collection. Now, this means that you have to set the int child IDs aside while reading, but you can do that in a simple std::vector parallel to the std::list<node>. Once you've patched all child IDs in the respective parents, you can discard the vector.
You can use boost.serialization library. This would save entire state of your container, even the iterators.
boost.serialization is a solution, or IMHO, you can use SQLite + Visitor pattern to load and save these nodes, but it won't be easy as it sounds.
Boost Serialization has already been suggested, and it's certainly a reasonable possibility.
A great deal depends on how you're going to use the data -- the fact that you're using a multiway tree in memory doesn't mean you necessarily have to store it as a multiway tree on disk. Since you're (apparently) already pushing the limits of what you can store in memory, the obvious question is whether you're just interested in serializing the data so you can re-constitute the same tree when needed, or whether you want something like a database so you can load parts of the information into memory as needed, and update records as needed.
If you want the latter, some of your choices will also depend on how static the structure is. For example, if a particular node has N children, is that number constant or is it subject to change? If it's subject to change, is there a limit on the maximum number of children?
If you do want to be able to traverse the structure on disk, one obvious possibility would be as you write it out, substitute the file offset of the appropriate data in place of the iterator you're using in memory.
Alternatively, since it looks like (at least most of) the data in an individual node has a fixed size, you might create a database-like structure of fixed-size records, and in each record record the record numbers of the parent/children.
Knowing the overall size in advance isn't particularly important (offhand, I can't think of any way I'd use the size even if it was known in advance).
Actually, I think your best option is to move the entire data structure into database tables. That way you get the benefit of people much smarter then you (or me) having dealt with issues of serialization. It will also prevent you from having to worry about whether the structure can fit into memory.
I've answered something like this on SO before, so I will summarize:
1. Use a database.
2. Substitute file offsets for links (pointers).
3. Store the data without the tree structure, in records, as a database would.
4. Use XML to create the tree structure, using node names instead of links.
5. This would be soooo much easier if you used a database like SqLite or MySQL.
When you spend too much time on the "serialization" and less on the primary purpose of your project, you need to use a database.
If you're doing it for persistence then there are several solutions you can use from the web i.e. google "persist std::list" or you can roll your own using mmap to create a file backed memory area.

C++ design: How to cache most recent used

We have a C++ application for which we try to improve performance. We identified that data retrieval takes a lot of time, and want to cache data. We can't store all data in memory as it is huge. We want to store up to 1000 items in memory. This items can be indexed by a long key. However, when the cache size goes over 1000, we want to remove the item that was not accessed for the longest time, as we assume some sort of "locality of reference", that is we assume that items in the cache that was recently accessed will probably be accessed again.
Can you suggest a way to implement it?
My initial implementation was to have a map<long, CacheEntry> to store the cache, and add an accessStamp member to CacheEntry which will be set to an increasing counter whenever an entry is created or accessed. When the cache is full and a new entry is needed, the code will scan the entire cache map and find the entry with the lowest accessStamp, and remove it.
The problem with this is that once the cache is full, every insertion requires a full scan of the cache.
Another idea was to hold a list of CacheEntries in addition to the cache map, and on each access move the accessed entry to the top of the list, but the problem was how to quickly find that entry in the list.
Can you suggest a better approach?
Thankssplintor
Have your map<long,CacheEntry> but instead of having an access timestamp in CacheEntry, put in two links to other CacheEntry objects to make the entries form a doubly-linked list. Whenever an entry is accessed, move it to the head of the list (this is a constant-time operation). This way you will both find the cache entry easily, since it's accessed from the map, and are able to remove the least-recently used entry, since it's at the end of the list (my preference is to make doubly-linked lists circular, so a pointer to the head suffices to get fast access to the tail as well). Also remember to put in the key that you used in the map into the CacheEntry so that you can delete the entry from the map when it gets evicted from the cache.
Scanning a map of 1000 elements will take very little time, and the scan will only be performed when the item is not in the cache which, if your locality of reference ideas are correct, should be a small proportion of the time. Of course, if your ideas are wrong, the cache is probably a waste of time anyway.
An alternative implementation that might make the 'aging' of the elements easier but at the cost of lower search performance would be to keep your CacheEntry elements in a std::list (or use a std::pair<long, CacheEntry>. The newest element gets added at the front of the list so they 'migrate' towards the end of the list as they age. When you check if an element is already present in the cache, you scan the list (which is admittedly an O(n) operation as opposed to being an O(log n) operation in a map). If you find it, you remove it from its current location and re-insert it at the start of the list. If the list length extends over 1000 elements, you remove the required number of elements from the end of the list to trim it back below 1000 elements.
Update: I got it now...
This should be reasonably fast. Warning, some pseudo-code ahead.
// accesses contains a list of id's. The latest used id is in front(),
// the oldest id is in back().
std::vector<id> accesses;
std::map<id, CachedItem*> cache;
CachedItem* get(long id) {
if (cache.has_key(id)) {
// In cache.
// Move id to front of accesses.
std::vector<id>::iterator pos = find(accesses.begin(), accesses.end(), id);
if (pos != accesses.begin()) {
accesses.erase(pos);
accesses.insert(0, id);
}
return cache[id];
}
// Not in cache, fetch and add it.
CachedItem* item = noncached_fetch(id);
accesses.insert(0, id);
cache[id] = item;
if (accesses.size() > 1000)
{
// Remove dead item.
std::vector<id>::iterator back_it = accesses.back();
cache.erase(*back_it);
accesses.pop_back();
}
return item;
}
The inserts and erases may be a little expensive, but may also not be too bad given the locality (few cache misses). Anyway, if they become a big problem, one could change to std::list.
In my approach, it's needed to have a hash-table for lookup stored objects quickly and a linked-list for maintain the sequence of last used.
When an object are requested.
1) try to find a object from the hash table
2.yes) if found(the value have an pointer of the object in linked-list), move the object in linked-list to the top of the linked-list.
2.no) if not, remove last object from the linked-list and remove the data also from hash-table then put object into hash-table and top of linked-list.
For example
Let's say we have a cache memory only for 3 objects.
The request sequence is 1 3 2 1 4.
1) Hash-table : [1]
Linked-list : [1]
2) Hash-table : [1, 3]
Linked-list : [3, 1]
3) Hash-table : [1,2,3]
Linked-list : [2,3,1]
4) Hash-table : [1,2,3]
Linked-list : [1,2,3]
5) Hash-table : [1,2,4]
Linked-list : [4,1,2] => 3 out
Create a std:priority_queue<map<int, CacheEntry>::iterator>, with a comparer for the access stamp.. For an insert, first pop the last item off the queue, and erase it from the map. Than insert the new item into the map, and finally push it's iterator onto the queue.
I agree with Neil, scanning 1000 elements takes no time at all.
But if you want to do it anyway, you could just use the additional list you propose and, in order to avoid scanning the whole list each time, instead of storing just the CacheEntry in your map, you could store the CacheEntry and a pointer to the element of your list that corresponds to this entry.
As a simpler alternative, you could create a map that grows indefinitely and clears itself out every 10 minutes or so (adjust time for expected traffic).
You could also log some very interesting stats this way.
I believe this is a good candidate for treaps. The priority would be the time (virtual or otherwise), in ascending order (older at the root) and the long as the key.
There's also the second chance algorithm, that's good for caches. Although you lose search ability, it won't be a big impact if you only have 1000 items.
The naΓ―ve method would be to have a map associated with a priority queue, wrapped in a class. You use the map to search and the queue to remove (first remove from the queue, grabbing the item, and then remove by key from the map).
Another option might be to use boost::multi_index. It is designed to separate index from data and by that allowing multiple indexes on the same data.
I am not sure this really would be faster then to scan through 1000 items. It might use more memory then good. Or slow down search and/or insert/remove.