Most efficient data structure to store an XML tree in C++

Most efficient data structure to store an XML tree in C++ - c++

I'm doing some work with XML in C++, and I would like to know what the best data structure to store XML data is. Please don't just tell me what you've heard of in the past; I'd like to know what the most efficient structure is. I would like to be able to store any arbitrary XML tree (assuming it is valid), with minimal memory overhead and lookup time.
My initial thought was a hash, but I couldn't figure out how to handle multiple children of the same tag, as well as how attributes would be handled.
Qt solutions are acceptable, but I'm more concerned with the overall structure than the specific library. Thanks for your input.

The most efficient structure would a set of classes derived from the DTD or the Schema that defines the particular XML instances you intend to process. (Surely you aren't going to process arbitrary XML?) Tags are represented by classes. Single children can be represented by fields. Childen with min...max arity can be represented by a field containing an array. Children with indefinite arity can be represented by a dynamically allocated array. Attributes and children can be stored as fields, often with an inferred data type (if an attribute represents a number, why store it as a string?). Using this approach, you can often navigate to a particular place in an XML document using native C++ accesspaths, e.g.,
root->tag1.itemlist[1]->description.
All of the can be generated automatically from the Schema or the DTD. There are tools to do this.
Altova offers some. I have no specific experience with this (although I have built similar tools for Java and COBOL).

You should first determine what the requirement for efficiency is, in terms of storage, speed etc. in concrete numbers. Without knowing this information, you can't tell if your implementation satisfies the requirement.
And, if you have this requirement, you will probably find that the DOM satisfies it, and has the advantage of zero code to maintain.
It will be a nightmare for future programmers as they wonder why someone wrote an alternate implementation of the DOM.
Actually, pretty much anything you do will just be a DOM implementation, but possibly incomplete, and with optimizations for indexing etc. My personal feelig is that re-inventing the wheel should be the last thing you consider.

there is a C++ XML library already built: xerces.
http://xerces.apache.org/xerces-c/install-3.html
there are some tree structures in \include\boost-1_46_1\boost\intrusive\
there is a red-black and an avl tree, but not having looked at those in a long time, I don't know if those are especially usable, I think not.
XML is a tree structure. you don't know what the structure is going to be unless it has a DTD defined and included in a (although the validator at validrome breaks on !DOCTYPEs and it shouldn't).
see http://w3schools.com/xml/xml_tree.asp for a tree example.
you may get something that doesn't follow a DTD or schema. totally unstructured. like this:
<?xml version="1.0"?>
<a>
<b>hello
<e b="4"/>
<c a="mailto:jeff#nowhere.com">text</c>
</b>
<f>zip</f>
<z><b /><xy/></z>
<zook flag="true"/>
<f><z><e/></z>random</f>
</a>
I know that queriable XML databases do exist, but I don't know much about them, except that they can handle unstructured data.
PHP has an XML parser which sticks it into what PHP calls an array (not quite like a C/C++ array, because the arrays can have arrays), you can tinker with it to see an example of what an XML data structure should have in it.
what you basically want is a very flexible tree where the root pointer points to a list. each of those nodes in the list contains a pointer that can point to a list. it should be an ordered list, so make it a . If your purpose is to be able to remove data, use a instead of a - it's ordered, while having the capability of easy manipulation.
word of warning: .erase(iterator i) erases everything starting at and after i.
.erase(iterator i1, iterator i2) erases everything from i1 up to but not including i2.
.end() is an iterator that points 1 after the end of the list, essentially at nothing.
.begin() is an iterator that points to the start of the list.
learn to use for_each(start,end,function) { } in or use a regular for statement.
iterators are like pointers. treat them as such.
#include <iterator>
#include <list>
#include <iostream>
using namespace std;
list<class node> nodelist;
list<class node>::iterator nli;
for (nli=nodelist.begin(); nli!=nodelist.end(); nli++) {
cout<<nli->getData()<<endl;
}
the nodes need to have an optional list of attributes and note that the DTD could possibly be contained within the XML document, so you have to be able to read it to parse the document (or you could throw it away). you may also run into XML Schema, the successor of the DTD.

I think the most efficient data struture to store xml in is probably vtd-xml, which uses array of longs instead of lots of interconnected structs/classes. The main idea is that structs/classes are based on small memory allocators which incurs severe overhead in a normal circumstance. See this article for further detail.
http://soa.sys-con.com/node/250512

I'm not sure what the most efficient method is, but since the DOM already exists why re-invent the wheel?
It may make sense to hash all nodes by name for lookup, but you should still use the DOM as the basic representation.

I've been exploring this problem myself. And, these are my thoughts.
a) every element in xml is either a node or a (key, value) pair.
b) store every element in a Hash. assign each element a type i.e "node","key,value".
c)every element will have a parent. assign a value to each of them.
d) every element may, or may, not have children/References. store the children in a btree which will define, the references.
Search time for any key will be O(1).A reference traversal can have a list of all the children inside the element.
Please review and suggest what I've missed.

Just use DOM to store the parsed XML file . Surely there are C++ DOM library .
You can query DOM with XPath expressions.

Related

How to 'mark' a node in a Clojure data structure?

I have
a Clojure data structure, let's call it dom, a tree of vectors and
maps of indefinite depth;
a particular node in it, let's call it the focus node, referred to as
a path into the tree: a sequence of keys such as you could present to
get-in.
I will be deciding on the focussed node in one function and I want to somehow represent that choice of focussed node in a way that can be passed to another function in a way that does not violate immutability and is not in conflict with Clojure's persistent data structures.
When I traverse the tree, I want to treat the focus node differently: for example, if I was printing the tree, I might want to print the focus node in bold.
If I were using C or Java, I could save a pointer/reference to the focus node, which I could compare with the current node as I traversed the tree. I don't think that's the right way to do it in Clojure: it feels hacky, and I'm sure there's some way to do it that takes advantage of Clojure's persistent data structures.
The solution has to work in Clojure and ClojureScript.
The options I can think of are:
Store a reference and check against that.
Attach a marker to the node in question.
Simultaneously recurse into the tree and along the path to the marked node.
Option (1) is unattractive, as I've explained.
Option (2) seems best, and painless given persistent data structures.
Option (3) is similar to option (2), except that it combines the
marking and traversing steps.
I'm sure this is a common problem. Is there a standard solution to it?

I suggest you reconsider #MerceloMorales's suggestion: to use metadata. Your node object is to have an accidental attribute that doesn't affect its normal functions. That is what metadata is designed for. And it works in ClojureScript. The only reason I can think of for not using metadata is that the node value is not a Clojure object, but is, for example, a number.
In The Clojure Cookbook, 2.22. Keeping Multiple Values for a Key, Luke Vanderhart uses metadata to solve a similar problem: marking entries that need to be interpreted as collections rather than single values.
Another approach might be to use a zipper to traverse/modify the node tree. Zippers are implemented in terms of - you've guessed it - metadata.
I share your misgivings about metadata: it feels queasy to attach just any old stuff to your data - like infecting it with a parasite. However, it's just as immutable a part of the object as any other.
The suggestion to use zippers is naive: The standard clojure zippers are designed for a hierarchy of sequential containers, not associative ones.

See Brandon Bloom's Dendrology talk for some great overview on questions like this.
I believe the ease of "marking" or otherwise updating tree structured data underlies his strong recommendation to always represent nodes as nested maps rather than vectors (or a mixture of vectors and maps). A mark based on a path described by a vector of keys is then as simple as:
(update-in tree-data path assoc :is-focussed true)
Your original data structure is unchanged and the new one returned by update-in shares everything structurally with the original except the updated node which is now easily tested for the :is-focussed property upon traversal.

Is there any MFC / STL class that represents Binary Tree [duplicate]

Why does the C++ STL not provide any "tree" containers, and what's the best thing to use instead?
I want to store a hierarchy of objects as a tree, rather than use a tree as a performance enhancement...

There are two reasons you could want to use a tree:
You want to mirror the problem using a tree-like structure:
For this we have boost graph library
Or you want a container that has tree like access characteristics
For this we have
std::map (and std::multimap)
std::set (and std::multiset)
Basically the characteristics of these two containers is such that they practically have to be implemented using trees (though this is not actually a requirement).
See also this question:
C tree Implementation

Probably for the same reason that there is no tree container in boost. There are many ways to implement such a container, and there is no good way to satisfy everyone who would use it.
Some issues to consider:
Are the number of children for a node fixed or variable?
How much overhead per node? - ie, do you need parent pointers, sibling pointers, etc.
What algorithms to provide? - different iterators, search algorithms, etc.
In the end, the problem ends up being that a tree container that would be useful enough to everyone, would be too heavyweight to satisfy most of the people using it. If you are looking for something powerful, Boost Graph Library is essentially a superset of what a tree library could be used for.
Here are some other generic tree implementations:
Kasper Peeters' tree.hh
Adobe's forest
core::tree

"I want to store a hierarchy of objects as a tree"
C++11 has come and gone and they still didn't see a need to provide a std::tree, although the idea did come up (see here). Maybe the reason they haven't added this is that it is trivially easy to build your own on top of the existing containers. For example...
template< typename T >
struct tree_node
{
T t;
std::vector<tree_node> children;
};
A simple traversal would use recursion...
template< typename T >
void tree_node<T>::walk_depth_first() const
{
cout<<t;
for ( auto & n: children ) n.walk_depth_first();
}
If you want to maintain a hierarchy and you want it to work with STL algorithms, then things may get complicated. You could build your own iterators and achieve some compatibility, however many of the algorithms simply don't make any sense for a hierarchy (anything that changes the order of a range, for example). Even defining a range within a hierarchy could be a messy business.

The STL's philosophy is that you choose a container based on guarantees and not based on how the container is implemented. For example, your choice of container may be based on a need for fast lookups. For all you care, the container may be implemented as a unidirectional list -- as long as searching is very fast you'd be happy. That's because you're not touching the internals anyhow, you're using iterators or member functions for the access. Your code is not bound to how the container is implemented but to how fast it is, or whether it has a fixed and defined ordering, or whether it is efficient on space, and so on.

If you are looking for a RB-tree implementation, then stl_tree.h might be appropriate for you too.

the std::map is based on a red black tree. You can also use other containers to help you implement your own types of trees.

The problem is that there is no one-size-fits-all solution. Moreover, there is not even a one-size-fits-all interface for a tree. That is, it is not even clear which methods such a tree data structure should provide and it is not even clear what a tree is.
This explains why there is no STL support on this: The STL is for data structures that most people need, where basically everyone agrees on what a sensible interface and an efficient implementation is. For trees, such a thing just doesn't exist.
The gory details
If want to understand further what the problem is, read on. Otherwise, the paragraph above already should be sufficent to answer your question.
I said that there is not even a common interface. You might disagree, since you have one application in mind, but if you think further about it, you will see that there are countless possible operations on trees. You can either have a data structure that enables most of them efficiently, but therefore is more complex overall and has overhead for that complexity, or you have more simple data structure that only allows basic operations but these as quick as possible.
If you want the complete story, check out my paper on the topic. There you will find possible interface, asymptotic complexities on different implementations, and a general description of the problem and also related work with more possible implementations.
What is a tree?
It already starts with what you consider to be a tree:
Rooted or unrooted: most programmers want rooted, most mathematicians want unrooted. (If you wonder what unrooted is: A - B - C is a tree where either A, B, or C could be the root. A rooted tree defines which one is. An unrooted doesn't)
Single root/connected or multi root/disconnected (tree or forest)
Is sibling order relevant? If no, then can the tree structure internally reorder children on updates? If so, iteration order among siblings is no longer defined. But for most trees, sibiling order is actually not meaningful, and allowing the data structure to reorder children on update is very beneficial for some updates.
Really just a tree, or also allow DAG edges (sounds weird, but many people who initially want a tree eventually want a DAG)
Labeled or unlabled? Do you need to store any data per node, or is it only the tree structure you're interested in (the latter can be stored very succinctly)
Query operations
After we have figured out what we define to be a tree, we should define query operations: Basic operations might be "navigate to children, navigate to parent", but there are way more possible operations, e.g.:
Navigate to next/prev sibling: Even most people would consider this a pretty basic operation, it is actually almost impossible if you only have a parent pointer or a children array. So this already shows you that you might need a totally different implementation based on what operations you need.
Navigate in pre/post order
Subtree size: the number of (transitive) descendants of the current node (possibly in O(1) or O(log n), i.e., don't just enumerate them all to count)
the height of the tree in the current node. That is, the longest path from this node to any leave node. Again, in less than O(n).
Given two nodes, find the least common ancestor of the node (with O(1) memory consumption)
How many nodes are between node A and node B in a pre-/post-order traversal? (less than O(n) runtime)
I emphasized that the interesting thing here is whether these methods can be performed better than O(n), because just enumerating the whole tree is always an option. Depending on your application, it might be absolutely crucial that some operations are faster than O(n), or you might not care at all. Again, you will need vastely different data structures depending on your needs here.
Update operations
Until now, I only talked about query opertions. But now to updates. Again, there are various ways in which a tree could be updated. Depending on which you need, you need a more or less sophisticated data structure:
leaf updates (easy): Delete or add a leaf node
inner node updates (harder): Move or delete move an inner node, making its children the children
of its parent
subtree updates (harder): Move or delete a subtree rooted in a node
To just give you some intuition: If you store a child array and your sibling order is important, even deleting a leaf can be O(n) as all siblings behind it have to be shifted in the child array of its parent. If you instead only have a parent pointer, leaf deletion is trivially O(1). If you don't care about sibiling order, it is also O(1) for the child array, as you can simply replace the gap with the last sibling in the array. This is just one example where different data structures will give you quite different update capabilities.
Moving a whole subtree is again trivially O(1) in case of a parent pointer, but can be O(n) if you have a data structure storing all nodes e.g., in pre-order.
Then, there are orthogonal considerations like which iterators stay valid if you perform updates. Some data structures need to invalidate all iterators in the whole tree, even if you insert a new leaf. Others only invalidate iterators in the part of the tree that is altered. Others keep all iterators (except the ones for deleted nodes) valid.
Space considerations
Tree structures can be very succinct. Roughly two bits per node are enough if you need to save on space (e.g., DFUDS or LOUDS, see this explanation to get the gist). But of course, naively, even a parent pointer is already 64 bits. Once you opt for a nicely-navigable structure, you might rather require 20 bytes per node.
With a lot of sophisication, one can also build a data structure that only takes some bits per entry, can be updated efficiently, and still enables all query operations asymptotically fast, but this is a beast of a structure that is highly complex. I once gave a practical course where I had grad students implement this paper. Some of them were able to implement it in 6 weeks (!), others failed. And while the structure has great asymptotics, its complexity makes it have quite some overhead for very simple operations.
Again, no one-size-fits-all.
Conclusion
I worked 5 years on finding the best data structure to represent a tree, and even though I came up with some and there is quite some related work, my conclusion was that there is not one. Depending on the use case, a highly sophsticated data struture will be outperformed by a simple parent pointer. Even defining the interface for a tree is hard. I tried defining one in my paper, but I have to acknowledge that there are various use cases where the interface I defined is too narrow or too large. So I doubt that this will ever end up in STL, as there are just too many tuning knobs.

In a way, std::map is a tree (it is required to have the same performance characteristics as a balanced binary tree) but it doesn't expose other tree functionality. The likely reasoning behind not including a real tree data structure was probably just a matter of not including everything in the stl. The stl can be looked as a framework to use in implementing your own algorithms and data structures.
In general, if there's a basic library functionality that you want, that's not in the stl, the fix is to look at BOOST.
Otherwise, there's a bunch of libraries out there, depending on the needs of your tree.

All STL container are externally represented as "sequences" with one iteration mechanism.
Trees don't follow this idiom.

I think there are several reasons why there are no STL trees. Primarily Trees are a form of recursive data structure which, like a container (list, vector, set), has very different fine structure which makes the correct choices tricky. They are also very easy to construct in basic form using the STL.
A finite rooted tree can be thought of as a container which has a value or payload, say an instance of a class A and, a possibly empty collection of rooted (sub) trees; trees with empty collection of subtrees are thought of as leaves.
template<class A>
struct unordered_tree : std::set<unordered_tree>, A
{};
template<class A>
struct b_tree : std::vector<b_tree>, A
{};
template<class A>
struct planar_tree : std::list<planar_tree>, A
{};
One has to think a little about iterator design etc. and which product and co-product operations one allows to define and be efficient between trees - and the original STL has to be well written - so that the empty set, vector or list container is really empty of any payload in the default case.
Trees play an essential role in many mathematical structures (see the classical papers of Butcher, Grossman and Larsen; also the papers of Connes and Kriemer for examples of they can be joined, and how they are used to enumerate). It is not correct to think their role is simply to facilitate certain other operations. Rather they facilitate those tasks because of their fundamental role as a data structure.
However, in addition to trees there are also "co-trees"; the trees above all have the property that if you delete the root you delete everything.
Consider iterators on the tree, probably they would be realised as a simple stack of iterators, to a node, and to its parent, ... up to the root.
template<class TREE>
struct node_iterator : std::stack<TREE::iterator>{
operator*() {return *back();}
...};
However, you can have as many as you like; collectively they form a "tree" but where all the arrows flow in the direction toward the root, this co-tree can be iterated through iterators towards the trivial iterator and root; however it cannot be navigated across or down (the other iterators are not known to it) nor can the ensemble of iterators be deleted except by keeping track of all the instances.
Trees are incredibly useful, they have a lot of structure, this makes it a serious challenge to get the definitively correct approach. In my view this is why they are not implemented in the STL. Moreover, in the past, I have seen people get religious and find the idea of a type of container containing instances of its own type challenging - but they have to face it - that is what a tree type represents - it is a node containing a possibly empty collection of (smaller) trees. The current language permits it without challenge providing the default constructor for container<B> does not allocate space on the heap (or anywhere else) for an B, etc.
I for one would be pleased if this did, in a good form, find its way into the standard.

Because the STL is not an "everything" library. It contains, essentially, the minimum structures needed to build things.

This one looks promising and seems to be what you're looking for:
http://tree.phi-sci.com/

IMO, an omission. But I think there is good reason not to include a Tree structure in the STL. There is a lot of logic in maintaining a tree, which is best written as member functions into the base TreeNode object. When TreeNode is wrapped up in an STL header, it just gets messier.
For example:
template <typename T>
struct TreeNode
{
T* DATA ; // data of type T to be stored at this TreeNode
vector< TreeNode<T>* > children ;
// insertion logic for if an insert is asked of me.
// may append to children, or may pass off to one of the child nodes
void insert( T* newData ) ;
} ;
template <typename T>
struct Tree
{
TreeNode<T>* root;
// TREE LEVEL functions
void clear() { delete root ; root=0; }
void insert( T* data ) { if(root)root->insert(data); }
} ;

Reading through the answers here the common named reasons are that one cannot iterate through the tree or that the tree does not assume the similar interface to other STL containers and one could not use STL algorithms with such tree structure.
Having that in mind I tried to design my own tree data structure which will provide STL-like interface and will be usable with existing STL algorthims as much as possible.
My idea was that the tree must be based on the existing STL containers and that it must not hide the container, so that it will be accessible to use with STL algorithms.
The other important feature the tree must provide is the traversing iterators.
Here is what I was able to come up with: https://github.com/cppfw/utki/blob/master/src/utki/tree.hpp
And here are the tests: https://github.com/cppfw/utki/blob/master/tests/unit/src/tree.cpp

All STL containers can be used with iterators. You can't have an iterator an a tree, because you don't have ''one right'' way do go through the tree.

C++ - Map-like data structure with structural sharing/immutability

Functional programming languages often work on immutable data structures but stay efficient by structural sharing. E.g. you work on some map of information, if you insert an element, you will not modify the existing map but create a new updated version. To avoid massive copying and memory usage, the map will share (as good as possible) the unchanged data between both instances.
I would be interested if there exists some template library providing such a map like data structure for C++. I searched a bit and found nothing, beside internal classes in LLVM.

A Copy On Write b+tree sounds like what your looking for. It basically creates a new snapshot of itself every time it gets modified but it shares unmodified leaf nodes between versions. Most of the implementations I've seen tend to be baked into append only database log files. CouchDB has a very nice write up on them. They are however "relatively easy", as far as map data structures go, to implement.

You can use an ordinary map, but marking every element with a timestamp or "map version number". If you want to remove elements too, use two marks. If you might reinsert removed elements, then you need a list of values and pairs of marks per element.
For example, you search for the key "foo", and you find that it had the value 5 in versions 0 to 3 (included), then it was "removed", and then it had the value -8 in versions 9 to current.
This eats a lot of memory and time, though.

What container is the best to manage XML data?

Imagine I have XML data:
<tag> value</tag> etc ....
<othertag> value value2 value_n </othertag>
etc ....
What is the best container to manage this informaction?
A simple vector, a list, other?
I'm going to do simple insertions, searchings, deletions.
Of course the code are going to be 'some rude'.
I know that there will be XML specific utilities, but I'd want to know your opinion.

"The best container" to manage XML is usually a DOM tree, since it can store all the information stored in the XML source and present it in a code-friendly way; still, depending on what you want to do to this data, it can be overkill.
On the other hand, since what you want to do are actually generic manipulations of the XML tree, I think it could be your best option; grab a good XML parser that produces a DOM tree and use it.
A personal note: please, don't reinvent an XML square wheel yet another time - there are enough broken XML parsers around, we don't need yet another one. Use a well known, standard conforming XML parser (e.g. Xerces-C++) that produces your DOM tree and be happy with it.

Well that highly depends on your xml data. Do have have a list of objects with no special identificators? Or do you want to be able to quickly ID an object in your list (i.e. have a mapping?).
You can always use http://linuxsoftware.co.nz/cppcontainers.html to make a decision. The flowchart at the bottom of the page is particularly useful.

How to implement an associative array/map/hash table data structure (in general and in C++)

Well I'm making a small phone book application and I've decided that using maps would be the best data structure to use but I don't know where to start. (Gotta implement the data structure from scratch - school work)

Tries are quite efficient for implementing maps where the keys are short strings. The wikipedia article explains it pretty well.
To deal with duplicates, just make each node of the tree store a linked list of duplicate matches
Here's a basic structure for a trie
struct Trie {
struct Trie* letter;
struct List *matches;
};
malloc(26*sizeof(struct Trie)) for letter and you have an array. if you want to support punctuations, add them at the end of the letter array.
matches can be a linked list of matches, implemented however you like, I won't define struct List for you.

Simplest solution: use a vector which contains your address entries and loop over the vector to search.
A map is usually implemented either as a binary tree (look for red/black trees for balancing) or as a hash map. Both of them are not trivial: Trees have some overhead for organisation, memory management and balancing, hash maps need good hash functions, which are also not trivial. But both structures are fun and you'll get a lot of insight understanding by implementing one of them (or better, both :-)).
Also consider to keep the data in the vector list and let the map contain indices to the vector (or pointers to the entries): then you can easily have multiple indices, say one for the name and one for the phone number, so you can look up entries by both.
That said I just want to strongly recommend using the data structures provided by the standard library for real-world-tasks :-)

A simple approach to get you started would be to create a map class that uses two vectors - one for the key and one for the value. To add an item, you insert a key in one and a value in another. To find a value, you just loop over all the keys. Once you have this working, you can think about using a more complex data structure.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js