Considering the following xml element:
<elem>
<sub_elem name="first">
<sub_elem name="second">
</elem>
Concerning the property tree populated by boost::property_tree::xml_parser::read_xml, is it guaranteed that sub_elem "first" will be before sub_elem "second"?
The documentation states:
Reads XML from an input stream and translates it to property tree.
However it depends what "translates" means exactly.
From property tree documentation: https://www.boost.org/doc/libs/1_65_1/doc/html/property_tree/container.html I deduce that order of elements in XML file is preserved
It is very important to remember that the property sequence is not
ordered by the key. It preserves the order of insertion. It closely
resembles a std::list. Fast access to children by name is provided via
a separate lookup structure. Do not attempt to use algorithms that
expect an ordered sequence (like binary_search) on a node's children.
Looking at internal code in https://www.boost.org/doc/libs/1_51_0/boost/property_tree/detail/xml_parser_read_rapidxml.hpp one may see traversing over nodes and push_back calls. It should indeed work in simplest way of preserving order.
Related
Consider the following example:
These are the inputs:
ASED
BTY
ASED->CWD
CWD->DTT
EI->FHK
These are just a string. They have no special meaning. But "->" indicates propagate as a clone. And I want to get DTT's father according to these entries. Is there a faster solution?
I did not ask about coding, I only ask about the method.
It all depends if the tokens (ASED, CWD, etc.) are unique or not. Also, do you want to use only standard C++ libraries or you are open to using additional libraries.
Assuming the tokens are unique, and you want to use standard C++. There are no tree data structures in the standard C++, but in this case you don't need them to address your problem.
Assuming also that a token can have only one parent, you can revert the expression parent -> child to child -> parent (in your algorithm, not in the input list). Once done, you can store the child as a key of a map and the parent as a value. You will need to introduce a stop key as a value (ex. NULL) to signify that the particular key has no parent.
To extract the parent, you will need to fetch the value from the map corresponding to the child, which is the intermediate parent (the link). Next you fetch the intermediate parent to get his parent. This will be repeated until you reach the stop key.
As of complexity of this approach, depends on which type of map you select std::map or std::unordered_map
This algorithm can be slightly modified to support for multiple parents, in which case during the traveling you will travel all possible values for the key you are fetching, which can be done with recursion or with a stack, and you will use std::multimap to store the data.
I have a DAG-like structure that is essentially a deeply-nested map. The maps in this structure can have common values, so the overall structure is not a tree but a direct acyclic graph. I'll refer to this structure as a DAG for brevity.
The nodes in this graph are of different but finite number of categories. Each category can have its own structure/keywords/number-of-children. There is one unique node that is the source of this DAG, meaning from this node we can reach all nodes in the DAG.
The task is to traverse through the DAG from the source node, and convert each node to another one or more nodes in a new constructed graph. I'll give an example for illustration.
The graph in the upper half is the input one. The lower half is the one after transformation. For simplicity, the transformation is only done on node A where it is split into node 1 and A1. The children of node A are also reallocated.
What I have tried (or in mind):
Write a function to convert one object for different types. Inside this function, recursively call itself to convert each of its children. This method suffers from the problem that data are immutable. The nodes in the transformed graph cannot be changed randomly to add children. To overcome this, I need to wrap every node in a ref/atom/agent.
Do a topological sort on the original graph. Then convert the nodes in the reversed order, i.e., bottom-up. This method requires a extra traverse of the graph but at least the data need not to be mutable. Regarding the topological sort algorithm, I'm considering DFS-based method as stated in the wiki page, which does not require the knowledge of the full graph nor a node's parents.
My question is:
Is there any other approaches you might consider, possibly more elegant/efficient/idiomatic?
I'm more in favour of the second method, is there any flaws or potential problems?
Thanks!
EDIT: On a second thought, a topological sorting is not necessary. The transformation can be done in the post-order traversal already.
This looks like a perfect application of Zippers. They have all the capabilities you described as needed and can produce the edited 'new' DAG. There are also a number of libraries that ease the search and replace capability using predicate threads.
I've used zippers when working with OWL ontologies defined in nested vector or map trees.
Another option would be to take a look at Walkers although I've found these a bit more tedious to use.
I want to add several objects in a structure which allows this:
Insertion of objects, immediately ordering the entire structure on add so I have a descending ordering of an int;
Being able to change the int by which the objects are ordered (I mean: say that object number 2, now has a int of 5, so it reorders the structure);
Fast structure, because it will be completely iterated 60 times a second;
Being able to directly access the objects by position;
Only needs to be iterated from top to bottom: higher INT to lower INT
No deletion required, but could become useful later on.
Some indications on how to use the structure would be great, since I don't know much about the C++ standard libraries.
All of the operations that you've listed (except for lookup by index) can be supported by a standard binary search tree, keyed by integer values. This gives you the ability to iterate over the elements in sorted order and to keep the objects sorted during any insertion. As #njr mentioned, you can also update priorities by removing objects from the binary search tree, changing their priority, then reinserting them into the binary search tree.
To support random access by index, you should consider looking into order statistic trees, a variant on binary search trees that in addition to all other operations supports very fast (O(log n)) lookup of an element by its index. That is, you could very efficiently query for the 15th element in the sorted sequence, or the 17th, etc. Order statistic trees aren't part of the C++ standard libraries, but this older question contains answers that can link you to an implementation.
Use a set or a map
For requirement 1 - provide a custom sorting function
For 2 - remove the item and add it again (or provide a wrapper that does that)
3 doesn't make sense (How big is the list, how fast is the processor/ram)
For 4 - Are you sure you need that? It seems to be kind of weird to try to access it by position when the position can change suddenly (some item was added or removed)
5 - same as 1
I'm doing some work with XML in C++, and I would like to know what the best data structure to store XML data is. Please don't just tell me what you've heard of in the past; I'd like to know what the most efficient structure is. I would like to be able to store any arbitrary XML tree (assuming it is valid), with minimal memory overhead and lookup time.
My initial thought was a hash, but I couldn't figure out how to handle multiple children of the same tag, as well as how attributes would be handled.
Qt solutions are acceptable, but I'm more concerned with the overall structure than the specific library. Thanks for your input.
The most efficient structure would a set of classes derived from the DTD or the Schema that defines the particular XML instances you intend to process. (Surely you aren't going to process arbitrary XML?) Tags are represented by classes. Single children can be represented by fields. Childen with min...max arity can be represented by a field containing an array. Children with indefinite arity can be represented by a dynamically allocated array. Attributes and children can be stored as fields, often with an inferred data type (if an attribute represents a number, why store it as a string?). Using this approach, you can often navigate to a particular place in an XML document using native C++ accesspaths, e.g.,
root->tag1.itemlist[1]->description.
All of the can be generated automatically from the Schema or the DTD. There are tools to do this.
Altova offers some. I have no specific experience with this (although I have built similar tools for Java and COBOL).
You should first determine what the requirement for efficiency is, in terms of storage, speed etc. in concrete numbers. Without knowing this information, you can't tell if your implementation satisfies the requirement.
And, if you have this requirement, you will probably find that the DOM satisfies it, and has the advantage of zero code to maintain.
It will be a nightmare for future programmers as they wonder why someone wrote an alternate implementation of the DOM.
Actually, pretty much anything you do will just be a DOM implementation, but possibly incomplete, and with optimizations for indexing etc. My personal feelig is that re-inventing the wheel should be the last thing you consider.
there is a C++ XML library already built: xerces.
http://xerces.apache.org/xerces-c/install-3.html
there are some tree structures in \include\boost-1_46_1\boost\intrusive\
there is a red-black and an avl tree, but not having looked at those in a long time, I don't know if those are especially usable, I think not.
XML is a tree structure. you don't know what the structure is going to be unless it has a DTD defined and included in a (although the validator at validrome breaks on !DOCTYPEs and it shouldn't).
see http://w3schools.com/xml/xml_tree.asp for a tree example.
you may get something that doesn't follow a DTD or schema. totally unstructured. like this:
<?xml version="1.0"?>
<a>
<b>hello
<e b="4"/>
<c a="mailto:jeff#nowhere.com">text</c>
</b>
<f>zip</f>
<z><b /><xy/></z>
<zook flag="true"/>
<f><z><e/></z>random</f>
</a>
I know that queriable XML databases do exist, but I don't know much about them, except that they can handle unstructured data.
PHP has an XML parser which sticks it into what PHP calls an array (not quite like a C/C++ array, because the arrays can have arrays), you can tinker with it to see an example of what an XML data structure should have in it.
what you basically want is a very flexible tree where the root pointer points to a list. each of those nodes in the list contains a pointer that can point to a list. it should be an ordered list, so make it a . If your purpose is to be able to remove data, use a instead of a - it's ordered, while having the capability of easy manipulation.
word of warning: .erase(iterator i) erases everything starting at and after i.
.erase(iterator i1, iterator i2) erases everything from i1 up to but not including i2.
.end() is an iterator that points 1 after the end of the list, essentially at nothing.
.begin() is an iterator that points to the start of the list.
learn to use for_each(start,end,function) { } in or use a regular for statement.
iterators are like pointers. treat them as such.
#include <iterator>
#include <list>
#include <iostream>
using namespace std;
list<class node> nodelist;
list<class node>::iterator nli;
for (nli=nodelist.begin(); nli!=nodelist.end(); nli++) {
cout<<nli->getData()<<endl;
}
the nodes need to have an optional list of attributes and note that the DTD could possibly be contained within the XML document, so you have to be able to read it to parse the document (or you could throw it away). you may also run into XML Schema, the successor of the DTD.
I think the most efficient data struture to store xml in is probably vtd-xml, which uses array of longs instead of lots of interconnected structs/classes. The main idea is that structs/classes are based on small memory allocators which incurs severe overhead in a normal circumstance. See this article for further detail.
http://soa.sys-con.com/node/250512
I'm not sure what the most efficient method is, but since the DOM already exists why re-invent the wheel?
It may make sense to hash all nodes by name for lookup, but you should still use the DOM as the basic representation.
I've been exploring this problem myself. And, these are my thoughts.
a) every element in xml is either a node or a (key, value) pair.
b) store every element in a Hash. assign each element a type i.e "node","key,value".
c)every element will have a parent. assign a value to each of them.
d) every element may, or may, not have children/References. store the children in a btree which will define, the references.
Search time for any key will be O(1).A reference traversal can have a list of all the children inside the element.
Please review and suggest what I've missed.
Just use DOM to store the parsed XML file . Surely there are C++ DOM library .
You can query DOM with XPath expressions.
I find that both set and map are implemented as a tree. set is a binary search tree, map is a self-balancing binary search tree, such as red-black tree? I am confused about the difference about the implementation. The difference I can image are as follow
1) element in set has only one value(key), element in map has two values.
2) set is used to store and fetch elements by itself. map is used to store and fetch elements via key.
What else are important?
Maps and sets have almost identical behavior and it's common for the implementation to use the exact same underlying technique.
The only important difference is map doesn't use the whole value_type to compare, just the key part of it.
Usually you'll know right away which you need: if you just have a bool for the "value" argument to the map, you probably want a set instead.
Set is a discrete mathematics concept that, in my experience, pops up again and again in programming. The stl set class is a relatively efficient way to keep track of sets where the most common opertions are insert/remove/find.
Maps are used where objects have a unique identity that is small compared to their entire set of attributes. For example, a web page can be defined as a URL and a byte stream of contents. You could put that byte stream in a set, but the binary search process would be extremely slow (since the contents are much bigger than the URL) and you wouldn't be able to look up a web page if its contents change. The URL is the identity of the web page, so it is the key of the map.
A map is usually implemented as a set< std::pair<> >.
The set is used when you want an ordered list to quickly search for an item, basically, while a map is used when you want to retrieve a value given its key.
In both cases, the key (for map) or value (for set) must be unique. If you want to store multiple values that are the same, you would use multimap or multiset.