What container is the best to manage XML data? - c++

Imagine I have XML data:
<tag> value</tag> etc ....
<othertag> value value2 value_n </othertag>
etc ....
What is the best container to manage this informaction?
A simple vector, a list, other?
I'm going to do simple insertions, searchings, deletions.
Of course the code are going to be 'some rude'.
I know that there will be XML specific utilities, but I'd want to know your opinion.

"The best container" to manage XML is usually a DOM tree, since it can store all the information stored in the XML source and present it in a code-friendly way; still, depending on what you want to do to this data, it can be overkill.
On the other hand, since what you want to do are actually generic manipulations of the XML tree, I think it could be your best option; grab a good XML parser that produces a DOM tree and use it.
A personal note: please, don't reinvent an XML square wheel yet another time - there are enough broken XML parsers around, we don't need yet another one. Use a well known, standard conforming XML parser (e.g. Xerces-C++) that produces your DOM tree and be happy with it.

Well that highly depends on your xml data. Do have have a list of objects with no special identificators? Or do you want to be able to quickly ID an object in your list (i.e. have a mapping?).
You can always use http://linuxsoftware.co.nz/cppcontainers.html to make a decision. The flowchart at the bottom of the page is particularly useful.

Related

Can a map be used as a tree?

eg. std::map<Item, std::vector<Item> >.
Would that be able serve as a "quick-and-dirty" tree structure (with some helper functions on top and given that less is implemented for Item), given that theres none in the std/boost ?
Would a std::unordered_map be better suited/more usefull/beneficial ? that requires a hash instead of compare though - which can be harder to implement.
I can see one issue though, finding parent/owner have to brute force go through the entire map (although that might be best stored in seperate structure if needed).
Another thing Im not so fond of is the sort of dual meaning of a map entry with an Item with an empty child list.
Can a map be used as a tree?
Situation is inverse: std::map is internally implemented using a tree. So tree can (is) used as a map.
Neither map nor unordered map are useful for implementing a general tree structure. Only if your intention is to use the tree as a map would it be useful to use these structures (because they are maps which was desirable in this scenario)
You can absolutely represent a tree in this manner; whether it's a good idea in a given situation depends entirely on which operations you need to be fast, which operations you're okay with being slow, and what your space requirements are.
(And of course in many applications the answer to all of the above may be "I don't care," in which case any implementation is probably fine.)

Design for customizable string filter

Suppose I've tons of filenames in my_dir/my_subdir, formatted in a some way:
data11_7TeV.00179691.physics_Egamma.merge.NTUP_PHOTON.f360_m796_p541_tid319627_00
data11_7TeV.00180400.physics_Egamma.merge.NTUP_PHOTON.f369_m812_p541_tid334757_00
data11_7TeV.00178109.physics_Egamma.merge.D2AOD_DIPHO.f351_m765_p539_p540_tid312017_00
For example data11_7TeV is the data_type, 00179691 the run number, NTUP_PHOTON the data format.
I want to write an interface to do something like this:
dataset = DataManager("my_dir/my_subdir").filter_type("data11_7TeV").filter_run("> 00179691").filter_tag("m = 796");
// don't to the filtering, be lazy
cout << dataset.count(); // count is an action, do the filtering
vector<string> dataset_list = dataset.get_list(); // don't repeat the filtering
dataset.save_filter("file.txt", "ALIAS"); // save the filter (not the filenames), for example save the regex
dataset2 = DataManagerAlias("file.txt", "ALIAS"); // get the saved filter
cout << dataset2.filter_tag("p = 123").count();
I want lazy behaviour, for example no real filtering has to be done before any action like count or get_list. I don't want to redo the filtering if it is already done.
I'm just learning something about design pattern, and I think I can use:
an abstract base class AbstractFilter that implement filter* methods
factory to decide from the called method which decorator use
every time I call a filter* method I return a decorated class, for example:
AbstractFilter::filter_run(string arg) {
decorator = factory.get_decorator_run(arg); // if arg is "> 00179691" returns FilterRunGreater(00179691)
return decorator(this);
}
proxy that build a regex to filter the filenames, but don't do the filtering
I'm also learning jQuery and I'm using a similar chaining mechanism.
Can someone give me some hints? Is there some place where a design like this is explained? The design must be very flexible, in particular to handle new format in the filenames.
I believe you're over-complicating the design-pattern aspect and glossing over the underlying matching/indexing issues. Getting the full directory listing from disk can be expected to be orders of magnitude more expensive than the in-RAM filtering of filenames it returns, and the former needs to have completed before you can do a count() or get_list() on any dataset (though you could come up with some lazier iterator operations over the dataset).
As presented, the real functional challenge could be in indexing the filenames so you can repeatedly find the matches quickly. But, even that's unlikely as you presumably proceed from getting the dataset of filenames to actually opening those files, which is again orders of magnitude slower. So, optimisation of the indexing may not make any appreciable impact to your overall program's performance.
But, lets say you read all the matching directory entries into an array A.
Now, for filtering, it seems your requirements can generally be met using std::multimap find(), lower_bound() and upper_bound(). The most general way to approach it is to have separate multimaps for data type, run number, data format, p value, m value, tid etc. that map to a list of indices in A. You can then use existing STL algorithms to find the indices that are common to the results of your individual filters.
There are a lot of optimisations possible if you happen to have unstated insights / restrictions re your data and filtering needs (which is very likely). For example:
if you know a particular filter will always be used, and immediately cuts the potential matches down to a manageable number (e.g. < ~100), then you could use it first and resort to brute force searches for subsequent filtering.
Another possibility is to extract properties of individual filenames into a structure: std::string data_type; std::vector<int> p; etc., then write an expression evaluator supporting predicates like "p includes 924 and data_type == 'XYZ'", though by itself that lends itself to brute-force comparisons rather than faster index-based matching.
I know you said you don't want to use external libraries, but an in-memory database and SQL-like query ability may save you a lot of grief if your needs really are at the more elaborate end of the spectrum.
I would use a strategy pattern. Your DataManager is constructing a DataSet type, and the DataSet has a FilteringPolicy assigned. The default can be a NullFilteringPolicy which means no filters. If the DataSet member function filter_type(string t) is called, it swaps out the filter policy class with a new one. The new one can be factory constructed via the filter_type param. Methods like filter_run() can be used to add filtering conditions onto the FilterPolicy. In the NullFilterPolicy case it's just no-ops. This seems straghtforward to me, I hope this helps.
EDIT:
To address the method chaining you simply need to return *this; e.g. return a reference to the DataSet class. This means you can chain DataSet methods together. It's what the c++ iostream libraries do when you implement operator>> or operator<<.
First of all, I think that your design is pretty smart and lends itself well to the kind of behavior you are trying to model.
Anyway, my understanding is that you are trying and building a sort of "Domain Specific Language", whereby you can chain "verbs" (the various filtering methods) representing actions on, or connecting "entities" (where the variability is represented by different naming formats that could exist, although you do not say anything about this).
In this respect, a very interesting discussion is found in Martin Flowler's book "Domain Specific Languages". Just to give you a taste of what it is about, here you can find an interesting discussion about the "Method Chaining" pattern, defined as:
“Make modifier methods return the host object so that multiple modifiers can be invoked in a single expression.”
As you can see, this pattern describes the very chaining mechanism you are positing in your design.
Here you have a list of all the patterns that were found interesting in defining such DSLs. Again, you will be easily find there several specialized patterns that you are also implying in your design or describing as way of more generic patterns (like the decorator). A few of them are: Regex Table Lexer, Method Chaining, Expression Builder, etc. And many more that could help you further specify your design.
All in all, I could add my grain of salt by saying that I see a place for a "command processor" pattern in your specificaiton, but I am pretty confident that by deploying the powerful abstractions that Fowler proposes you will be able to come up with a much more specific and precise design, covering aspect of the problem that right now are simply hidden by the "generality" of the GoF pattern set.
It is true that this could be "overkill" for a problem like the one you are describing, but as an exercise in pattern oriented design it can be very insightful.
I'd suggest starting with the boost iterator library - eg the filter iterator.
(And, of course, boost includes a very nice regex library.)

Most efficient data structure to store an XML tree in C++

I'm doing some work with XML in C++, and I would like to know what the best data structure to store XML data is. Please don't just tell me what you've heard of in the past; I'd like to know what the most efficient structure is. I would like to be able to store any arbitrary XML tree (assuming it is valid), with minimal memory overhead and lookup time.
My initial thought was a hash, but I couldn't figure out how to handle multiple children of the same tag, as well as how attributes would be handled.
Qt solutions are acceptable, but I'm more concerned with the overall structure than the specific library. Thanks for your input.
The most efficient structure would a set of classes derived from the DTD or the Schema that defines the particular XML instances you intend to process. (Surely you aren't going to process arbitrary XML?) Tags are represented by classes. Single children can be represented by fields. Childen with min...max arity can be represented by a field containing an array. Children with indefinite arity can be represented by a dynamically allocated array. Attributes and children can be stored as fields, often with an inferred data type (if an attribute represents a number, why store it as a string?). Using this approach, you can often navigate to a particular place in an XML document using native C++ accesspaths, e.g.,
root->tag1.itemlist[1]->description.
All of the can be generated automatically from the Schema or the DTD. There are tools to do this.
Altova offers some. I have no specific experience with this (although I have built similar tools for Java and COBOL).
You should first determine what the requirement for efficiency is, in terms of storage, speed etc. in concrete numbers. Without knowing this information, you can't tell if your implementation satisfies the requirement.
And, if you have this requirement, you will probably find that the DOM satisfies it, and has the advantage of zero code to maintain.
It will be a nightmare for future programmers as they wonder why someone wrote an alternate implementation of the DOM.
Actually, pretty much anything you do will just be a DOM implementation, but possibly incomplete, and with optimizations for indexing etc. My personal feelig is that re-inventing the wheel should be the last thing you consider.
there is a C++ XML library already built: xerces.
http://xerces.apache.org/xerces-c/install-3.html
there are some tree structures in \include\boost-1_46_1\boost\intrusive\
there is a red-black and an avl tree, but not having looked at those in a long time, I don't know if those are especially usable, I think not.
XML is a tree structure. you don't know what the structure is going to be unless it has a DTD defined and included in a (although the validator at validrome breaks on !DOCTYPEs and it shouldn't).
see http://w3schools.com/xml/xml_tree.asp for a tree example.
you may get something that doesn't follow a DTD or schema. totally unstructured. like this:
<?xml version="1.0"?>
<a>
<b>hello
<e b="4"/>
<c a="mailto:jeff#nowhere.com">text</c>
</b>
<f>zip</f>
<z><b /><xy/></z>
<zook flag="true"/>
<f><z><e/></z>random</f>
</a>
I know that queriable XML databases do exist, but I don't know much about them, except that they can handle unstructured data.
PHP has an XML parser which sticks it into what PHP calls an array (not quite like a C/C++ array, because the arrays can have arrays), you can tinker with it to see an example of what an XML data structure should have in it.
what you basically want is a very flexible tree where the root pointer points to a list. each of those nodes in the list contains a pointer that can point to a list. it should be an ordered list, so make it a . If your purpose is to be able to remove data, use a instead of a - it's ordered, while having the capability of easy manipulation.
word of warning: .erase(iterator i) erases everything starting at and after i.
.erase(iterator i1, iterator i2) erases everything from i1 up to but not including i2.
.end() is an iterator that points 1 after the end of the list, essentially at nothing.
.begin() is an iterator that points to the start of the list.
learn to use for_each(start,end,function) { } in or use a regular for statement.
iterators are like pointers. treat them as such.
#include <iterator>
#include <list>
#include <iostream>
using namespace std;
list<class node> nodelist;
list<class node>::iterator nli;
for (nli=nodelist.begin(); nli!=nodelist.end(); nli++) {
cout<<nli->getData()<<endl;
}
the nodes need to have an optional list of attributes and note that the DTD could possibly be contained within the XML document, so you have to be able to read it to parse the document (or you could throw it away). you may also run into XML Schema, the successor of the DTD.
I think the most efficient data struture to store xml in is probably vtd-xml, which uses array of longs instead of lots of interconnected structs/classes. The main idea is that structs/classes are based on small memory allocators which incurs severe overhead in a normal circumstance. See this article for further detail.
http://soa.sys-con.com/node/250512
I'm not sure what the most efficient method is, but since the DOM already exists why re-invent the wheel?
It may make sense to hash all nodes by name for lookup, but you should still use the DOM as the basic representation.
I've been exploring this problem myself. And, these are my thoughts.
a) every element in xml is either a node or a (key, value) pair.
b) store every element in a Hash. assign each element a type i.e "node","key,value".
c)every element will have a parent. assign a value to each of them.
d) every element may, or may, not have children/References. store the children in a btree which will define, the references.
Search time for any key will be O(1).A reference traversal can have a list of all the children inside the element.
Please review and suggest what I've missed.
Just use DOM to store the parsed XML file . Surely there are C++ DOM library .
You can query DOM with XPath expressions.

What is the best data structure to store FIX messages?

What's the best way to store the following message into a data structure for easy access?
"A=abc,B=156,F=3,G=1,H=10,G=2,H=20,G=3,H=30,X=23.50,Y=xyz"
The above consists of key/value pairs of the following:
A=abc
B=156
F=3
G=1
H=10
G=2
H=20
G=3
H=30
X=23.50
Y=xyz
The tricky part is the keys F, G and H. F indicates the number of items in a group whose item consists of G and H.
For example if F=3, there are three items in this group:
Item 1: G=1, H=10
Item 2: G=2, H=20
Item 3: G=3, H=30
In the above example, each item consists of two key/pair values: G and H. I would like the data structure to be flexible such that it can handle if the item increases its key/pair values. As much as possible, I would like to maintain the order it appears in the string.
UPDATE: I would like to store the key/value pairs as strings even though the value often appears as float or other data type, like a map.
May not be what you're looking for, but I'd simply recommend using QuickFIX (quickfixengine.org), which is a very high quality C++ FIX library. It has the type "FIX::Message" which does everything you're looking for, I believe.
I work with FIX a lot in Python an Perl, and I tend to use a dictionary or hash. Your keys should be unique within the message. For C++, you could look at std::map or STL extension std::hash_map.
If you have a subset of FIX messages you have to support (most exchanges usually use 10-20 types), you can roll your own classes to parse messages into. If you're trying to be more generic, I would suggest creating something like a FIXChunk class. The entirety of the message could be stored in this class, organized into keys and their values, as well as lists of repeating groups. Each of the repeating groups would itself be a FIXChunk.
A simple solution, but you could use a std::multimap<std::string,std::string> to store the data. That allows you to have multiple keys with the same value.
In my experience, fix messages are usually stored either in their original form (as a stream of bytes) or as a complex data structure providing a full APIs that can handle their intricacies. After all, a fix message can sometimes represent a tree of data.
The problem with the latter solution is that the transition is expensive in terms of computation cost in high-speed trading systems. If you are building a trading system, you may prefer to lazily calculate the parts of the fix message than you need, which is admittedly easier said than done.
I am not familiar with efficient open-source implementations; companies like the one I work for usually have proprietary implementations.

Efficient way of extracting specific numerical attributes from XML

The application I work uses XML for save/restore purposes. Here's an example snippet:
<?xml version="1.0" standalone="yes"?>
<itemSet>
<item handle="2" attribute1="30" attribute2="blah"></item>
<item handle="5" attribute1="27" attribute2="blahblah"></item>
</itemSet>
I want to be able to efficiently pre-process the XML which I read in from the configuration file. In particular, I want to extract the handle values from the example configuration above.
Ideally, I need a function/method to be able to pass in an opaque XML string, and return all of the handle values in a list. For the above example, a list containing 2 and 5 would be returned.
I know there's a regular expression out there that will help, but is it the most efficient way of doing this? String manipulation can be costly, and there may be potentially 1000s of XML strings I would need to process in a configuration file.
You are looking for a stream oriented XML parser that reads each node in your XML one at a a time rather then loading the whole thing into memory.
One of the best known is the SAX - Simple API for XML
Here's a good article describing why to use SAX and also specific of using SAX in C++.
You can think of SAX as a parser of XML that only loads the bare minimum into memory and so works well on very large XML documents. As compare to the Regex or DOM approach that will require you to load the entire document into memory.
I'd guess a regex of some sort is going to be your best option for efficiency. It's going to be faster than parsing the XML into any sort of structural construct, and as long as you can extract all the information you will need in one pass, it's likely the most efficient method.
It would be hard to beat something like:
/* untested code */
using std::string;
size_t pos = 0;
vector<int> handles;
while ((pos = xmlstr.find("handle=\"", pos)) != string::npos) {
handles.push_back(atoi(xmlstr.data() + pos + 7));
}
It would be more efficient if handles.reserve() were called with the proper size, or perhaps if handles were a deque or list, depending on how it needs to be used later. This is unsafe code if the xml string might be malformed (xmlstr.data() is not null-terminated, so atoi might go off the end of the array). It also doesn't check that handle isn't the end of a longer attribute name, or indeed whether it is actually an attribute.
Using a regex library for a regular expression like "\\bhandle=\"\\d+\"" is likely to get you results nearly as fast with less likelihood of error. It still doesn't confirm that handle is an attribute; you have to judge if that's likely to be a problem.