Does the sequence order of XSD elements have a significant impact on client aplications?
Imagine that a clinet application that is givven a different order. Should that influence it in any way?
Additionally is there a special case where the order read by the client application is different that the XSD?
Thanks
1) It depends on how the "contract" is written... which of the XSD "compositors" is used (sequence, choice and all):
Sequence: the elements must show in that exact order and repetition as identified by cardinality constraints. Repeat as needed...
Choice: for any occurrence of the choice, exactly one from the options particles may appear. If the choice is repeatable, then it means those options may repeat, but in no particular order.
All: elements may appear in any order; in XSD 1.0 version, each element may occur at most once; XSD 1.1 relaxed this constraint, which means more of the same may occur.
If you're providing a client application with elements using an order different than that prescribed using the associated xsd:sequence then an XSD validator must flag the XML as invalid. It should not mater for an xsd:all or a repeating choice.
2) If you're processing the XML using an XML API, then the order read by the client application is always that in the XML instance (not the XSD). If you're processing the XML using some sort of XSD to code binding technology, such as JAXB or XML serialization in .NET, then as long as the XML is valid, the concept of "ordering" is affected... corresponding references within a list will still reflect that encountered in the XML file; however, in object orientation there is no ordering of fields in a class definition (proprietary annotations/tags may still capture this as metadata for the purposes of serializing it back correctly, but that is just a binding technology "ism" rather than a OO concept).
Then there are the really bad XSD contracts, where semantics of elements are implied by the relative position of the element in the parent's node collection (e.g. first customer is "principal", second would be "co-applicant", etc.) which makes this discussion even fuzzier...
Related
Considering the following xml element:
<elem>
<sub_elem name="first">
<sub_elem name="second">
</elem>
Concerning the property tree populated by boost::property_tree::xml_parser::read_xml, is it guaranteed that sub_elem "first" will be before sub_elem "second"?
The documentation states:
Reads XML from an input stream and translates it to property tree.
However it depends what "translates" means exactly.
From property tree documentation: https://www.boost.org/doc/libs/1_65_1/doc/html/property_tree/container.html I deduce that order of elements in XML file is preserved
It is very important to remember that the property sequence is not
ordered by the key. It preserves the order of insertion. It closely
resembles a std::list. Fast access to children by name is provided via
a separate lookup structure. Do not attempt to use algorithms that
expect an ordered sequence (like binary_search) on a node's children.
Looking at internal code in https://www.boost.org/doc/libs/1_51_0/boost/property_tree/detail/xml_parser_read_rapidxml.hpp one may see traversing over nodes and push_back calls. It should indeed work in simplest way of preserving order.
This is for a university project. I'm writing in C++ but language is irrelevant to the question.
Some context: we are required design an application to perform the following steps:
reads in some data that represents use cases (in my case a simple text file);
Example of a use case in the .txt file:
user
places
order
applies a STRIDE matrix to that data to create misuse cases
don't worry if you're not familiar with STRIDE matrices - its irrelevant to the question
outputs the list of created misuse cases (basically just replaced the actor and the verb);
Example of a generated misuse case:
misactor (internal)
tampers
order
applies some sort of list reduction algorithm to the list;
this is the part where I'm stuck
outputs the reduced list.
So far, I have an array of pointers to UseCase objects containing strings for the entity (eg. user), the relationship/verb (eg. places) and the target (eg. order).
I've got up to the point where I have generated a list of MisUseCase objects, and I now need to apply some sort of reduction/pruning algorithm to this list. Except I have no idea where to even start.
What would be the best way to reduce a list like this to a smaller set of viable/relevant objects?
Thank you in advance.
Imagine I have XML data:
<tag> value</tag> etc ....
<othertag> value value2 value_n </othertag>
etc ....
What is the best container to manage this informaction?
A simple vector, a list, other?
I'm going to do simple insertions, searchings, deletions.
Of course the code are going to be 'some rude'.
I know that there will be XML specific utilities, but I'd want to know your opinion.
"The best container" to manage XML is usually a DOM tree, since it can store all the information stored in the XML source and present it in a code-friendly way; still, depending on what you want to do to this data, it can be overkill.
On the other hand, since what you want to do are actually generic manipulations of the XML tree, I think it could be your best option; grab a good XML parser that produces a DOM tree and use it.
A personal note: please, don't reinvent an XML square wheel yet another time - there are enough broken XML parsers around, we don't need yet another one. Use a well known, standard conforming XML parser (e.g. Xerces-C++) that produces your DOM tree and be happy with it.
Well that highly depends on your xml data. Do have have a list of objects with no special identificators? Or do you want to be able to quickly ID an object in your list (i.e. have a mapping?).
You can always use http://linuxsoftware.co.nz/cppcontainers.html to make a decision. The flowchart at the bottom of the page is particularly useful.
Suppose I've tons of filenames in my_dir/my_subdir, formatted in a some way:
data11_7TeV.00179691.physics_Egamma.merge.NTUP_PHOTON.f360_m796_p541_tid319627_00
data11_7TeV.00180400.physics_Egamma.merge.NTUP_PHOTON.f369_m812_p541_tid334757_00
data11_7TeV.00178109.physics_Egamma.merge.D2AOD_DIPHO.f351_m765_p539_p540_tid312017_00
For example data11_7TeV is the data_type, 00179691 the run number, NTUP_PHOTON the data format.
I want to write an interface to do something like this:
dataset = DataManager("my_dir/my_subdir").filter_type("data11_7TeV").filter_run("> 00179691").filter_tag("m = 796");
// don't to the filtering, be lazy
cout << dataset.count(); // count is an action, do the filtering
vector<string> dataset_list = dataset.get_list(); // don't repeat the filtering
dataset.save_filter("file.txt", "ALIAS"); // save the filter (not the filenames), for example save the regex
dataset2 = DataManagerAlias("file.txt", "ALIAS"); // get the saved filter
cout << dataset2.filter_tag("p = 123").count();
I want lazy behaviour, for example no real filtering has to be done before any action like count or get_list. I don't want to redo the filtering if it is already done.
I'm just learning something about design pattern, and I think I can use:
an abstract base class AbstractFilter that implement filter* methods
factory to decide from the called method which decorator use
every time I call a filter* method I return a decorated class, for example:
AbstractFilter::filter_run(string arg) {
decorator = factory.get_decorator_run(arg); // if arg is "> 00179691" returns FilterRunGreater(00179691)
return decorator(this);
}
proxy that build a regex to filter the filenames, but don't do the filtering
I'm also learning jQuery and I'm using a similar chaining mechanism.
Can someone give me some hints? Is there some place where a design like this is explained? The design must be very flexible, in particular to handle new format in the filenames.
I believe you're over-complicating the design-pattern aspect and glossing over the underlying matching/indexing issues. Getting the full directory listing from disk can be expected to be orders of magnitude more expensive than the in-RAM filtering of filenames it returns, and the former needs to have completed before you can do a count() or get_list() on any dataset (though you could come up with some lazier iterator operations over the dataset).
As presented, the real functional challenge could be in indexing the filenames so you can repeatedly find the matches quickly. But, even that's unlikely as you presumably proceed from getting the dataset of filenames to actually opening those files, which is again orders of magnitude slower. So, optimisation of the indexing may not make any appreciable impact to your overall program's performance.
But, lets say you read all the matching directory entries into an array A.
Now, for filtering, it seems your requirements can generally be met using std::multimap find(), lower_bound() and upper_bound(). The most general way to approach it is to have separate multimaps for data type, run number, data format, p value, m value, tid etc. that map to a list of indices in A. You can then use existing STL algorithms to find the indices that are common to the results of your individual filters.
There are a lot of optimisations possible if you happen to have unstated insights / restrictions re your data and filtering needs (which is very likely). For example:
if you know a particular filter will always be used, and immediately cuts the potential matches down to a manageable number (e.g. < ~100), then you could use it first and resort to brute force searches for subsequent filtering.
Another possibility is to extract properties of individual filenames into a structure: std::string data_type; std::vector<int> p; etc., then write an expression evaluator supporting predicates like "p includes 924 and data_type == 'XYZ'", though by itself that lends itself to brute-force comparisons rather than faster index-based matching.
I know you said you don't want to use external libraries, but an in-memory database and SQL-like query ability may save you a lot of grief if your needs really are at the more elaborate end of the spectrum.
I would use a strategy pattern. Your DataManager is constructing a DataSet type, and the DataSet has a FilteringPolicy assigned. The default can be a NullFilteringPolicy which means no filters. If the DataSet member function filter_type(string t) is called, it swaps out the filter policy class with a new one. The new one can be factory constructed via the filter_type param. Methods like filter_run() can be used to add filtering conditions onto the FilterPolicy. In the NullFilterPolicy case it's just no-ops. This seems straghtforward to me, I hope this helps.
EDIT:
To address the method chaining you simply need to return *this; e.g. return a reference to the DataSet class. This means you can chain DataSet methods together. It's what the c++ iostream libraries do when you implement operator>> or operator<<.
First of all, I think that your design is pretty smart and lends itself well to the kind of behavior you are trying to model.
Anyway, my understanding is that you are trying and building a sort of "Domain Specific Language", whereby you can chain "verbs" (the various filtering methods) representing actions on, or connecting "entities" (where the variability is represented by different naming formats that could exist, although you do not say anything about this).
In this respect, a very interesting discussion is found in Martin Flowler's book "Domain Specific Languages". Just to give you a taste of what it is about, here you can find an interesting discussion about the "Method Chaining" pattern, defined as:
“Make modifier methods return the host object so that multiple modifiers can be invoked in a single expression.”
As you can see, this pattern describes the very chaining mechanism you are positing in your design.
Here you have a list of all the patterns that were found interesting in defining such DSLs. Again, you will be easily find there several specialized patterns that you are also implying in your design or describing as way of more generic patterns (like the decorator). A few of them are: Regex Table Lexer, Method Chaining, Expression Builder, etc. And many more that could help you further specify your design.
All in all, I could add my grain of salt by saying that I see a place for a "command processor" pattern in your specificaiton, but I am pretty confident that by deploying the powerful abstractions that Fowler proposes you will be able to come up with a much more specific and precise design, covering aspect of the problem that right now are simply hidden by the "generality" of the GoF pattern set.
It is true that this could be "overkill" for a problem like the one you are describing, but as an exercise in pattern oriented design it can be very insightful.
I'd suggest starting with the boost iterator library - eg the filter iterator.
(And, of course, boost includes a very nice regex library.)
I'm doing some work with XML in C++, and I would like to know what the best data structure to store XML data is. Please don't just tell me what you've heard of in the past; I'd like to know what the most efficient structure is. I would like to be able to store any arbitrary XML tree (assuming it is valid), with minimal memory overhead and lookup time.
My initial thought was a hash, but I couldn't figure out how to handle multiple children of the same tag, as well as how attributes would be handled.
Qt solutions are acceptable, but I'm more concerned with the overall structure than the specific library. Thanks for your input.
The most efficient structure would a set of classes derived from the DTD or the Schema that defines the particular XML instances you intend to process. (Surely you aren't going to process arbitrary XML?) Tags are represented by classes. Single children can be represented by fields. Childen with min...max arity can be represented by a field containing an array. Children with indefinite arity can be represented by a dynamically allocated array. Attributes and children can be stored as fields, often with an inferred data type (if an attribute represents a number, why store it as a string?). Using this approach, you can often navigate to a particular place in an XML document using native C++ accesspaths, e.g.,
root->tag1.itemlist[1]->description.
All of the can be generated automatically from the Schema or the DTD. There are tools to do this.
Altova offers some. I have no specific experience with this (although I have built similar tools for Java and COBOL).
You should first determine what the requirement for efficiency is, in terms of storage, speed etc. in concrete numbers. Without knowing this information, you can't tell if your implementation satisfies the requirement.
And, if you have this requirement, you will probably find that the DOM satisfies it, and has the advantage of zero code to maintain.
It will be a nightmare for future programmers as they wonder why someone wrote an alternate implementation of the DOM.
Actually, pretty much anything you do will just be a DOM implementation, but possibly incomplete, and with optimizations for indexing etc. My personal feelig is that re-inventing the wheel should be the last thing you consider.
there is a C++ XML library already built: xerces.
http://xerces.apache.org/xerces-c/install-3.html
there are some tree structures in \include\boost-1_46_1\boost\intrusive\
there is a red-black and an avl tree, but not having looked at those in a long time, I don't know if those are especially usable, I think not.
XML is a tree structure. you don't know what the structure is going to be unless it has a DTD defined and included in a (although the validator at validrome breaks on !DOCTYPEs and it shouldn't).
see http://w3schools.com/xml/xml_tree.asp for a tree example.
you may get something that doesn't follow a DTD or schema. totally unstructured. like this:
<?xml version="1.0"?>
<a>
<b>hello
<e b="4"/>
<c a="mailto:jeff#nowhere.com">text</c>
</b>
<f>zip</f>
<z><b /><xy/></z>
<zook flag="true"/>
<f><z><e/></z>random</f>
</a>
I know that queriable XML databases do exist, but I don't know much about them, except that they can handle unstructured data.
PHP has an XML parser which sticks it into what PHP calls an array (not quite like a C/C++ array, because the arrays can have arrays), you can tinker with it to see an example of what an XML data structure should have in it.
what you basically want is a very flexible tree where the root pointer points to a list. each of those nodes in the list contains a pointer that can point to a list. it should be an ordered list, so make it a . If your purpose is to be able to remove data, use a instead of a - it's ordered, while having the capability of easy manipulation.
word of warning: .erase(iterator i) erases everything starting at and after i.
.erase(iterator i1, iterator i2) erases everything from i1 up to but not including i2.
.end() is an iterator that points 1 after the end of the list, essentially at nothing.
.begin() is an iterator that points to the start of the list.
learn to use for_each(start,end,function) { } in or use a regular for statement.
iterators are like pointers. treat them as such.
#include <iterator>
#include <list>
#include <iostream>
using namespace std;
list<class node> nodelist;
list<class node>::iterator nli;
for (nli=nodelist.begin(); nli!=nodelist.end(); nli++) {
cout<<nli->getData()<<endl;
}
the nodes need to have an optional list of attributes and note that the DTD could possibly be contained within the XML document, so you have to be able to read it to parse the document (or you could throw it away). you may also run into XML Schema, the successor of the DTD.
I think the most efficient data struture to store xml in is probably vtd-xml, which uses array of longs instead of lots of interconnected structs/classes. The main idea is that structs/classes are based on small memory allocators which incurs severe overhead in a normal circumstance. See this article for further detail.
http://soa.sys-con.com/node/250512
I'm not sure what the most efficient method is, but since the DOM already exists why re-invent the wheel?
It may make sense to hash all nodes by name for lookup, but you should still use the DOM as the basic representation.
I've been exploring this problem myself. And, these are my thoughts.
a) every element in xml is either a node or a (key, value) pair.
b) store every element in a Hash. assign each element a type i.e "node","key,value".
c)every element will have a parent. assign a value to each of them.
d) every element may, or may, not have children/References. store the children in a btree which will define, the references.
Search time for any key will be O(1).A reference traversal can have a list of all the children inside the element.
Please review and suggest what I've missed.
Just use DOM to store the parsed XML file . Surely there are C++ DOM library .
You can query DOM with XPath expressions.