Efficient way of extracting specific numerical attributes from XML

Efficient way of extracting specific numerical attributes from XML - c++

The application I work uses XML for save/restore purposes. Here's an example snippet:
<?xml version="1.0" standalone="yes"?>
<itemSet>
<item handle="2" attribute1="30" attribute2="blah"></item>
<item handle="5" attribute1="27" attribute2="blahblah"></item>
</itemSet>
I want to be able to efficiently pre-process the XML which I read in from the configuration file. In particular, I want to extract the handle values from the example configuration above.
Ideally, I need a function/method to be able to pass in an opaque XML string, and return all of the handle values in a list. For the above example, a list containing 2 and 5 would be returned.
I know there's a regular expression out there that will help, but is it the most efficient way of doing this? String manipulation can be costly, and there may be potentially 1000s of XML strings I would need to process in a configuration file.

You are looking for a stream oriented XML parser that reads each node in your XML one at a a time rather then loading the whole thing into memory.
One of the best known is the SAX - Simple API for XML
Here's a good article describing why to use SAX and also specific of using SAX in C++.
You can think of SAX as a parser of XML that only loads the bare minimum into memory and so works well on very large XML documents. As compare to the Regex or DOM approach that will require you to load the entire document into memory.

I'd guess a regex of some sort is going to be your best option for efficiency. It's going to be faster than parsing the XML into any sort of structural construct, and as long as you can extract all the information you will need in one pass, it's likely the most efficient method.

It would be hard to beat something like:
/* untested code */
using std::string;
size_t pos = 0;
vector<int> handles;
while ((pos = xmlstr.find("handle=\"", pos)) != string::npos) {
handles.push_back(atoi(xmlstr.data() + pos + 7));
}
It would be more efficient if handles.reserve() were called with the proper size, or perhaps if handles were a deque or list, depending on how it needs to be used later. This is unsafe code if the xml string might be malformed (xmlstr.data() is not null-terminated, so atoi might go off the end of the array). It also doesn't check that handle isn't the end of a longer attribute name, or indeed whether it is actually an attribute.
Using a regex library for a regular expression like "\\bhandle=\"\\d+\"" is likely to get you results nearly as fast with less likelihood of error. It still doesn't confirm that handle is an attribute; you have to judge if that's likely to be a problem.

Related

WSDL XSD elements sequence

Does the sequence order of XSD elements have a significant impact on client aplications?
Imagine that a clinet application that is givven a different order. Should that influence it in any way?
Additionally is there a special case where the order read by the client application is different that the XSD?
Thanks

1) It depends on how the "contract" is written... which of the XSD "compositors" is used (sequence, choice and all):
Sequence: the elements must show in that exact order and repetition as identified by cardinality constraints. Repeat as needed...
Choice: for any occurrence of the choice, exactly one from the options particles may appear. If the choice is repeatable, then it means those options may repeat, but in no particular order.
All: elements may appear in any order; in XSD 1.0 version, each element may occur at most once; XSD 1.1 relaxed this constraint, which means more of the same may occur.
If you're providing a client application with elements using an order different than that prescribed using the associated xsd:sequence then an XSD validator must flag the XML as invalid. It should not mater for an xsd:all or a repeating choice.
2) If you're processing the XML using an XML API, then the order read by the client application is always that in the XML instance (not the XSD). If you're processing the XML using some sort of XSD to code binding technology, such as JAXB or XML serialization in .NET, then as long as the XML is valid, the concept of "ordering" is affected... corresponding references within a list will still reflect that encountered in the XML file; however, in object orientation there is no ordering of fields in a class definition (proprietary annotations/tags may still capture this as metadata for the purposes of serializing it back correctly, but that is just a binding technology "ism" rather than a OO concept).
Then there are the really bad XSD contracts, where semantics of elements are implied by the relative position of the element in the parent's node collection (e.g. first customer is "principal", second would be "co-applicant", etc.) which makes this discussion even fuzzier...

Store characters in C++ tree

How it is possible to store character values in binary tree? I have an CSV file with data, and I have to retrieve that data, search the database, then insert search results. I did that using C++ map from Standard Template Library, but now my task is to do that using tree structure. Searched the web, but haven't found anything about characters, just integers, like this: http://www.cprogramming.com/tutorial/lesson18.html
Thanks.
Edin.

Just use the code from your link and replace int by char.

I wouldn't use "my own" binary tree.
I would suggest you use, std::map or std::vector (depending on the amount of data, and many other factors) - start with vector, as that's the "easiest" - if that can be proven to be "bad", then change it - if you write your code well, it shouldn't change much.
But more importantly, when you say "character", I suspect you actually mean "string". So a vector with a class or struct containing your elements from the csv file would be a sutiable solution.

What container is the best to manage XML data?

Imagine I have XML data:
<tag> value</tag> etc ....
<othertag> value value2 value_n </othertag>
etc ....
What is the best container to manage this informaction?
A simple vector, a list, other?
I'm going to do simple insertions, searchings, deletions.
Of course the code are going to be 'some rude'.
I know that there will be XML specific utilities, but I'd want to know your opinion.

"The best container" to manage XML is usually a DOM tree, since it can store all the information stored in the XML source and present it in a code-friendly way; still, depending on what you want to do to this data, it can be overkill.
On the other hand, since what you want to do are actually generic manipulations of the XML tree, I think it could be your best option; grab a good XML parser that produces a DOM tree and use it.
A personal note: please, don't reinvent an XML square wheel yet another time - there are enough broken XML parsers around, we don't need yet another one. Use a well known, standard conforming XML parser (e.g. Xerces-C++) that produces your DOM tree and be happy with it.

Well that highly depends on your xml data. Do have have a list of objects with no special identificators? Or do you want to be able to quickly ID an object in your list (i.e. have a mapping?).
You can always use http://linuxsoftware.co.nz/cppcontainers.html to make a decision. The flowchart at the bottom of the page is particularly useful.

Most efficient data structure to store an XML tree in C++

I'm doing some work with XML in C++, and I would like to know what the best data structure to store XML data is. Please don't just tell me what you've heard of in the past; I'd like to know what the most efficient structure is. I would like to be able to store any arbitrary XML tree (assuming it is valid), with minimal memory overhead and lookup time.
My initial thought was a hash, but I couldn't figure out how to handle multiple children of the same tag, as well as how attributes would be handled.
Qt solutions are acceptable, but I'm more concerned with the overall structure than the specific library. Thanks for your input.

The most efficient structure would a set of classes derived from the DTD or the Schema that defines the particular XML instances you intend to process. (Surely you aren't going to process arbitrary XML?) Tags are represented by classes. Single children can be represented by fields. Childen with min...max arity can be represented by a field containing an array. Children with indefinite arity can be represented by a dynamically allocated array. Attributes and children can be stored as fields, often with an inferred data type (if an attribute represents a number, why store it as a string?). Using this approach, you can often navigate to a particular place in an XML document using native C++ accesspaths, e.g.,
root->tag1.itemlist[1]->description.
All of the can be generated automatically from the Schema or the DTD. There are tools to do this.
Altova offers some. I have no specific experience with this (although I have built similar tools for Java and COBOL).

You should first determine what the requirement for efficiency is, in terms of storage, speed etc. in concrete numbers. Without knowing this information, you can't tell if your implementation satisfies the requirement.
And, if you have this requirement, you will probably find that the DOM satisfies it, and has the advantage of zero code to maintain.
It will be a nightmare for future programmers as they wonder why someone wrote an alternate implementation of the DOM.
Actually, pretty much anything you do will just be a DOM implementation, but possibly incomplete, and with optimizations for indexing etc. My personal feelig is that re-inventing the wheel should be the last thing you consider.

there is a C++ XML library already built: xerces.
http://xerces.apache.org/xerces-c/install-3.html
there are some tree structures in \include\boost-1_46_1\boost\intrusive\
there is a red-black and an avl tree, but not having looked at those in a long time, I don't know if those are especially usable, I think not.
XML is a tree structure. you don't know what the structure is going to be unless it has a DTD defined and included in a (although the validator at validrome breaks on !DOCTYPEs and it shouldn't).
see http://w3schools.com/xml/xml_tree.asp for a tree example.
you may get something that doesn't follow a DTD or schema. totally unstructured. like this:
<?xml version="1.0"?>
<a>
<b>hello
<e b="4"/>
<c a="mailto:jeff#nowhere.com">text</c>
</b>
<f>zip</f>
<z><b /><xy/></z>
<zook flag="true"/>
<f><z><e/></z>random</f>
</a>
I know that queriable XML databases do exist, but I don't know much about them, except that they can handle unstructured data.
PHP has an XML parser which sticks it into what PHP calls an array (not quite like a C/C++ array, because the arrays can have arrays), you can tinker with it to see an example of what an XML data structure should have in it.
what you basically want is a very flexible tree where the root pointer points to a list. each of those nodes in the list contains a pointer that can point to a list. it should be an ordered list, so make it a . If your purpose is to be able to remove data, use a instead of a - it's ordered, while having the capability of easy manipulation.
word of warning: .erase(iterator i) erases everything starting at and after i.
.erase(iterator i1, iterator i2) erases everything from i1 up to but not including i2.
.end() is an iterator that points 1 after the end of the list, essentially at nothing.
.begin() is an iterator that points to the start of the list.
learn to use for_each(start,end,function) { } in or use a regular for statement.
iterators are like pointers. treat them as such.
#include <iterator>
#include <list>
#include <iostream>
using namespace std;
list<class node> nodelist;
list<class node>::iterator nli;
for (nli=nodelist.begin(); nli!=nodelist.end(); nli++) {
cout<<nli->getData()<<endl;
}
the nodes need to have an optional list of attributes and note that the DTD could possibly be contained within the XML document, so you have to be able to read it to parse the document (or you could throw it away). you may also run into XML Schema, the successor of the DTD.

I think the most efficient data struture to store xml in is probably vtd-xml, which uses array of longs instead of lots of interconnected structs/classes. The main idea is that structs/classes are based on small memory allocators which incurs severe overhead in a normal circumstance. See this article for further detail.
http://soa.sys-con.com/node/250512

I'm not sure what the most efficient method is, but since the DOM already exists why re-invent the wheel?
It may make sense to hash all nodes by name for lookup, but you should still use the DOM as the basic representation.

I've been exploring this problem myself. And, these are my thoughts.
a) every element in xml is either a node or a (key, value) pair.
b) store every element in a Hash. assign each element a type i.e "node","key,value".
c)every element will have a parent. assign a value to each of them.
d) every element may, or may, not have children/References. store the children in a btree which will define, the references.
Search time for any key will be O(1).A reference traversal can have a list of all the children inside the element.
Please review and suggest what I've missed.

Just use DOM to store the parsed XML file . Surely there are C++ DOM library .
You can query DOM with XPath expressions.

compressed string storage

Lets say I have many objects containing strings of non-trivial length (around ~3-4kb). The strings are all different from each other yet at the same time contain lots of common parts/subsequences. On average maybe 80-90% of any individual string is contained withing the others as well. Is there an easy way to automatically exploit this huge redundancy for compressing the data?
Ideally the solution would be C++ and transparent for the user (i.e. I can use it as if I was accessing a regular read only const std::string but instead reading from compressed storage).

Algorithmically, Lempel–Ziv–Welch with one dictionary for all objects/strings might be a good start.

You can use huffman coding implementation is not hard, Also there are zip algorithms in languages (like C# and java) and you can use them.
Also If you sure 80-90% are repeated in all, create a dictionary of all words, then for each string store the position of dictionary word, means have a bit array of big size (10000 i.e) and mark the related position bits[i] to 1 if a words[i] exists in the current string. think each word length is 5 character then the abbreviation takes around 1/5 size.

If the common parts of the strings are common because they are composed from other strings, then you might get some traction by using the stlport rope class, which looks for all the world like a std::string, but uses substring tree representation with copy on write that makes them both very space efficient (common substrings are shared) and very good at inserts and deletes (log(n))
When to use rope:
you are making a template engine. document instances are made from a template by substituting varying data in the template, and then cached for future uses. Parts that are common to templates and instances are stored only once and shared across instances, inserts and deletes are cheap.
When not to use rope:
you are loading many documents from outside the domain of your application (from disk, or over a network) and using them without modification. rope doesn't share strings if they are not copied from one rope to another. If you can afford to do the work to find the common substrings, rope can still be used to improve your final representations.

Like #Saeed mentioned, a simple Huffman coding will perform well here.
There is no need in dictionary, if the common words are known apriori (you've mentioned that it's a HTML). Just precompute a huffman table using statistical data from many HTML files (Note that you can encode whole tag by a single symbol, and you can have as many symbols as you want).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js