Xerces: How to merge duplicate nodes? - c++

My question is this:
If I have the following XML:
<root>
<alpha one="start">
<in>1</in>
</alpha>
</root>
and then I'll add the following path:
<root><alpha one="start"><out>2</out></alpha></root>
which results in
<root>
<alpha one="start">
<in>1</in>
</alpha>
</root>
<root>
<alpha one="start">
<out>2</out>
</alpha>
</root>
I want to be able to convert it into this:
<root>
<alpha one="start">
<in>1</in>
<out>2</out>
</alpha>
</root>
Besides implementing it myself (don't feel like reinventing the wheel today),
is there a specific way in Xerces (2.8,C++) to do it?
If so, at which point of the DOMDocuments life is the node merging done? at each insertion? at the writing of the document, explicitly on demand?
Thanks.

If you use xalan its possible to use an xpath to find the element and directly insert into the correc one.
The following code may be slow but returns all "root" elments with the attribute "one" set to "start".
selectNodes("//root[#one="start"]")
It is probably better to use the full path
selectNodes("/abc/def/.../root[#one="start"]")
or if you already have the parent element work relative
selectNodes("./root[#one="start"]")
I think to get the basic concepts xpath on wikipedia.

Isn't it just a one minute task if you know the names of the container tag where various different tags are present?
In your example, get a pointer to the alpha tag in all the XML documents and put the contents of all of them into a new document's alpha if they're not present there already.
This isn't as bad as reinventing the wheel. I'm not familiar with Xerces, but with libxml++, I would call this an easy task.

Related

eXist-db / XSLT / Saxon collection() slow as molasses (or errors out with memory limit)

Coming from this question, I managed one entirely unsatisfactory solution for accessing an eXist-DB collection() from an XSLT 2.0 document loaded from within an eXist-db/Xquery transformation function:
The XSLT file declares a variable :
<xsl:variable name="coll" select="collection('xmldb:exist:///db/apps/deheresi/data/collection_ms609.xml')"/>
This points to a catalog xml file I created (per Saxon documentation) that looks like this, in order to load the actual collection:
<collection stable="true">
<doc href="xmldb:exist:///db/apps/deheresi/data/ms609_0001.xml"/>
<doc href="xmldb:exist:///db/apps/deheresi/data/ms609_0002.xml"/>
...
...
<doc href="xmldb:exist:///db/apps/deheresi/data/ms609_0709.xml"/>
<doc href="xmldb:exist:///db/apps/deheresi/data/ms609_0710.xml"/>
</collection>
This allows the XSLT file to use a key that needs to search across all these files:
<xsl:key name="correspkey" match="tei:seg[#type='dep_event' and #corresp]" use="#corresp"/>
<xsl:variable name="correspvar" select="self::seg[#type='dep_event' and #corresp]/#corresp"/>
<xsl:value-of select="$coll/(key('correspid',$correspvar) except $correspvar)/#id" separator=", "/>
As it stands, if I have 50 documents in the catalog, I get a result in 2 minutes; with all 710 I get a java GC error after 4 minutes.
I have set indexes on relevant nodes in eXist-DB, but this does nothing to performance. It seems to me Saxon is working 'outside' eXist-DB's optimisations, treating eXist-DB as a simple file system.
(For what it's worth, setting href="/db/apps/deheresi/data/ms609_0001.xml" does not let Saxon see the documents.)
I suspect all of this is why the eXist-DB documentation is non-existent.
As it goes, I am looking for solutions for intensive searches of collections from within XSLT 2.0 loaded within eXist-DB by Xquery transform().
If anything, I hope this post helps future searchers encountering the same problem.
The general architectural principle is: try to move the searching closer to the data. In this case this means: use eXist to find the documents of interest, don't extract every possible candidate document from eXist and then ask Saxon to do the searching. Select the actual documents of interest in an eXist XQuery, and then pass the list of these documents to Saxon in a stylesheet parameter.

Adding/removing specific elements from xml file, in Qt?

I have a XML Document, like this:
<?xml version="1.0" encoding="UTF-8"?>
<items>
<item s_no="1">
<title>title_1</title>
<path>path1</path>
<desc>descriptoion1</desc>
</item>
<item s_no="2">
<title>title_2</title>
<path>path2</path>
<desc>descriptoion2</desc>
</item>
This is generated from QXmlStreamWriter in Qt. I want a function to add <item> tag with all elements like <title>, <path> etc. and I want a function to remove an item tag by identifying s_no attributes. All this should be done, without affecting any other content in the file.
I've searched a lot,I know there are similar questions, I've tried some code but it didn't worked. Are there any functions that do this, in QDomDocument?
When I have looked into doing this in the past, it hasn't really been a trivial thing.
QDomDocument and QDomNode
I think you should be able to do it with QDomDocument and QDomNode. Sometimes it is hard to see all the possible functions just on the main page for the documentation of the class, because it can get so much from the abstract classes it is derived from... clicking "lists of all members" shows a complete list.
http://doc.qt.io/qt-5/qdomdocument-members.html
Some calls that look promising include: childNodes elementById elementsByTagName createNode insertBefore insertAfter removeChild.
UPDATE: A working example that shows a straight forward way how to delete and insert nodes on a QDomDocument.
https://github.com/peteristhegreat/xml_insert_remove
Note, that when adding QDomNodes/QDomElements, etc, every element needs to be created on the document, otherwise it doesn't stay in scope when you leave a function.
QXmlStreamReader and QXmlStreamWriter
A few documents I've seen (a few years ago) said that they highly recommend using the QXmlStream* classes since they are better supported, or have been maintained more recently. I think it has some better error handling and doesn't have to load the whole document to be useful.
So as far as editing the document and resaving it, the most direct way that I know of is to read in everything, and store it as nested C++ classes and then write them out.
QJson Example (similar to QXmlStream*
There is a similar example with Json, that really shows off the power of subclassing a read and a write function into your model.
http://doc.qt.io/qt-5/qtcore-json-savegame-example.html
I think a similar approach could be done with the stream reader and writer class for XML.
Hope that helps.

XSLT and the for-each loop

I am using the XML file here
http://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xml
And I have written this code
<xsl:for-each select="registry/record">
However it never finds anything because of this line in the XML
<registry xmlns="http://www.iana.org/assignments" id="service-names-port-numbers">
If I change that to
<registry>
It works, however I cannot change the XML, I must change the XSLT. What can I do to make it work? I just need to find those records.
Thanks.
XSLT and XPath are namespace-aware. Unfortunately they don't have any notation for setting a default namespace for the path, so you have to use an explicit prefix bound to the namespace.
If you aren't familiar with XML Namespaces, do review them. They're important.
Taking your specific example, here's a simplified version of the start of the SNTP document
<registry xmlns="http://www.iana.org/assignments" id="service-names-port-numbers">
<title>Service Name and Transport Protocol Port Number Registry</title>
<category>Service Names and Transport Protocol Port Numbers</category>
<updated>2014-02-06</updated>
<xref type="rfc" data="rfc6335"/>
<expert> ... names of experts ... </expert>
<note> ... usage notes ... </note>
<record>
<protocol>tcp</protocol>
<xref type="person" data="Jon_Postel"/>
<description>Reserved</description>
<number>0</number>
</record>
</registry>
The xmlns="http://www.iana.org/assignments" is a default namespace binding. All elements in this document will be in that namespace unless they have a prefix bound to another namespace or another xmlns= is used to change the default for them and their children.
Your XPaths and Match Expressions MUST reference this namespace, or they won't work.
Change
<xsl:for-each select="registry/record">
to
<xsl:for-each select="assignments:registry/assignments:record"
xmlns:assignments="http://www.iana.org/assignments">
(You can use a shorter prefix than assignments; I'm just trying to make this as clear as possible. You can also bind the prefix higher in your XSLT document -- typically, on the <xsl:stylesheet> element -- so it's available throughout the stylesheet rather than just in this one place.)
This will work, assuming the rest of your code is correct.
Also: In general, <xsl:for-each> tends to be overused. In general, unless this is a place where you really do need to do different processing than anywhere else in the stylesheet, you should instead be using <xsl:apply-templates> so the normal template-matching rules can apply. Otherwise you're making the stylesheet hard to extend and maintain. XSL is a rule-matching, nonprocedural language; learn to use it that way.

Processing the output of another XSLT Stylesheet

I have an XSLT stylesheet that produces some output in XML. I want to processes that output with another stylesheet. Is there a way to tell the latter stylesheet to "run and use" the results from the former?
There is not, as far as I know, a standard way to tell an XSLT processor to run another stylesheet on given input and do something with the output. In some cases you can process the input against one set of templates and save the result in a variable, then apply a different set of templates to the value of the variable, something like this:
<xsl:template match="/">
<xsl:variable name="temp">
<xsl:apply-templates mode="first-pass"/>
</xsl:variable>
<xsl:apply-templates select="$temp" mode="second-pass"/>
</xsl:template>
This assumes you're running XSLT 2.0. In XSLT 1.0 you will need a processor that supports the node-set extension (many do), and you'll need to change the reference to $temp to something like exslt:nodeset($temp).
As you will perceive, this won't work very well if your two stylesheets both use the default mode and operate on overlapping sets of element types. So some XSLT processors have added extensions to provide the kind of functionality you describe (see, for example, discussions of the Xalan pipe:pipeDocument extension element).
Of course, you can also handle the pipe outside of XSLT. The simplest way to do it depends upon the environment you are running in.
If you're running XSLT from an operating system shell and your XSLT processor accepts input on stdin, you can pipe the output from one stylesheet into the other:
xsltproc a.xsl in.xml | xsltproc b.xsl - > out.xml
And as mohammed moh has already pointed out, many scripting environments make it possible to do similar things: he mentions PHP, and of course there's XProc.
yes You can. You must Transforms the source node to a DOMDocument I don't Know What is your Programming Language . For Example in php is transformToDoc() after Transforms You Can Run A New XSLT Stylesheet On DOMDocument Output

Transforming one XML document into another with C++

What would be a straightforward way to transform a source XML document into a destination XML document. There are only small differences between source and destination: Specifically I want to delete the first UnitIDRecord-Node within each UnitIDGroup-Node.
What would be the appropriate model for this task DOM or SAX?
What XML-library would best fit this problem (which guarantees that the source and destination only differs in the deleted nodes, no missing namespace, attributes, encoding, ...)?
I read about XSLT, could this be an option?
The XML document looks like following:
<?xml version="1.0" encoding="UTF-8"?>
<ExPostInformationRealGeneration xmlns="http://schemas.seven2one.de/EEX/TransparencyPlatform" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schemas.seven2one.de/EEX/TransparencyPlatform EEXTransparencyPlatform.xsd">
<DispatcherID>XYZ</DispatcherID>
<CreationDateTime>2012-05-22T13:57:00Z</CreationDateTime>
<MessageText>1 - Positiv - Meldung mit Quality-Tag - L000</MessageText>
<UnitIDGroup>
<UnitID>E110200-001</UnitID>
<UnitIDRecord><Quantity>16.9</Quantity><Starttime>2008-04-30T22:00:00Z</Starttime><Period>PT1H</Period><MessageText></MessageText></UnitIDRecord>
<UnitIDRecord><Quantity>16.6</Quantity><Starttime>2008-04-30T23:00:00Z</Starttime><Period>PT1H</Period><MessageText></MessageText></UnitIDRecord>
<UnitIDRecord><Quantity>16.4</Quantity><Starttime>2008-05-01T00:00:00Z</Starttime><Period>PT1H</Period><MessageText></MessageText></UnitIDRecord>
</UnitIDGroup>
<UnitIDGroup>
<UnitID>E110200-002</UnitID>
<UnitIDRecord><Quantity>16.9</Quantity><Starttime>2008-04-30T22:00:00Z</Starttime><Period>PT1H</Period><MessageText></MessageText></UnitIDRecord>
<UnitIDRecord><Quantity>16.6</Quantity><Starttime>2008-04-30T23:00:00Z</Starttime><Period>PT1H</Period><MessageText></MessageText></UnitIDRecord>
<UnitIDRecord><Quantity>16.4</Quantity><Starttime>2008-05-01T00:00:00Z</Starttime><Period>PT1H</Period><MessageText></MessageText></UnitIDRecord>
</UnitIDGroup>
<UnitIDGroup>
<UnitID>E110201-001</UnitID>
<UnitIDRecord><Quantity>7.0</Quantity><Starttime>2008-04-30T22:00:00Z</Starttime><Period>PT1H</Period><MessageText></MessageText></UnitIDRecord>
<UnitIDRecord><Quantity>7.1</Quantity><Starttime>2008-04-30T23:00:00Z</Starttime><Period>PT1H</Period><MessageText></MessageText></UnitIDRecord>
<UnitIDRecord><Quantity>7.1</Quantity><Starttime>2008-05-01T00:00:00Z</Starttime><Period>PT1H</Period><MessageText></MessageText></UnitIDRecord>
</UnitIDGroup>
<!-- other UnitIDGroup elements -->
</ExPostInformationRealGeneration>
I would consider the possibility of reading the file in as strings and writing the string out to another file if it matches your criteria. That's a 5 line program and avoids any parsing etc. It will run quickly and is simple. But, it is specific to this problem and not reusable. I offer this therefore as a suggestion not the correct solution!