I am trying to parse am XML string using xerces c++.
The structure is
<root>
<optionA>
<optionB/>
</optionA>
</root>
I read the xml string into MemBufInputSource and then parse it.
When I call getChildNodes() on root, it always returns 2. Should it not be 1? Here, only option A is the child of root. Also, for each child I check if its a node and of type element. For the first child, the check is always false.
Why does it show a count of 2 children?
getChildNodes() returns all child nodes, not just the ones that are elements.
The whitespace between the elements (new lines in this case) count as a text node (DOMNode::NodeType::TEXT_NODE). By my count there are actually 2 text nodes in your example, so 3 child nodes overall, though differences when transcribing into the question, or different configuration of Xerces may have resulted in 2 child nodes in your original code.
If you change your XML example to be all on one line with no whitespace
<root><optionA><optionB/></optionA></root>
you can see that Xerces will then report that there is only one child of root.
Here is the full list of node types that Xerces may encounter.
Related
I am using XSLT to transform XML files and this XPath is a very small part of it. The main object is a performance issue. First, I will describe the context:
Part of the transformation is a complex grouping operation, used to group up a sequence of similar elements, in the order they appear. This is a small sample from the data:
<!-- potentially a lot more data-->
<MeaningDefBlock>
<!-- potentially a lot more data-->
<MeaningSegment>
<Meaning>
<value> or </value>
</Meaning>
</MeaningSegment>
<MeaningSegment>
<MeaningInsert>
<OpenBracket>
<value>(</value>
</OpenBracket>
<Meaning>
<value>ex.: </value>
</Meaning>
<IllustrationInsert>
<value>ita, lics</value>
</IllustrationInsert>
<ClosedBracket>
<value>)</value>
</ClosedBracket>
</MeaningInsert>
</MeaningSegment>
<!-- potentially a lot more data-->
</MeaningDefBlock>
<!-- potentially a lot more data-->
There are only parent elements (ex.: MeaningInsert) and elements that only contain a value element, which contains text (ex.: IllustrationInsert).
The text from the input gets grouped into elements that have such text segments: or (ex.:, ita, lics and ) (in this case, the "ita, lics" segment separates the groups that would otherwise be all in one). The main point is that elements from different levels can be grouped. XPath is used to identify groups via previous segments and keyed in the XSL. The whole key is very complicated and not the object of the question (but I still provide it for context):
<xsl:key name="leavesInGroupL4" match="MeaningSegment//*[value]" use="generate-id(((preceding-sibling::*[value]|ancestor-or-self::MeaningSegment/preceding-sibling::MeaningSegment//*[value])[not(boolean(self::IllustrationInsert|self::LatinName)=boolean(current()/self::IllustrationInsert|current()/self::LatinName))]|ancestor-or-self::MeaningDefBlock)[last()])"/>
The important part being:
(preceding-sibling::*[value]|ancestor-or-self::MeaningSegment/preceding-sibling::MeaningSegment//*[value])[...]
From the context of an element with a value child (like Meaning or OpenBracket), this XPath selects the previous siblings and all the elements with values from the preceding siblings of the parent/ancestor MeaningSegment. In practice, it basically selects all the text that came before it (or, rather, the grandparent of the text itself)
I have later realized that there might be even further complications with layers and differing depth of the elements with values. I might need to select all such preceding elements regardless of their parent and siblings but still in the same block. I have substituted "the important part" with a somewhat simpler XPath expression:
preceding::*[value and generate-id(ancestor-or-self::MeaningDefBlock) = generate-id(current()/ancestor-or-self::MeaningDefBlock)]
This only checks that it's in the same block and it works! It successfully selects the preceding segments of text in the block, even if elements with values and parent elements are mixed together. Example input fragment:
...
<OpenBracket>
<value>(</value>
</OpenBracket>
<SomeParentElement>
<LatinName>
<value>also italics</value>
</LatinName>
</SomeParentElement>
<ClosedBracket>
<value>)</value>
</ClosedBracket>
...
This is not something the first approach could do because the brackets and the LatinName are not siblings.
However, the new approach with preceding:* is extremely slow! On a real document, the XSL transformation takes up to 5 minutes instead of the usual 3 seconds that the original approach takes (including overhead), which is a 100x increase in time taken. Of course, that is because preceding checks nearly every node in the whole document when it is executed (a lot of times). The document has a lot of MeaningDefBlock blocks (nearly 2000), each with a couple segments of text (usually single-digit) and a bunch of other straight-forward elements/nodes unrelated to said text (usually in the low hundreds, each block). Quite easy to see how this all adds up to preceding trashing performance over preceding-sibling.
I was wondering if this could be optimized somehow. In XSL, keys have greatly improved performance multiple times in our project but I'm not sure if preceding and keys can be combined or if the XPath needs to be more complex and tailored to my specific case, perhaps enumerating the elements it should look at (and hopefully ignoring everything else).
Since the input will currently always work with the first approach, I have conceded and rolled back the change (and would probably rather take the 5 min hit every time than trying optimization myself).
I use XSLT 1.0 and XPath 1.0
I guess you've probably already worked out that
preceding::*[value and generate-id(ancestor-or-self::MeaningDefBlock)
= generate-id(current()/ancestor-or-self::MeaningDefBlock)]
is going to search back to the beginning of the document; it's not smart enough to know that it only needs to search within the containing meaningDefBlock element.
One answer to that would be to change it to something like this:
ancestor-or-self::MeaningDefBlock//*[value][. << current()]
The << operator requires XPath 2.0 and for a problem as complex as this, you really ought to consider moving forwards. However you can simulate the operator in 1.0 with an expression like generate-id(($A|$B)[1]) = generate-id($A).
There's no guarantee this will be faster, but unlike your existing solution it should be independent of how many MeaningDefBlock elements there are in the document.
I am trying to add new nodes to an element tree. These new nodes have childs in them. Is there anyway using lxml to add all these in one go.
Ex:
Old format
<Test>
<Header>
</Header>
</Test>
New format that I am trying to achieve by adding nodes
<Test>
<Header>
<Source>
<ProcessID> 234 </ProcessID>
<InstanceID> 1 </InstanceID>
</Source>
<Target>
<ProcessID> 234 </ProcessID>
<InstanceID> 1 </InstanceID>
</Target>
</Header>
</Test>
I am looking for 2 things:
1) Is there anyway by which I can add entire Source node and Target node in one go? I mean adding source node in one go and Target node in one go. Instead of adding source node and then processID and then InstanceID etc
2) At the moment I am maintaining changes in a flat file and storing the changes and then applying them using lxml
The problem I am facing is,
When I add Source node to Header node using Subelement it is not added as proper tag and only closure tag of Source is added. When I try to get Source element using find function I am getting element as null. Hence I can't add childs to Source node. As you can see Source node doesn't have any attributes or text, but it has childs.
Can you help me in adding this structure to element tree? I tried all the ways as much as I can. I am sure there should be a simple solution to do this instead of adding it one by one.
I have so many files to be treated like this, so looking for a simple solution.
TIA
I'm seeing a quite odd behaviour, when trying to limit the results given by applying ancestor::* to an element I always get an extra ancestor although is expressly excluded by the predicate.
Here the code:
XML:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<level_a>
<level_b>
<level_c>
<level_d>
<level_e/>
</level_d>
</level_c>
</level_b>
</level_a>
<level_b>
<level_c>
<level_d>
<level_d>
<level_e/>
</level_d>
</level_d>
</level_c>
</level_b>
</root>
XPath:
(//level_d[not(level_d)])[last()]/ancestor::*[level_c|level_b]
so basically I'm selecting the level_d elements that doesn't have another level_d element nested, getting the last one of them and trying to get all the ancestors up to element level_b.
But the result I'm seeing using Altova XMLSpy 2011 is:
level_a
level_b
I don't quite understand why I'm getting that result and how can I improve my xpath to limit effectively the ancestors up to level_b (i.e. level_c and level_b).
Any hint is greatly appreciated!
Regards
Vlax
Well ancestor::*[level_c|level_b] selects all elements on the ancestor axis that have a level_c or level_b child.
You might want (//level_d[not(level_d)])[last()]/ancestor::*[self::level_c|self::level_b].
Or with your textual description "to limit effectively the ancestors to level_b" you simply want (//level_d[not(level_d)])[last()]/ancestor::level_b.
I think you get right result because clause ancestor::*[level_c|level_b] I read as "all ancestors containing element level_b or level_c". So, level_b is ok because it contains level_c and level_a is ok too because it contains level_b.
So if I change your XPath into (//level_d[not(level_d)])[last()]/ancestor::*[level_c] it results into level_b only.
Probably it is not exactly what you asking for but I'm not sure if I understand well the purpose of your XPath :-)
How can I find the last node that contains a specific structure?
<defect-event>
<event-assigned-to>
<assigned-to-user>
<last-name>Doe</last-name>
<first-name>John</first-name>
<middle-name></middle-name>
</assigned-to-user>
</event-assigned-to>
</defect-event>
There can be many "defect-event" nodes at the same level, below or above the one with the "assigned-to-user" sub node.
There can also be multiple "defect-event" nodes with the "assigned-to-user" sub node.
I need to find the last one "defect-event" node which contains the "assigned-to-user" sub node.
Thanks!
Something on these lines is probably what you want:
defect-event[event-assigned-to[assigned-to-user]][position()=last()]
In effect, you're saying "find me all the defect-event which contains an event-assigned-to containing an assigned-to-user, and then just give me the one whose position() is last()".
Having said that, you might need to tweak this depending on the context you're in when you try to find the node, and what you're doing to the node (eg: behaviour might vary if you're in a for-each loop as opposed to an apply-templates situation).
I've seen lots of "de-duplicate this xml" questions but everyone wants the first node or the nodes are identical. I have a bit of a bigger puzzle.
I have a list of articles in XML, a relevant snippet is shown:
<item><key>Article1</key><stamp>100</stamp></item>
<item><key>Article1</key><stamp>130</stamp></item>
<item><key>Article2</key><stamp>800</stamp></item>
<item><key>Article1</key><stamp>180</stamp></item>
<item><key>Article3</key><stamp>900</stamp></item>
<item><key>Article3</key><stamp>950</stamp></item>
<item><key>Article4</key><stamp>990</stamp></item>
<item><key>Article5</key><stamp>999</stamp></item>
I'd like a list of nodes where the keys are unique and where the last instance is returned, not the first: Stamp (integer) is always increasing for elements of a particular key. Ideally I'd like "largest stamp" but they're always in order so the shortcut is ok.
Desired result: (Order doesn't really matter.)
<item><key>Article2</key><stamp>800</stamp></item>
<item><key>Article1</key><stamp>180</stamp></item>
<item><key>Article3</key><stamp>950</stamp></item>
<item><key>Article4</key><stamp>990</stamp></item>
<item><key>Article5</key><stamp>999</stamp></item>
I'm somewhat confused on how to get this list. Any ideas?
I'm using the Saxon processor if it matters.
The short version:
Instead of using [1] in the Muenchian grouping, use [last()]