XML Parsing results in Duplicates using libxml2 - c++

I'm using libxml2 to parse the following XML string:
<?xml version=\"1.0\"?>
<note>
<to>
<name>Tove</name>
<name>Tovi</name>
</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
Formatted as a C-style string:
"<?xml version=\"1.0\"?><note><to><name>Tove</name><name>Tovi</name></to><from>Jani</from><heading>Reminder</heading><body>Don't forget me this weekend!</body></note>"
This is based on the example from the W3C's site on XML; I only added the nested names in the "to" field.
I have the following recursive code in C++ to parse it into an object tree:
RBCXMLNode * RBCXMLDoc::recursiveProcess(xmlNodePtr node) {
RBCXMLNode *rNode = new RBCXMLNode();
xmlNodePtr childIterator = node->xmlChildrenNode;
const char *chars = (const char *)(node->name);
string name(chars);
const char *content = (const char *)xmlNodeGetContent(node);
rNode->setName(name);
rNode->setUTF8Data(content);
cout << "Just parsed " << rNode->name() << ": " << rNode->stringData() << endl;
while (childIterator != NULL) {
RBCXMLNode *rNode2 = recursiveProcess(childIterator);
rNode->addChild(rNode2);
childIterator = childIterator->next;
}
return rNode;
}
So for each node it creates the matching object, sets its name and content, then recurses for its children. Note that each node is only processed once. However, I get the following (nonsensical, to me at least) output:
Just parsed note: ToveToviJaniReminderDon't forget me this weekend!
Just parsed to: ToveTovi
Just parsed name: Tove
Just parsed text: Tove
Just parsed name: Tovi
Just parsed text: Tovi
Just parsed from: Jani
Just parsed text: Jani
Just parsed heading: Reminder
Just parsed text: Reminder
Just parsed body: Don't forget me this weekend!
Just parsed text: Don't forget me this weekend!
Note that each item is being parsed twice; once giving the name as "text" and one giving it as whatever it should be. Also, the "note" root node is having its data parsed as well; this is undesirable. Also note that this root node is not parsed twice, like the others are.
So I have two questions:
How do I avoid parsing the root node's data, and just have its name and not its content? This also will presumably happen with more deeply nested nodes as well.
How do I avoid the duplicate parsing on the other nodes? Obviously, I want to keep the properly named versions, while maintaining the (unlikely) possibility that a node actually is named "text". Also, there may be duplicate nodes that are desired, so just checking to see if the node has been parsed already is not an option.
Thanks in advance.

The main problem I see in your code is that you're calling xmlNodeGetContent(). This is returning you the whole text inside the tag and its ending counterpart.
When parsing with libxml2 you get some nodes whose content is complex, so you cannot rely on xmlNodeGetContent() to retrieve the content. You have to do the recursive function differently. For instance, you the fastest solution to your function would be to only print the node name for nodes that are not text (tested with xmlNodeIsText()), and to write just the xmlNodeGetContent() for nodes that are text. This would give you an output something like:
Just parsed note
Just parsed to
Just parsed name
Just parsed text: Tove
Just parsed name
Just parsed text: Tovi
...
Note that now you only print elements, and only text when you have a text element type.
This also makes sense conceptually, because the content of a non-text node (not text) is so complex that how do you print it? You can only print its label (name). However, text nodes are so simple that you can print their content.

Related

Xml read using libxml2 library

I have a XML file as below
<root>
<Radii1 VT = "121212 121212"/>
</root>
I am trying to read the xml using libxml2 library.
cur = cur->xmlChildrenNode;
while (cur != NULL) {
if ((!xmlStrcmp(cur->name, (const xmlChar *)"Radii1"))){
}
cur = cur->next;
}
Now , my problem is if i am printing the cur->name, first it is giving me text then next time it will give me Radii1 and again next time will give text and then will exit the code.
I am not sure why that is happening is the format of the xml not correct?
The XML format is correct, but a node is not just an XML entity. You're seeing nodes in the XML document that represent text portions of the document; namely the whitespace -- and specifically the newlines -- between the XML entities.
What you want to do is examine the value in cur->type, whether it's an XML_ELEMENT_NODE, an XML_TEXT_NODE; or any one of various other kinds of XML nodes, and decide what you want to do with them.
And if you are searching for a particular attribute, like "VT", it would be one of the child XML_ATTRIBUTE_NODEs of the Radii1 XML_ELEMENT_NODE.

Construct Xpath

I have the following repeated piece of the web-page:
<div class="txt ext">
<strong class="param">param_value1</strong>
<strong class="param">param_value2</strong>
</div>
I would like to extract separately values param_value1 and param_value2 using Xpath. How can I do it?
I have tried the following constructions:
'//strong[#class="param"]/text()[0]'
'//strong[#class="txt ext"]/strong[#class="param"][0]/text()'
'//strong[#class="param"]'
none of which returned me separately param_value1 and param_value2.
P.S. I am using Python 2.7 and the latest version of Scrapy.
Here is my testing code:
test_content = '<div class="txt ext"><strong class="param">param_value1</strong><strong class="param">param_value2</strong></div>'
sel = HtmlXPathSelector(text=test_content)
sel.select('//div/strong[#class="param"]/text()').extract()[0]
sel.select('//div/strong[#class="param"]/text()').extract()[1]
// means descendant or self. You are selecting any strong element in any context. [...] is a predicate which restricts your selection according to some boolean test. There is no strong element with a class attribute which equals txt ext, so you can exclude your second expression.
Your last expression will actually return a node-set of all the strong elements which have a param attribute. You can then extract individual nodes from the node set (use [1], [2]) and then get their text contents (use text()).
Your first expression selects the text contents of both nodes but it's also wrong. It's in the wrong place and you can't select node zero (it doesn't exist). If you want the text contents of the first node you should use:
//strong[#class="param"][1]/text()
and you can use
//strong[#class="param"][2]/text()
for the second text.

AutoHotkey RegExReplace with math

I am trying to change all instances of a number in an xml file. The constant 45 should be added to the number.
Temp is the following text:
<rownum value="1">
<backupapplication>HP Data Protector</backupapplication>
<policy>AUTDR12_Daily</policy>
<policytype>FileSystem</policytype>
<dataretained>31</dataretained>
<fullbackup>7</fullbackup>
<backuptime>0.17</backuptime>
<retentionperiod>Short</retentionperiod>
<peakmbps>11</peakmbps>
<backupcategory>Fulls & Fulls</backupcategory>
</rownum>
<rownum value="2">
<backupapplication>HP Data Protector</backupapplication>
<policy>AUTP_Appl_Monthly</policy>
<policytype>FileSystem</policytype>
<dataretained>268</dataretained>
<fullbackup>91</fullbackup>
<backuptime>2.31</backuptime>
<retentionperiod>Long</retentionperiod>
<peakmbps>12</peakmbps>
<backupcategory>Fulls & Fulls</backupcategory>
</rownum>
I tried the following code:
NeedleRegEx = <rownum value="(\d+)">
Replacement = <rownum value="($1+45)">
Temp := RegExReplace(Temp, NeedleRegEx, Replacement)
But this changes it into
<rownum value="1+45">
while I want
<rownum value="46">
How do I do this in AutoHotKey?
RegEx aren't designed to evaluate mathematical expressions. There are some languages, in which you can use a replacing function that can do dynamic replacements (e.g. JavaScript). But no such luck in AHK.
Using RegEx for the purpose of parsing XML documents isn't good practice anyway. I suggest using an XML parser instead. For AHK, you can utilize a COM object of MSXML2.DOMDocument. Here's an example (and further references) of how to use it: http://www.autohotkey.com/board/topic/56987-com-object-reference-autohotkey-v11/page-2#entry367838.
What you want to do is parse your XML to a DOM document and loop over every rownum tag. Now, you can retrieve the value attribute, increment it, and overwrite the attribute with the new value.
Update
To the code you've posted in the comments: There were some minor mistakes and one big mistake. The big mistake was trying to parse non-valid XML. You can check your XML files by feeding them to a formatter/validator. The loadXml()method will return false if there was a parsing error. The method obj.saveXML() does not exist. If you want to retrieve the document's string representation, simply access its xml property: obj.xml. If you want to save it to a file, there's the built-in method save(filepath).
Here's my suggestion for a clean approach (yes, you CAN use meaningful variable names!):
doc := ComObjCreate("MSXML2.DOMDocument.6.0")
if(!doc.loadXml(xmlString)) {
msgbox % "Hey! That's no valid XML!"
ExitApp
}
rownums := doc.getElementsByTagName("rownum")
Loop % rownums.length
{
rownum := rownums.item(A_Index-1)
value := rownum.getAttribute("value")
value += 45
rownum.setAttribute("value", value)
}
doc.save("myNewFile.xml")

How to read a string line by line in C++

I have a string with an xml code in it. I want to read from it line by line so i can extract the strings betweens "title" tags.
I know how to extract the titles, but how do i traverse the string ?
Sounds easy but i have no idee right now.
Thanks in advanced.
Maybe you can give some more details about what extracting the strings between the "title" tags means?
If you already can extract the title tags, then that means you know their positions, so then extracting the string is just a matter of taking the substring between the opening and closing title tags right?
Are you looking for a XML parser? The opensource libxml works well, and has bindings for a variety of languages. There are other parsers, what parsers allow you to do is to take the XML string and create a tree data structure which gives you easy access to the elements of the XML.
EDIT: Originally the requirement about not using an xml parser didn't exist in the question. Here's a rough algorithm to create your own XML parser.
1) Create a tree data structure, and a recursive parse() function.
2) Search for a XML tag, anything with the pattern <...>. Add the "..." tag to one of the child nodes of the current node you are on, and call the recursive parse() function again.
3) If you find a XML tag that closes the orginal <...>, then you are done with parsing that block. Go back to step #2. If there are no other blocks then return from the parse function.
Here's some pseudo code:
// node: The current node in the tree
// current_position: the current position in the XML string that you are parsing
// string: the XML string that you are parsing.
parse(node, current_position, string):
while current_position < len(string):
current_position = find(string[current_position:len(string)], "<...>")
if !found: return current_position // should be end of string if nothing is found.
node.children[node.num_children] = new Node("<...>");
current_position = parse(node.children[node.num_children],current_position+size_of_tag,string)
current_position = find(string[current_position:len(string)], "</...>")
node.num_children++
return current_position

Use a String as an E4X Expression in AS3?

I need to use a string to access nodes and attributes in XML using E4X. It would be ideal to have this scenario (with XML already loaded):
var myXML:XML = e.target.data;
var myStr:String = "appContent.bodyText.(#name == 'My Text')";
myXML.myStr = "New Value for bodyText node where attribute('name') is equal to 'My Text'";
I ultimately need to set new values to an XML document using strings as E4X expressions.
As noted above:
I figured out a workaround
Take the string of the E4X path you want to target
Pull the E4X path and compare it to your target path
If the two are equal, do what you will with that node/attribute
It's a hack, but it works. You could even parse the XML and populate an array with the target string and the target node, then you could just access it through an item in the array. This is expandable in many ways. As long as everything is set up for proper garbage collection, you'll be okay.