Parsing an XML document - c++

I want to parse an XML document in c++ and be able to identify what text exists in a particular tag. I have checked parsers like TiyXML and PugiXML but none of them seem to identify the tags separately. How can I achieved this?

Using RapidXml, you can traverse the nodes and attributes and identify the text of their tag.
#include <iostream>
#include <rapidxml.hpp>
#include <rapidxml_utils.hpp>
#include <rapidxml_iterators.hpp>
int main()
{
using namespace rapidxml;
file<> in ("input.xml"); // Load the file in memory.
xml_document<> doc;
doc.parse<0>(in.data()); // Parse the file.
// Traversing the first-level elements.
for (node_iterator<> first=&doc, last=0; first!=last; ++first)
{
std::cout << first->name() << '\n'; // Write tag.
// Travesing the attributes of the element.
for (attribute_iterator<> attr_first=*first, attr_last=0;
attr_first!=attr_last; ++attr_first)
{
std::cout << attr_first->name() << '\n'; // Write tag.
}
}
}

To get all tag names with pugixml:
void dumpTags(const pugi::xml_node& node) {
if (!node.empty()) {
std::cout << node.name() << std::endl;
for (pugi::xml_node child=node.first_child(); child; child=child.next_sibling())
dumpTags(child);
}
}
pugi::xml_document doc;
pugi::xml_parse_result result = doc.load("<tag1>abc<tag2>def</tag2>pqr</tag1>");
dumpTags(doc.first_child());

Related

Find all keys JSON - RapidJSON

I need to find all the keys in the kTypeNames[] with rapidJSON library.
Trying to iterate all the nodes but I'm missing something; here's the code:
#include <iostream>
#include <fstream>
#include <string>
#include <bits/stdc++.h>
#include <unistd.h>
#include "rapidjson/document.h"
#include "rapidjson/writer.h"
#include "rapidjson/stringbuffer.h"
using namespace rapidjson;
using namespace std;
const char* kTypeNames[] = { "id", "text", "templ_text", "key" };
int main(int argc, char* argv[]) {
string line;
char json[65000];
std::ifstream file(argv[1]);
unsigned long i = 0;
if (file.is_open()) {
while (!file.eof()) {
file.get(json[i]);
i++;
}
file.close();
} else {
cout << "Unable to open file";
}
Document document;
document.Parse(json);
printf("\n\n\n\n*********Access values in document**********\n");
assert(document.IsObject());
for (auto Typename : kTypeNames) {
if (document.HasMember(Typename)) {
cout << "\n";
cout << Typename << ":" << document[Typename].GetString()<< endl;
cout << "\n";
}
else {
cout << "\n None\n";
}
}
It does not works with a nested JSON.
{
"node": {
"text": "find this",
"templ_text": "don't find",
"ver": "don't find"
},
"ic": "",
"text": "also this",
"templ_text": "don't care",
"par": {
"SET": {
"vis": "<blabla>",
"text": "keyFound",
"templ_text": "don't need this"
}
}
}
This is the output:
None
text:also this
templ_text:don't care
None
I would like to find all the "text" keys
How can I iterate through all the nodes/ json document?
The code you have is just searching for a list of pre-defined keys directly within the document root (document.HasMember is not a recursive search!).
You could just loop through the document nodes recursively. For example for object/map nodes, you loop on the MemberBegin() and MemberEnd() iterators, similar to a std::map or other standard containers.
for (auto i = node.MemberBegin(); i != node.MemberEnd(); ++i)
{
std::cout << "key: " << i->name.GetString() << std::endl;
WalkNodes(i->value);
}
Array uses Begin() and End(). Then, when you encounter a node with a "text" member, you can output the value of that node (i->value).
Alternatively, rather than using a Document DOM object, you can do it with the parser stream. Rapidjson uses a "push" API for this, where it calls methods you define in a class as it encounters each piece of JSON. Specifically, it will call a Key method.
class MyHandler : public BaseReaderHandler<UTF8<>, MyReader> {
bool Key(const char* str, SizeType length, bool copy)
{
std::cout << "Key: " << str << std::endl;
}
...
};
MyHandler handler;
rapidjson::Reader reader;
rapidjson::StringStream ss(json);
reader.Parse(ss, handler);
This gets a bit more complex, you will want to set a flag of some sorts, and then output the next value callback after.
class MyHandler : public BaseReaderHandler<UTF8<>, MyReader> {
bool Key(const char* str, SizeType length, bool copy)
{
isTextKey = strcmp(str, "text") == 0; // Also need to set to false in some other places
return true;
}
bool String(const char* str, SizeType length, bool copy)
{
if (isTextKey) std::cout << "text string " << str << std::endl;
return true;
}
...
bool isTextKey = false;
};
Also remember, that JSON allows a null within a string \0, which is why also have the size parameters and members, as well as Unicode. So to fully support any JSON document that needs accounting for.

boost ptree access first element with no path name

I am using boost library to manipulate a JSON string and I would like to access to a first element.
I was wondering if there where some convenient way to access a first element of ptree with no path name.
I do this, but I got no value :
namespace pt = boost::property_tree;
pt::ptree pt2;
string json = "\"ok\"";
istringstream is(json);
try
{
pt::read_json(is, pt2);
cout << pt2.get_child("").equal_range("").first->first.data() << endl;
}
catch (std::exception const& e)
{
cerr << e.what() << endl;
}
Solution:
replace cout << pt2.get_child("").equal_range("").first->first.data() << endl;
by cout << pt2.get_value<std::string>() << endl;
Firstly, Property Tree is not a JSON library.
Secondly, the input is not in the subset of JSON supported by the library (e.g.).
Thirdly, since the input results in a tree that has no child nodes, you should use the value of the root node itself.
Lastly, if you had wanted the first node, use ordered_begin()->second:
Live On Coliru
#include <boost/property_tree/json_parser.hpp>
#include <boost/property_tree/xml_parser.hpp>
#include <iostream>
void broken_input() {
boost::property_tree::ptree pt;
std::istringstream is("\"ok\"");
read_json(is, pt);
std::cout << "Root value is " << pt.get_value<std::string>() << std::endl;
}
void normal_tree() {
boost::property_tree::ptree pt;
pt.put("first", "hello");
pt.put("second", "world");
pt.put("third", "bye");
std::cout << pt.ordered_begin()->second.get_value<std::string>() << std::endl;
write_json(std::cout, pt);
}
int main() {
try {
broken_input();
normal_tree();
}
catch (std::exception const& e)
{
std::cerr << e.what() << std::endl;
}
}
Prints
Root value is ok
hello
{
"first": "hello",
"second": "world",
"third": "bye"
}
I would like to access to a first element.
It is impossible in general case, since JSON elements are not place-fixed by definition. The current first element can change its place after JSON transformations and a resulting JSON will be the same, although elements are reordered. Thus such API is not provided by BOOST.

Search for key by vector in map

So, we have a school-project in creating a phonebook where you should be able to look up phone numbers by searching for the name. I decided to use a map with a string for the phone number and and a vector of strings for the name, due associated number should be able to have multiple names in it.
However, due to us jumping straight from Python to C++ without any explanation of the syntax or the language, I am having a hard time coming up with a way to look for the number by searching for names.
The class I am using looks like this
class Telefonbok
{
public:
void add(string namn, string nummer)
{
map<string, vector<string>>::iterator it = boken.find(nummer);
if (it != boken.end())
{
cout << "This number already exists, please choose another";
}
else
{
namn_alias.push_back(namn);
boken[nummer] = namn_alias;
}
}
void lookup(string name)
{
for (map<string, vector<string>>::iterator sokning = boken.begin(); sokning != boken.end(); sokning++)
cout << "Hello!";
}
private:
vector<string> namn_alias;
string nummer;
map<string, vector<string>> boken;
};
What I am trying to do in lookup function is to search for a phone number by the names in the vector, but I am stumped on how to proceed with looking through the vector inside the for-loop.
The plan was to go through the Map keys one by one to find the vector that contains the searched-for name. Any tips on how to proceed or some functions I have missed that can be used for this?
Algirdas is correct, you should read up on C++.
Assuming you are mapping name to 1-or-more numbers, but only 1 number per name...
#include <cstddef>
#include <iostream>
#include <map>
#include <string>
#include <vector>
using std::cout;
using std::endl;
using std::map;
using std::string;
using std::vector;
class Telefonbok
{
public:
void add(string namn, string nummer) {
auto it = nummer_namn.find(nummer);
if (it != nummer_namn.end()) {
cout << "This number already exists, please choose another" << endl;
}
else {
nummer_namn[nummer] = namn;
namn_nummer[namn].push_back(nummer);
}
}
void lookup(string name) {
auto it = namn_nummer.find(name);
if (it == namn_nummer.end()) {
cout << "Unable to find any numbers for " << name << ", sorry." << endl;
return;
}
for (auto const& sokning : it->second)
cout << name << " : " << sokning << endl;
}
private:
map<string, vector<string>> namn_nummer;
map<string, string> nummer_namn;
};
int main() {
Telefonbok bok;
bok.add("Eljay", "789");
bok.add("Eljay", "456");
bok.add("Beaker", "123");
bok.lookup("Eljay");
bok.lookup("Beaker");
bok.lookup("Bunsen Honeydew");
return EXIT_SUCCESS;
}

RapidXML giving empty CDATA nodes

I wrote the code bellow to get CDATA node value too, I got the node's name, but the values are in blank.
I changed the parse Flags to parse_full, but it not worked too.
If I manually remove "<![CDATA[" and "]]>" from the XML, It gives the value as expected, but removing it before parse is not a option.
The code:
#include <iostream>
#include <vector>
#include <sstream>
#include "rapidxml/rapidxml_utils.hpp"
using std::vector;
using std::stringstream;
using std::cout;
using std::endl;
int main(int argc, char* argv[]) {
rapidxml::file<> xmlFile("test.xml");
rapidxml::xml_document<> doc;
doc.parse<rapidxml::parse_full>(xmlFile.data());
rapidxml::xml_node<>* nodeFrame = doc.first_node()->first_node()->first_node();
cout << "BEGIN\n\n";
do {
cout << "name: " << nodeFrame->first_node()->name() << "\n";
cout << "value: " << nodeFrame->first_node()->value() << "\n\n";
} while( nodeFrame = nodeFrame->next_sibling() );
cout << "END\n\n";
return 0;
}
The XML:
<rss version="2.0" xmlns:g="http://base.google.com/ns/1.0" xmlns:c="http://base.google.com/cns/1.0">
<itens>
<item>
<title><![CDATA[Title 1]]></title>
<g:id>34022</g:id>
<g:price>2173.00</g:price>
<g:sale_price>1070.00</g:sale_price>
</item>
<item>
<title><![CDATA[Title 2]]></title>
<g:id>34021</g:id>
<g:price>217.00</g:price>
<g:sale_price>1070.00</g:sale_price>
</item>
</itens>
</rss>
When you use CDATA, RapidXML parses that as a separate node 'below' the outer element in the hierarchy.
Your code correctly gets 'title' by using nodeFrame->first_node()->name(), but - because the CDATA text is in a separate element, you'd need to use this to extract the value:
cout << "value: " <<nodeFrame->first_node()->first_node()->value();

Parsing XML Attributes with Boost

I would like to share with you an issue I'm having while trying to process some attributes from XML elements in C++ with Boost libraries (version 1.52.0). Given the following code:
#define ATTR_SET ".<xmlattr>"
#define XML_PATH1 "./pets.xml"
#include <iostream>
#include <string>
#include <boost/foreach.hpp>
#include <boost/property_tree/ptree.hpp>
#include <boost/property_tree/xml_parser.hpp>
using namespace std;
using namespace boost;
using namespace boost::property_tree;
const ptree& empty_ptree(){
static ptree t;
return t;
}
int main() {
ptree tree;
read_xml(XML_PATH1, tree);
const ptree & formats = tree.get_child("pets", empty_ptree());
BOOST_FOREACH(const ptree::value_type & f, formats){
string at = f.first + ATTR_SET;
const ptree & attributes = formats.get_child(at, empty_ptree());
cout << "Extracting attributes from " << at << ":" << endl;
BOOST_FOREACH(const ptree::value_type &v, attributes){
cout << "First: " << v.first.data() << " Second: " << v.second.data() << endl;
}
}
}
Let's say I have the following XML structure:
<?xml version="1.0" encoding="utf-8"?>
<pets>
<cat name="Garfield" weight="4Kg">
<somestuff/>
</cat>
<dog name="Milu" weight="7Kg">
<somestuff/>
</dog>
<bird name="Tweety" weight="0.1Kg">
<somestuff/>
</bird>
</pets>
Therefore, the console output I'll get will be the next:
Extracting attributes from cat.<xmlattr>:
First: name Second: Garfield
First: weight Second: 4Kg
Extracting attributes from dog.<xmlattr>:
First: name Second: Milu
First: weight Second: 7Kg
Extracting attributes from bird.<xmlattr>:
First: name Second: Tweety
First: weight Second: 0.1Kg
However, if I decide to use a common structure for every single element laying down from the root node (in order to identify them from their specific attributes), the result will completely change. This may be the XML file in such case:
<?xml version="1.0" encoding="utf-8"?>
<pets>
<pet type="cat" name="Garfield" weight="4Kg">
<somestuff/>
</pet>
<pet type="dog" name="Milu" weight="7Kg">
<somestuff/>
</pet>
<pet type="bird" name="Tweety" weight="0.1Kg">
<somestuff/>
</pet>
</pets>
And the output would be the following:
Extracting attributes from pet.<xmlattr>:
First: type Second: cat
First: name Second: Garfield
First: weight Second: 4Kg
Extracting attributes from pet.<xmlattr>:
First: type Second: cat
First: name Second: Garfield
First: weight Second: 4Kg
Extracting attributes from pet.<xmlattr>:
First: type Second: cat
First: name Second: Garfield
First: weight Second: 4Kg
It seems the number of elements hanging from the root node is being properly recognized since three sets of attributes have been printed. Nevertheless, all of them refer to the attributes of the very first element...
I'm not an expert in C++ and really new to Boost, so this might be something I'm missing with respect to hash mapping processing or so... Any advice will be much appreciated.
The problem with your program is located in this line:
const ptree & attributes = formats.get_child(at, empty_ptree());
With this line you are asking to get the child pet.<xmlattr> from pets and you do this 3 times independently of whichever f you are traversing. Following this article I'd guess that what you need to use is:
const ptree & attributes = f.second.get_child("<xmlattr>", empty_ptree());
The full code, that works with both your xml files, is:
#define ATTR_SET ".<xmlattr>"
#define XML_PATH1 "./pets.xml"
#include <iostream>
#include <string>
#include <boost/foreach.hpp>
#include <boost/property_tree/ptree.hpp>
#include <boost/property_tree/xml_parser.hpp>
using namespace std;
using namespace boost;
using namespace boost::property_tree;
const ptree& empty_ptree(){
static ptree t;
return t;
}
int main() {
ptree tree;
read_xml(XML_PATH1, tree);
const ptree & formats = tree.get_child("pets", empty_ptree());
BOOST_FOREACH(const ptree::value_type & f, formats){
string at = f.first + ATTR_SET;
const ptree & attributes = f.second.get_child("<xmlattr>", empty_ptree());
cout << "Extracting attributes from " << at << ":" << endl;
BOOST_FOREACH(const ptree::value_type &v, attributes){
cout << "First: " << v.first.data() << " Second: " << v.second.data() << endl;
}
}
}
Without ever using this feature so far, I would suspect that boost::property_tree XML parser isn't a common XML parser, but expects a certain schema, where you have exactly one specific tag for one specific property.
You might prefer to use other XML parsers that provides parsing any XML schema, if you want to work with XML beyond the boost::property_tree capabilities. Have a look at e.g. Xerces C++ or Poco XML.
File to be parsed, pets.xml
<pets>
<pet type="cat" name="Garfield" weight="4Kg">
<something name="test" value="*"/>
<something name="demo" value="#"/>
</pet>
<pet type="dog" name="Milu" weight="7Kg">
<something name="test1" value="$"/>
</pet>
<birds type="parrot">
<bird name="african grey parrot"/>
<bird name="amazon parrot"/>
</birds>
</pets>
code:
// DemoPropertyTree.cpp : Defines the entry point for the console application.
//Prerequisite boost library
#include "stdafx.h"
#include <boost/property_tree/xml_parser.hpp>
#include <boost/property_tree/ptree.hpp>
#include <boost/foreach.hpp>
#include<iostream>
using namespace std;
using namespace boost;
using namespace boost::property_tree;
void processPet(ptree subtree)
{
BOOST_FOREACH(ptree::value_type petChild,subtree.get_child(""))
{
//processing attributes of element pet
if(petChild.first=="<xmlattr>")
{
BOOST_FOREACH(ptree::value_type petAttr,petChild.second.get_child(""))
{
cout<<petAttr.first<<"="<<petAttr.second.data()<<endl;
}
}
//processing child element of pet(something)
else if(petChild.first=="something")
{
BOOST_FOREACH(ptree::value_type somethingChild,petChild.second.get_child(""))
{
//processing attributes of element something
if(somethingChild.first=="<xmlattr>")
{
BOOST_FOREACH(ptree::value_type somethingAttr,somethingChild.second.get_child(""))
{
cout<<somethingAttr.first<<"="<<somethingAttr.second.data()<<endl;
}
}
}
}
}
}
void processBirds(ptree subtree)
{
BOOST_FOREACH(ptree::value_type birdsChild,subtree.get_child(""))
{
//processing attributes of element birds
if(birdsChild.first=="<xmlattr>")
{
BOOST_FOREACH(ptree::value_type birdsAttr,birdsChild.second.get_child(""))
{
cout<<birdsAttr.first<<"="<<birdsAttr.second.data()<<endl;
}
}
//processing child element of birds(bird)
else if(birdsChild.first=="bird")
{
BOOST_FOREACH(ptree::value_type birdChild,birdsChild.second.get_child(""))
{
//processing attributes of element bird
if(birdChild.first=="<xmlattr>")
{
BOOST_FOREACH(ptree::value_type birdAttr,birdChild.second.get_child(""))
{
cout<<birdAttr.first<<"="<<birdAttr.second.data()<<endl;
}
}
}
}
}
}
int _tmain(int argc, _TCHAR* argv[])
{
const std::string XML_PATH1 = "C:/Users/10871/Desktop/pets.xml";
ptree pt1;
boost::property_tree::read_xml( XML_PATH1, pt1 );
cout<<"********************************************"<<endl;
BOOST_FOREACH( ptree::value_type const& topNodeChild, pt1.get_child( "pets" ) )
{
ptree subtree = topNodeChild.second;
if( topNodeChild.first == "pet" )
{
processPet(subtree);
cout<<"********************************************"<<endl;
}
else if(topNodeChild.first=="birds")
{
processBirds(subtree);
cout<<"********************************************"<<endl;
}
}
getchar();
return 0;
}
The output is shown here:
output