Emitting avro format from pipes in Hadoop - c++

I have to program in C++ for Hadoop and I deal with a complex structure of output value.
Unfortunately I can't figure out how to emit this structure in Avro format in MapReduce.
There are some writers like DataFileWriter and they work well for me. But it all doesn't make sense in terms of HDFS.
How I emit the structure now:
IOSerializer serializer;
context.emit(key, serializer.toString(output));
This custom toString method I wrote myself (sorry for the name, I'm totally from the Java world).
This is just a custom serialization into String. I really want some interoperability here and decided to use Avro.
This is the code to write Avro into the file:
avro::DataFileWriter<fusion_solve::graph> dfw("test.bin", schema);
dfw.write(output);
dfw.close();
What I want to be able to do is something like this:
IOSerializer serializer;
context.emit(serializer.toAvro(key, output));
For the moment I will be happy to get just plain JSON string as output, to convert it later.
The other option for me is writing custom RecordWriter in Java. But which type of input data should I use in this case, JSON?

Related

from proto2 debugString to json

I'm new to protocol buffers, and I' m having some hard time parsing the human readable output from proto2 (DebugString function, ref here) to a structured data.
In fact, at first, I tried to do this recursively and then i though about using the debugString() function that I can use. What I want is a simple structured output (json) that contain fieldnames and their values, but proto2 doesn't have a json output or coverter... (correct me if i' m wrong) and I'm working with cpp
Is there any other way to do this work without manually handling it? Do you know any library that is doing that?
thank you

Serialize a data structure in text format in C++/Qt

I need to store some data structure into a SQL database. This data may vary so I could not know which fields the database must have. I am about to do it encoding the data structure into a XML or JSON object and then put it into the SQL database. But it could not work, so I need serialization, since the problem is about encoding that structure.
Which library/tool/method may I use to serialize/deserialize a data structure into a text? Let's say a data structure composed by some Integers, some Unicode strings, and some Booleans.
Many thanks in advance!
since the problem is about encoding that structure.
Qt can't automatically serialize things, so you'll have to write some kind of routine to save/load your data. Exact solution depends on your requirements and how are you going to use data.
Things you should consider:
Is human-readability required? (loading text file can be slower, due to parsing)
Is communication/data exchange (with some other program) required?
What is least-expensive development solution?
Recommendation:
If data is simple, used only by your app, binary(non-human readable), then use QDataStream + QByteArray, serialize by hand.
If data is complex, used only by your app, binary(non-human readable), then use QDataStream + QByteArray + QVariantMap. (dump data into qvariant map, then serialize it into QByteArray using QDataStream)
If data is simple, used only by your app, text(should be human readable) then plain text, json or xml.
If data is complex, used only by your app, text(human readable), then use json or xml - depending on whichever is available/whichever is shortest.
If data is supposed to be read by some 3rd party tool, then use whatever format this tool uses.
Details:
If you're communicating with something, you're stuck with whatever format the thing you communicate with is using. That'll be probably either json or xml, because various scripting langauges frequently can easily read either of them
If data is supposed to be human-readable, then you have following options: plain text, json or xml, depending on data format/complexity. (ini won't work in your scenario, because it is supported by QSettings that doesn't really serialize to random in-memory location. xml/json) can store tree-like structures, but json is available in Qt 4 only via external dependency, and xml reader/writer requires effort to get hang of it.
If you want to store private data (used only by your application) that is not human readable and is only used by your application, you can use pretty much whatever you want, including binary format.
The simplest way would be to dump it into QByteArray + QDataStream manually, because QDataStream guarantees that binary data will be correctly loaded on any platform regardless of endianness (as long as you don't dump it as a raw memory block, that is). That'll work fine if data is simple array of similar structures that has fixed number of components that are always present.
If the data is tree-like and has optional nodes (i.e. tree of key-value pairs, where value can be another tree) and keys may or may not be present, you'll library or routine that deals with those key/value pair trees.
In Qt 4 that's QVariantMap and xml (via QXmlStreamWriter/QXmlStreamReader). Qt 5 also has json support. In case when multiple solutions are available, built-in solution that takes least amount of effort to implement, wins. Reading a named field from QVariantMap takes 1 line of code per value + a helper function that is roughly 10 lines of code, it also supports all Qt 4 types.
QVariantMap has significant advantage other json in a sense that it supports all Qt types natively as values. I.e. you can easily dump QColor, QPoint, plus you types you registered into it. Anything you can store in QVariant, you can store within QVariantMap, which can then be serialized to/from QDataStream in a single line of code.
Json, on other hand, has advantage of being "standard" data format that can be loaded within scripting language. That is at cost of having only 6 basic types it supports.
Json is not supported natively by Qt 4, and although there is QJson, adding external dependencies is not always a good idea, because you're going to babysit them. If you're using Qt 4 and really need json support, it might make sense to upgrade to Qt 5.
What do you do with this data in the database? If you want to use it in another program or language, then your data need to be readable and you can use QXmlStreamWriter/Reader or QJsonDocument (both with QByteArray).
If you don't want to use your data outside your program, you can write them in QByteArray with QDataStream.
It would be something like the code below.
QByteArray data;
QDataStream stream(&data);
stream << yourClasses;
sqlQuery.bindValue(data);
You just need to add stream operators in your classes you want to serialize:
QDataStream &operator<<(QDataStream &, A const&);
QDataStream &operator>>(QDataStream &, A &);
This is a long and detailed answer, but here goes the point. Qt 5.2 has an example how you could achieve all this with Qt based on the "SaveGame" example which is also doing the serialization and deserialization. It is basically doing that for the non-playable game character. It is also using QSaveFile for saving the information safely into the desired file. Here you can find the url to the documentation of the example for your convenience:
http://doc-snapshot.qt-project.org/qdoc/qtcore-savegame-example.html
I have just solved this issue a few days ago with Qt5's json parser. I would suggest to take a look into the following classes:
QJsonDocument: http://qt-project.org/doc/qt-5.1/qtcore/qjsondocument.html
QJsonObject: http://qt-project.org/doc/qt-5.1/qtcore/qjsonobject.html
QJsonArray: http://qt-project.org/doc/qt-5.1/qtcore/qjsonarray.html
QJsonParseError: http://qt-project.org/doc/qt-5.1/qtcore/qjsonparseerror.html
QJsonValue: http://qt-project.org/doc/qt-5.1/qtcore/qjsonvalue.html
If you are planning to use Qt 4, you will need to use the QJson library for instance, but mind you, it is a lot slower than the Qt 5 json parser based on the performance. Here you can find Thiago's benchmark:
https://plus.google.com/108138837678270193032/posts/7EVTACgwtxK
The json format supports strings, integers, and bool just as you wish, so that should work for you. Here you can find the signatures how to serialize and deserialize such types:
Serialize
bool toBool(bool defaultValue = false) const
double toDouble(double defaultValue = 0) const
QString toString(const QString & defaultValue = QString()) const
Deserialize
QJsonValue(bool b)
QJsonValue(double n)
QJsonValue(const QString & s)
Here you can find a summary page about the json support in Qt 5:
http://doc-snapshot.qt-project.org/qt5-nosubdir/json.html
Note that, Qt 5's json is saved internally as a very efficient binary representation, so essentially you could also use that rather than text representation. It depends on your exact scenario. Actually, for Qt 4, you could even backport these classes if needed since they are so much tied to Qt 5. It may have actually been so for BlackBerry development since we were struggling with Qt 4 there, and we badly needed the json parser from Qt.

How to start using xml with C++

(Not sure if this should be CW or not, you're welcome to comment if you think it should be).
At my workplace, we have many many different file formats for all kinds of purposes. Most, if not all, of these file formats are just written in plain text, with no consistency. I'm only a student working part-time, and I have no experience with using xml in production, but it seems to me that using xml would improve productivity, as we often need to parse, check and compare these outputs.
So my questions are: given that I can only control one small application and its output (only - the inputs are formats that are used in other applications as well), is it worth trying to change the output to be xml-based? If so, what are the best known ways to do that in C++ (i.e., xml parsers/writers, etc.)? Also, should I also provide a plain-text output to make it easy for the users (which are also programmers) to get used to xml? Should I provide a script to translate xml-plaintext? What are your experiences with this subject?
Thanks.
Don't just use XML because it's XML.
Use XML because:
other applications (that only accept XML) are going to read your output
you have an hierarchical data structure that lends itself perfectly for XML
you want to transform the data to other formats using XSL (e.g. to HTML)
EDIT:
A nice personal experience:
Customer: your application MUST be able to read XML.
Me: Er, OK, I will adapt my application so it can read XML.
Same customer (a few days later): your application MUST be able to read fixed width files, because we just realized our mainframe cannot generate XML.
Amir, to parse an XML you can use TinyXML which is incredibly easy to use and start with. Check its documentation for a quick brief, and read carefully the "what it does not do" clause. Been using it for reading and all I can say is that this tiny library does the job, very well.
As for writing - if your XML files aren't complex you might build them manually with a string object. "Aren't complex" for me means that you're only going to store text at most.
For more complex XML reading/writing you better check Xerces which is heavier than TinyXML. I haven't used it yet I've seen it in production and it does deliver it.
You can try using the boost::property_tree class.
http://www.boost.org/doc/libs/1_43_0/doc/html/property_tree.html
http://www.boost.org/doc/libs/1_43_0/doc/html/boost_propertytree/tutorial.html
http://www.boost.org/doc/libs/1_43_0/doc/html/boost_propertytree/parsers.html#boost_propertytree.parsers.xml_parser
It's pretty easy to use, but the page does warn that it doesn't support the XML format completely. If you do use this though, it gives you the freedom to easily use XML, INI, JSON, or INFO files without changing more than just the read_xml line.
If you want that ability though, you should avoid xml attributes. To use an attribute, you have to look at the key , which won't transfer between filetypes (although you can manually create your own subnodes).
Although using TinyXML is probably better. I've seen it used before in a couple of projects I've worked on, but don't have any experience with it.
Another approach to handling XML in your application is to use a data binding tool, such as CodeSynthesis XSD. Such a tool will generate C++ classes that hide all the gory details of parsing/serializing XML -- all that you see are objects corresponding to your XML vocabulary and functions that you can call to get/set the data, for example:
Person p = person ("person.xml");
cout << p.name ();
p.name ("John");
p.age (30);
ofstream ofs ("person.xml");
person (ofs, p);
Here's what previous SO threads have said on the topic. Please add others you know of that are relevant:
What is the best open XML parser for C++?
What is XML good for and when should i be using it?
What are good alternative data formats to XML?
BTW, before you decide on an XML parser, you may want to make sure that it will actually be able to parse all XML documents instead of just the "simple" ones, as discussed in this article:
Are you using a real XML parser?

Want to store profiles in Qt, use SQLite or something else?

I want to store some settings for different profiles of what a "task" does.
I know in .NET there's a nice ORM is there something like that or an Active Record or whatever? I know writing a bunch of SQL will be fun
I'm going to agree with Micheal E and say that you can use QJson, but no you don't have to manage serialization. QJson has a QObject->QJson serializer/deserialzer. So as long as all your relevant data is exposed via Q_PROPERTY QJson can grab it and write/read it to/from the disk.
Examples here: http://qjson.sourceforge.net/usage.html
From there you can simply dump the data into a file.
One option would be to serialize objects to JSON with QJson. You still need to manage serialization, but it could well be a lot simpler if you don't need sophisticated query capabilities.

XML Serialization/Deserialization in C++

I am using C++ from Mingw, which is the windows version of GNC C++.
What I want to do is: serialize C++ object into an XML file and deserialize object from XML file on the fly. I check TinyXML. It's pretty useful, and (please correct me if I misunderstand it) it basically add all the nodes during processing, and finally put them into a file in one chunk using TixmlDocument::saveToFile(filename) function.
I am working on real-time processing, and how can I write to a file on the fly and append the following result to the file?
Thanks.
BOOST has a very nice Serialization/Deserialization lib BOOST.Serialization.
If you stream your objects to a boost xml archive it will stream them in xml format.
If xml is to big or to slow you only need to change the archive in a text or binary archive to change the streaming format.
Here is a better example of C++ object serialization:
http://www.codeproject.com/KB/XML/XMLFoundation.aspx
I notice that each TiXmlBase Class has a Print method and also supports streaming to strings and streams.
You could walk the new parts of the document in sequence and output those parts as they are added, maybe?
Give it a try.....
Tony
I've been using gSOAP for this purpose. It is probably too powerful for just XML serialization, but knowing it can do much more means I do not have to consider other solutions for more advanced projects since it also supports WSDL, SOAP, XML-RPC, and JSON. Also suitable for embedded and small devices, since XML is simply a transient wire format and not kept in a DOM or something memory intensive.