What library exists for the construction of AFP documents in different languages? - build

I am currently looking for a library in any language that allows me to create structured AFP documents but so far I have not found any
Previously I tried to use a library called afp.lib belonging to java but this structure the document but lost bytes which distorts the document
I hope the guide of some language that allows me the construction of AFP without loss of bytes. even just the library that allows me to do it

You can try https://github.com/yan74/afplib which is a JAVA library for reading and writing AFP. It is however very low-level so you get fine-grained access to all structured fields, triplets, and such. You need to have detailed knowledge on MO:DCA in order to make use of it. If you want to create documents a composer is better suited: Apache FOP

Related

Portable Key-Value data file format for Hadoop?

I'm looking for a portable Key-Value data file format that can serve as an input and output format for Hadoop and is also readable and writable apart from Hadoop directly in C++, Java, and Python. One catch... I need to have support for processing with non-java mappers and reducers (specifically c++ via Hadoop Pipes).
Any ideas? Should I write my own portable Key-Value file format that interoperates with Hadoop and Hadoop Pipes? Would such a new format be useful to the community?
Long Version:
Hadoop Sequence files (and their cousins Map, Set, Array, and BloomMap) seem to be the standard for efficient binary key-value data storage when working with Hadoop. One downside of Sequence Files is that they are readable and writable only in Java (they are specified in terms of serialized java objects). I would like to build a complex multi-stage MapReduce pipeline where the input and output to various stages must be readable and writable from C++, java, and python. Furthermore, I need to be able to write mappers and reducers in a language other than java (i.e. c++) in order to use large and highly optimized c++ libraries in the mapping stage.
I've considered various workarounds, but none of them seem... attractive.
Convert : Add extra conversion stage before and after each MapReduce stage to convert the stage's input and outputs between Sequence Files and a portable format compatible with other languages.
Problem: The data consumed and generated between stages is quite large (TB)... It is expensive to duplicate the data multiple times at each stage just to get read / write access in a different programming language. There are 10 stages, this is too much overhead for me to pay for ($$$).
Avro File : Use Avro's portable data file format.
Problem: While there does seem to be code to allow the portable Avro data file to serve as an input or output format in a MapReduce, it only works with mappers and reducers written in Java. I've seen several discussions about creating support for mappers in other languages via the avro/mapred/tether package, but only java is currently supported. From the docs: "Currently only a Java framework has been implemented, for test purposes, so this feature is not yet useful."
http://avro.apache.org/docs/1.5.4/api/java/org/apache/avro/mapred/tether/package-summary.html
Avro File + SWIG : Use Avro data format with a Java mapper that calls a custom SWIG wrapped c++ library accessed from the distributed cache to do the real processing.
The immutability of java strings makes writing SWIG wrappers a pain and inefficient because a copy is required. Also, this many layers of wrapping is starting to become a maintenance and debugging and configuration nightmare!
I am considering writing my own language portable Key-Value file format based on the H-File format that interoperates with Hadoop and Hadoop Pipes... Are there better off-the-shelf alternatives? Would such a portable format be useful to the community?
I think you've made a couple of miss-assumptions:
One downside of Sequence Files is that they are readable and writable only in Java (they are specified in terms of serialized java objects)
Depends on what you mean by serialized java objects. Hadoop uses the WritableSerialization class to provide the mechanism for serialization, not the default Java serialization mechanism. You can configure hadoop to use default Java serialization (JavaSerialization), or any custom implementation of your choice (through the io.serializations configuration property).
So if you use the Hadoop Writable mechanism, you just need to write a reader for C++ that can interpret sequence files, and then write c++/python equivalents of the classes you wish to serialize (but this would be a pain to maintain, and leads to your second question, Avro)
Furthermore, I need to be able to write mappers and reducers in a language other than java (i.e. c++) in order to use large and highly optimized c++ libraries in the mapping stage
You can write mappers / reducers in python / c++ / whatever currently using Hadoop Streaming, and use Sequence Files to store the intermediate formats. All streaming requires is your mapper / reducer / combiner expects the input on stdin in key\tvalue pairs (you can customize the delimiter instead of tab), and outputs in a similar format (that again is customizable).
http://hadoop.apache.org/common/docs/current/streaming.html (I'm sure you've found this link, but just in case).
So what if you want to pass more complex key / value pairs to / from your streaming mapper / reducer - in this case i would say look into customizing the contrib/streaming source code, specifically the PipeMapper, PipeReducer and PipeMapRed classes. You could, for example amend the output/inputs to be <Type-int/str,Length-int,Value-byte[]> tuples, and then amend your python / c++ code to interpret appropriately.
With these modifications, you could use Avro to manage the code around serialization between the hadoop streaming framework (Java) and your c++/python code. You might even be able to use the Avro.
Finally - have you looked into the AvroAsTextInputFormat and AvroTextOutputFormat classes, they may be exactly what you are looking for (caveat, i've never used them)

How to parse a collection of c++ header files?

I am working in a project and I want to do reflection in C++ so after research I found that the best way is to parse header files to get abstract syntax tree in XML format and use it in reflection. I tried many tools but none of them compatible with visual c++ 2008 or visual c++ 2010 like coco, cint, gccxml. please replay soon
Visual Studio already parses all code in your project (IntelliSense feature). You can use Visual C++ Code Model for access.
Our C++ front end is capable of parsing many dialects of C++, including GNU and MS. It builds compiler data structures for ASTs and symbol tables with the kind of information needed to "do reflection" for C++. It is rather trivial to export the parse tree as an XML document. The symbol table information could be exported as XML by walking the symbol structure.
People always seem to want the AST and symbol table data in XML format, I guess under the assumption that they can read it into a DOM structure or manipulate it with XSLT. There are two serious flaws to this idea: 1) the sheer volume of the XML data is enormous, and generating/rereading it simply adds a lot of time 2) that having these structures available will make "easy" to do ...something....
What we think people really want to do is to analyze the code, and/or transform the code (typically based on an analysis). That requires that the tool, whatever it is, provide access to the program structure in a way that makes is "easier" to analyze and, well, transform. For instance, if you decide to modify the AST how will you regenerate the source text?
We have built the DMS Software Reengineering Toolkit to provide exactly the kind of general purpose support to parse, analyze, transform, prettyprint ("regenerate source"). DMS has front ends for a wide variety of languages (C++, C, Java, COBOL, Python, ...) and provides the set of standard services useful to build custom analyzers/transformations on code. At the risk of being bold, we have spent a long time thinking about implementing useful mechanisms to cover this set of tasks, in the same way that MS has spent a long time determining what should be in Windows. You can try to replicate this mechanism but expect it to be a huge cost (we have been working on DMS for 15 years), or you can close your eyes and pretend you can hack together enough to do what you imagine you need to do (mostly what you'll discover is that it isn't enough in practice).
Because of this general need for "program manipulation services", our C++ front end is hosted on top of DMS.
DMS with the C++ front end have been used to build a variety of standard software engineering tools (test coverage, profilers) as well as carry out massive changes to code (there's a paper at the webiste on how DMS was used to massively rearchitect aircraft mission software).
EDIT 7/8/2014: Our Front end now handles full C++11, and parts of C++14, including control and dataflow for functions/procedures/methods.

C++ code/XML generation tools

I'm not sure what exactly the right term is, kind of like ORM using XML as the data store. Are there any decent tools which will autogenerate C++ classes (including data and serialization/deserialization) based on an XML schema? Or will create XML-sync code and schema based on a C++ class definition?
TinyXML is great but it's so old-school to spend all that time writing code to load/save XML data to classes. I've seen similar tools focused on SOAP/WSDL, but they generated all kinds of other code on top of the basics.
Any good open-source libraries out there?
The only thing I've seen that attempts to do this is CodeSynthesisXSD.
If you are looking for an open source and commercial licensed tool to auto-generate C++ classes, including data and serialization/deserialization, based on an XML schema, then I strongly recommend GSOAP. It is easy to use, compliant to industry standards, and actively maintained.
See also http://www.rpbourret.com/xml/XMLDataBinding.htm
I was disappointed with many other C++ XML tools that promise full data bindings but will fail to process more extensive sets of WSDLs and schemas such as ONVIF. Having to retool an entire project was a pain. I know that GSOAP will do the job. A winner IMHO.
Not open source, but won't XML Thunder work for you?

What Linux Full Text Indexing Tool Has A Good C++ API?

I'm looking to add full text indexing to a Linux desktop application written in C++. I am thinking that the easiest way to do this would be to call an existing library or utility. This article reviews various open source utilities available for the Gnome and KDE desktops; metatracker, recoll and stigi are all written in C++ so they each seem reasonable. But I cannot find any notable documentation on how to use them as libraries or through an API. I could, instead, use something like Clucene or Xapian, which are generic full text indexing libraries. They seem more straightforward but if I used them, I'd have to implement my own indexing daemon, an unappealing prospect.
Also, Xesam seems to be the latest thing, does anyone have any evidence that it works?
So, does anyone have experience using any of the applications or libraries? How did you use it and what documentation was useful?
I used CLucene, which you mentioned (and also Lucene.NET), and found it to be pretty good.
There's also Strigi which AFAIK works with Xesam and is the default used in KDE.
After further looking around, I found and worked with Recol. It believe that it has the best C++ interface to a full text search engine, in this case Xapian.
It is important to realize that clucene and Xapian are both highly complex libraries designed primarily for multi-user server applications. Cutting them down to a level appropriate for a client-system is not easy. If I remember correctly, Strigi has a complex, pure C interface which isn't adapted.
Clucene also doesn't seem to be that actively maintained currently and Xapian seems to be maintained. But the thing is the existence of recol, which allows you to index particular files without the massive, massive setup that raw Xapian or clucene requires - creating your own "stemming" set is not normally desirable, etc.

Best XML serialization library for a MFC C++ app

I have an application, written in C++ using MFC and Stingray libraries. The application works with a wide variety of large data types, which are all currently serialized based on MFC Document/View serialize derived functionality. I have also added options for XML serialization based on the Stingray libraries, which implements DOM via the Microsoft XML SDK. While easy to implement the performance is terrible, to the extent that it is unusable on anything other than very small documents.
What other XML serialization tools would you folks recommend for this scenario. I don't want DOM, as it seems to be a memory hog, and I'm already dealing with large in memory data. Ideally, i'd like a streaming parser that is fast, and easy to use with MFC. My current front runner is expat which is fast and simple, but would require a lot of class by class serialization code to be added. Any other efficient and easier to implement alternatives out there that people would recommend?
The Boost Serialization library supports XML. This library basically consists in:
Start from the principles of MFC serialization and take all the good things it provides.
Solve every single issue of MFC serialization!
Among the improvements compared to MFC is support for XML.
Note that you don't necessarily control the XML schema of this serialization. It uses its own schema.
This is an age old problem. I was the team lead of the development team with the most critical path dependencies on the largest software project in the world during 1999 and 2000 and this very issue was the focus of my work during that time. I am convinced that the wheel was invented by multiple engineers who were unaware that others had already invented it. The same is true of XML Data binding in C++. I invented it too, and I've been perfecting it for over 10 years on various projects. I have a solution that addresses the issues noted here and some additional issues that repeatedly arise:
XML Updates. This is the ability to re-apply a subset of XML into an existing object model. In many cases the XML is bound to indexed objects and we cannot afford to re-index for each update.
COM and CORBA interface management. In the same respect that the XML Data Binding can be automated through object oriented practices - so can the instances of interface objects that provide that data to the application layer.
State Tracking. The application often needs to distinguish between an empty value vs. a missing value - both create an empty string. This provides the validation along with Data Binding.
The source code uses the least restrictive license - less so that GPL. The project is supported and managed from here:
http://www.codeproject.com/KB/XML/XMLFoundation.aspx
Now that it's the year 2010, I believe that nobody else will attempt to reinvent the wheel because there are a few to choose from. IMHO - this wheel is the most polished and well rounded implementation available.
Enjoy.
A good solution would be libxml. It provides lightweight SAX parsing and data structures for XML processing. There are several DOM libraries which are built on top of libxml.
Unfortunatly it is a C library, but C++ wrappers are available.
A few years ago I switched from MSXML to libxml because of the performance issues you mentioned.
If you decide to use libxml, you should also take a look at libxslt.
We use Xerces-C++. It was easy to setup and performance is good enough so we don't need to think about changing. However we aren't XML heavy.
I did listen to a podcast by Scott Hanselman (from Hansel Minutes) where they discuss the XML performance of MSXML and XSLT.
what about RapidXML, I am using it in an MFC app with some modification to support UTF-16 with std::string. I am quite satisfied with it so far.
The gSOAP toolkit auto-serializes native C and C++ data to/from XML and supports the full XML schema specification through XML data bindings:
gSOAP SourceForge Project
It has evolved since 1999 to a significant code base with code generation tools and libraries. It supports many databinding and customization features, which is especially critical for mapping XML schema types to/from the C and C++ types. It can serialize any C/C++ type and also STL containers, container templates, and cyclic data structures. It has been used in the W3C Schema Patterns for Databinding working group (with 100% schema pattern coverage success since years). There is an active open source user base and the gSOAP development functionality has been used in many industrial projects and Fortune 100 companies to develop SOAP/XML infrastructures.
This is late in the game, I just want to mention that we also use LIBXML. It's robust and reliable, and has worked well. A little bit too low-level, you'll want to build some wrappers on top of its functions.
For instance, you'll get a different sequence of function returns depending on whether you have this:
<tag attribute="value"/>
or this:
<tag attribute="value"> </tag>
Sometimes you may want that, sometimes you don't care.
We use TinyXML for all our XML needs be it MFC or straight C++.
http://sourceforge.net/projects/tinyxml