ML.NET: type inference for csv loading? - ml.net

The bare minimum requirement of an ML library is that it be capable of inferring the types of the extraordinarily large number of fields in real world ML applications (for example: 2,000).
Real World ML Applications live in a pipeline. Literally: a pipeline as in UNIX/Linux style fifoq's, named pipes connected by a pipe symbol. Not an abstraction called "pipeline" inside a document written in a third party language and compiled. These pipelines are generically (as in general, not templated) typed, and all tools associated with UNIX/Linux pipelines infer types at runtime.
These tools allow for dynamic generation and expansion of csv fields and types to arbitrary widths beyond the capacity of a single file to encode by hand.
So, once again, the bare minimum requirement of an ML library is that it have the capacity to open an ML file without throwing a timeline of work in the ML Engineer's lap when he can roll out an entire system using GNU tools + Python in the same amount of time.
This means inferring the types of the extraordinarily large number of fields in a potentially dynamically generated and rapidly changing CSV file. Ideally, the same binary console app can be used for the CSV data at various stages of an evolving or developing pipeline such that annotating the field types and recompiling is unnecessary.
I am reviewing the ML.NET data-IO system, in addition to F#'s CsvProvider, and the available CSV libraries for C#. I am also reviewing the CLR/CLI interop, as I can construct a C++ CSV inference system, but the CLR/CLI VS templates appear to work only on Windows platforms.
It seems there is no capability to load a CSV with the basic types inferred (datetime, double, int, string). Is that an accurate assessment?

Check out the new Dataframe API. It has a LoadCsv method on it that will infer column types and it is compatible with ML.NET's IDataView.

Related

What library exists for the construction of AFP documents in different languages?

I am currently looking for a library in any language that allows me to create structured AFP documents but so far I have not found any
Previously I tried to use a library called afp.lib belonging to java but this structure the document but lost bytes which distorts the document
I hope the guide of some language that allows me the construction of AFP without loss of bytes. even just the library that allows me to do it
You can try https://github.com/yan74/afplib which is a JAVA library for reading and writing AFP. It is however very low-level so you get fine-grained access to all structured fields, triplets, and such. You need to have detailed knowledge on MO:DCA in order to make use of it. If you want to create documents a composer is better suited: Apache FOP

Using Multiple GPUs with C++ cntk

I'm trying to gradually move over from brainscript to the C++ interface for cntk. The complete lack of documentation doesn't help. My latest project is multi-gpu training. There's an example for single GPU training. What is the best strategy for doing multi-gpu training. Is there a c++ equivalent of the python data_parallel_distributed_learner? (or other parallelisation methods) or do you have to code it yourself at the low level (data selection, model parameter combination, etc.). How does this work with MPI? Is threads/OpenMP an option as with evaluation (in whch case how to select the GPU/combine distributed models).
The Python APIs are mostly following the C++ APIs. So if you understand how to train on multiple GPUs with Python, the C++ is a straight translation from Python. For distributed training you will need
CreateDataParallelDistributedLearner, and to specify the number of workers in the MinibatchSource constructor as well as make sure every worker reads a different part of the data as can be done by the workerRank argument of GetNextMinibatch. As with Python you will need an MPI implementation and to invoke your C++ program with mpirun.

How to best expose ocaml library to other languages?

There are various exchange languages - json, ect - that provide an ability to quickly and reliably export and parse data to a common format. This is a boon between languages, and for it there is Piqi, which basically generates parsable exchange formats for any type that you define; it automates the process of writing boiler code (writing functions that read in some exchange info and build up a instance of some arbitrary type). Basically, the best option to date is protocol buffers, and I absolutely want, if I go down the route of ocaml-rpc, to use protocol buffers.
It would be nice if there were some declarative pattern to manage function exposure, so that the ocaml library can be reached over some medium (like RPC, or map a function to a url with encoding for arguments).
Imagine offering a library as a service; where you don't want to or can't make actual bindings between every single pair of languages. But servers and the data parsing has already been written... so wouldn't there be some way to integrate the two, and just specify what functions should be exposed and where/how?
Lastly, it appears to me that protocol buffers is a mechanism by which you can encode/decode data quickly, but not a transport mechanism... is there some kind of ocaml-RPC spec or some ocaml RPC library? Aren't there various RPC protocols (and ergo, if I try to point two languages using diff protocols at one another, achieve failure)? Additionally, the server mechanism that waits and receives RPC calls is (possibly) another module(?)
How do I achieve this?
To update this, the latest efforts under the piqi project are aimed at producing a working OCaml RPC service. From this, it would be, in vision, easy to specify what functions to expose at the RPC service end, and target function selection on the client side should allow for some mechanized facility to allow those exposed functions to be selected.
At the current time, this RPC system for ocaml facilitates inter-language exchange of data that can be reconstructed by parsers through the use of proto-buffers; it is under development and still being discussed here
I think that ocaml-rpc library suits your requirements. It can infer serialization functions and, also, can generate client and server code. The interesting part, is that they use OCaml as a IDL language. For example, this is a definition of the rpc function:
external rpc2 : ?opt:string -> variant -> unit = ""
From which there will be inferred server and client functorized code, that will take care on transporting marshaling and demarshaling the data, so that you need to work only with pure OCaml data types.
The problem with this library is that it is barely documented, so you may find it hard to use.
Also, as now I know, that you're tackling with BAP, I would like to bring your attention to a new BAP 1.x, that will be ready soon, and it will have bindings, that will allow to call it from any language, although currently we're mostly targeting python.

Portable Key-Value data file format for Hadoop?

I'm looking for a portable Key-Value data file format that can serve as an input and output format for Hadoop and is also readable and writable apart from Hadoop directly in C++, Java, and Python. One catch... I need to have support for processing with non-java mappers and reducers (specifically c++ via Hadoop Pipes).
Any ideas? Should I write my own portable Key-Value file format that interoperates with Hadoop and Hadoop Pipes? Would such a new format be useful to the community?
Long Version:
Hadoop Sequence files (and their cousins Map, Set, Array, and BloomMap) seem to be the standard for efficient binary key-value data storage when working with Hadoop. One downside of Sequence Files is that they are readable and writable only in Java (they are specified in terms of serialized java objects). I would like to build a complex multi-stage MapReduce pipeline where the input and output to various stages must be readable and writable from C++, java, and python. Furthermore, I need to be able to write mappers and reducers in a language other than java (i.e. c++) in order to use large and highly optimized c++ libraries in the mapping stage.
I've considered various workarounds, but none of them seem... attractive.
Convert : Add extra conversion stage before and after each MapReduce stage to convert the stage's input and outputs between Sequence Files and a portable format compatible with other languages.
Problem: The data consumed and generated between stages is quite large (TB)... It is expensive to duplicate the data multiple times at each stage just to get read / write access in a different programming language. There are 10 stages, this is too much overhead for me to pay for ($$$).
Avro File : Use Avro's portable data file format.
Problem: While there does seem to be code to allow the portable Avro data file to serve as an input or output format in a MapReduce, it only works with mappers and reducers written in Java. I've seen several discussions about creating support for mappers in other languages via the avro/mapred/tether package, but only java is currently supported. From the docs: "Currently only a Java framework has been implemented, for test purposes, so this feature is not yet useful."
http://avro.apache.org/docs/1.5.4/api/java/org/apache/avro/mapred/tether/package-summary.html
Avro File + SWIG : Use Avro data format with a Java mapper that calls a custom SWIG wrapped c++ library accessed from the distributed cache to do the real processing.
The immutability of java strings makes writing SWIG wrappers a pain and inefficient because a copy is required. Also, this many layers of wrapping is starting to become a maintenance and debugging and configuration nightmare!
I am considering writing my own language portable Key-Value file format based on the H-File format that interoperates with Hadoop and Hadoop Pipes... Are there better off-the-shelf alternatives? Would such a portable format be useful to the community?
I think you've made a couple of miss-assumptions:
One downside of Sequence Files is that they are readable and writable only in Java (they are specified in terms of serialized java objects)
Depends on what you mean by serialized java objects. Hadoop uses the WritableSerialization class to provide the mechanism for serialization, not the default Java serialization mechanism. You can configure hadoop to use default Java serialization (JavaSerialization), or any custom implementation of your choice (through the io.serializations configuration property).
So if you use the Hadoop Writable mechanism, you just need to write a reader for C++ that can interpret sequence files, and then write c++/python equivalents of the classes you wish to serialize (but this would be a pain to maintain, and leads to your second question, Avro)
Furthermore, I need to be able to write mappers and reducers in a language other than java (i.e. c++) in order to use large and highly optimized c++ libraries in the mapping stage
You can write mappers / reducers in python / c++ / whatever currently using Hadoop Streaming, and use Sequence Files to store the intermediate formats. All streaming requires is your mapper / reducer / combiner expects the input on stdin in key\tvalue pairs (you can customize the delimiter instead of tab), and outputs in a similar format (that again is customizable).
http://hadoop.apache.org/common/docs/current/streaming.html (I'm sure you've found this link, but just in case).
So what if you want to pass more complex key / value pairs to / from your streaming mapper / reducer - in this case i would say look into customizing the contrib/streaming source code, specifically the PipeMapper, PipeReducer and PipeMapRed classes. You could, for example amend the output/inputs to be <Type-int/str,Length-int,Value-byte[]> tuples, and then amend your python / c++ code to interpret appropriately.
With these modifications, you could use Avro to manage the code around serialization between the hadoop streaming framework (Java) and your c++/python code. You might even be able to use the Avro.
Finally - have you looked into the AvroAsTextInputFormat and AvroTextOutputFormat classes, they may be exactly what you are looking for (caveat, i've never used them)

How to parse a collection of c++ header files?

I am working in a project and I want to do reflection in C++ so after research I found that the best way is to parse header files to get abstract syntax tree in XML format and use it in reflection. I tried many tools but none of them compatible with visual c++ 2008 or visual c++ 2010 like coco, cint, gccxml. please replay soon
Visual Studio already parses all code in your project (IntelliSense feature). You can use Visual C++ Code Model for access.
Our C++ front end is capable of parsing many dialects of C++, including GNU and MS. It builds compiler data structures for ASTs and symbol tables with the kind of information needed to "do reflection" for C++. It is rather trivial to export the parse tree as an XML document. The symbol table information could be exported as XML by walking the symbol structure.
People always seem to want the AST and symbol table data in XML format, I guess under the assumption that they can read it into a DOM structure or manipulate it with XSLT. There are two serious flaws to this idea: 1) the sheer volume of the XML data is enormous, and generating/rereading it simply adds a lot of time 2) that having these structures available will make "easy" to do ...something....
What we think people really want to do is to analyze the code, and/or transform the code (typically based on an analysis). That requires that the tool, whatever it is, provide access to the program structure in a way that makes is "easier" to analyze and, well, transform. For instance, if you decide to modify the AST how will you regenerate the source text?
We have built the DMS Software Reengineering Toolkit to provide exactly the kind of general purpose support to parse, analyze, transform, prettyprint ("regenerate source"). DMS has front ends for a wide variety of languages (C++, C, Java, COBOL, Python, ...) and provides the set of standard services useful to build custom analyzers/transformations on code. At the risk of being bold, we have spent a long time thinking about implementing useful mechanisms to cover this set of tasks, in the same way that MS has spent a long time determining what should be in Windows. You can try to replicate this mechanism but expect it to be a huge cost (we have been working on DMS for 15 years), or you can close your eyes and pretend you can hack together enough to do what you imagine you need to do (mostly what you'll discover is that it isn't enough in practice).
Because of this general need for "program manipulation services", our C++ front end is hosted on top of DMS.
DMS with the C++ front end have been used to build a variety of standard software engineering tools (test coverage, profilers) as well as carry out massive changes to code (there's a paper at the webiste on how DMS was used to massively rearchitect aircraft mission software).
EDIT 7/8/2014: Our Front end now handles full C++11, and parts of C++14, including control and dataflow for functions/procedures/methods.