What level of control is required for google cloud ml - google-cloud-platform

When using google cloud ML to train models:
The official examples https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/census/tensorflowcore/trainer/task.py uses hooks, is_client, MonitoredTrainingSession and some other complexity.
Is this required for cloud ml or is using this example enough: https://github.com/amygdala/tensorflow-workshop/tree/master/workshop_sections/wide_n_deep?
The documentation is a bit limited in terms of best practices and optimisation, will GCP ML handle the client/worker mode or do we need to set devices e.g. replica_device_setter and so on?

CloudML Engine is largely agnostic to how you write your TensorFlow programs. You provide a Python program, and the service executes it for you, providing it with some environment variables you can use to perform distributed training (if necessary), e.g., task index, etc.
census/tensorflowcore demonstrates how to do things with the "core" TensorFlow library -- how to do everything "from scratch", including using replica_device_setters, MonitoredTrainingSessions, etc.. This may be necessary sometimes for ultimate flexibility, but can be tedious.
Alongside the census/tensorflowcore example, you'll also see a sample called census/estimator. This example is based on a higher level library, which unfortunately is in contrib and therefore does not yet have a fully stable API (expect lots of deprecation warnings, etc.). Expect it to stabilize in a future version of TensorFlow.
That particularly library (known as Estimators) is a higher level API that takes care of a lot of the dirty work for you. It will parse TF_CONFIG for you and setup the replica_device_setter as well as handle the MonitoredTrainingSession and necessary Hooks, while remaining fairly customizable.
This is the same library that the wide and deep example you pointed to is based on and they are fully supported on the service.

Related

Google AI platform for inference, what is supported?

I have been reading over the docs for Google AI platform for inference, they do not seem (maybe I have missed it) to state what frameworks are supported.
Am I to assume that all frameworks are supported?
The training side supports custom dockers, which gives the feeling that any code will run as long as you can package it in a container - but on the prediction / inference product there is no such guidance.
Custom prediction routines is a feature already in place the lets you perform your predictions with any custom code, from any framework of your choice. Bear in mind that it is still in beta but you can already try it.

Existing API for NLP in C++?

Is/are there existing C++ NLP API(s) out there? The closest thing I have found is CLucene, a port of Lucene. However, it seems a bit obsolete and the documentation is far from complete.
Ideally, this/these API(s) would permit tokenization, stemming and PoS tagging.
Freeling is written in C++ too, although most people just use their binaries to run the tools: http://devel.cpl.upc.edu/freeling/downloads?order=time&desc=1
Try something like DyNet, it's a generic neural net framework but most of its processes are focusing on NLP because the maintainers are creators of the NLP community.
Or perhaps Marian-NMT, it was designed for sequence-to-sequence model machine translation but potentially many NLP tasks can be structured as a sequence-to-sequence task.
Outdated
Maybe you can try Ellogon http://www.ellogon.org/ , they have GUI support and also C/C++ API for NLP too.
if you remove the restriction on c++ , you get the perfect NLTK (python)
the remaining effort is then interfacing between python and c++.
Apache Lucy would get you part of the way there. It is under active development.
Maybe you can use Weka-C++. It's the very popular Weka library for machine learning and data mining (including NLP) ported from Java to C++.
Weka supports tokenization and stemming, you'll probably need to train a classifier for PoS tagging.
I only used Weka with Java though, so I'm afraid can't give you more details on this version.
There is TurboParser by André Martins at CMU, also has a Python wrapper. There is is an online demo for it.
This project provides free (even for commercial use) state-of-the-art information extraction tools. The current release includes tools for performing named entity extraction and binary relation detection as well as tools for training custom extractors and relation detectors.
MITIE is built on top of dlib, a high-performance machine-learning library, MITIE makes use of several state-of-the-art techniques including the use of distributional word embeddings and Structural Support Vector Machines[3]. MITIE offers several pre-trained models providing varying levels of support for both English and Spanish, trained using a variety of linguistic resources (e.g., CoNLL 2003, ACE, Wikipedia, Freebase, and Gigaword). The core MITIE software is written in C++, but bindings for several other software languages including Python, R, Java, C, and MATLAB allow a user to quickly integrate MITIE into his/her own applications.
https://github.com/mit-nlp/MITIE

what is the Open Cloud Computing Interface?

I am going to develop a cloud application and in my research for state of the art tools in Cloud Computing i saw some references to OCCI (Open Cloud Computing Interface).
I was not able to find out an answer to the following questions
1)Is it easy to use this Interface ?
2)What programming languages does this interface Supports ?
3)Is this Interface mature enough?
Any information are well appreciated!
This question has been asked quite some time ago but, hopefully, the answer is still relevant.
Is it easy to use?
Depends on what you want. If you want to make your own implementation, then probably not. If you use one of the existing implementations (see bellow), then yes.
What programming languages does this interface Support?
We know about two implementations (libraries, CLI), which are for Ruby and Java. See:
https://wiki.egi.eu/wiki/rOCCI:ROCCI
https://github.com/EGI-FCTF/jOCCI-api
rOCCI (the first one) also as a server side (the rOCCI-server) that translates OCCI to propriatary cloud management platforms such as OpenNebula.
Is this Interface mature enough?
Yes, given that it is being used by real-world infrastructures. Among them, e.g., the EGI Federated Cloud. That said, the current OCCI specification (1.1) has a few shortcomings that will be addressed in version 1.2 (due in Autumn 2015), so that if someone is just starting a project, it is worth implementing with 1.2 already.
Many of your questions can be answered (positively, by the way!) by visiting the OCCI-WG home site at http://occi-wg.org and/or searching on "occi implementation".
Another recent and useful resource is the tutorials and workshop talks given at the recent Cloud Interoperability Week held simultaneously with events in Madrid and Santa Clara, part of the Cloud Plugfest hands-on developer training series:
Or generally at http://www.cloudplugfest.org/
The basic specs are published by the Open Grid Forum.
The Open Cloud Computing Interface (OCCI) is a set of specifications delivered through the Open Grid Forum, for cloud computing service providers. OCCI has a set of implementations that act as proofs of concept. It builds upon World Wide Web fundamentals by using the Representational State Transfer (REST) approach for interacting with services.

Has anyone used dataflow programming in a real project with a mainstream language?

I am looking at using some Dataflow programming techniques in a clojure program but I am having difficulty in finding much information from projects using Java, C#, or other mainstream languages that have used such techniques in the real world. I would be grateful to hear if anyone has any expereinces they could share regarding this.
Here, we are! We've made... (quotation is from one of my older post):
We've designed and implemented a DF
server for our automation project
(dispatcher, component iterface, a
bunch of components, DF language, DF
compiler, UI). It is written in bare
C++, and runs on several Unix-like
systems (Linux x86, MIPS, avr32 etc.,
Mac OSX). It lacks several features,
e.g. sophisticated flow control,
complex thread control (there is only
a not too advanced component for it),
so it is just a prototype, even it
works. We're now working on a
full-featured server. We've learnt lot
during implementing and using the
prototype.
Also, we'll make a visual editor some
day.
There're dataflow systems wich don't even mention dataflow approach:
SynthEdit: http://www.synthedit.com/ - It's an audio related framework and component set for creating VST plugins
TinyOS: http://www.tinyos.net/ - It's an embedded operating system/framework
Digital synthetisers/samplers are dataflow systems, programmed - supposedly - in C or some parts in Assembly, check my answer to another post about some examples.
Quartz Composer, a graphic magic tool for Mac,
Blender has dataflow subsystem for image composing.
Writing a dataflow system is not rocket science. Here's my older post about the basics of dataflow framework.
The term dataflow is wide. There are realtime synchronous dataflow systems, like synthetisers and samplers, there are asynchronous ones, like our home aut. system (the system is in idle unless the user presses a button or a timer runs out), and there're even different architectures, like spreadsheets or make.
Wanna reading more about dataflow programming? Read J. Paul Morrison's site and book.
Pervasive DataRush is a framework for parallel dataflow programming for any JVM language, including Clojure.
Pervasive DataRush uses a dataflow architecture. The architecture implements a program that executes as a graph of computation nodes interconnected by dataflow queues. The nodes use the queues to share data. As the data is streaming, only data required by any active operation needs to be in memory at any given time, allowing very large data sets to be analyzed. Besides offering the potential for scaling to problems larger than available memory, dataflow graphs exploit multiple forms of parallelism.
Customers are using DataRush for big data analytics and data preparation (ETL).
We've made another one: a collaborative spreadsheet with MySQL/PHP backend and AJAX frontend. The software is in beta state, documentation is under construction.

Continuous build infrastructure recommendations for primarily C++; GreenHills Integrity

I need your recommendations for continuous build products for a large (1-2MLOC) software development project. Characteristics:
ClearCase revision control
Approx 80% C++; 15% Java; 5% script or low-level
Compiles for Green Hills Integrity OS, but also some windows and JVM chunks
Mostly an embedded system; also includes some UI pieces and some development support (simulation tools, config tools, etc...)
Each notional "version" of the deliverable includes deployment images for a number of boards, UI machines, etc... (~10 separate images; 5 distinct operating systems)
Need to maintain/track many simultaneous versions which, notably, are built for a variety of different board support packages
Build cycle time is a major issue on the project, need support for whatever features help address this (mostly need to manage a large farm of build machines, I guess..)
Operates in a secure environment (this is a gov't program) (Edited to add: This is a classified program; outsourcing the build infrastructure is a non-starter.)
Interested in any best practices or peripheral guidance you might offer. The build automation issues is one of several overlapping best practices that appear to be missing on the program, but try to keep your answers focused on build infrastructure piece and observations directly related.
Cost is not the driving concern. Scalability and ease of retrofitting onto an existing infrastructure are key.
(Edited to address #Dan's comment. ;-)
From my experience with similar systems, there are approximately two parts to this problem:
A repeatable method for checking out sources, building the software, and testing it (if you want to do continual testing as well as building), using a small number of command-line invocations.
A means of calling these command lines on various servers in the build farm.
For the latter, we've been using BuildBot, which seems to work pretty well.
For the former, we have a homegrown solution that started out as a simple bash shell script and grew ... rather substantially. From experience, I'd suggest starting out in python rather than bash -- you'll spend far more code in handling setup and configuration than in actually invoking programs. (Also, it's probably easier to run it on Windows if you're doing that.)
The things I've found to be really key in our script's usefulness are:
Ironclad repeatability. We have a standard set of build tools, and the scripts start out by scrubbing environment variables. There are very few command-line options; everything goes into configuration files, and those go in version control.
Logging. We produce a log of every command that the build script executes.
Configuration file inheritance. Each variant of our software gets a configuration file, and those files can include more-general settings (which include even-more-general settings).
Extensibility. When we add a new source component, it's pretty easy to add a set of instructions for building that component (and the instructions can be arbitrary bash code). The "can be arbitrary code" part is probably key here; no way is a pre-existing product going to be able to do all of the quirky things that you need for a large complex real-world system.
You can get started with a reasonably simple script and let it grow organically as the need arises; honestly, although ours is a bit messy, I think we got a much more usable result that way than we would have with heavy top-down design.
Cost isn't an object? I've worked for GreenHills, and they've solved these issues for their in-house build/test farms. Ask them to do the same for you.
When I see emphasis on things like scalability and security in a build system, I start thinking that you might be a candidate for the enterprise class build systems / CI systems. Conveniently, it sounds like you can afford them as well. A year old SD Times article provides a basic breakdown between the enterprise and team level build tools.
My company makes AnthillPro and we've worked with a number of companies on large embedded projects as well as highly secure projects. IBM is probably the largest other player in the space with BuildForge.
AnthillPro puts some extra emphasis on what you do with the images in the minutes/hours/days post build (do you install them onto simulators / hardware and run automated tests? stage them? promote them?) but we also see folks using it for just build.