How do I write a google cloud dataflow transform mapping? - google-cloud-platform

I'm upgrading a google cloud dataflow job from dataflow java sdk 1.8 to version 2.4 and then trying to update its existing dataflow job on google cloud using the --update and --transformNameMapping arguments, but I can't figure out how to properly write the transformNameMappings such that the upgrade succeeds and passes the compatibility check.
My code fails at the compatibility check with the error:
Workflow failed. Causes: The new job is not compatible with 2018-04-06_13_48_04-12999941762965935736. The original job has not been aborted., The new job is missing steps BigQueryIO.writeTableRows/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey, PubsubIO.readStrings. If these steps have been renamed or deleted, please specify them with the update command.
The dataflow transform names for the existing, currently running job are:
PubsubIO.Read
ParDo(ExtractJsonPath) - A custom function we wrote
ParDo(AddMetadata) - Another custom function we wrote
BigQueryIO.Write
In my new code that uses the 2.4 sdk, I've changed the 1st and 4th transforms/functions because of some libraries being renamed and deprecation of some of the old sdk's functions in the new version.
You can see the specific transform code below:
The 1.8 SDK version:
PCollection<String> streamData =
pipeline
.apply(PubsubIO.Read
.timestampLabel(PUBSUB_TIMESTAMP_LABEL_KEY)
//.subscription(options.getPubsubSubscription())
.topic(options.getPubsubTopic()));
streamData
.apply(ParDo.of(new ExtractJsonPathFn(pathInfos)))
.apply(ParDo.of(new AddMetadataFn()))
.apply(BigQueryIO.Write
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.to(tableRef)
The 2.4 SDK version I rewrote:
PCollection<String> streamData =
pipeline
.apply("PubsubIO.readStrings", PubsubIO.readStrings()
.withTimestampAttribute(PUBSUB_TIMESTAMP_LABEL_KEY)
//.subscription(options.getPubsubSubscription())
.fromTopic(options.getPubsubTopic()));
streamData
.apply(ParDo.of(new ExtractJsonPathFn(pathInfos)))
.apply(ParDo.of(new AddMetadataFn()))
.apply("BigQueryIO.writeTableRows", BigQueryIO.writeTableRows()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.to(tableRef)
So it seems to me like PubsubIO.Read should map to PubsubIO.readStrings and BigQueryIO.Write should map to BigQueryIO.writeTableRows. But I could be misunderstanding how this works.
I've been trying a wide variety of things - I tried to give those two transforms that I'm failing to remap defined names as they formerly were not explicity named, so I updated my applys to .apply("PubsubIO.readStrings" and .apply("BigQueryIO.writeTableRows" and then set my transformNameMapping argument to:
--transformNameMapping={\"BigQueryIO.Write\":\"BigQueryIO.writeTableRows\",\"PubsubIO.Read\":\"PubsubIO.readStrings\"}
or
--transformNameMapping={\"BigQueryIO.Write/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey\":\"BigQueryIO.writeTableRows/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey\",\"PubsubIO.Read\":\"PubsubIO.readStrings\"}
or even trying to remap all the internal transforms inside the composite transform
--transformNameMapping={\"BigQueryIO.Write/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey\":\"BigQueryIO.writeTableRows/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey\",\"BigQueryIO.Write/BigQueryIO.StreamWithDeDup/Reshuffle\":\"BigQueryIO.writeTableRows/BigQueryIO.StreamWithDeDup/Reshuffle\",\"BigQueryIO.Write/BigQueryIO.StreamWithDeDup\":\"BigQueryIO.writeTableRows/BigQueryIO.StreamWithDeDup\",\"BigQueryIO.Write\":\"BigQueryIO.writeTableRows\",\"PubsubIO.Read\":\"PubsubIO.readStrings\"}
but I seem to get the same exact error no matter what:
The new job is missing steps BigQueryIO.writeTableRows/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey, PubsubIO.readStrings.
Wondering if I'm doing something seriously wrong? Anybody whose written a transform mapping before who would be willing to share the format they used? I can't find any examples online at all besides the main google documentation on updating dataflow jobs which doesn't really cover anything but the most simple case --transformNameMapping={"oldTransform1":"newTransform1","oldTransform2":"newTransform2",...} and doesn't make the example very concrete.

It turns out there was additional information in the logs in the google cloud web console dataflow job details page that I was missing. I needed to adjust the log level from info to show any log level and then I found several step fusion messages like for example (although there were far more):
2018-04-16 (13:56:28) Mapping original step BigQueryIO.Write/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey to write/StreamingInserts/StreamingWriteTables/Reshuffle/GroupByKey in the new graph.
2018-04-16 (13:56:28) Mapping original step PubsubIO.Read to PubsubIO.Read/PubsubUnboundedSource in the new graph.
Instead of trying to map PubsubIO.Read to PubsubIO.readStrings I needed to map to the steps that I found mentioned in that additional logging. In this case I got past my errors by mapping PubsubIO.Read to PubsubIO.Read/PubsubUnboundedSource and BigQueryIO.Write/BigQueryIO.StreamWithDeDup to BigQueryIO.Write/StreamingInserts/StreamingWriteTables. So try mapping your old steps to those that are mentioned in the full logs before the job failure message in the logs.
Unfortunately I'm not working through a failure of the compatibility check due to a change in the coder used from the old code to the new code, but my missing step errors are solved.

Related

Cannot launch Vertex Hypertune Job (Google Cloud Platform)

I am trying to use Vertex Hypertune from the google cloud console. I filled the forms to indicate my dataset, python package, compute resources, and so on.
Everything seems good up until the point where I submitted the job, I got an instant error (so probably it is not an issue from my code because there is no way it has even been run):
Unable to parse `training_pipeline.training_task_inputs` into hyperparameter tuning task `inputs` defined in the file gs://google-cloud-aiplatform/schema/trainingjob/definition/hyperparameter_tuning_task_1.0.0.yaml
I am really confused about why I get this error as I launched a training job without hyperparameter tuning with the same arguments and it worked just fine.
Any help would be truly appreciated
Note: I used a tabular dataset that comes from a BigQuery table (loaded with the dataset functionality). Default parameters were chosen for this dataset.
I picked the tensorflow 1.15 pre-built container and added my python code in an archive .tar.gz (generated with python setup.py sdist).
I configured only one hyperparameter (learning rate, double, between 0.001 and 0.1 to maximize 'accuracy', as declared in hypertune) and picked the lightest standard machine (n1-standard-4).
EDIT: Following the comment of #Jofre, it works now so it was probably originated by a temporary UI bug.

Updating job by mapping PubsubIO.Read to PubsubIO.Read/PubsubUnboundedSource gives 'Coder or type for step has changed' compatibility check failure?

I'm updating a currently running google cloud dataflow job from the v1.8 Java Dataflow SDK to v2.4 Java Dataflow SDK and as part of that process as per the release notes for the 1.x -> 2.x move (https://cloud.google.com/dataflow/release-notes/release-notes-java-2#changed_pubsubio_api) I'm changing the function PubsubIO.Read as used below:
PCollection<String> streamData =
pipeline
.apply(PubsubIO.Read
.timestampLabel(PUBSUB_TIMESTAMP_LABEL_KEY)
.topic(options.getPubsubTopic()));
to instead be PubsubIO.readStrings() as below:
PCollection<String> streamData =
pipeline
.apply(PubsubIO.readStrings()
.withTimestampAttribute(PUBSUB_TIMESTAMP_LABEL_KEY)
.fromTopic(options.getPubsubTopic()));
Which then leads me to need to use the transform mapping command line argument like so
'--transformNameMapping={\"PubsubIO.Read\": \"PubsubIO.Read/PubsubUnboundedSource\"}'
But I get a compatabiltiy check failure:
Workflow failed. Causes: The new job is not compatible with
2016-12-13_15_23_40-..... The original job has not been aborted., The
Coder or type for step PubsubIO.Read/PubsubUnboundedSource has
changed.
This confuses me a bit as it seems like the old code was working with strings and the new code is still using strings, can anyone help me understand what this error message is telling me? Is there perhaps a way for me to add a logging statement that will tell me what Coder I am using so that I can run my tests with my old code and new code and see what the difference is?
I think that the problem is that you are trying to update an existing job. As the 2.x release introduced breaking changes, streaming jobs cannot be updated. There is a warning for users upgrading from 1.x at the top of that documentation page that reads:
Update Incompatibility: The Dataflow SDK 2.x for Java is update-incompatible with Dataflow 1.x. Streaming jobs using a Dataflow
1.x SDK cannot be updated to use a Dataflow 2.x SDK. Dataflow 2.x pipelines may only be updated across versions starting with SDK
version 2.0.0.
Regarding the Coder changes there is some explanation on BEAM-1415:
There's no longer a way to read/write a generic type T. Instead,
there's PubsubIO. {read,write} {Strings,Protos,PubsubMessages}.
Strings and protos are a very common case so they have shorthands. For
everything else, use PubsubMessage and parse it yourself. In case of
read, you can read them with or without attributes. This gets rid of
the ugly use of Coder for decoding a message's payload (forbidden by
the style guide), and since PubsubMessage is easily encodable, again
the style guide also dictates to use that explicitly as the
input/return type of the transforms
In your tests you can use CoderRegistry.getCoder() as in here.

How to load data into CrowdFlower's job by using GATE's crowdsourcing plugin?

I am trying to create a job on CrowdFlower using
GATE crowdsourcing plugin. My problem is I cannot load the data to the
job at all. What I have done so far in creating the job is:
Create job builder in PR.
Right click on the job builder and choose create a new CrowdFlower
job. The job appeared in my job's list in CrowdFlower.
Populate corpus with some documents, pre-processing them with some
ANNIE's application, e.g. tokenizer and sentence splitter
Add the job builder to a corpus pipeline, edit some parameters so
they match with the initial annotations (tokens and sentences)
Run the pipeline. (Of course I make sure the Job ID match)
After I did all those, the job still has 0 row data. I am wondering if
I have done something wrong because I am sure that I follow all the instructions on this tutorial, specifically from page 28 to 35. Any advice on this?
I bet you have a typo in one of the job builder runtime parameters :)
Double-check the names of annotations and annotation sets, make sure all of them exist in your documents. If they exist and the builder found them, a cf_..._id feature should appear on each entity annotation.
If the job builder found any annotations it would call the crowdflower API and throw an exception if it fails to upload the data. It really sounds like it's not sending any requests and the only reason I see is it can't find annotations.

Removing old jobs from Jenkins

I'd like to shelve old builds in all of my jobs for example
build numbers 1-10
I'm wondering if there is way to do that from the jenkins UI using a single command.
First of all in order to make changes to a bulk of jobs of I would use something called configuration slicer.
you can get to that from here: https://wiki.jenkins-ci.org/display/JENKINS/Configuration+Slicing+Plugin
Also you want to delete your build? or archive them?! in case of deleting I would use the Log rotation eaither by date or number of builds. In the configure section of the job click on Discard old build and you will see the options.
and finally you can always use Artifact deployer and somether examples from that plug in.
Link Here: https://wiki.jenkins-ci.org/display/JENKINS/ArtifactDeployer+Plugin
Link on how to use the CLI in Jenkins : https://wiki.jenkins-ci.org/display/JENKINS/Jenkins+CLI
EDIT 1
In regards to the comments below where you are asking about "Shelving Jobs" .
I think the phrase you are looking for here is "archive" and not shelving - that is a very Visual Studio/TFS concept - so I am not personally aware of any anything that does SHELVING per say.
In terms of Groovy script I believe that you are now asking a different question and so this should be raised specifically as different question - but as far as groovy script go you can use the following link as an intro :
http://groovy.codehaus.org/

Weka - Measuring testing time

I'm using Weka 3.6.8 to carry out some machine learning and I'm want to find the 'time taken to test model on training/testing data'. When I test a predictive model on evaluation data, this parameter seems to be missing. Has this feature been removed from Weka or is it just a setting I'm missing? All I seem to be able to find is the time taken to build the actual predictive model. (I've also checked the Weka Manual but can't find anything)
Thanks in advance
That feature was added to 3.7.7, you need to upgrade. You should be able to get this data by running the test on the command line with the -T parameter.