Is there documentation available for Google Cloud Dataflow? - google-cloud-platform

Google Cloud Dataflow has been released in June 2014 (more information in this blog post), but I can't find any technical documentation on the developers section of the cloud.google.com website: https://cloud.google.com/developers/
Does someone knows where I can find more information, technical documentation about this product?
I'm really interested about how works topology, is it static or dynamic?.. etc..

Google Cloud Dataflow is now in Alpha stage. The documentation is now publicly available here: https://cloud.google.com/dataflow/. Follow the documentation link.
Please note that in Alpha - access to the managed service is limited to invite only. You can request access via the link above. Use the "Apply for Alpha" button.
The Cloud Dataflow SDK for Java has also been made public & open sourced on GitHub here: https://github.com/GoogleCloudPlatform/DataflowJavaSDK. Please note that you can download the SDK and run your Dataflow programs locally without having to execute them on the managed service. Local pipeline execution is a great way to get a feel for the programming model, but understand that the local execution is not parallelized.
We are also moving support over to StackOverflow. Please use the tag: google-cloud-dataflow.
Cheers - Eric

Google Cloud Dataflow is currently in private beta. You can apply here. Documentation is provided upon approval.

Related

Working with google cloud storage in julia applications

I have a query related to the google cloud storage for julia application.
Currently, I am hosting a julia application (docker container) on GCP and would like to allow the app to utilize cloud storage buckets to write and read the data.
I have explored few packages which promise to do this operation.
GoogleCloud.jl
This package in the docs show a clear and concise representation of the implementation. However, adding this package result in incremental compilation warning with many of the packages failing to compile. I have opened an issue on their github page : https://github.com/JuliaCloud/GoogleCloud.jl/issues/41
GCP.jl
The scope is limited, currently the only support is for BigQuery
Python package google
This is quite informative and operational but will take a toll on the code's performance. But do advise if this is the only viable option.
I would like to know are there other methods which can be used to configure a julia app to work with google storage?
Thanks look forward to the suggestions!
GCP.jl is promising plus you may be able to do with gRPC if Julia support gRPC (see below).
Discovery
Google has 2 types of SDK (aka Client Library). API Client Libraries are available for all Google's APIs|services.
Cloud Client Libraries are newer, more language idiosyncratic but only available for Cloud. Google Cloud Storage (GCS) is part of Cloud but, in this case, I think an API Client Library is worth pursuing...
Google's API (!) Client Libraries are auto-generated from a so-called Discovery document. Interestingly, GCP.jl specifically describes using Discovery to generate the BigQuery SDK and mentions that you can use the same mechanism for any other API Client Library (i.e. GCS).
NOTE Explanation of Google Discovery
I'm unfamiliar with Julia but, if you can understand enough of that repo to confirm that it's using the Discovery document to generate APIs and, if you can work out how to reconfigure it for GCS, this approach would provide you with a 100% fidelity SDK for Cloud Storage (and any other Google API|service).
Someone else tried to use the code to generate an SDK for Sheets and had an issue so it may not be perfect.
gRPC
Google publishes for the subset of its services that support gRPC. If you'd prefer to use gRPC, it ought be possible to use the Protobufs in Google's repo to define a gRPC client for Cloud Storage

Working link for Google Cloud pipeline components docs?

Does anyone have a working link for the docs for Google Cloud pipeline components. The link in the github page under "ReadTheDocs page" is broken. Tried some other tutorial notebooks, such as this one, the link under "The components are documented here." seems to be broken too.
Edit:
The link is up now.
Pipelines support KFP (Kubeflow pipeline) and TFX (Tensorflow Extended) definitions. You have documentation here
You can find useful resources here especially this notebook

Leveraging AWS Neptune Gremlin Client Library

We're looking to leverage the Neptune Gremlin client library to get load balancing and refreshes automatic.
There is a blog article here: https://aws.amazon.com/blogs/database/load-balance-graph-queries-using-the-amazon-neptune-gremlin-client/
This is also a repo containing the code here:
https://github.com/awslabs/amazon-neptune-tools/tree/master/neptune-gremlin-client
However, the artifacts aren't published anywhere. Is it still possible to do this? Ideally, we avoid vendoring the code into our codebase since we would then forefeit updates.
The artifacts for several of the tools in that repo can be found here.
https://github.com/awslabs/amazon-neptune-tools/releases/tag/amazon-neptune-tools-1.2

Mapping dependencies/requirements for GCP APIs/services

Does anyone knows a way to map the dependencies or requirements of any GCP API?
E.g. enabling container.googleapis.com would automatically enable compute.googleapis.com and others into a same chart/table/text/anything.
The GCP docs don't specify any such dependency for any API (from what I have seen so far). So I'm either looking for a Doc which specifies this, a gcloud command or a completely different tool that can help mapping it.
We don't have any public external documentation around service dependencies for now. therefore please open a FR in refer to this link
did you open a Feature Request as suggested ? If so, can you share the link ?
As a faint consolation, you can have a look at this article from which we can tell that the API interdependency information was once available through the serviceusage API.
There you'll find a diagram as of october 2020 (see screenshot bellow)
One workaround could be to use the Service Usage API. The disable method has a disableDependentServices field which disables all services that depend on the services being disabled.
You could enable a bunch of services in GCP, disable a service, and observe which dependent services are also disabled.
I did end up opening a feature request for this and the fact that I had to do so still boggles the mind.

Invalid arguments when creating new datalab instance

I am following the quickstart tutorial for datalab here, within the GCP console. When I try to run
datalab beta create-gpu datalab-instance-name
In step 3 I receive the following error
write() argument must be str, not bytes
Can anyone help explain why this is the case and how to fix it?
Thanks
Referring to the official documentation, before running Datalab instance, the corresponding APIs should be enabled: Google Compute Engine and Cloud Source Repositories APIs. To do so, visit Products -> APIs and Services -> Library and search for the APIs. Additionally, make sure that billing is enabled for your Google Cloud project.
You can also enabling the APIs by typing the following command, which will give you a prompt to enable the API:
datalab list
I made some research and found that the same issue has been reported on the Github page. If enabling API's wouldn't work, the best option would be to contribute (add a comment) in the mentioned Github topic to make it more visible to the Datalab Engineering team.