Where are models saved by default? - google-cloud-ml

I've submitted a training job to the cloud using the RESTful API and see in the console logs that it completed successfully. In order to deploy the model and use it for predictions I have saved the final model using tf.train.Saver().save() (according to the how-to guide).
When running locally, I can find the graph files (export-* and export-*.meta) in the working directory. When running on the cloud however, I don't know where they end up. The API doesn't seem to have a parameter for specifying this, it's not in the bucket with the trainer app, and I can't find any temporary buckets on the cloud storage created by the job.

When you set up your Cloud ML environment you set up a bucket for this purpose. Have you looked in there?
https://cloud.google.com/ml/docs/how-tos/getting-set-up
Edit (for future record): As Robert mentioned in comments, you'll want to pass the output location to the job as an argument. Couple of things to be mindful of:
Use a unique output location per job, so one job doesn't clobber over the outputs of another.
The recommendation is to specify the parent output path, and use it to contain the exported model in a subpath called 'model', as well as organizing other outputs like checkpoints and summaries within that path. That makes it easier to manage all the outputs.
While not required, I'll also suggest staging the training code in a packages subpath of the output, which helps correlate the source with the outputs it produces.
Finally(!), also keep in mind when you use hyperparameter tuning, you'll need to append the trial id to the output path for outputs produced by individual runs.

Related

After training in AI Platform, where can I find model.bst or other model file?

I trained a XGBoost model using AI Platform as here.
Now I have the choice in the Console to download the model, as follows (but not Deploy it, since "Only models trained with built-in algorithms can be deployed from this page"). So, I click to download.
However, in the bucket the only file I see is a tar, as follows.
That tar (directory tree follows) holds only some training code, and not a model.bst, model.pkl, or model.joblib, or other such model file.
Where do I find model.bst or the like, which I can deploy?
EDIT:
Following the answer, below, we see that the "Download model" button is misleading as it sends us to the job directory, not the output directory (which is set arbitrarily in the codel the model is at census_data_20210527_215945/model.bst )
bucket = storage.Client().bucket(BUCKET_ID)
blob = bucket.blob('{}/{}'.format(
datetime.datetime.now().strftime('census_%Y%m%d_%H%M%S'),
model))
blob.upload_from_filename(model)
Only in-build algorithms automatically store the model in Google Cloud storage.
In your case, you have a custom training application.
You have to take care of saving the model on your own.
Referring to your example this is implemented as listed here.
The model is uploaded to Google Cloud Storage using the cloud storage client.

Specify output filename of Cloud Vision request

So I'm sitting with Google Cloud Vision (for Node.js) and I'm trying to dynamically upload a document to a Google Cloud Bucket, process it using Google Cloud Vision API, and then downloading the .json afterwards. However, when Cloud Vision processes my request and places it in my bucket for saved text extractions, it appends output-1-to-n.json at the end of the filename. So let's say I'm processing a file called foo.pdf that's 8 pages long, the output will not be foo.json (even though I specified that), but rather be foooutput1-to-8.json.
Of course, this could be remedied by checking the page count of the PDF before uploading it and appending that to the path I search for when downloading, but that seems like such an unneccesary hacky solution. I can't seem to find anything in the documentation about not appending output-1-to-n to outputs. Extremely happy for any pointers!
You can't specify a single output file for asyncBatchAnnotate because depending on your input, many files may get created. The output config is only a prefix and you have to do a wildcard search in gcs for your given prefix (so you should make sure your prefix is unique).
For more details see this answer.

Modifying image in Active Storage cloud

I'm using Rails 5.2 and GCS as cloud service.
I'd like to give an opportunity to users to crop and rotate user's image.
User has many Images, Image has one :image_file attached
In development I use such method:
class Image
...
def rotate(degree)
image = MiniMagick::Image.new(ActiveStorage::Blob.service.send(:path_for, self.image_file.key))
image.rotate "#{degree}"
image.write(ActiveStorage::Blob.service.send(:path_for, self.image_file.key))
self.image_file.blob.analyze
end
...
end
But I can't figure out how to get to image files in cloud.
I've made it to download the file to local storage and make all the operations needed.
Now it takes only to replace (delete current and create a new one with the same name) the file in the cloud (without changing anything in the database records if possible), but I can't figure out how to do this with active storage.
At least I need to get the file name in the cloud to use just bare google-cloud-ruby
To list files stored in Cloud Storage bucket using Ruby on Rails see the code example defined here. You can also upload files to cloud storage bucket and delete files from them using Ruby on Rails.
Also since you are allowing your customers to modify their files in Cloud Storage buckets, you may consider using versioning. This will incur you additional cost but will provide reliability for your customers.
Here is the link to Ruby on Google Cloud Platform documentation which might be helpful to you.

Pointing multiple projects' log sinks to one bucket

I have a few GCP projects with log sinks to different storage buckets. I'd like to combine them into a single bucket. But the stackdriver export doesn't add any distinguishing information to the object names it creates; they all look like cloudaudit.googleapis.com/activity/2017/11/14/00:00:00_00:59:59_S0.json
What will happen if I start pushing them all to a single bucket? Will the different project sinks overwrite each other's objects? Is there any way to distinguish which project created the logs just from the object?
If not, I guess I should switch to pubsub sinks, and then write some code that produces objects with more desirable names. Are there any established patterns or examples for doing this?
Update: I filed https://issuetracker.google.com/issues/69371200 for this issue.
To enable this, just select custom destination on the sink and point to the bucket with this format: storage.googleapis.com/[BUCKET_ID].
I've just enabled this in a couple of my projects, as I'm curious to see the results when exporting to a bucket. However, I have been using a single BQ sink for all my projects, and the tables created have all the logs mixed, so no logs lost when using a single BQ sink.
I'm assuming for a GCS sink will work in the same way, but I'll tell you in a couple of days.
If a single bucket sink does not work, you can always use a single BQ sink (that will help in analyzing the logs), and when you no longer want to have them in BQ, export them and store the files wherever you want.
Also, since you'll be writing to your sink constantly, you can't use nearline or coldline, so the storage pricing is better in BQ than a regional bucket (0.02 USD/GB in BQ vs somewhere between 0.02 and 0.35 USD/GB for regional storage, depending on the region; BQ has 10GB free monthly, GCS 5GB).
I would generally recommend using a BQ sink, but I'll tell you what happens with my bucket logs.
Update:
A few hours later, and I've verified that shared bucket sinks work pretty much as you would expect. It concatenates logs chronologically regardless of the project origin, and only creates a single file for each time window. Hope this helps! (I still prefer BQ as a log sink...)
Update 2:
For the behavior you seek in the feature request, I would use BQ, but you could just as easily grep the project ID and separate the logs:
grep '"logName":"projects/<your-project-id>/' mixed-log.json > single-project-log.json
Or just get a cloud function triggered by bucket updates (so, every time you receive a log file in the sink) to run this for you.
Or namespace you buckets and have a cloud function moving them to wherever you need as soon as they are written.
The possibilities are endless!
If you have an organization or folder which includes all the projects that you want to collect logs from, then you can create a sink that collects from all projects in that org/folder.
Unfortunatlely, you cannot do this from the Cloud Console. Instead you must use gcloud with the --organization or --folder option or the API.

Is there a way to group my DynamoDB export tasks on one EMR cluster?

When I set up a re-occuring backup via the export function in the DynamoDB console, the task it creates automatically creates a new EMR cluster when it runs. Some of my tables need to be backed up but are fairly small. What I end up with is a huge number of large servers running to back up some relatively small tables. Is there any easy way to chain these tasks to run on one server group in series or parallel?
Yes, it is possible. There is not a direct way but needs some additional tweaking in the Data-Pipeline end. You are required to understand how Data-Pipeline actually runs your export job by default.
When you click on export button on DDB console, it takes you to Data-Pipelines console to create a Pipeline for the export.
After filling out the template, instead of running, you can use Edit in Architect feature to alter the current template which only works with one table.
On the architect page, if you observe the Activities section ,you will find EmrAcvity running a EMR STEP using the following param's . This EMR STEP will run the export job using parameters that you initially passed on the template. Note that it will also RunsOn EMRclusterforBackup resource which you can find in resource section.
s3://dynamodb-emr-#{myDDBRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}
To run export on other DDB tables using same EMR resource, you simply need to create another EMRActivity object by clicking Add and then add EMRActivity on architect. On this activity , you can use the same RunsOn as previous activity is using and in the STEP param's you can manually edit to to include other table name and its export path
like
s3://dynamodb-emr-#{myDDBRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,s3://myexport-bucket/table2/,table2,0.9
You can extend it for multiple tables.
Note: This can easily be done for multiple tables using a JSON file as Data-Pipeline definition , editing it to add more activities and parameters and then exporting it to Run later.