Online prediction with Data stored in Bucket - google-cloud-platform

As per my understanding Online prediction works with json data. Currently i am running online prediction on local host, where each image get converted to json. ML engin API use this json from localhost for prediction.
Internally ML engine API might have been uploading json to cloud for prediction.
Is there any way to run online prediction on json files already uploaded to cloud bucket?

Internally we parse the input from the payload in the request directly for serving, not store the requests on disk. Currently reading inputs from Cloud is not supported for online prediction. You may consider to use batch prediction which reads data from files stored on cloud.
There is a small discrepancy of the inputs between online and batch for the model that accepts only one string input (probably like your case). In this case, you must base64 encode the image bytes and put it in a JSON file for online prediction, while for batch prediction you need to pack the image bytes into records in TFRecords format and save it as tfrecord file(s). Other than that, the inputs are compatible.

Related

Pytorch: Google cloud storage, persistent disk for training models on DLVM?

I'm wondering what's the best way to go about
a)reading in the data
b)writing data while training a model?
My data is in GCS bucket, about 1TB, from a dataflow job.
For writing data (all I want is model checkpoints, and logs) do people just write to a zonal persistent disk? Don't write to google cloud storage. it is a large model so the checkpoints take up a fair bit of space.
I can't seem to write data to cloud storage, without writing say a context manager and byte string code all the places I want to write.
Now for reading in data:
pytorch doesn't have a good way to read in the data from GCS bucket like tensorflow?
So what should I do, I've tride gcsfuse which I think could work, when I 'mount' the bucket, I can only see inside the repo I selected, not sub directories. Is this normal?
would gcsfuse be the right way to load in data from GCS?
Thanks.

CRC32C checksum for HTTP Range Get requests in google cloud storage

When I want to get a partial range of file content in the google cloud storage, I used XML API and use HTTP Range Get requests. From the google cloud response, I can find the header x-goog-hash, and it contains CRC32C and MD5 checksums. But these checksums are calculated from the whole file. What I need is the crc32c checksum of the partial range of content in the response. With that partial crc32c checksum, I can verify the data in response, otherwise, I cannot check the validation of the response.
I was wondering: Are the files stored in your bucket on gzip format? I read here Using Range Header on gzip-compressed files that you can't get partial information from a compressed file. By default you get the whole file information.
Anyways, could you share the petition you're sending?
I looked for more information and found this: Request Headers and Cloud Storage.
It says that when you use the Range header, the returned checksum will cover the whole file.
So far, there's no way to get the checksum for a byte range alone using the XML API.
However, you could try to do it by splitting your file with your preferred programming language and get the checksum for that "splitted" part.

Data Loss Prevention on Big Data files

I have migrated a big data application on to cloud and the input files are stored in GCS. The files can be of different formats like txt, csv, avro, parquet etc and these files contain sensitive data that I want to mask.
Also, I have read there is some quota restriction on the size of file. For my case a single file can contain 15M records.
I have tried the DLP UI as well as Client library to inspect those files, but its not working.
Github page - https://github.com/Hitman007IN/DataLossPreventionGCPDemo
under the resources there are 2 files. test.txt is working and test1.txt which is the sample file that I use in my application is not working.
Google Cloud DLP just launched support last week for scanning Avro files natively.

Alter the audio format for Amazon connect recordings

So the basic problem is that I am setting up an Amazon connect instance and have successfully started recording calls too but I want the recording audio file to be stored in the S3 bucket in some format(.mp3, .mp4, etc) other than the default that is provided by Amazon (.wav).
Since .wav is the default format and I am not getting any official documentation regarding the change in any format, Any leads would be welcome.
Rather than downloading the file and converting it to the target format, which I have already done, I need the file to be stored in the S3 in the target format itself(anything other than .wav).
Currently there is no capability to change audio format for Connect call recordings; it's fixed at 8 kHz, 16 bit WAV files. But you could set up a MediaConvert job to automatically convert to any format you'd like.
Alternatively you could have a trigger call a Lambda function to do the same thing. Here's a CloudFormation template that sets that up for you.
In both solutions, you might want to adjust the process a bit to delete the WAV file after conversion, to save on storage costs

Specify output filename of Cloud Vision request

So I'm sitting with Google Cloud Vision (for Node.js) and I'm trying to dynamically upload a document to a Google Cloud Bucket, process it using Google Cloud Vision API, and then downloading the .json afterwards. However, when Cloud Vision processes my request and places it in my bucket for saved text extractions, it appends output-1-to-n.json at the end of the filename. So let's say I'm processing a file called foo.pdf that's 8 pages long, the output will not be foo.json (even though I specified that), but rather be foooutput1-to-8.json.
Of course, this could be remedied by checking the page count of the PDF before uploading it and appending that to the path I search for when downloading, but that seems like such an unneccesary hacky solution. I can't seem to find anything in the documentation about not appending output-1-to-n to outputs. Extremely happy for any pointers!
You can't specify a single output file for asyncBatchAnnotate because depending on your input, many files may get created. The output config is only a prefix and you have to do a wildcard search in gcs for your given prefix (so you should make sure your prefix is unique).
For more details see this answer.