My teams heavily uses spacy, bert, and other NLP tools based on models. Where should I store these models (en_core_web_lg and such), so that:
It is only stored once (pricing reasons)
Multiple Notebook projects can access it
I have tried uploading it in a Cloud Storage bucket because pandas could open files directly from the bucket, but this is not the case for spacy.
I would like to avoid solutions like having the notebooks downloading locally models from the bucket every time it is run.
Related
We're trying to deploy a custom optimizer model into SageMaker. Our model consists of a number of .py files distributed across the repo and some external lib dependencies like ortools. Input CSV files can be put into a S3 bucket. Output of our model is a pickle file which is based on Input CSV files (these will be different each time someone runs a job).
We would prefer not to use ECR but if there's no other way option then can we follow the link below in order to achieve what we're aiming for? This sagemaker endpoint is expected to be called from a stepfunction.
https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html
I'd encourage you to check out the examples here for BYOC deployment.
Would require more information particularly on the framework and model to suggest further.
I am receiving csv files from different users (from the same organisation) over Microsoft Teams. I have to download each file and import them into a bucket on Google Cloud Storage.
What would be the most efficient way to directly store those files directly into Google Cloud Storage everytime I am receiving a file from a given user over Teams? Files must be imported using Microsoft Teams.
I was thinking to trigger from Pub/Sub using Cloud Run but I am a bit confused how to connect this with teams.
I imagine you should be able to do this fine using Power Automate, but it might depend on how you're receiving the files (for instance are users sending them 1-1 to you directly, or uploading them into a Files tab in a specific Team/Channel).
Here's an example template for moving files from OneDrive for Business to Google Drive, that sounds like it should help: https://flow.microsoft.com/en-us/galleries/public/templates/02057296acac46e9923e8a842ab9911d/sync-onedrive-for-business-files-to-google-drive-files/
I have a model.pkl file which is pre-trained and all other files related to the ml model. I want it to deploy it on the aws sagemaker.
But without training, how to deploy it to the aws sagmekaer, as fit() method in aws sagemaker run the train command and push the model.tar.gz to the s3 location and when deploy method is used it uses the same s3 location to deploy the model, we don't manual create the same location in s3 as it is created by the aws model and name it given by using some timestamp. How to put out our own personalized model.tar.gz file in the s3 location and call the deploy() function by using the same s3 location.
All you need is:
to have your model in an arbitrary S3 location in a model.tar.gz archive
to have an inference script in a SageMaker-compatible docker image that is able to read your model.pkl, serve it and handle inferences.
to create an endpoint associating your artifact to your inference code
When you ask for an endpoint deployment, SageMaker will take care of downloading your model.tar.gz and uncompressing to the appropriate location in the docker image of the server, which is /opt/ml/model
Depending on the framework you use, you may use either a pre-existing docker image (available for Scikit-learn, TensorFlow, PyTorch, MXNet) or you may need to create your own.
Regarding custom image creation, see here the specification and here two examples of custom containers for R and sklearn (the sklearn one is less relevant now that there is a pre-built docker image along with a sagemaker sklearn SDK)
Regarding leveraging existing containers for Sklearn, PyTorch, MXNet, TF, check this example: Random Forest in SageMaker Sklearn container. In this example, nothing prevents you from deploying a model that was trained elsewhere. Note that with a train/deploy environment mismatch you may run in errors due to some software version difference though.
Regarding your following experience:
when deploy method is used it uses the same s3 location to deploy the
model, we don't manual create the same location in s3 as it is created
by the aws model and name it given by using some timestamp
I agree that sometimes the demos that use the SageMaker Python SDK (one of the many available SDKs for SageMaker) may be misleading, in the sense that they often leverage the fact that an Estimator that has just been trained can be deployed (Estimator.deploy(..)) in the same session, without having to instantiate the intermediary model concept that maps inference code to model artifact. This design is presumably done on behalf of code compacity, but in real life, training and deployment of a given model may well be done from different scripts running in different systems. It's perfectly possible to deploy a model with training it previously in the same session, you need to instantiate a sagemaker.model.Model object and then deploy it.
I am trying to setup a basic pytorch pipeline with google ai platform.
I don't understand how google storage works with ai-platform jobs.
I am trying to mount several google storage blobs to my ai-platform jobs but completely can not find how I can do it. I need to do two things: 1) access dataset from my python pytorch code and 2) after train finish access logs and models
In the Google AI Platform tutorials the only relevant concept I found is manually downloading the dataset to job local storage via python google.cloud.storage API and uploading the result after the program finish. But surely this is unacceptable in the situation of quick research iterations (because of large datasets and possible crashes in the middle of training).
What is the solutions for such a basic problem?
You can use Cloud Storage Fuse to mount your bucket and use it like it was a local folder to avoid data download.
I am setting up a relationship where two Google App Engine applications (A and B) need to share data. B needs to read data from A, but A is not directly accessible to B. Both A and B currently use Google Datastore (NOT persistent disk).
I have an idea where I take a snapshot of A's state and upload it to a separate Google Cloud Storage location. This location can be read by B.
Is it possible to take a snapshot of A using Google App Engine and upload this snapshot (perhaps in JSON) to a separate Google Cloud Storage location to be read from by B? If so, how?
What you're looking for is the Datastore managed export/import service:
This page describes how to export and import Cloud Firestore in
Datastore mode entities using the managed export and import service.
The managed export and import service is available through the gcloud
command-line tool and the Datastore mode Admin API (REST,
RPC).
You can see a couple of examples described in a bit more details in these more or less related posts:
Google AppEngine Getting 403 forbidden trying to update cron.yaml
Transferring data from product datastore to local development environment datastore in Google App Engine (Python)
You may need to take extra precautions:
if you need data consistency (exports are not atomic)
to handle potential conflicts in entity key IDs, especially if using manually-generated ones or referencing them in other entities
If A is not directly accessible to B isn't actually something intentional and you'd be OK with allowing B to access A then that's also possible. The datastore can be accessed from anywhere, even from outside Google Cloud (see How do I use Google datastore for my web app which is NOT hosted in google app engine?). It might be a bit tricky to set it up, but once that's done it's IMHO a smoother sharing approach than the export/import one.