What is the Equivalent of Azure Blob Storage in AWS? - amazon-web-services

Apologies for the noob question, but among the storage options offered by AWS, which one is functionally closest to Azure blob container?
My previous workflow always involved mounting an azure blob container as a Linux file system in /mnt/ directory with the blobfuse, so that I can read and write data using normal file handling commands in python. So which AWS storage facility can I mount as a file system in my Linux desktop in a similar fashion?

Related

Google Cloud Bucket mounted on Compute Engine Instance using gcsfuse does not create files

I have been able to mount Google Cloud Bucket using
gcsfuse --implicit-dirs " production-xxx-appspot /mount
or equally
sudo mount -t gcsfuse -o implicit_dirs,allow_other,uid=1000,gid=1000,key_file=service-account.json production-xxx-appspot /mount
Mounting works fine.
What happens is that when I execute the following commands after mounting, they also work fine :
mkdir /mount/files/
cp -rf /home/files/* /mount/files/
However, when I use :
mcedit /mount/files/a.txt
or
vi /mount/files/a.txt
The output says that there is no file available which makes sense.
Is there any other way to cover this situation, and use applications in a way that they can directly create files on the mounted google cloud bucket rather than creating files locally and copying afterwards.
If you do not want to create files locally and upload later, you should consider using a file storage system like Google Drive
Google Cloud storage is an object Storage system that means objects cannot be modified, you have to write the object completely at once. Object storage also does not work well with traditional databases, because writing objects is a slow process and writing an app to use an object storage API is not as simple as using file storage.
In a file storage system, Data is stored as a single piece of information inside a folder, just like you would organize pieces of paper inside a manila folder. When you need to access that piece of data, your computer needs to know the path to find it. (Beware—It can be a long, winding path.)
If you want to use Google Cloud Storage, you need to create your file locally and then push it to your bucket.
Here are an example of how to configure Google Cloud Storage with Node.js: File Upload example
Here is a tutorial on How to mount Object Storage on Cloud Server using s3fs-fuse
If you want to know more about storage formats please follow this link
More information about reading and writing to Cloud Storage in this link

Creating a duplicate of a VM

I'm preparing to get in to the world of cloud computing.
My first question is:
Is it possible to programmatically create a new, or duplicate an existing VM from my server?
Project Background
I provide a file processing service, and as it's been growing I need to offer a better service.
Project Requirement
Machine specs:
HDD: Min 16gb
CPU: Min 1 core
RAM: Min 2
GB GPU: Min CUDA 10.1 compatible
What I'm thinking is the following steps:
User uploads a file
A dedicated VM is created for that specific file inside Google Cloud Compute
The file is sent to the VM
File is processed using a Anaconda environment
Results are downloaded to local server
Dedicated VM is removed
Results are served to user
How is this accomplished?
PS: I'm looking for resources and advice. Not code.
Your question is a perfect formulation of the concept of Google Cloud Run. At the highest level concept, you create a Docker image (think of it like a VM) and then register that Docker image with GCP Cloud Run. When a trigger occurs, GCP will spin up an instance of that Docker container and pass in information about the cause of that trigger (a file created in GCS or a REST request or others ...). What you do in your container is up to you. You have full power of the Linux environment (under Docker) to do as you like. When your request ends, the container is spun down. You are only billed for the compute resources you use. If your container (VM) isn't being used, you pay nothing until the next trigger.
An alternative to Cloud Run is Cloud Functions. This is a higher level abstraction where instead of providing a Docker container, you provide the body of a function (JavaScript, Java, Python or others) and the request is passed to that function when a trigger occurs. Which you use is mostly personal choice (you didn't elaborate on "File is processed").
References:
Cloud Run
Cloud Functions

Mount google cloud storage folders to google ai platform job

I am trying to setup a basic pytorch pipeline with google ai platform.
I don't understand how google storage works with ai-platform jobs.
I am trying to mount several google storage blobs to my ai-platform jobs but completely can not find how I can do it. I need to do two things: 1) access dataset from my python pytorch code and 2) after train finish access logs and models
In the Google AI Platform tutorials the only relevant concept I found is manually downloading the dataset to job local storage via python google.cloud.storage API and uploading the result after the program finish. But surely this is unacceptable in the situation of quick research iterations (because of large datasets and possible crashes in the middle of training).
What is the solutions for such a basic problem?
You can use Cloud Storage Fuse to mount your bucket and use it like it was a local folder to avoid data download.

Streaming send file from GCS to Windows Server

I want to create a job running on DataFlow (streaming format).
The function will be to receive files from Google Cloud Storage (Path: gs: // mybucket ....) and transfer this file to a server, running Windows Server.
Can anybody suggest me a solution?
Windows is not supported as a storage destination. Dataflow has a set of connectors that are used for data transfers/storage. Windows server is not one of them.
This link provides the current connectors that are avaiable.
Apache Beam Built-in I/O Transforms

Data science workflow with large geospatial datasets

I am relatively new to the docker approach so please bear with me.
The goal is to ingest large geospatial datasets to Google Earth Engine using an open source replicable approach. I got everything working on my local machine and a Google Compute Engine but would like to make the approach accessible to others as well.
The large static geospatial files (NETCDF4) are currently stored on Amazon S3 and Google Cloud Storage (GEOTIFF). I need a couple of python based modules to convert and ingest the data into Earth Engine using a command line interface. This has to happen only once. The data conversion is not very heavy and can be done by one fat instance (32GB RAM, 16 cores takes 2 hours), there is no need for a cluster.
My question is how I should deal with large static datasets in Docker. I thought of the following option but would like to know best practices.
1) Use docker and mount the amazon s3 and Google Cloud Storage buckets to the docker container.
2) Copy the large datasets to a docker image and use Amazon ECS
3) just use the AWS CLI
4) use Boto3 in Python
5) A fifth option that I am not yet aware of
The python modules that I use are a.o.: python-GDAL, pandas, earth-engine, subprocess