Designing Cloud Functions to stream downloads to Google Storage - google-cloud-platform

I have a workflow where I want to download many large (500 MB - 1.5 GB) files each day from an external server and upload the file to GCS. Each file has its own endpoint. Due to restrictions on the external server that I download each file from, each download can take a few minutes. I was thinking of using a Cloud Function where I send the URL of each file, and the function downloads the file locally, then uploads it to GCS. However, to do so, I would need one of the large Cloud Function instances (2 GB). These can get quite expensive -- is there a way to use a lower memory instance and stream the result to GCS directly?

If you are transferring data from other Cloud Service Providers you can use Storage Transfer Service which will only cost you the Network charges as mentioned in this document
If you are transferring the data from your on-premises server then you can use Transfer service for on-premises data and schedule the transfer service as mentioned in this document where you can set up the interval to be in Weeks, Days or Hours . This will cost you $0.0125 per GB transferred to the destination successfully in addition to the networking cost. As you are more concerned about time being needed to download if you use Cloud Function which in turn increases the cost you may consider this as it will not cost you per time rather per GB of data.
You can use Streaming Transfers as mentioned in this document which allow you to transfer data to Cloud Storage without saving data to a file. Streaming uploads to Cloud Storage are helpful when we usually don’t know the final size of the data that is being uploaded at the starting of upload.

Related

Reducing AWS Data Transfer Cost for AWS S3 files

AWS S3 has a standard public bucket and folder (Asia Pacific region) which hosts ~30 GB of images/media. On another hand, the website and app access these images by using a direct S3 object URL. Unknowingly we run into high data transfer cost and its significantly unproportionate:
Amazon Simple Storage Service: USD 30
AWS Data Transfer: USD 110
I have also read that if EC2 and S3 is in the same region cost will be significantly lower, but problem is S3 objects are accessible from anywhere in the world from client machine directly and no EC2 is involved in between.
Can someone please suggest how can data transfer costs be reduced?
The Data Transfer charge is directly related to the amount of information that goes from AWS to the Internet. Depending on your region, it is typically charged at 9c/GB.
If you are concerned about the Data Transfer charge, there are a few things you could do:
Activate Amazon S3 Server Access Logging, which will create a log file for each web request. You can then see how many requests are coming and possibly detect strange access behaviour (eg bots, search engines, abuse).
You could try reducing the size of files that are typically accessed, such as making images smaller. Take a look at the Access Logs and determine which objects are being accessed the most, and are therefore causing the most costs.
Use less large files on your website (eg videos). Again, look at the Access Logs to determine where the data is being used.
A cost of $110 suggests about 1.2TB of data being transferred.

Best way to transfer data from on-prem to AWS

I have a requirement to transfer data(one time) from on prem to AWS S3. The data size is around 1 TB. I was going through AWS Datasync, Snowball etc... But these managed services are better to migrate if the data is in petabytes. Can someone suggest me the best way to transfer the data in a secured way cost effectively
You can use the AWS Command-Line Interface (CLI). This command will copy data to Amazon S3:
aws s3 sync c:/MyDir s3://my-bucket/
If there is a network failure or timeout, simply run the command again. It only copies files that are not already present in the destination.
The time taken will depend upon the speed of your Internet connection.
You could also consider using AWS Snowball, which is a piece of hardware that is sent to your location. It can hold 50TB of data and costs $200.
If you have no specific requirements (apart from the fact that it needs to be encrypted and the file-size is 1TB) then I would suggest you stick to something plain and simple. S3 supports an object size of 5TB so you wouldn't run into trouble. I don't know if your data is made up of many smaller files or 1 big file (or zip) but in essence its all the same. Since the end-points or all encrypted you should be fine (if your worried, you can encrypt your files before and they will be encrypted while stored (if its backup of something). To get to the point, you can use API tools for transfer or just file-explorer type of tools which have also connectivity to S3 (e.g. https://www.cloudberrylab.com/explorer/amazon-s3.aspx). some other point: cost-effectiviness of storage/transfer all depends on how frequent you need the data, if just a backup or just in case. archiving to glacier is much cheaper.
1 TB is large but it's not so large that it'll take you weeks to get your data onto S3. However if you don't have a good upload speed, use Snowball.
https://aws.amazon.com/snowball/
Snowball is a device shipped to you which can hold up to 100TB. You load your data onto it and ship it back to AWS and they'll upload it to the S3 bucket you specify when loading the data.
This can be done in multiple ways.
Using AWS Cli, we can copy files from local to S3
AWS Transfer using FTP or SFTP (AWS SFTP)
Please refer
There are tools like cloudberry clients which has a UI interface
You can use AWS DataSync Tool

minimizing the cost of uploading a very large tar file to Google Cloud Storage

I'm currently trying upload and then untar a very large file (1.3 tb) into Google Cloud Storage at the lowest price.
I initially thought about creating a really cheap instance just to download the file and put it in a bucket, then creating a new instance with a good amount of RAM to untar the file and then put the result in a new bucket.
However since the bucket price depends on the nbr of request I/O I'm not sure it's the best option, and even for performance it might not be the best.
What would be the best strategy to untar the file in the cheapest way?
First some background information on pricing:
Google has pretty good documentation about how to ingest data into GCS. From that guide:
Today, when you move data to Cloud Storage, there are no ingress traffic charges. The gsutil tool and the Storage Transfer Service are both offered at no charge. See the GCP network pricing page for the most up-to-date pricing details.
The "network pricing page" just says:
[Traffic type: Ingress] Price: No charge, unless there is a resource such as a load balancer that is processing ingress traffic. Responses to requests count as egress and are charged.
There is additional information on the GCS pricing page about your idea to use a GCE VM to write to GCS:
There are no network charges for accessing data in your Cloud Storage buckets when you do so with other GCP services in the following scenarios:
Your bucket and GCP service are located in the same multi-regional or regional location. For example, accessing data in an asia-east1 bucket with an asia-east1 Compute Engine instance.
From later in that same page, there is also information about the pre-request pricing:
Class A Operations: storage.*.insert[1]
[1] Simple, multipart, and resumable uploads with the JSON API are each considered one Class A operation.
The cost for Class A operations is per 10,000 operations, and is either $0.05 or $0.10 depending on the storage type. I believe you would only be doing 1 Class A operation (or at most, 1 Class A operation per file that you upload), so this probably wouldn't add up to much usage overall.
Now to answer your question:
For your use case, it sounds like you want to have the files in the tarball be individual files in GCS (as opposed to just having a big tarball stored in one file in GCS). The first step is to untar it somewhere, and the second step is to use gsutil cp to copy it to GCS.
Unless you have to (i.e. not enough space on the machine that holds the tarball now), I wouldn't recommend copying the tarball to an intermediate VM in GCE before uploading to GCE, for two reasons:
gsutil cp already handles a bunch of annoying edge cases for you: parallel uploads, resuming an upload in case there's a network failure, retries, checksum comparisons, etc.
Using any GCE VMs will add cost to this whole copy operation -- costs for the disks plus costs for the VMs themselves.
If you want to try the procedure out with something lower-risk first, make a small directory with a few megabytes of data and a few files and use gsutil cp to copy it, then check how much you were charged for that. From the GCS pricing page:
Charges accrue daily, but Cloud Storage bills you only at the end of the billing period. You can view unbilled usage in your project's billing page in the Google Cloud Platform Console.
So you'd just have to wait a day to see how much you were billed.

Using AWS Kinesis for large file uploads

My client has a service which stores a lot of files, like video or sound files. The service works well, however looks like the long-time file storing is quite a challenge, and we would like to use AWS for storing these files.
The problem is the following, the client wants to use AWS kinesis for transferring every file from our servers to AWS. Is this possible? Can we transfer files using that service? There's a lot of video files, and we got more and more every day. And every files is relatively big.
We would also like to save some detail of the files, possibly into dynamoDB, we could use Lambda functions for that.
The most important thing, that we need a reliable data transfer option.
KInesis would not be the right tool to upload files, unless they were all very small - and most videos would almost certainly be over the 1MB record size limit:
The maximum size of a data blob (the data payload before
Base64-encoding) within one record is 1 megabyte (MB).
https://aws.amazon.com/kinesis/streams/faqs/
Use S3 with multi-part upload using one of the SDK's. Objects you won't be accessing for 90+ days can be moved to Glacier.
Multipart upload allows you to upload a single object as a set of parts. Each part is a contiguous portion of the object's data. You can upload these object parts independently and in any order. If transmission of any part fails, you can retransmit that part without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object. In general, when your object size reaches 100 MB, you should consider using multipart uploads instead of uploading the object in a single operation.
Amazon Web Services. Amazon Simple Storage Service (S3) Developer Guide (Kindle Locations 4302-4306). Amazon Web Services, Inc.. Kindle Edition.
To further optimize file upload speed, use transfer acceleration:
Amazon S3 Transfer Acceleration enables fast, easy, and secure transfers of files over long distances between your client and an S3 bucket. Transfer Acceleration takes advantage of Amazon CloudFront’s globally distributed edge locations. As the data arrives at an edge location, data is routed to Amazon S3 over an optimized network path.
Amazon Web Services. Amazon Simple Storage Service (S3) Developer Guide (Kindle Locations 2060-2062). Amazon Web Services, Inc.. Kindle Edition.
Kinesis launched a new service "Kinesis Video Streams" - https://aws.amazon.com/kinesis/video-streams/ which may be helpful to move large amount of data.

How to determine time needed to upload data to Amazon EC2?

We need to populate database which sits on Amazon WS { EC2 (Compute Cluster Eight extra large) + EBS 1TB }. Given that we have close to 700GB of data on local, how can I find out the time (theoretical) it would take to upload the entire data? I could not find any information on data upload/download speeds for EC2?
Since this will depend strongly on the networking betweeen your site and amazon's data centre...
Test it with a few GB and extrapolate.
Be aware of AWS Import/Export and consider the option of simply couriering Amazon a portable hard drive. (Old saying: "Never underestimate the bandwidth of a stationwagon full of tape"). In fact I note the page includes a section "When to use..." which gives some indication of transfer times vs. connection bandwidth.