For a particular task I'm working on I have a dataset that is about 25 GB. I'm still experimenting with several methods of preprocessing and definitely don't have my data to it's final form yet. I'm not sure what the common workflow is for this sort of problem, so here is what I'm thinking:
Copy dataset from bucket storage to Compute Engine machine SSD (maybe use around 50 GB SSD) using gcsfuse.
Apply various preprocessing operations as an experiment.
Run training with PyTorch on the data stored on the local disk (SSD)
Copy newly processed data back to storage bucket with gcsfuse if it was successful.
Upload results and delete the persistent disk that was used during training.
The alternative approach is this:
Run the processing operations on the data within the Cloud Bucket itself using the mounted directory with gcsfuse
Run training with PyTorch directly on the mounted gcsfuse Bucket directory, using a compute engine instance with very limited storage.
Upload Results and Delete Compute Engine Instance.
Which of these approaches is suggested? Which will incur fewer charges and is used most often when running these kind of operations. Is there a different workflow that I'm not seeing here?
On the billing side, the charges would be the same, as the fuse operations are charged like any other Cloud Storage interface according to the documentation. In your use case I don’t know how you are going to train the data, but if you do more than one operation to files it would be better to have them downloaded, trained locally and then the final result uploaded, which would be 2 object operations. If you do, for example, more than one change or read to a file during the training, every operation would be an object operation. On the workflow side, the proposed one looks good to me.
Related
For an AI project I want to train a model over a dataset which is about 300 GB. I want to use the AWS SageMaker framework.
In SageMaker documentation, they write that SageMaker can import data from AWS S3 bucket. Since the dataset is huge, I zipped it (to several zip files) and uploaded it to a S3 bucket. It took several hours. However, in order to use it I need to unzip the dataset. There are several options:
Unzip directly in S3. This might be impossible to do. See refs below.
Upload the uncompressed data directly, I tried it but it takes too much time and stopped in the middle, uploading only 9% of the data.
Uploading the data to a AWS EC2 machine and unzip it there. But can I import the data to SageMaker from EC2?
Many solutions offer a Python script that downloading the data from S3, unzipping it locally (on the desktop) and then streaming it back to the S3 bucket (see references below). Since I have the original files I can simply upload them to S3, but this takes too long (see 2).
Added in Edit:
I am now trying to upload the uncompressed data using AWS CLI V2.
References:
How to extract files in S3 on the fly with boto3?
https://community.talend.com/s/question/0D53p00007vCjNSCA0/unzip-aws-s3?language=en_US
https://www.linkedin.com/pulse/extract-files-from-zip-archives-in-situ-aws-s3-using-python-tom-reid
https://repost.aws/questions/QUI8fTOgURT-ipoJmN7qI_mw/unzipping-files-from-s-3-bucket
https://dev.to/felipeleao18/how-to-unzip-zip-files-from-s3-bucket-back-to-s3-29o9
The main strategy most commonly used, and also least expensive (since space has its own cost * GB), is not to use the space of the EC2 instance used for the training job but rather to take advantage of the high transfer rate from bucket to instance memory.
This is on the basis that the bucket resides in the same region as the EC2 instance. Otherwise you have to increase the transmission performance, for a fee of course.
You can implement all the strategies for reading files in parallel in your script or reads by chunks, but my advice is to use automated frameworks such as dask/pyspark/pyarrow (in case you need to read dataframes) or review the nature of the storage of these zippers if it can be transformed into a more facilitative form (e.g., a csv transformed into parquet.gzip).
If the nature of the data is different (e.g., images or other), an appropriate lazy data-loading strategy must be identified.
For example, for your zipper problem, you can easily get the list of your files from an S3 folder and read them sequentially.
You already have the data in S3 zipped. What's left is:
Provision a SageMaker notebook instance, or an EC2 instance with enough EBS storage (say 800GB)
Login to the notebook instance, open a shell, copy the data from S3 to local disk.
Unzip the data.
Copy unzip data back to S3.
terminate the instance and the EBS to avoid extra cost.
This should be fast (no less than 250MB/sec) as both the instance has high bandwidth to S3 within the same AWS Region.
Assuming you refer to Training, when talking about using the dataset in SageMaker, read this guide on different storage options for large datasets.
Following AWS-recommended best practices, we have organization-wide CloudTrail and VPC flow logging configured to log to a centralized logs archive account. Since CloudTrail and VPC flow are organization-wide in multiple regions, we're getting a high number of new log files saved to S3 daily. Most of these files are quite small (several KB).
The high number of small log files is fine while they're in the STANDARD storage class, since you just pay for total data size without any minimum file size overhead. However, we've found it challenging to deep archive these files after 6 or 12 months, since any storage class other than STANDARD (such as GLACIER) has a minimum billable file size (STANDARD-IA is 128, GLACIER doesn't have a minimum size but adds 40KB of metadata per object, etc.).
What are the best practices for archiving a large number of small S3 objects? I could use a Lambda to download multiple files, re-bundle them into a larger file, and re-store it, but that would be pretty expensive in terms of compute time and GET/PUT requests. As far as I can tell, S3 Batch Operations has no support for this. Any suggestions?
Consider using a tool like S3-utils concat. This is not an AWS-supported tool but an open source tool to perform the type of action you are requiring.
You'll probably want the pattern matching syntax which will allow you to create a single file for each day's logs.
$ s3-utils concat my.bucket.name 'date-hierachy/(\d{4})/(\d{2})/(\d{2})/*.gz' 'flat-hierarchy/$1-$2-$3.gz'
This could be run as a daily job so each day is condensed into one file. Definitely recommended to run this in a resource on the Amazon network (i.e. your VPC with the s3 gateway endpoint attached) to improve file transfer performance and avoid data transfer out fees.
I have about 80,000,000 50KB files on S3 (4TB), which I want to transfer to Glacier DA.
I have come to realize there's a cost inefficiency in transferring a lot of small files to Glacier.
Assuming I don't mind archiving my files into a single (or multiple) tar/zips - what would be the best practice to transition those files to Glacier DA?
It is important to note that I only have these files on S3, and not on any local machine.
The most efficient way would be:
Launch an Amazon EC2 instance in the same region as the bucket. Choose an instance type with high-bandwidth networking (eg t3 family). Launch it with spot pricing because you can withstand the small chance that it is stopped. Assign plenty of EBS disk space. (Alternatively, you could choose a Storage Optimized instance since the disk space is included free, but the instance is more expensive. Your choice!)
Download a subset of the files to the instance using the AWS Command-Line Interface (CLI) by specifying a path (subdirectory) to copy. Don't try and do it all at once!
Zip/compress the files on the EC2 instance
Upload the compressed files to S3 using --storage-class DEEP_ARCHIVE
Check that everything seems good, and repeat for another subset!
The above would incur very little charge since you can terminate the EC2 when it is no longer needed, and EBS is only charged while the volumes exist.
If it takes too long to list a subset of the files, you might consider using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You can then use this list to specifically copy files, or identify a path/subdirectory to copy.
As an extra piece of advice... if your system is continuing to collect even more files, you might consider collecting the data in a different way (eg streaming to Kinesis Firehose to batch data together), or combining the data on a regular basis rather than letting it creep up to so many files again. Fewer, larger files are much easier to use in processes if possible.
On the AWS developer docs for Sagemaker, they recommend us to use PIPE mode to directly stream large datasets from S3 to the model training containers (since it's faster, uses less disk storage, reduces training time, etc.).
However, they don't include information on whether this data streaming transfer is charged for (they only include data transfer pricing for their model building & deployment stages, not training).
So, I wanted to ask if anyone knew whether this data transfer in PIPE mode is charged for, since if it is, I don't get how this would be recommended for large datasets, since streaming a few epochs for each model iteration can get prohibitively expensive for large datasets (my dataset, for example, is 6.3TB on S3).
Thank you!
You are charged for the S3 GET calls that you do similarly to what you would be charged if you used the FILE option of the training. However, these charges are usually marginal compared to the alternatives.
When you are using the FILE mode, you need to pay for the local EBS on the instances, and for the extra time that your instances are up and only copying the data from S3. If you are running multiple epochs, you will not benefit much from the PIPE mode, however, when you have so much data (6.3 TB), you don't really need to run multiple epochs.
The best usage of PIPE mode is when you can use a single pass over the data. In the era of big data, this is a better model of operation, as you can't retrain your models often. In SageMaker, you can point to your "old" model in the "model" channel, and your "new" data in the "train" channel and benefit from the PIPE mode to the maximum.
I just realized that on S3's official pricing page, it says the following under the Data transfer section:
Transfers between S3 buckets or from Amazon S3 to any service(s) within the same AWS Region are free.
And since my S3 bucket and my Sagemaker instances will be in the same AWS region, the data transfer costs should be free.
I'm currently trying upload and then untar a very large file (1.3 tb) into Google Cloud Storage at the lowest price.
I initially thought about creating a really cheap instance just to download the file and put it in a bucket, then creating a new instance with a good amount of RAM to untar the file and then put the result in a new bucket.
However since the bucket price depends on the nbr of request I/O I'm not sure it's the best option, and even for performance it might not be the best.
What would be the best strategy to untar the file in the cheapest way?
First some background information on pricing:
Google has pretty good documentation about how to ingest data into GCS. From that guide:
Today, when you move data to Cloud Storage, there are no ingress traffic charges. The gsutil tool and the Storage Transfer Service are both offered at no charge. See the GCP network pricing page for the most up-to-date pricing details.
The "network pricing page" just says:
[Traffic type: Ingress] Price: No charge, unless there is a resource such as a load balancer that is processing ingress traffic. Responses to requests count as egress and are charged.
There is additional information on the GCS pricing page about your idea to use a GCE VM to write to GCS:
There are no network charges for accessing data in your Cloud Storage buckets when you do so with other GCP services in the following scenarios:
Your bucket and GCP service are located in the same multi-regional or regional location. For example, accessing data in an asia-east1 bucket with an asia-east1 Compute Engine instance.
From later in that same page, there is also information about the pre-request pricing:
Class A Operations: storage.*.insert[1]
[1] Simple, multipart, and resumable uploads with the JSON API are each considered one Class A operation.
The cost for Class A operations is per 10,000 operations, and is either $0.05 or $0.10 depending on the storage type. I believe you would only be doing 1 Class A operation (or at most, 1 Class A operation per file that you upload), so this probably wouldn't add up to much usage overall.
Now to answer your question:
For your use case, it sounds like you want to have the files in the tarball be individual files in GCS (as opposed to just having a big tarball stored in one file in GCS). The first step is to untar it somewhere, and the second step is to use gsutil cp to copy it to GCS.
Unless you have to (i.e. not enough space on the machine that holds the tarball now), I wouldn't recommend copying the tarball to an intermediate VM in GCE before uploading to GCE, for two reasons:
gsutil cp already handles a bunch of annoying edge cases for you: parallel uploads, resuming an upload in case there's a network failure, retries, checksum comparisons, etc.
Using any GCE VMs will add cost to this whole copy operation -- costs for the disks plus costs for the VMs themselves.
If you want to try the procedure out with something lower-risk first, make a small directory with a few megabytes of data and a few files and use gsutil cp to copy it, then check how much you were charged for that. From the GCS pricing page:
Charges accrue daily, but Cloud Storage bills you only at the end of the billing period. You can view unbilled usage in your project's billing page in the Google Cloud Platform Console.
So you'd just have to wait a day to see how much you were billed.