Is there any way to upload 50000 image files to Amazon S3 Bucket from a list of URLs - amazon-web-services

Is there any way to upload 50000 image files to Amazon S3 Bucket. The 50000 image file URLs are saved in a .txt file. Can someone please tell me a better way to do this.

It sounds like your requirement is: For each image URL listed in a text file, copy the images to an Amazon S3 bucket.
There is no in-built capability with Amazon S3 to do this. Instead, you would need to write an app that:
Reads the text file and, for each URL
Downloads the image
Uploads the image to Amazon S3
Doing this on an Amazon EC2 instance would be the fastest, due to low latency between S3 and EC2.
You could also get fancy and do it via Amazon EMR. It would be the fastest due to parallel processing, but would require knowledge of how to use Hadoop.
If you have a local copy of the images, you could order an AWS Snowball and use it to transfer the files to Amazon S3. However, it would probably be faster just to copy the files over the Internet (rough guess... at 1MB per file, total volume is 50GB).

Related

Import data to Amazon AWS SageMaker from S3 or EC2

For an AI project I want to train a model over a dataset which is about 300 GB. I want to use the AWS SageMaker framework.
In SageMaker documentation, they write that SageMaker can import data from AWS S3 bucket. Since the dataset is huge, I zipped it (to several zip files) and uploaded it to a S3 bucket. It took several hours. However, in order to use it I need to unzip the dataset. There are several options:
Unzip directly in S3. This might be impossible to do. See refs below.
Upload the uncompressed data directly, I tried it but it takes too much time and stopped in the middle, uploading only 9% of the data.
Uploading the data to a AWS EC2 machine and unzip it there. But can I import the data to SageMaker from EC2?
Many solutions offer a Python script that downloading the data from S3, unzipping it locally (on the desktop) and then streaming it back to the S3 bucket (see references below). Since I have the original files I can simply upload them to S3, but this takes too long (see 2).
Added in Edit:
I am now trying to upload the uncompressed data using AWS CLI V2.
References:
How to extract files in S3 on the fly with boto3?
https://community.talend.com/s/question/0D53p00007vCjNSCA0/unzip-aws-s3?language=en_US
https://www.linkedin.com/pulse/extract-files-from-zip-archives-in-situ-aws-s3-using-python-tom-reid
https://repost.aws/questions/QUI8fTOgURT-ipoJmN7qI_mw/unzipping-files-from-s-3-bucket
https://dev.to/felipeleao18/how-to-unzip-zip-files-from-s3-bucket-back-to-s3-29o9
The main strategy most commonly used, and also least expensive (since space has its own cost * GB), is not to use the space of the EC2 instance used for the training job but rather to take advantage of the high transfer rate from bucket to instance memory.
This is on the basis that the bucket resides in the same region as the EC2 instance. Otherwise you have to increase the transmission performance, for a fee of course.
You can implement all the strategies for reading files in parallel in your script or reads by chunks, but my advice is to use automated frameworks such as dask/pyspark/pyarrow (in case you need to read dataframes) or review the nature of the storage of these zippers if it can be transformed into a more facilitative form (e.g., a csv transformed into parquet.gzip).
If the nature of the data is different (e.g., images or other), an appropriate lazy data-loading strategy must be identified.
For example, for your zipper problem, you can easily get the list of your files from an S3 folder and read them sequentially.
You already have the data in S3 zipped. What's left is:
Provision a SageMaker notebook instance, or an EC2 instance with enough EBS storage (say 800GB)
Login to the notebook instance, open a shell, copy the data from S3 to local disk.
Unzip the data.
Copy unzip data back to S3.
terminate the instance and the EBS to avoid extra cost.
This should be fast (no less than 250MB/sec) as both the instance has high bandwidth to S3 within the same AWS Region.
Assuming you refer to Training, when talking about using the dataset in SageMaker, read this guide on different storage options for large datasets.

Uploading a file to s3 after downloading from a link

I have a link in a request which is pointing to some pdf /image content type. My requirement is to upload the content in the link to the s3 server.
Do I have to download it and then uploading the file but I have to many call and limited file storage in the machine Or Is there any other way to achieve this.
You must upload the file to Amazon S3.
It is not possible to tell Amazon S3 to retrieve a file from a URL.
My requirement is to upload the content in the link to the s3 server.
we - you need some compute resource. S3 itself won't do that.
Do I have to download it and then uploading the file
Or Is there any other way to achieve this.
The compute resource (logic) doesn't need to reside on your computer. You may use some AWS Compute resource near the S3, such as Lambda, EC2, ECS, .. You may decide based on the predicted load or other requirements.

download files from AWS S3 bucket in parallel

I want to download million of files from S3 bucket which will take more than a week to be downloaded one by one - any way/ any command to download those files in parallel using shell script ?
Thanks,
AWS CLI
You can certainly issue GetObject requests in parallel. In fact, the AWS Command-Line Interface (CLI) does exactly that when transferring files, so that it can take advantage of available bandwidth. The aws s3 sync command will transfer the content in parallel.
See: AWS CLI S3 Configuration
If your bucket has a large number of objects, it can take a long time to list the contents of the bucket. Therefore, you might want to sync the bucket by prefix (folder) rather than trying it all at once.
AWS DataSync
You might instead want to use AWS DataSync:
AWS DataSync is an online data transfer service that simplifies, automates, and accelerates copying large amounts of data to and from AWS storage services over the internet or AWS Direct Connect... Move active datasets rapidly over the network into Amazon S3, Amazon EFS, or Amazon FSx for Windows File Server. DataSync includes automatic encryption and data integrity validation to help make sure that your data arrives securely, intact, and ready to use.
DataSync uses a protocol that takes full advantage of available bandwidth and will manage the parallel downloading of content. A fee of $0.0125 per GB applies.
AWS Snowball
Another option is to use AWS Snowcone (8TB) or AWS Snowball (50TB or 80TB), which are physical devices that you can pre-load with content from S3 and have it shipped to your location. You then connect it to your network and download the data. (It works in reverse too, for uploading bulk data to Amazon S3).

Uploading huge no of files into S3 is very slow

I am uploading 1.8 GB of data that has 500000 of small XML files into the S3 bucket.
When I upload it from my local machine, it takes a very very long time 7 hours.
And when I zipped it and uploaded it takes 5 minutes of time.
But my issue is I can not zip it simply because later on I need to have something in AWS to unzip it.
So is there any way to make this upload faster? Files name are different not running number.
Transfer Acceleration is enabled.
Please suggest me how I can optimize this?
You can always upload the zip file to an EC2 instance then unzip it there and sync it to the S3 bucket.
The Instance Role must have permissions to put Objects into S3 for this to work.
I also suggest you look into configuring an S3 VPC Gateway Endpoint before doing this: https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints.html

Extract .gz files in S3 automatically

I'm trying to find a solution to extract ALB logs file in .gz format when they're uploaded automatically from ALB to S3.
My bucket structure is like this
/log-bucket
..alb-1/AWSLogs/account-number/elasticloadbalancing/ap-northeast-1/2018/log.gz
..alb-2/AWSLogs/account-number/elasticloadbalancing/ap-northeast-1/2018/log.gz
..alb-3/AWSLogs/account-number/elasticloadbalancing/ap-northeast-1/2018/log.gz
Basically, every 5 minutes, each ALB would automatically push logs to correspond S3 bucket. I'd like to extract new .gz files right at that time in same bucket.
Is there any ways to handle this?
I noticed that we can use Lambda function but not sure where to start. A sample code would be greatly appreciated!
Your best choice would probably be to have an AWS Lambda function subscribed to S3 events. Whenever a new object gets created, this Lambda function would be triggered. The Lambda function could then read the file from S3, extract it, write the extracted data back to S3 and delete the original one.
How that works is described in Using AWS Lambda with Amazon S3.
That said, you might also want to reconsider if you really need to store uncompressed logs in S3. Compressed files are not only cheaper, as they don't take as much storage space as uncompressed ones, but they are usually also faster to process, as the bottleneck in most cases is network bandwidth for transferring the data and not available CPU-resources for decompression. Most tools also support working directly with compressed files. Take Amazon Athena (Compression Formats) or Amazon EMR (How to Process Compressed Files) for example.