I am trying to train a machine learning model on AWS EC2. I have over 50GB of data currently stored in an AWS S3 bucket. When training my model on EC2, I want to be able to access this data.
Essentially, I want to be able to call this command:
python3 train_model.py --train_files /data/train.csv --dev_files /data/dev.csv --test_files /data/test.csv
where /data/train.csv is my S3 bucket s3://data/. How can I do this? I currently only see ways to cp my S3 data into my EC2.
You can develop an enhancement to your code using boto.
But if you want access to your S3 as if it was another local filesystem I would consider s3fs-fuse, explained further here.
Another option would be to use the aws-cli to sync your code to a local folder.
How can I do this? I currently only see ways to cp my S3 data into my EC2.
S3 is a object storage system. It does not allow for direct access nor reading of files like a regular file system.
Thus to read your files, you need to download it first (downloading in parts is also possible), or have some third party software do it for you like s3-fuse. You can download it to your instance, or store in external file system (e.g. EFS).
Its not clear from your question if you have one 50GB CSV file, or multiple small ones. In case you have one large CSV file of 50GB in size, you can reduce the amount of data read, if not all of its needed, at once using S3 Select:
With S3 Select, you can use a simple SQL expression to return only the data from the store you’re interested in, instead of retrieving the entire object. This means you’re dealing with an order of magnitude less data which improves the performance of your underlying applications.
Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format.
Related
For an AI project I want to train a model over a dataset which is about 300 GB. I want to use the AWS SageMaker framework.
In SageMaker documentation, they write that SageMaker can import data from AWS S3 bucket. Since the dataset is huge, I zipped it (to several zip files) and uploaded it to a S3 bucket. It took several hours. However, in order to use it I need to unzip the dataset. There are several options:
Unzip directly in S3. This might be impossible to do. See refs below.
Upload the uncompressed data directly, I tried it but it takes too much time and stopped in the middle, uploading only 9% of the data.
Uploading the data to a AWS EC2 machine and unzip it there. But can I import the data to SageMaker from EC2?
Many solutions offer a Python script that downloading the data from S3, unzipping it locally (on the desktop) and then streaming it back to the S3 bucket (see references below). Since I have the original files I can simply upload them to S3, but this takes too long (see 2).
Added in Edit:
I am now trying to upload the uncompressed data using AWS CLI V2.
References:
How to extract files in S3 on the fly with boto3?
https://community.talend.com/s/question/0D53p00007vCjNSCA0/unzip-aws-s3?language=en_US
https://www.linkedin.com/pulse/extract-files-from-zip-archives-in-situ-aws-s3-using-python-tom-reid
https://repost.aws/questions/QUI8fTOgURT-ipoJmN7qI_mw/unzipping-files-from-s-3-bucket
https://dev.to/felipeleao18/how-to-unzip-zip-files-from-s3-bucket-back-to-s3-29o9
The main strategy most commonly used, and also least expensive (since space has its own cost * GB), is not to use the space of the EC2 instance used for the training job but rather to take advantage of the high transfer rate from bucket to instance memory.
This is on the basis that the bucket resides in the same region as the EC2 instance. Otherwise you have to increase the transmission performance, for a fee of course.
You can implement all the strategies for reading files in parallel in your script or reads by chunks, but my advice is to use automated frameworks such as dask/pyspark/pyarrow (in case you need to read dataframes) or review the nature of the storage of these zippers if it can be transformed into a more facilitative form (e.g., a csv transformed into parquet.gzip).
If the nature of the data is different (e.g., images or other), an appropriate lazy data-loading strategy must be identified.
For example, for your zipper problem, you can easily get the list of your files from an S3 folder and read them sequentially.
You already have the data in S3 zipped. What's left is:
Provision a SageMaker notebook instance, or an EC2 instance with enough EBS storage (say 800GB)
Login to the notebook instance, open a shell, copy the data from S3 to local disk.
Unzip the data.
Copy unzip data back to S3.
terminate the instance and the EBS to avoid extra cost.
This should be fast (no less than 250MB/sec) as both the instance has high bandwidth to S3 within the same AWS Region.
Assuming you refer to Training, when talking about using the dataset in SageMaker, read this guide on different storage options for large datasets.
There is a requirement to copy from Azure Blob to S3 for 10TB data and also from Synpase to Redshift for 10TB of data.
What is the best way to achieve these 2 migrations?
For the Redshift - you could export Azure Synapse Analytics to a a blob storage in a compatible format ideally compressed and then copy the data to S3. It is pretty straightforward to import data from S3 to Redshift.
You may need a VM instance to load read from Azure Storage and put into AWS S3 (doesn't matter where). The simplest option seems to be using the default CLI (Azure and AWS) to read the content to the migration instance and write to to the target bucket. However me personally - I'd maybe create an application writing down checkpoints, if the migration process interrupts for any reason, the migration process wouldn't need to start from the scratch.
There are a few options you may "tweak" based on the files to move, if there are many small files or less large files, from which region to move where, ....
https://aws.amazon.com/premiumsupport/knowledge-center/s3-upload-large-files/
As well you may consider using the AWS S3 Transfer Acceleration, may or may not help too.
Please note every larger cloud provider has some outbound data egress cost, for 10TB it may be considerable cost
I have a project that requires large amounts of CSV data to be transformed regularly. This data will be stored in S3 and I am using an EC2 instance running Ubuntu server 16.04 to perform edits to the data and Apache Airflow to route the data. Downloading and reuploading this data to S3 is quite expensive, is there a way I can edit this CSV data in memory without downloading the file to local storage on the Ubuntu instance?
Thank you in advance
In general you could think of program that will fetch the CSV file from s3 (using s3 sdk) and store it and transform in memory and then save back to s3. But it will still require "downloading and reuploading". The only difference is that file won't be physically stored to local disk but kept in program memory.
You could also use s3fs to mount s3 bucket to a directory on server and perform requested operations directly on the files.
But they still need to be downloaded from s3 and reuploaded there (although it will be on-the-fly and invisible to you).
Hope that helps.
I have a requirement to transfer data(one time) from on prem to AWS S3. The data size is around 1 TB. I was going through AWS Datasync, Snowball etc... But these managed services are better to migrate if the data is in petabytes. Can someone suggest me the best way to transfer the data in a secured way cost effectively
You can use the AWS Command-Line Interface (CLI). This command will copy data to Amazon S3:
aws s3 sync c:/MyDir s3://my-bucket/
If there is a network failure or timeout, simply run the command again. It only copies files that are not already present in the destination.
The time taken will depend upon the speed of your Internet connection.
You could also consider using AWS Snowball, which is a piece of hardware that is sent to your location. It can hold 50TB of data and costs $200.
If you have no specific requirements (apart from the fact that it needs to be encrypted and the file-size is 1TB) then I would suggest you stick to something plain and simple. S3 supports an object size of 5TB so you wouldn't run into trouble. I don't know if your data is made up of many smaller files or 1 big file (or zip) but in essence its all the same. Since the end-points or all encrypted you should be fine (if your worried, you can encrypt your files before and they will be encrypted while stored (if its backup of something). To get to the point, you can use API tools for transfer or just file-explorer type of tools which have also connectivity to S3 (e.g. https://www.cloudberrylab.com/explorer/amazon-s3.aspx). some other point: cost-effectiviness of storage/transfer all depends on how frequent you need the data, if just a backup or just in case. archiving to glacier is much cheaper.
1 TB is large but it's not so large that it'll take you weeks to get your data onto S3. However if you don't have a good upload speed, use Snowball.
https://aws.amazon.com/snowball/
Snowball is a device shipped to you which can hold up to 100TB. You load your data onto it and ship it back to AWS and they'll upload it to the S3 bucket you specify when loading the data.
This can be done in multiple ways.
Using AWS Cli, we can copy files from local to S3
AWS Transfer using FTP or SFTP (AWS SFTP)
Please refer
There are tools like cloudberry clients which has a UI interface
You can use AWS DataSync Tool
I have large amount of data stored on amazon S3 in the forms of objects.
like i Have user which have 200+ GB of photos (about 100000+ objects) stored on amazon S3. each object is a photo , each object size is average 5MB.
Now I want to give a user a link to download data.
Currently what i am doing.
Using S3cmd i copy all the objects from S3 to EC2.
and then using ZIP command or TAR Command i create a
ZIp.
After Zip process is complete i move the zip file back to the S3.
and Then create a singed link that i send to user as an email.
But this process takes a long long time, most of the time it gives out of memory issues, storage issues and this process is very slow.
I need to Know
Is there any way that i can boost this process time.
Is there any third party service/tool where i can create fast zip
of my files and send to user.
or any other 3rd party solution, I am ready to pay for it.
Try using EMR (Elastic Map Reducer and the S3distCp) that can be helpful in your required situation, for EMR you have to create a cluster. and the running your job.
The direction what you are following at high level is correct. However there isn't any straight forward answer which may possibly solve your problem in a single shot.
These are the things which you can try doing
Ask your user to create a AWS account ( or create an IAM user ) and provide a read-only access to that user / account
During the process of uploading to S3 you can group the photos in the bundles of few 50s or 100s compress it and then put in S3 ( from EC2 i.e. during creation of the media itself)
Export to external media from S3 using - Amazon Import / Export
S3DistCP is tool that can greatly help in cases such as this.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
S3DistCP can copy from and to S3 using an EMR Cluster instead of a single instance and compress objects on the fly.
However, in "big data" processing, the user will probably have a better experience if you either create the bundles in advance proactively or start the process asynchronously on-demand and notify the user on completion with the download link.