Working with AWS S3 Large Public Data Set

Working with AWS S3 Large Public Data Set - amazon-web-services

AWS has several public "big data" data sets available. Some are hosted for free on EBS, and others, like NASA NEX climate data are hosted on S3. I have found more discussion on how to work with those that are hosted in EBS, but have been unable to get an S3 data set within an EC2 with reasonable enough speed to actually work with the data.
So my issue is getting the public big data sets (~256T) "into" an EC2. One approach I tried was to mount the public S3 to my EC2, as in this tutorial. However, when attempting to use python to evaluate this mounted data, the processing times were very, very slow.
I am starting to think utilizing the AWS CLI (cp or sync) may be the correct approach, but am still having difficulty finding documentation on this with respect to large, public S3 data sets.
In short, is mounting the best way to work with AWS' S3 public big data sets, is the CLI better, is this an EMR problem, or does the issue lie entirely in instance size and / or bandwidth?

Very large data sets are typically analysed with the help of distributed processing tools such as Apache Hadoop (which is available as part of the Amazon EMR service). Hadoop can split processing between multiple servers (nodes), achieving much better speed and throughput by working in parallel.
I took a look at one of the data set directories and found these files:
$ aws s3 -ls s3://nasanex/NEX-DCP30/NEX-quartile/rcp26/mon/atmos/tasmax/r1i1p1/v1.0/CONUS/
2013-09-29 17:58:42 1344734800 tasmax_ens-avg_amon_rcp26_CONUS_200601-201012.nc
2013-10-09 05:08:17 83 tasmax_ens-avg_amon_rcp26_CONUS_200601-201012.nc.md5
2013-09-29 18:18:00 1344715511 tasmax_ens-avg_amon_rcp26_CONUS_201101-201512.nc
2013-10-09 05:14:49 83 tasmax_ens-avg_amon_rcp26_CONUS_201101-201512.nc.md5
2013-09-29 18:15:33 1344778298 tasmax_ens-avg_amon_rcp26_CONUS_201601-202012.nc
2013-10-09 05:17:37 83 tasmax_ens-avg_amon_rcp26_CONUS_201601-202012.nc.md5
2013-09-29 18:20:42 1344775120 tasmax_ens-avg_amon_rcp26_CONUS_202101-202512.nc
2013-10-09 05:07:30 83 tasmax_ens-avg_amon_rcp26_CONUS_202101-202512.nc.md5
...
Each data file in this directory is 1.3TB (together with an MD5 file to verify file contents via a checksum).
I downloaded one of these files:
$ aws s3 cp s3://nasanex/NEX-DCP30/NEX-quartile/rcp26/mon/atmos/tasmax/r1i1p1/v1.0/CONUS/tasmax_ens-avg_amon_rcp26_CONUS_200601-201012.nc .
Completed 160 of 160 part(s) with 1 file(s) remaining
The aws s3 cp command used multi-part download to retrieve the file. It still took considerable time because 1.3TB is a lot of data!
The result is a local file that can be accessed via Python:
$ ls -l
total 1313244
-rw-rw-r-- 1 ec2-user ec2-user 1344734800 Sep 29 2013 tasmax_ens-avg_amon_rcp26_CONUS_200601-201012.nc
It is in .nc format, which I think is a NetCDF.
I recommend processing one file at a time, since EBS data volumes are 16TiB maximum size.

Related

Read timeout on endpoint URL when copying large files from local machine to S3

When I'm running aws s3 cp local_file.csv s3://bucket_name/file.csv, the upload copying begins properly and runs ok, until the speed slows down and eventually times out (at around 20-30% uploaded) with the following error:
Read timeout on endpoint URL: "https://bucketname.s3.amazonaws.com/file.csv?uploadid=xxx&partNumber=65.
The file is a large one (~2GB) but I ran this process OK in the past from another network with higher upload speeds. Now that I'm running it from my home at lower speed (max 10mbps, but this goes down the longer the upload takes), I want to allow more leeway before it times out.
Any idea how to set that timeout to a different threshold? Couldn't spot this in the AWS docs.

I had to add a new parameter to the command: --cli-read-timeout
For example:
aws s3 cp SOURCE_FOLDER TARGET_FOLDER --recursive --cli-read-timeout 0
More information: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-options.html

You might have to setup some configuration values for cli in your config file so that your large file is broken down into manageable chunks: See the link below:
https://docs.aws.amazon.com/cli/latest/topic/s3-config.html
Also make sure that your CLI version is up to date.

Errors importing large CSV file to DynamoDB using Lambda

I want to import a large csv file (around 1gb with 2.5m rows and 50 columns) into a DynamoDb, so have been following this blog from AWS.
However it seems I'm up against a timeout issue. I've got to ~600,000 rows ingested, and it falls over.
I think from reading the CloudWatch log that the timeout is occurring due to the boto3 read on the CSV file (it opens the entire file first, iterates through and batches up for writing)... I tried to reduce the file size (3 columns, 10,000 rows as a test), and I got a timeout after 2500 rows.
Any thoughts here?!
TIA :)

I really appreciate the suggestions (Chris & Jarmod). After trying and failing to break things programmatically into smaller chunks, I decided to look at the approach in general.
Through research I understood there were 4 options:
Lambda Function - as per the above this fails with a timeout.
AWS Pipeline - Doesn't have a template for importing CSV to DynamoDB
Manual Entry - of 2.5m items? no thanks! :)
Use an EC2 instance to load the data to RDS and use DMS to migrate to DynamoDB
The last option actually worked well. Here's what I did:
Create an RDS database (I used the db.t2.micro tier as it was free) and created a blank table.
Create an EC2 instance (free Linux tier) and:
On the EC2 instance: use SCP to upload the CSV file to the ec2 instance
On the EC2 instance: Firstly Sudo yum install MySQL to get the tools needed, then use mysqlimport with the --local option to import the CSV file to the rds MySQL database, which took literally seconds to complete.
At this point I also did some data cleansing to remove some white spaces and some character returns that had crept into the file, just using standard SQL queries.
Using DMS I created a replication instance, endpoints for the source (rds) and target (dynamodb) databases, and finally created a task to import.
The import took around 4hr 30m
After the import, I removed the EC2, RDS, and DMS objects (and associated IAM roles) to avoid any potential costs.
Fortunately, I had a flat structure to do this against, and it was only one table. I needed the cheap speed of the dynamodb, otherwise, I'd have stuck to the RDS (I almost did halfway through the process!!!)
Thanks for reading, and best of luck if you have the same issue in the future.

More efficient use of aws s3 sync?

Lately, we've noticed that our AWS bill has been higher than usual. It's due to adding an aws s3 sync task to our regular build process. The build process generates something around 3,000 files. After the build, we run aws s3 sync to upload them en masse into a bucket. The problem is that this is monetarily expensive. Each upload is costing us a ~$2 (we think) and this adds up to a monthly bill that raises the eyebrow.
All but maybe 1 or 2 of those files actually change from build to build. The rest are always the same. Yet aws s3 sync sees that they all changed and uploads the whole lot.
The documentation says that aws s3 sync compares the file's last modified date and byte size to determine if it should upload. The build server creates all those files brand-new every time, so the last modified date is always changed.
What I'd like to do is get it to compute a checksum or a hash on each file and then use that hash to compare the files. Amazon s3 already has the etag field which is can be an MD5 hash of the file. But the aws s3 sync command doesn't use etag.
Is there a way to use etag? Is there some other way to do this?
The end result is that I'd only like to upload the 1 or 2 files that are actually different (and save tremendous cost)

The aws s3 sync command has a --size-only parameter.
From aws s3 sync options:
--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.
This will likely avoid copying all files if they are updated with the same content.

As an alternative to s3 sync or cp you could use s5cmd
https://github.com/peak/s5cmd
This is able to sync files on the size and date if different, and also has speeds of up to 4.6gb/s
Example of the sync command:
AWS_REGION=eu-west-1 /usr/local/bin/s5cmd -stats cp -u -s --parents s3://bucket/folder/* /home/ubuntu

S3 charges $0.005 per 1,000 PUT requests (doc), so it's extremely unlikely that uploading 3,000 files is costing you $2 per build. Maybe $2 per day if you're running 50-100 builds a day, but that's still not much.
If you really are paying that much per build, you should enable CloudTrail events and see what is actually writing that much (for that matter, maybe you've created some sort of recursive CloudTrail event log).
The end result is that I'd only like to upload the 1 or 2 files that are actually different
Are these files the artifacts produced by your build? If yes, why not just add a build step that copies them explicitly?

The issue that I got was using wildcard * in the --include option. Using one wildcard was fine but when I added the second * such as /log., it looked like sync tried to download everything to compare, which took a lot of CPU and network bandwidth.

what is best way to copy 100GB data between two AWS volumes?

I have two volumes attached to the same instance and it is taking 5 hours to transfer 100GB from one to the other using linux mv.
The c5.large instance supposedly uses enhanced network architecture and has a network speed of .74 Gigabits/s = .0925 Gigabytes per second. So I was expecting .74/8*60*60=333GB per hour. I am 15 times slower.
Where did I go wrong? Is there a better solution?

I use c.large instances and the speed is up to .74 Gigabits/s in practice e.g. downloading from S3 buckets, is about .45MBits/s which is more than an order of magnitude less than that nominal value (for a c4.xlarge node)
I suggest you chop your data into 1GB packages and use the following script to download them onto the attached storage option of your choice.
for i in {part001..part100}
do
echo " $i Download"
fnam=$i.csv.bz2
wget -O /tmp/data/$fnam http://address/to/the/data/$fnam
echo "$(date) $i Unzip"
bunzip2 /tmp/data/$fnam
done

Amazon redshift query aborts automatically after 1 hour

I have around 500GB compressed data in amazon s3. I wanted to load this data to Amazon Redshift. For that, I have created an internal table in AWS Athena and I am trying to load data in the internal table of Amazon Redshift.
Loading of this big data into Amazon Redshift is taking more than an hour. The problem is when I fired a query to load data it gets aborted after 1hour. I tried it 2-3 times but it's getting aborted after 1 hour. I am using Aginity Tool to fire the query. Also, in Aginity tool it is showing that query is currently running and the loader is spinning.
More Details:
Redshift cluster has 12 nodes with 2TB space for each node and I used 1.7 TB space.
S3 files are not the same size. One of them is 250GB. Some of them in MB.
I am using the command
create table table_name as select * from athena_schema.table_name
it stops exactly after 1hr.
Note: I have set the current query timeout in Aginity to 90000 sec.

I know this is an old thread, but for anyone coming here because of the same issue, I've realised that, at least for my case, the problem was the Aginity client; so, it's not related with Redshift or its Workload Manager, but only with such third party client called Aginity. In summary, use a different client like SQL Workbench and run the COPY command from there.
Hope this helps!
Carlos C.
More information, about my environment:
Redshift:
Cluster TypeThe cluster's type: Multi Node
Cluster: ds2.xlarge
NodesThe cluster's type: 4
Cluster Version: 1.0.4852
Client Environment:
Aginity Workbench for Redshift
Version 4.9.1.2686 (build 05/11/17)
Microsoft Windows NT 6.2.9200.0 (64-bit)
Network:
Connected to OpenVPN, via SSH Port tunneling.
The connection is not being dropped. This issue is only affecting the COPY command. The connection remains active.
Command:
copy tbl_XXXXXXX
from 's3://***************'
iam_role 'arn:aws:iam::***************:role/***************';
S3 Structure:
120 files of 6.2 GB each. 20 files of 874MB.
Output:
ERROR: 57014: Query (22381) cancelled on user's request
Statistics:
Start: ***************
End: ***************
Duration: 3,600.2420863

I'm not sure if following answer will solve your exact problem of timeout at exactly 1 Hr.
But, based on my experience, in case of Redshift loading data via Copy command is best and fast way. SO I feel that timeout issue shouldn't happen at all in your case.
The copy command in RedShift could load data from S3 or via SSH.
e.g.
Simple copy
copy sales from 'emr://j-SAMPLE2B500FC/myoutput/part-*' iam_role
'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '\t' lzop;
e.g. Using Menifest
copy customer
from 's3://mybucket/cust.manifest'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest;
PS: Even if you do it using Menifest and divide your data into Multiple files, it will be more faster as RedShift loads data in parallel.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Working with AWS S3 Large Public Data Set - amazon-web-services

Related

Read timeout on endpoint URL when copying large files from local machine to S3

Errors importing large CSV file to DynamoDB using Lambda

More efficient use of aws s3 sync?

what is best way to copy 100GB data between two AWS volumes?

Amazon redshift query aborts automatically after 1 hour

Categories

Resources