I have two volumes attached to the same instance and it is taking 5 hours to transfer 100GB from one to the other using linux mv.
The c5.large instance supposedly uses enhanced network architecture and has a network speed of .74 Gigabits/s = .0925 Gigabytes per second. So I was expecting .74/8*60*60=333GB per hour. I am 15 times slower.
Where did I go wrong? Is there a better solution?
I use c.large instances and the speed is up to .74 Gigabits/s in practice e.g. downloading from S3 buckets, is about .45MBits/s which is more than an order of magnitude less than that nominal value (for a c4.xlarge node)
I suggest you chop your data into 1GB packages and use the following script to download them onto the attached storage option of your choice.
for i in {part001..part100}
do
echo " $i Download"
fnam=$i.csv.bz2
wget -O /tmp/data/$fnam http://address/to/the/data/$fnam
echo "$(date) $i Unzip"
bunzip2 /tmp/data/$fnam
done
Related
I'm uploading a file that is 8.6T in size.
$ nohup gsutil -o GSUtil:parallel_composite_upload_threshold=150M cp big_file.jsonl gs://bucket/big_file.jsonl > nohup.mv-big-file.out 2>&1 &
At some point, it just hangs, with no error messages, nothing.
Any suggestions on how I can move this large file from the box to the GS bucket?
In accordance to what #John Hanley mentioned, the maximum size limit for individual objects stored in Cloud Storage is 5 TB, as stated in Buckets and Objects Limits
Here are some workaround that you can try :
You can try uploading it across multiple folders on a single bucket since there is no limit on the actual bucket size.
Second option you can try is chunking of your files up to 32 chunks Parallel composite uploads.
Another option that you may also consider is Transfer Appliance for a faster and higher capacity of upload to Cloud Storage.
You might want to take a look as well to GCS's best practices documentation.
When running this query on AWS Athena, it manages to query a 63GB Traders.csv file
SELECT * FROM Trades WHERE TraderID = 1234567
Tt takes 6.81 seconds, scanning 63.82GB in so doing (almost exactly the size of the Trades.csv file, so is doing a full table scan).
What I'm shocked at is the unbelievable speed of data drawn from s3. It seems like AWS Athena's strategy is to use an unbelievably massive box with a ton of RAM and incredible s3 loading ability to get around the lack of indexing (although on a standard SQL DB you would have an index on TraderID and load millions times less data).
But in my experiments I only managed to get these data reads from S3 (which are still impressive):
InstanceType
Mb/s
Network Card Gigabits
t2.2xlarge
113
low
t3.2xlarge
140
up to 5
c5n.2xlarge
160
up to 25
c6gn.16xlarge
230
100
(that's megabytes rather than megabits)
I'm using an internal VPC Endpoint for the s3 on eu-west-1. Anyone got any tricks/tips for getting s3 to load fast? Has anyone got over 1GB/s read speeds from s3? Is this even possible?
It seems like AWS Athena's strategy is to use an unbelievably massive
box with a ton of RAM
No, it's more like many small boxes, not a single massive box. Athena is running your query in parallel, on multiple servers at once. The exact details of that are not published anywhere as far as I am aware, but they make very clear in the documentation that your queries run in parallel.
Lately, we've noticed that our AWS bill has been higher than usual. It's due to adding an aws s3 sync task to our regular build process. The build process generates something around 3,000 files. After the build, we run aws s3 sync to upload them en masse into a bucket. The problem is that this is monetarily expensive. Each upload is costing us a ~$2 (we think) and this adds up to a monthly bill that raises the eyebrow.
All but maybe 1 or 2 of those files actually change from build to build. The rest are always the same. Yet aws s3 sync sees that they all changed and uploads the whole lot.
The documentation says that aws s3 sync compares the file's last modified date and byte size to determine if it should upload. The build server creates all those files brand-new every time, so the last modified date is always changed.
What I'd like to do is get it to compute a checksum or a hash on each file and then use that hash to compare the files. Amazon s3 already has the etag field which is can be an MD5 hash of the file. But the aws s3 sync command doesn't use etag.
Is there a way to use etag? Is there some other way to do this?
The end result is that I'd only like to upload the 1 or 2 files that are actually different (and save tremendous cost)
The aws s3 sync command has a --size-only parameter.
From aws s3 sync options:
--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.
This will likely avoid copying all files if they are updated with the same content.
As an alternative to s3 sync or cp you could use s5cmd
https://github.com/peak/s5cmd
This is able to sync files on the size and date if different, and also has speeds of up to 4.6gb/s
Example of the sync command:
AWS_REGION=eu-west-1 /usr/local/bin/s5cmd -stats cp -u -s --parents s3://bucket/folder/* /home/ubuntu
S3 charges $0.005 per 1,000 PUT requests (doc), so it's extremely unlikely that uploading 3,000 files is costing you $2 per build. Maybe $2 per day if you're running 50-100 builds a day, but that's still not much.
If you really are paying that much per build, you should enable CloudTrail events and see what is actually writing that much (for that matter, maybe you've created some sort of recursive CloudTrail event log).
The end result is that I'd only like to upload the 1 or 2 files that are actually different
Are these files the artifacts produced by your build? If yes, why not just add a build step that copies them explicitly?
The issue that I got was using wildcard * in the --include option. Using one wildcard was fine but when I added the second * such as /log., it looked like sync tried to download everything to compare, which took a lot of CPU and network bandwidth.
My understanding was that spark will choose the 'default' number of partitions, solely based on the size of the file or if its a union of many parquet files, the number of parts.
However, in reading in a set of large parquet files, I see the that default # of partitions for an EMR cluster with a single d2.2xlarge is ~1200. However, in a cluster of 2 r3.8xlarge I'm getting default partitions of ~4700.
What metrics does Spark use to determine the default partitions?
EMR 5.5.0
spark.default.parallelism - Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.
2X number of CPU cores available to YARN containers.
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#spark-defaults
Looks like it matches non EMR/AWS Spark as well
I think there was some transient issue because I restarted that EMR cluster with d2.2xlarge and it gave me the number of partitions I expected, which matched the r3.8xlarge, which was the number of files on s3.
If anyone knows why this kind of things happens though, I'll gladly mark yours as the answer.
AWS has several public "big data" data sets available. Some are hosted for free on EBS, and others, like NASA NEX climate data are hosted on S3. I have found more discussion on how to work with those that are hosted in EBS, but have been unable to get an S3 data set within an EC2 with reasonable enough speed to actually work with the data.
So my issue is getting the public big data sets (~256T) "into" an EC2. One approach I tried was to mount the public S3 to my EC2, as in this tutorial. However, when attempting to use python to evaluate this mounted data, the processing times were very, very slow.
I am starting to think utilizing the AWS CLI (cp or sync) may be the correct approach, but am still having difficulty finding documentation on this with respect to large, public S3 data sets.
In short, is mounting the best way to work with AWS' S3 public big data sets, is the CLI better, is this an EMR problem, or does the issue lie entirely in instance size and / or bandwidth?
Very large data sets are typically analysed with the help of distributed processing tools such as Apache Hadoop (which is available as part of the Amazon EMR service). Hadoop can split processing between multiple servers (nodes), achieving much better speed and throughput by working in parallel.
I took a look at one of the data set directories and found these files:
$ aws s3 -ls s3://nasanex/NEX-DCP30/NEX-quartile/rcp26/mon/atmos/tasmax/r1i1p1/v1.0/CONUS/
2013-09-29 17:58:42 1344734800 tasmax_ens-avg_amon_rcp26_CONUS_200601-201012.nc
2013-10-09 05:08:17 83 tasmax_ens-avg_amon_rcp26_CONUS_200601-201012.nc.md5
2013-09-29 18:18:00 1344715511 tasmax_ens-avg_amon_rcp26_CONUS_201101-201512.nc
2013-10-09 05:14:49 83 tasmax_ens-avg_amon_rcp26_CONUS_201101-201512.nc.md5
2013-09-29 18:15:33 1344778298 tasmax_ens-avg_amon_rcp26_CONUS_201601-202012.nc
2013-10-09 05:17:37 83 tasmax_ens-avg_amon_rcp26_CONUS_201601-202012.nc.md5
2013-09-29 18:20:42 1344775120 tasmax_ens-avg_amon_rcp26_CONUS_202101-202512.nc
2013-10-09 05:07:30 83 tasmax_ens-avg_amon_rcp26_CONUS_202101-202512.nc.md5
...
Each data file in this directory is 1.3TB (together with an MD5 file to verify file contents via a checksum).
I downloaded one of these files:
$ aws s3 cp s3://nasanex/NEX-DCP30/NEX-quartile/rcp26/mon/atmos/tasmax/r1i1p1/v1.0/CONUS/tasmax_ens-avg_amon_rcp26_CONUS_200601-201012.nc .
Completed 160 of 160 part(s) with 1 file(s) remaining
The aws s3 cp command used multi-part download to retrieve the file. It still took considerable time because 1.3TB is a lot of data!
The result is a local file that can be accessed via Python:
$ ls -l
total 1313244
-rw-rw-r-- 1 ec2-user ec2-user 1344734800 Sep 29 2013 tasmax_ens-avg_amon_rcp26_CONUS_200601-201012.nc
It is in .nc format, which I think is a NetCDF.
I recommend processing one file at a time, since EBS data volumes are 16TiB maximum size.