I have a system that processes big data sets and downloads data from an S3 bucket.
Each instance downloads multiple objects from inside an object (dir) on S3. When the number of instances are less, the download speeds are good i.e. 4-8MiB/s.
But when I use like 100-300 instances the download speed reduce to 80KiB/s.
Wondering what might be the reasons behind it and what ways can I use to remedy it?
If your EC2 instances are in private subnets, then your NAT may be a limiting factor.
Try the following:
Add S3 endpoints to your VPC. This bypasses your NAT when your EC2 instances talk to S3.
If you are using NAT instances, try using NAT gateways instead. They can scale up/down the bandwidth.
If you are using a NAT instance, try increasing the instance type of your NAT instance to one with more CPU and Enhanced Networking.
If you are using a single NAT, try using multiple NATs instead (one per subnet). This will spread the bandwidth across multiple NATs.
If all that fails, try putting your EC2 instances into public subnets.
How are the objects in your S3 bucket named? The naming of the objects can have a surprisingly large effect on the throughput of the bucket due to partitioning. In the background, S3 partitions your bucket based on the keys of the objects, but only the first 3-4 characters of the key are really important. Also note that the key is the entire path in the bucket, but the subpaths don't matter for partitioning. So if you have a bucket called mybucket and you have objects inside like 2017/july/22.log, 2017/july/23.log, 2017/june/1.log, 2017/oct/23.log then the fact that you've partitioned by month doesn't actually matter because only the first few characters of the entire key are used.
If you have a sequential naming structure for the objects in your bucket, then you will likely have bad performance with many parallel requests for objects. In order to get around this, you should assign a random prefix of 3-4 characters to each object in the bucket.
See http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html for more information.
You probably want to use S3DistCP instead of managing concurrency and connections by hand...
Related
This https://aws.amazon.com/blogs/storage/architecting-for-high-availability-on-amazon-s3/#:~:text=Amazon%20S3%20maintains%20redundancy%20even%20within%20one%20of,can%20still%20access%20their%20data%20with%20no%20downtime states the following:
Amazon S3 storage classes replicate their data on more than three
Availability Zone (except for S3 One Zone-Infrequent Access).
What's the point of this article https://aws.amazon.com/blogs/startups/large-scale-disaster-recovery-using-aws-regions/ stating:
S3 snapshots: We rely on the cross s3 sync and this works like a
charm. We are able to copy the data from our primary to the DR region
within a matter of few minutes.
The latter seem superfluous now and is from 2017, so may be it is out-dated? Or is it the thrust that we should also be be placing Amazon S3 copies over over Regions? I see no such need as the AZ's within a Region are physically separated from each other. What am I missing?
S3 buckets are region specific. When you create a new bucket you need to select the target region for that bucket.
For DR reasons, you can keep backups in another region. Should the primary region fail in a way that the entire region is affected, then you could restore in the backup region.
Your DR strategy will depend on your use case, and your needs for returning services back to normal in case of region wide failure.
For example, let's say you rely on ec2/ebs to operate your service and those services suffer region wide outage for 5 hours. In order to recover your service you would need to move to a region where the resources are available. Assuming you need S3 data for operational processing you would want to have that data ready in the Target recovery region.
Storing in multiple AZs in a region does not guarantee safety in case of entire region failure.This is applicable for all regional services. The article you shared indeed mentions this so it is not irrelevant.
The service that runs in HA is handled by hosts running in different
availability zones but in the same geographical region. This approach,
however, does not guarantee that our business will be up and running
in case the entire region goes down
I need an HTTP web-service serving files (1-10GiB) being result of merging some smaller files in S3 bucket. Such a logic is pretty easy to implement, but I need a very high scalability, so would prefer to put it on cloud. What Amazon service will be most feasible for this particular case? Should I use AWS Lambda for that?
Unfortunately, you can't achieve that with lambda, since it only offer 512mb for strage, and you can't mount volumes.You will need EBS or EFS to download and process the data. Since you need scalability, I would sugest Fargate + EFS. Plain EC2 instances would do just fine, but you might lose some money because it can be tricky to provision the correct amount for your needs, and most of the time it is overprovisioned.
If you don't need to process the file in real time, you can use a single instance and use SQS to queue the jobs and save some money. In that scenario you could use lambda to trigger the jobs, and even start/kill the instance when it is not in use.
Merging files
It is possible to concatenate Amazon S3 files by using the UploadPartCopy:
Uploads a part by copying data from an existing object as data source.
However, the minimum allowable part size for a multipart upload is 5 MB.
Thus, if each of your parts is at least 5 MB, then this would be a way to concatenate files without downloading and re-uploading.
Streaming files
Alternatively, rather than creating new objects in Amazon S3, your endpoint could simply read each file in turn and stream the contents back to the requester. This could be done via API Gateway and AWS Lambda. Your AWS Lambda code would read each object from S3 and keep returning the contents until the last object has been processed.
First, let me clarify your goal: you want to have an endpoint, say https://my.example.com/retrieve that reads some set of files from S3 and combines them (say, as a ZIP)?
If yes, does whatever language/framework that you're using support chunked encoding for responses?
If yes, then it's certainly possible to do this without storing anything on disk: you read from one stream (the file coming from S3) and write to another (the response). I'm guessing you knew that already based on your comments to other answers.
However, based on your requirement of 1-10 GB of output, Lambda won't work because it has a limit of 6 MB for response payloads (and iirc that's after Base64 encoding).
So in the AWS world, that leaves you with an always-running server, either EC2 or ECS/EKS.
Unless you're doing some additional transformation along the way, this isn't going to require a lot of CPU, but if you expect high traffic it will require a lot of network bandwidth. Which to me says that you want to have a relatively large number of smallish compute units. Keep a baseline number of them always running, and scale based on network bandwidth.
Unfortunately, smallish EC2 instances in general have lower bandwidth, although the a1 family seems to be an exception to this. And Fargate doesn't publish bandwidth specs.
That said, I'd probably run on ECS with Fargate due to its simpler deployment model.
Beware: your biggest cost with this architecture will almost certainly be data transfer. And if you use a NAT, not only will you be paying for its data transfer, you'll also limit your bandwidth. I would at least consider running in a public subnet (with assigned public IPs).
Background
We have developed an e-commerce application where I want to use CDN to improve the speed of the app and also to reduce the load on the host.
The application is hosted on an EC2 server and now we are going to use Cloud Front.
Questions
After reading a lot of articles and documents, I have created a distribution for my sample site. After doing all the experience I have come to know the following things. I want to be sure if am right about these points or not.
When we create a distribution it takes all the accessible data from the given origin path. We don't need to copy/ sync our files to cloud front.
We just have to change the path of our application according to this distribution CNAME (if cname is given).
There is no difference between placing the images/js/CSS files on S3 or on our own host. Cloud Front will just take them by itself.
The application will have thousands of pictures of the products, should we place them on S3 or its ok if they are on the host itself? Please share any good article to understand the difference of both the techniques.
Because if S3 is significantly better then I'll have to make a program to sync all such data on S3.
Thanks for the help.
Some reasons to store the images on Amazon S3 rather than your own host (and then serve them via Amazon CloudFront):
Less load on your servers
Even though content is cached in Amazon CloudFront, your servers will still be hit with requests for the first access of each object from every edge location (each edge location maintains its own cache), repeated every time that the object expires. (Refreshes will generate a HEAD request, and will only re-download content that has changed or been flushed from the cache.)
More durable storage
Amazon S3 keeps copies of your data across multiple Availability Zones within the same Region. You could also replicate data between your servers to improve durability but then you would need to manage the replication and pay for storage on every server.
Lower storage cost
Storing data on Amazon S3 is lower cost than storing it on Amazon EBS volumes. If you are planning on keeping your data in both locations, then obviously using S3 is more expensive but you should also consider storing it only on S3, which makes it lower cost, more durable and less for you to backup on your server.
Reasons to NOT use S3:
More moving parts -- maintaining code to move files to S3
Not as convenient as using a local file system
Having to merge log files from S3 and your own servers to gather usage information
I am confusing about the Amazon S3 replica mechanism. In my understanding, by default, Amazon S3 applies 3-replica mechanism, in which there will be 3 replicas for each object created on my S3 bucket. And all the replicas are stored in multiple availability zones within only ONE region, which I specified when creating S3 bucket.
Is my understanding correct? If it's correct, is it possible to see where the replicas of an object are stored?
Thanks
You are pretty much correct. S3 replication works by replicating across at least 3 data centers, over at least two AZs within a single region (each availability zone can have multiple data centers).
The replication is part of s3, which is a managed service, meaning you just have to accept what they're telling you. Telling you where the replicas were wouldn't really serve any purpose, and AWS never really disclose the details of their infrastructure to anyone who doesn't need to know. Even if they told you the data was stored in Availability Zone 1 and 2, this is effectively meaningless information, as zones are aliases, i.e your Zone 1 probably isn't the same as my Zone 1.
Can some one help me in understanding the S3 outage usecase here.
The probability of S3 outage is very less, but in case if this happens, what are the ways we can access data that sits in S3.
I know that there is one possibility, that is cross region replication, that works for new files, that I am going to put in my s3 bucket, if I enable it now. What happen to old files, I know if I go and upload all those historical files also to the other region, then it works.
Then again the same question, if both the regions went down, then what?
I am sure others would have thought of this. Any inputs on this.
From Protecting Data in Amazon S3:
Objects are redundantly stored on multiple devices across multiple facilities in an Amazon S3 region. To help better ensure data durability, Amazon S3 PUT and PUT Object copy operations synchronously store your data across multiple facilities before returning SUCCESS. Once the objects are stored, Amazon S3 maintains their durability by quickly detecting and repairing any lost redundancy.
...
Backed with the Amazon S3 Service Level Agreement
Designed to provide 99.999999999% durability and 99.99% availability of objects over a given year
Designed to sustain the concurrent loss of data in two facilities
So, if you're still not happy with all those statements, how can you access your data in an outage?
If your data is in only one region, and the region is not accessible, then your data is not accessible. Note, however, that an external network connectivity problem could prevent access to Amazon S3, yet Amazon S3 might still be accessible from Amazon EC2 instances in the same region.
Cross-region replication will copy your data to another Amazon S3 region. It requires versioning to be activated. To copy any files that exist prior to activating cross-region replication, use the sync command in the AWS Command-Line Utility (CLI), eg:
aws s3 sync s3://bucket1/folder s3://bucket2/folder
Each AWS region operates independently, so the possibility of multiple regions suffering outages would presumably be even less likely.
If you are feeling particularly paranoid, you could copy your data to another cloud provider (Azure, Google, Rackspace, etc). There are tools that can assist:
CloudBerry Cloud Migrator
AzureCopy
...and no doubt many more!