We use S3 for static file hosting. Is it possible to set it up as redundant? I don't want to rely only on one zone in case anything brokens.
Thanks.
Amazon S3 buckets are regional-level services. Data is replicated automatically across multiple Availability Zones.
So, if you wish to have redundancy across Availability Zones, it is done for you automatically.
If you wish to have redundancy across regions, you might be able to use Amazon CloudFront and/or Amazon Route 53.
Its not possible to have redundant bucket names in aws....all the bucket names in aws are unique.
Related
I understand Data Lake Zones in S3 and I am looking at establishing 3 zones - LANDING, STAGING, CURATED. If I were in an Azure environment, I would create the Data Lake and have multiple folders as various zones.
How would I do the equivalent in AWS - Would it be a separate bucket for each zone (s3://landing_data/, s3://staging_data, s3://curated_data) or a single bucket with multiple folders (i.e. s3://bucket_name/landing/..., s3://bucket_name/staging/). I understand AWS S3 is nothing more than containers.
Also, would I be able to mount multiple S3 buckets on Databricks AWS? If so is there any reference documentation?
Is there any best/recommended approach given that we can read and write to S3 in multiple ways?
I looked at this as well.
S3 performance Best Pratices
There is no single solution - the actual implementation depends on the amount of data, number of consumers/producers, etc. You need to take into account AWS S3 limits, like:
By default you may have only 100 buckets in an account - it could be increased although
You may issue 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix (directory) in a single bucket (although the number of prefixes is not limited)
You can mount each of the buckets, or individual folders into Databricks workspace as described in documentation. But it's really not recommended from the security standpoint, as everyone in workspace will have the same permissions as role that was used for mounting. Instead of that, just use full S3 URLs in combination with instance profiles.
I have a system that processes big data sets and downloads data from an S3 bucket.
Each instance downloads multiple objects from inside an object (dir) on S3. When the number of instances are less, the download speeds are good i.e. 4-8MiB/s.
But when I use like 100-300 instances the download speed reduce to 80KiB/s.
Wondering what might be the reasons behind it and what ways can I use to remedy it?
If your EC2 instances are in private subnets, then your NAT may be a limiting factor.
Try the following:
Add S3 endpoints to your VPC. This bypasses your NAT when your EC2 instances talk to S3.
If you are using NAT instances, try using NAT gateways instead. They can scale up/down the bandwidth.
If you are using a NAT instance, try increasing the instance type of your NAT instance to one with more CPU and Enhanced Networking.
If you are using a single NAT, try using multiple NATs instead (one per subnet). This will spread the bandwidth across multiple NATs.
If all that fails, try putting your EC2 instances into public subnets.
How are the objects in your S3 bucket named? The naming of the objects can have a surprisingly large effect on the throughput of the bucket due to partitioning. In the background, S3 partitions your bucket based on the keys of the objects, but only the first 3-4 characters of the key are really important. Also note that the key is the entire path in the bucket, but the subpaths don't matter for partitioning. So if you have a bucket called mybucket and you have objects inside like 2017/july/22.log, 2017/july/23.log, 2017/june/1.log, 2017/oct/23.log then the fact that you've partitioned by month doesn't actually matter because only the first few characters of the entire key are used.
If you have a sequential naming structure for the objects in your bucket, then you will likely have bad performance with many parallel requests for objects. In order to get around this, you should assign a random prefix of 3-4 characters to each object in the bucket.
See http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html for more information.
You probably want to use S3DistCP instead of managing concurrency and connections by hand...
I am confusing about the Amazon S3 replica mechanism. In my understanding, by default, Amazon S3 applies 3-replica mechanism, in which there will be 3 replicas for each object created on my S3 bucket. And all the replicas are stored in multiple availability zones within only ONE region, which I specified when creating S3 bucket.
Is my understanding correct? If it's correct, is it possible to see where the replicas of an object are stored?
Thanks
You are pretty much correct. S3 replication works by replicating across at least 3 data centers, over at least two AZs within a single region (each availability zone can have multiple data centers).
The replication is part of s3, which is a managed service, meaning you just have to accept what they're telling you. Telling you where the replicas were wouldn't really serve any purpose, and AWS never really disclose the details of their infrastructure to anyone who doesn't need to know. Even if they told you the data was stored in Availability Zone 1 and 2, this is effectively meaningless information, as zones are aliases, i.e your Zone 1 probably isn't the same as my Zone 1.
When we go to S3 in AWS console in "Global" option it shows
"S3 does not require region selection."
But when we create new bucket there it asks for Region !
So are S3 buckets region specific ?
The user interface shows all your buckets, in all regions. But buckets exist in a specific region and you need to specify that region when you create a bucket.
S3 buckets are region specific, you can check http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region to get the list of end-points based on the region
From the doc on creating S3 bucket
Amazon S3 creates bucket in a region you specify. You can choose any
AWS region that is geographically close to you to optimize latency,
minimize costs, or address regulatory requirements. For example, if
you reside in Europe, you might find it advantageous to create buckets
in the EU (Ireland) or EU (Frankfurt) regions. For a list of AWS
Amazon S3 regions, go to Regions and Endpoints in the AWS General
Reference.
Also from UI, if you look at the properties for each of your bucket, you will see the original region
Yes S3 buckets are region specific.
When you create a new bucket you need to select the target region for that bucket.
For example:
Hope it helps.
How it works now is that if you are expecting the content to load fast globally, you create a bucket for every region you want your data to load quickly from, but use 'Versioning' to auto duplicate content from one bucket to the other.
Click on one of your buckets, then go to Management, then go to 'Replication'.
Follow the instructions to setup a rule that will copy from one bucket to another.
Congratualtion, you now have globally fast content from a single bucket.
I appreciate if this seems a little off-piste, but I think this is what we are all really looking to achieve.
There is a related answer that highlights one important point: although the console and the CLI allow viewing buckets in all regions, probably due to the fact that bucket names must be globally unique, buckets are still tied to a region.
This matters, for example, when dealing with permissions. You may have Infrastructure as Code generalized with roles that give permissions to all buckets for the current region. Although the CLI might give you the impression that a newly created bucket can be seen in all regions, in reality you may end up with errors if you fail to specifically grant access to a service running in one region but requiring access to an S3 bucket in another region.
Using S3 cross-region replication, if a user downloads http://mybucket.s3.amazonaws.com/myobject , will it automatically download from the closest region like cloudfront? So no need to specify the region in the URL like http://mybucket.s3-[region].amazonaws.com/myobject ?
http://aws.amazon.com/about-aws/whats-new/2015/03/amazon-s3-introduces-cross-region-replication/
Bucket names are global, and cross-region replication involves copying it to a different bucket.
In other words, example.us-west-1 and example.us-east-1 is not valid, as there can only be one bucket named 'example'.
That's implied in the announcement post- Mr. Barr is using buckets named jbarr and jbarr-replication.
Using S3 cross-Region replication will put your object into two (or more) buckets in two different Regions.
If you want a single access point that will choose the closest available bucket then you want to use Multi-Region Access Points (MRAP)
MRAP makes use of Global Accelerator and puts bucket requests onto the AWS backbone at the closest edge location, which provides faster, more reliable connection to the actual bucket. Global Accelerator also chooses the closest available bucket. If a bucket is not available, it will serve the request from the other bucket providing automatic failover
You can also configure it in an active/passive configuration, always serving from one bucket until you initiate a failover
From the MRAP page on AWS console it even shows you a graphical representation of your replication rules
s3 is global service, no need specify the region. The S3 name has to be unique globally.
when you create s3, you need specify the region, however it doesn't mean you need put the region name when you access it. To speed up the access speed from other region, there are several options like
-- Amazon S3 Transfer Acceleration with same bucket name.
-- Or use set up another bucket with different name in different region and enable cross region replication. Create an origin group with two origins for cloudfront.