I'm building a Laravel application that offers an authoring tool to customers. Each customer will get their own subdomain i.e:
customer-a.my-tool.com
customer-b.my-tool.com
My tool is hosted on Amazon in multiple regions for performance but mostly for privacy law reasons(GDPR++). Each customer have their data in only one region. Australian customers in Australia, European in Europe etc. So the customers users must be directed to the correct region. If a European user ends up being served by the US region their data won't be there.
We can solve this manually using DNS and simply point each sub-domain to the correct IP, but we don't want to do this for two reasons. (1) updating the DNS might take up to 60 seconds. We don't want the customer to wait. (2) It seems the sites we've researched uses wildcard domains. For instance slack and atlassian.net. We know that atlassian.net also have multiple regions.
So the question is:
How can we use a wildcard domain and still route the traffic to the regions where the content is located?
Note:
We don't want the content in all regions, but we can have for instance a DynamoDB available in all regions mapping subdomains to regions.
We don't want to tie an organization to a region. I.e. a domain structure like customer-a.region.my-tool.com is an option we've considered, but discarded
We, of course, don't want to be paying for transferring the data twice, and having apps in all regions accessing the databases in the regions the data belong to is not an option since it will be slow.
How can we use a wildcard domain and still route the traffic to the regions where the content is located?
It is, in essence, not possible to do everything you are trying to do, given all of the constraints you are imposing: automatically, instantaneously, consistently, and with zero overhead, zero cost, and zero complexity.
But that isn't to say it's entirely impossible.
You have asserted that other vendors are using a "wildcard domain," which is a concept that is essentially different than I suspect you believe it necessarily entails. A wildcard in DNS, like *.example.com is not something you can prove to the exclusion of other possibilities, because wildcard records are overridden by more specific records.
For a tangible example that you can observe, yourself... *.s3.amazonaws.com has a DNS wildcard. If you query some-random-non-existent-bucket.s3.amazonaws.com, you will find that it's a valid DNS record, and it routes to S3 in us-east-1. If you then create a bucket by that name in another region, and query the DNS a few minutes later, you'll find that it has begun returning a record that points to the S3 endpoint in the region where you created the bucket. Yes, it was and is a wildcard record, but now there's a more specific record that overrides the wildcard. The override will persist for at least as long as the bucket exists.
Architecturally, other vendors that segregate their data by regions (rather than replicating it, which is another possibility, but not applicable to your scenario) must necessarily be doing something along one of these lines:
creating specific DNS records and accepting the delay until the DNS is ready or
implementing what I'll call a "hybrid" environment that behaves one way initially, and a different way eventually, this evironment uses specific DNS records to override a wildcard and has an ability to temporarily deliver, via a reverse proxy, a misrouted request to the correct cluster, to allow instantaneous correct behavior until the DNS propagates or
an ongoing "two-tier" environment, using a wildcard without more specific records to override it, operating a two-tier infrastructure, with an outer tier that is distributed globally, that accepts any request, and has internal routing records that deliver the request to an inner tier -- the correct regional cluster.
The first option really doesn't seem unreasonable. Waiting a short time for your own subdomain to be created seems reasonably common. But, there are other options.
The second option, the hybrid environment, would simply require that the location where your wildcard points to by default be able to do some kind of database lookup to determine where the request should go, and proxy the request there. Yes, you would pay for inter-region transport, if you implement this yourself in EC2, but only until the DNS update takes effect. Inter-region bandwidth between any two AWS regions costs substantially less than data transfer to the Internet -- far less than "double" the cost.
This might be accomplished in any number of ways that are relatively straightforward.
You must, almost by definition, have a master database of the site configuration, somewhere, and this system could be queried by a complicated service that provides the proxying -- HAProxy and Nginx both support proxying and both support Lua integrations that could be used to do a lookup of routing information, which could be cached and used as long as needed to handle the temporarily "misrouted" requests. (HAProxy also has static-but-updatable map tables and dynamic "stick" tables that can be manipulated at runtime by specially-crafted requests; Nginx may offer similar things.)
But EC2 isn't the only way to handle this.
Lambda#Edge allows a CloudFront distribution to select a back-end based on logic -- such as a query to a DynamoDB table or a call to another Lambda function that can query a relational database. Your "wildcard" CloudFront distribution could implement such a lookup, caching results in memory (container reuse allows very simple in-memory caching using simply an object in a global varible). Once the DNS record propagates, the requests would go directly from the browser to the appropriate back-end. CloudFront is marketed as a CDN, but it is in fact a globally-distributed reverse proxy with an optional response caching capability. This capability may not be obvious at first.
In fact, CloudFront and Lambda#Edge could be used for such a scenario as yours in either the "hybrid" environment or the "two-tier" environment. The outer tier is CloudFront -- which automatically routes requests to the edge on the AWS network that is nearest the viewer, at which point a routing decision can be made at the edge to determine the correct cluster of your inner tier to handle the request. You don't pay for anything twice, here, since bandwidth from EC2 to CloudFront costs nothing. This will not impact site performance other than the time necessary for thst initial database lookup, and once your active containers have that cached the responsiveness of the site will not be impaired. CloudFront, in general, improves responsiveness of sites even when most of the content is dynamic, because it optimizes both the network path and protocol exchanges between the viewer and your back-end, with optimized TCP stacks and connection reuse (particularly helpful at reducing the multiple round-trips required by TLS handshakes).
In fact, CloudFront seems to offer an opportunity to have it both ways -- an initially hybrid capability that automatically morphs into a two-tier infrastructure -- because CloudFront distributions also have a wildcard functionality with overrides: a distribution with *.example.com handles all requests unless a distribution with a more specific domain name is provisioned -- at which point the other distribution will start handling the traffic. CloudFront takes a few minutes before the new distribution overrides the wildcard, but when the switchover happens, it's clean. A few minutes after the new distribution is configured, you make a parallel DNS change to the newly assigned hostname for the new distribution, but CloudFront is designed in such a way that you do not have to tightly coordinate this change -- all endpoints will handle all domains because CloudFront doesn't use the endpoint to make the routing decision, it uses SNI and the HTTP Host header.
This seems almost like a no-brainer. A default, wildcard CloudFront distribution is pointed to by a default, wildcard DNS record, and uses Lambda#Edge to identify which of your clusters handles a given subdomain using a database lookup, followed by the deployment -- automated, of course -- of a distribution for each of your customers, which already knows how to forward the request to the correct cluster, so no further database queries are needed after the subdomain is fully live. You'll need to ask AWS Support to increase your account's limit for the number of CloudFront distributions from the default of 200, but that should not be a problem.
There are multiple ways to accomplish that database lookup. As mentioned, before, the Lambda#Edge function can invoke a second Lambda function inside VPC to query the database for routing instructions, or you could push the domain location config to a DynamoDB global table, which would replicate your domain routing instructions to multiple DynamoDB regions (currently Virginia, Ohio, Oregon, Ireland, and Frankfurt) and DynamoDB can be queried directly from a Lambda#Edge function.
Related
I have an application running on EC2 instances behind an ALB
application basically serves dynamic HTML pages. to reduce the loads on application instances i was panning to save the rendered HTML to S3 and serve it from their instead of the application instances.
Whenever we receive a request on a route that should be routed to either ALB or S3 depending on if the page is stored in S3.
for that we are planning to use cloudfront and edge lamabda to dynamically route the traffic to different origins depending on value set on dynamoDB for a route.
So far on testing seems to be working fine only issue is increased latency when using dynamo DB and for pages not stored in S3 adds considerable latency from lambda and application.
I would like to know if their are any better approaches than this and is their any better storage mechanism than DynamoDB that we can use with lambda#edge.
if we can apply this logic with a feature lag to only certain routes (thought of cloudfront behaviors but have more than 500+ routes different behaviors won't be feesible)
How good is caching things in lambda memory for certain amount of time is it possible?
We have tried using global dynamo DB to add replica in the region closer to the users but still add more latency
I have an app hosted on 2 VMs in different regions (Say US and Australia). Users will have their data on either of these regions because of data residency constraints.
Let's say a user whose data is on AU server is travelling to Canada and Geographical/Latency based routing will route the user to closest server that is US.
My App can internally somehow respond to redirect this user to AU server. How can i make all subsequent requests go to AU server directly without going to US server?
I presume that you have configured a DNS Name in Amazon Route 53 to use Georouting or Latency-Based Routing to direct users to a destination.
To direct the user to a different location, you could need to send them to an IP address or DNS Name that does not go through this georouting/latency lookup.
For example:
Create sydney.example.com to point to the Sydney server
Create ohio.example.com to point to the Ohio server
Create example.com to use geo/latency routing that resolves to either sydney. or ohio.
When you want to redirect a user to a specific location (eg Sydney), send them to sydney.example.com to avoid going through the geo/latency lookup again
You need to architecturally separate data residency requirements of a given user's profile data, from performance considerations of content delivery - this will help you get farther along on both tracks.
For example, data residency of a user is typically decided at account creation time and rarely changes subsequently. Your "data at rest" stays in Australia for example. The data at rest may be exposed to all your application servers globally (including those booted up in a Canadian data center) using uniform internal API or database-host accessing schemes.
The above architecture will allow at-rest data to be in Australia, but "data in transit" or cached-in-memory data (e.g any KV store) to be accessible from any application server "worker" anywhere in the world. Make sure you use encryption while the data is in transit, and audit any persistence of the said data.
You can layer on further performance by maximising HTTP responses that are cacheable freely by CDNs and any intermediary caches (the Cache-Control HTTP header), as well as using the geo-routing strategies you already mention.
I have been trying to reason why an S3 bucket name has to be globally unique. I came across the stackoverflow answer as well that says in order to resolve host header, bucket name got to be unique. However, my point is can't AWS direct the s3-region.amazonaws.com to region specific web server that can serve the bucket object from that region? That way the name could be globally unique only for a region. Meaning, the same bucket could be created in a different region. Please let me know if my understanding is completely wrong on how name resolution works or otherwise?
There is not, strictly speaking, a technical reason why the bucket namespace absolutely had to be global. In fact, it technically isn't quite as global as most people might assume, because S3 has three distinct partitions that are completely isolated from each other and do not share the same global bucket namespace across partition boundaries -- the partitions are aws (the global collection of regions most people know as "AWS"), aws-us-gov (US GovCloud), and aws-cn (the Beijing and Ningxia isolated regions).
So things could have been designed differently, with each region independent, but that is irrelevant now, because the global namespace is entrenched.
But why?
The specific reasons for the global namespace aren't publicly stated, but almost certainly have to do with the evolution of the service, backwards compatibility, and ease of adoption of new regions.
S3 is one of the oldest of the AWS services, older than even EC2. They almost certainly did not foresee how large it would become.
Originally, the namespace was global of necessity because there weren't multiple regions. S3 had a single logical region (called "US Standard" for a long time) that was in fact comprised of at least two physical regions, in or near us-east-1 and us-west-2. You didn't know or care which physical region each upload went to, because they replicated back and forth, transparently, and latency-based DNS resolution automatically gave you the endpoint with the lowest latency. Many users never knew this detail.
You could even explicitly override the automatic geo-routing of DNS amd upload to the east using the s3-external-1.amazonaws.com endpoint or to the west using the s3-external-2.amazonaws.com endpoint, but your object would shortly be accessible from either endpoint.
Up until this point, S3 did not offer immediate read-after-write consistency on new objects since that would be impractical in the primary/primary, circular replication environment that existed in earlier days.
Eventually, S3 launched in other AWS regions as they came online, but they designed it so that a bucket in any region could be accessed as ${bucket}.s3.amazonaws.com.
This used DNS to route the request to the correct region, based on the bucket name in the hostname, and S3 maintained the DNS mappings. *.s3.amazonaws.com was (and still is) a wildcard record that pointed everything to "S3 US Standard" but S3 would create a CNAME for your bucket that overrode the wildcard and pointed to the correct region, automatically, a few minutes after bucket creation. Until then, S3 would return a temporary HTTP redirect. This, obviously enough, requires a global bucket namespace. It still works for all but the newest regions.
But why did they do it that way? After all, at around the same time S3 also introduced endpoints in the style ${bucket}.s3-${region}.amazonaws.com ¹ that are actually wildcard DNS records: *.s3-${region}.amazonaws.com routes directly to the regional S3 endpoint for each S3 region, and is a responsive (but unusable) endpoint, even for nonexistent buckets. If you create a bucket in us-east-2 and send a request for that bucket to the eu-west-1 endpoint, S3 in eu-west-1 will throw an error, telling you that you need to send the request to us-east-2.
Also, around this time, they quietly dropped the whole east/west replication thing, and later renamed US Standard to what it really was at that point -- us-east-1. (Buttressing the "backwards compatibility" argument, s3-external-1 and s3-external-2 are still valid endpoints, but they both point to precisely the same place, in us-east-1.)
So why did the bucket namespace remain global? The only truly correct answer an outsider can give is "because that's what the decided to do."
But perhaps one factor was that AWS wanted to preserve compatibility with existing software that used ${bucket}.s3.amazonaws.com so that customers could deploy buckets in other regions without code changes. In the old days of Signature Version 2 (and earlier), the code that signed requests did not need to know the API endpoint region. Signature Version 4 requires knowledge of the endpoint region in order to generate a valid signature because the signing key is derived against the date, region, and service... but previously it wasn't like that, so you could just drop in a bucket name and client code needed no regional awareness -- or even awareness that S3 even had regions -- in order to work with a bucket in any region.
AWS is well-known for its practice of preserving backwards compatibility. They do this so consistently that occasionally some embarrassing design errors creep in and remain unfixed because to fix them would break running code.²
Another issue is virtual hosting of buckets. Back before HTTPS was accepted as non-optional, it was common to host ststic content by pointing your CNAME to the S3 endpoint. If you pointed www.example.com to S3, it would serve the content from a bucket with the exact name www.example.com. You can still do this, but it isn't useful any more since it doesn't support HTTPS. To host static S3 content with HTTPS, you use CloudFront in front of the bucket. Since CloudFront rewrites the Host header, the bucket name can be anything. You might be asking why you couldn't just point the www.example.com CNAME to the endpoint hostname of your bucket, but HTTP and DNS operate at very different layers and it simply doesn't work that way. (If you doubt this assertion, try pointing a CNAME from a domain that you control to www.google.com. You will not find that your domain serves the Google home page; instead, you'll be greeted with an error because the Google server will only see that it's received a request for www.example.com, and be oblivious to the fact that there was an intermediate CNAME pointing to it.) Virtual hosting of buckets requires either a global bucket namespace (so the Host header exactly matches the bucket) or an entirely separate mapping database of hostnames to bucket names... and why do that when you already have an established global namespace of buckets?
¹ Note that the - after s3 in these endpoints was eventually replaced by a much more logical . but these old endpoints still work.
² two examples that come to mind: (1) S3's incorrect omission of the Vary: Origin response header when a non-CORS request arrives at a CORS-enabled bucket (I have argued without success that this can be fixed without breaking anything, to no avail); (2) S3's blatantly incorrect handling of the symbol + in an object key, on the API, where the service interprets + as meaning %20 (space) so if you want a browser to download from a link to /foo+bar you have to upload it as /foo{space}bar.
You create an S3 bucket in a specific region only and objects stored in a bucket is only stored in that region itself. The data is neither replicated nor stored in different regions, unless you setup replication on a per bucket basis.
However. AWS S3 shares a global name space with all accounts. The name given to an S3 bucket should be unique
This requirement is designed to support globally unique DNS names for each bucket eg. http://bucketname.s3.amazonaws.com
I have following S3 buckets
"client1"
"client2"
...
"clientX"
and our clients upload data to their buckets via jar app (client1 to bucket client1 ect.). Here is peace of code:
BasicAWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
AmazonS3 s3client = new AmazonS3Client(credentials);
s3client.setRegion(Region.getRegion(Regions.US_EAST_1));
File file = new File(OUTPUT_DIRECTORY + '/' + fileName);
s3client.putObject(new PutObjectRequest(bucketName, datasource + '/' + fileName, file));
and problem is, that they have firewall for output traffic. They must allow URL .amazonaws.com in firewall. Is it possible to set endpoint to our domain storage.domain.com ?
We are expecting, that we will change region in future, but all our client are locket to amazonaws.com = US_EAST_1 region now -> all our clients will need change rules in firewall.
If the endpoint will be storage.domain.com - everything will be ok :)
Example of expected clients URL
client1 will put data to URL client1.storage.domain.com
client2 will put data to URL client2.storage.domain.com
clientX will put data to URL clientX.storage.domain.com
We know about setting in CloudFront, but it's per bucket. We are finding solution with one global AWS setting. How can we do that?
Thank you very much
Not sure if this will be affordable for you (due to extra fee you may have), but this should work:
Create Route53 with your domain and subdomains (client1, client2.. clientX)
Create (or use default) VPC with endpoints (https://aws.amazon.com/blogs/aws/new-vpc-endpoint-for-amazon-s3/)
Route all traffic from Route53 to your VPC through the internet gateway (IGW)
You may need to have Security group and NACL things configured. Let me know if you need further details.
There are numerous factors at play, here, not the least of which is support for SSL.
First, we need to eliminate one obvious option that will not work:
S3 supports naming each bucket after a domain name, then using a CNAME to point at each bucket. So, for example, if you name a bucket client-1.storage.example.com and then point a DNS CNAME (in Route 53) for client-1.storage.example.com CNAME client-1.storage.example.com.s3.amazonaws.com then this bucket becomes accessible on the Internet as client-1.storage.example.com.
This works only if you do not try to use HTTPS. The reason for the limitation is a combination of factors, which are outside the scope of this answer. There is no workaround that uses only S3. Workarounds require additonal components.
Even though the scenario above will not work for your application, let's assume for a moment that it will, since it makes another problem easy to illustrate:
We are finding solution with one global AWS setting
This may not be a good idea, even if it is possible. In the above scenario, it would be tempting for you to set up a wildcard CNAME so that *.storage.example.com CNAME s3[-region].amazonaws.com which would give you a magical DNS entry that would work for any bucket with a name matching *.storage.example.com and created in the appropriate region... but there is a serious vulnerability in such a configuration -- I could create a bucket called sqlbot.storage.example.com (assuming no such bucket already existed) and now I have a bucket that you do not control, using a hostname under your domain, and you don't have any way to know about it, or stop it. I can potentially use this to breach your clients' security because now my bucket is accessible from inside your client's firewall, thanks to the wildcard configuration.
No, you really need to automate the steps to deploy each client, regardless of the ultimate solution, rather than relying on a single global setting. All AWS services (S3, Route 53, etc.) lend themselves to automation.
CloudFront seems like it holds the key to the simplest solution, by allowing you to map each client hostname to their own bucket. Yes, this does require a CloudFront distribution to be configured for each client, but this operation can also be automated, and there isn't a charge for each CloudFront distribution. The only charges for CloudFront are usage-related (per request and per GB transferted). Additional advantages here include SSL support (including wildcard *.storage.example.com certificate from ACM which can be shared across multiple CloudFront distributions) and the fact that with CloudFront in the path, you do not need the bucket name and the hostname to be the same.
This also gives you the advantage of being able to place each bucket in the most desirable region for that specific bucket. It is, however, limited to files not exceeding 20 GB in size, due to the size limit imposed by CloudFront.
But the problem with using CloudFront for applications with a large number of uploads of course is that you're going to pay bandwidth charges for the uploads. In Europe, Canada, and the US, it's cheap ($0.02/GB) but in India it is much more expensive ($0.16/GB), with other areas varying in price between these extremes. (You pay for downloads with CloudFront, too, but in that case, S3 does not bill you for any bandwidth charges when downloads are pulled through CloudFront... so the consideration is not usually as significant, and adding CloudFront in front of S3 for downloads can actually be slightly cheaper than using S3 alone).
So, while CloudFront is probably the official answer, there are a couple of considerations that are still potentially problematic.
S3 transfer acceleration avoids the other problem you mentioned -- the bucket regions. Buckets with transfer acceleration enabled are accessible at https://bucket-name.s3-accelerate.amazonaws.com regardless of the bucket region, so that's a smaller hole to open, but the transfer acceleration feature is only supported for buckets without dots in their bucket names. And transfer acceleration comes with additional bandwidth charges.
So where does this leave you?
There's not a built-in, "serverless" solution that I can see that would be simple, global, automatic, and inexpensive.
It seems unlikely, in my experience, that a client who is so security-conscious as to restrict web access by domain would simultaneously be willing to whitelist what is in effect a wildcard (*.storage.example.com) and could result in trusting traffic that should not be trusted. Granted, it would be better than *.amazonaws.com but it's not clear just how much better.
I'm also reasonably confident that many security configurations rely on static IP address whitelisting, rather than whitelisting by name... filtering by name in an HTTPS environment has implications and complications of its own.
Faced with such a scenario, my solution would revolve around proxy servers deployed in EC2 -- in the same region as the buckets -- which would translate the hostnames in the requests into bucket names and forward the requests to S3. These could be deployed behind ELB or could be deployed on Elastic IP addresses, load balanced using DNS from Route 53, so that you have static endpoint IP addresses for clients that need them.
Note also that any scenario involving Host: header rewrites for requests authorized by AWS Signature V4 will mean that you have to modify your application's code to sign the requests with the real hostname of the target S3 endpoint, while sending the requests to a different hostname. Sending requests directly to the bucket endpoint (including the transfer acceleration endpoint) is the only way to avoid this.
OK, so I have a an Amazon S3 bucket to which I want to allow users to upload files directly from the client over https.
In order to do this it became apparent that I would have to change the bucket name from a format using periods to a format using dashes. So :
my.bucket.com
became :
my-bucket-com
This being required due to a limitation of https authentication which can't deal with periods in the bucket name when resolving the S3 endpoint.
So everything is peachy, except now I'd like to allow access to those files while hiding the fact that they are being stored on Amazon S3.
The obvious choice seems to be to use Route 53 zone configuration records to add a CNAME record to point my url at the bucket, given that I already have the 'bucket.com' domain :
my.bucket.com > CNAME > my-bucket-com.s3.amazonaws.com
However, I now seem to have hit another limitation, in that Amazon seem to insist that the name of the CNAME record must match the bucket name exactly so the above example will not work.
My temporary solution is to use a reverse proxy on an EC2 instance while traffic volumes are low. But this is not a good or long term solution as it means that all S3 access is being funneled through the proxy server causing extra server load, and data transfer charges. Not to mention the solution really isn't scalable when traffic volumes start to increase.
So is it possible to achieve both of my goals above or are they mutually exclusive?
If I want to be able to upload directly from clients over https, I can't then hide the S3 url from end users accessing that content and vice versa?
Well there simply doesn't seem to be a straightforward way of achieving this.
There are 2 possible solutions :
1.) Put your S3 bucket behind Amazon Cloudfront - but this does incur a lot more charges, all be it with the added benefit of lower latency regional access to your content.
2.) The solution we will go with will simply be to split the bucket in to two.
One for upload from HTTPS clients my-bucket-com; And one for CNAME aliased access to that content my.bucket.com. This keeps the costs down, although it will involve extra steps in organising the content before it can be accessed.