I have an application running on EC2 instances behind an ALB
application basically serves dynamic HTML pages. to reduce the loads on application instances i was panning to save the rendered HTML to S3 and serve it from their instead of the application instances.
Whenever we receive a request on a route that should be routed to either ALB or S3 depending on if the page is stored in S3.
for that we are planning to use cloudfront and edge lamabda to dynamically route the traffic to different origins depending on value set on dynamoDB for a route.
So far on testing seems to be working fine only issue is increased latency when using dynamo DB and for pages not stored in S3 adds considerable latency from lambda and application.
I would like to know if their are any better approaches than this and is their any better storage mechanism than DynamoDB that we can use with lambda#edge.
if we can apply this logic with a feature lag to only certain routes (thought of cloudfront behaviors but have more than 500+ routes different behaviors won't be feesible)
How good is caching things in lambda memory for certain amount of time is it possible?
We have tried using global dynamo DB to add replica in the region closer to the users but still add more latency
Related
Can anybody simply summarize why/when I would use one over the other?
CloudFront is AWS's CDN, or content delivery network. It's a collection of over 200 server deployments all over the world (called edges or PoPs) that all advertise the same URL / IP address range. When you have data that never or rarely changes, you can host it somewhere and serve it through a CloudFront URL - each of these edges will cache a copy of the data and serve it very quickly on subsequent requests. Since you now have hundreds of servers all over the world on your side, the amount of data you can serve and the speed to you can serve it at increases by many orders of magnitude. You'll use CloudFront by giving your end users CloudFront URLs that they hit directly.
DAX is a caching layer that's specifically tied to AWS's DynamoDB database, which is key-value & range query storage structure. DAX sits in front of DynamoDB, storing frequently used keys and values in memory, which allows it to serve them back to you quickly without actually hitting DynamoDB. Any write that happens also automatically clears the cache for that key. This is local to particular region and DynamoDB database. You'll use DAX by installing a special client SDK in your server code that can understand the special protocol, and pass all DynamoDB reads and writes through it.
I'm building a Laravel application that offers an authoring tool to customers. Each customer will get their own subdomain i.e:
customer-a.my-tool.com
customer-b.my-tool.com
My tool is hosted on Amazon in multiple regions for performance but mostly for privacy law reasons(GDPR++). Each customer have their data in only one region. Australian customers in Australia, European in Europe etc. So the customers users must be directed to the correct region. If a European user ends up being served by the US region their data won't be there.
We can solve this manually using DNS and simply point each sub-domain to the correct IP, but we don't want to do this for two reasons. (1) updating the DNS might take up to 60 seconds. We don't want the customer to wait. (2) It seems the sites we've researched uses wildcard domains. For instance slack and atlassian.net. We know that atlassian.net also have multiple regions.
So the question is:
How can we use a wildcard domain and still route the traffic to the regions where the content is located?
Note:
We don't want the content in all regions, but we can have for instance a DynamoDB available in all regions mapping subdomains to regions.
We don't want to tie an organization to a region. I.e. a domain structure like customer-a.region.my-tool.com is an option we've considered, but discarded
We, of course, don't want to be paying for transferring the data twice, and having apps in all regions accessing the databases in the regions the data belong to is not an option since it will be slow.
How can we use a wildcard domain and still route the traffic to the regions where the content is located?
It is, in essence, not possible to do everything you are trying to do, given all of the constraints you are imposing: automatically, instantaneously, consistently, and with zero overhead, zero cost, and zero complexity.
But that isn't to say it's entirely impossible.
You have asserted that other vendors are using a "wildcard domain," which is a concept that is essentially different than I suspect you believe it necessarily entails. A wildcard in DNS, like *.example.com is not something you can prove to the exclusion of other possibilities, because wildcard records are overridden by more specific records.
For a tangible example that you can observe, yourself... *.s3.amazonaws.com has a DNS wildcard. If you query some-random-non-existent-bucket.s3.amazonaws.com, you will find that it's a valid DNS record, and it routes to S3 in us-east-1. If you then create a bucket by that name in another region, and query the DNS a few minutes later, you'll find that it has begun returning a record that points to the S3 endpoint in the region where you created the bucket. Yes, it was and is a wildcard record, but now there's a more specific record that overrides the wildcard. The override will persist for at least as long as the bucket exists.
Architecturally, other vendors that segregate their data by regions (rather than replicating it, which is another possibility, but not applicable to your scenario) must necessarily be doing something along one of these lines:
creating specific DNS records and accepting the delay until the DNS is ready or
implementing what I'll call a "hybrid" environment that behaves one way initially, and a different way eventually, this evironment uses specific DNS records to override a wildcard and has an ability to temporarily deliver, via a reverse proxy, a misrouted request to the correct cluster, to allow instantaneous correct behavior until the DNS propagates or
an ongoing "two-tier" environment, using a wildcard without more specific records to override it, operating a two-tier infrastructure, with an outer tier that is distributed globally, that accepts any request, and has internal routing records that deliver the request to an inner tier -- the correct regional cluster.
The first option really doesn't seem unreasonable. Waiting a short time for your own subdomain to be created seems reasonably common. But, there are other options.
The second option, the hybrid environment, would simply require that the location where your wildcard points to by default be able to do some kind of database lookup to determine where the request should go, and proxy the request there. Yes, you would pay for inter-region transport, if you implement this yourself in EC2, but only until the DNS update takes effect. Inter-region bandwidth between any two AWS regions costs substantially less than data transfer to the Internet -- far less than "double" the cost.
This might be accomplished in any number of ways that are relatively straightforward.
You must, almost by definition, have a master database of the site configuration, somewhere, and this system could be queried by a complicated service that provides the proxying -- HAProxy and Nginx both support proxying and both support Lua integrations that could be used to do a lookup of routing information, which could be cached and used as long as needed to handle the temporarily "misrouted" requests. (HAProxy also has static-but-updatable map tables and dynamic "stick" tables that can be manipulated at runtime by specially-crafted requests; Nginx may offer similar things.)
But EC2 isn't the only way to handle this.
Lambda#Edge allows a CloudFront distribution to select a back-end based on logic -- such as a query to a DynamoDB table or a call to another Lambda function that can query a relational database. Your "wildcard" CloudFront distribution could implement such a lookup, caching results in memory (container reuse allows very simple in-memory caching using simply an object in a global varible). Once the DNS record propagates, the requests would go directly from the browser to the appropriate back-end. CloudFront is marketed as a CDN, but it is in fact a globally-distributed reverse proxy with an optional response caching capability. This capability may not be obvious at first.
In fact, CloudFront and Lambda#Edge could be used for such a scenario as yours in either the "hybrid" environment or the "two-tier" environment. The outer tier is CloudFront -- which automatically routes requests to the edge on the AWS network that is nearest the viewer, at which point a routing decision can be made at the edge to determine the correct cluster of your inner tier to handle the request. You don't pay for anything twice, here, since bandwidth from EC2 to CloudFront costs nothing. This will not impact site performance other than the time necessary for thst initial database lookup, and once your active containers have that cached the responsiveness of the site will not be impaired. CloudFront, in general, improves responsiveness of sites even when most of the content is dynamic, because it optimizes both the network path and protocol exchanges between the viewer and your back-end, with optimized TCP stacks and connection reuse (particularly helpful at reducing the multiple round-trips required by TLS handshakes).
In fact, CloudFront seems to offer an opportunity to have it both ways -- an initially hybrid capability that automatically morphs into a two-tier infrastructure -- because CloudFront distributions also have a wildcard functionality with overrides: a distribution with *.example.com handles all requests unless a distribution with a more specific domain name is provisioned -- at which point the other distribution will start handling the traffic. CloudFront takes a few minutes before the new distribution overrides the wildcard, but when the switchover happens, it's clean. A few minutes after the new distribution is configured, you make a parallel DNS change to the newly assigned hostname for the new distribution, but CloudFront is designed in such a way that you do not have to tightly coordinate this change -- all endpoints will handle all domains because CloudFront doesn't use the endpoint to make the routing decision, it uses SNI and the HTTP Host header.
This seems almost like a no-brainer. A default, wildcard CloudFront distribution is pointed to by a default, wildcard DNS record, and uses Lambda#Edge to identify which of your clusters handles a given subdomain using a database lookup, followed by the deployment -- automated, of course -- of a distribution for each of your customers, which already knows how to forward the request to the correct cluster, so no further database queries are needed after the subdomain is fully live. You'll need to ask AWS Support to increase your account's limit for the number of CloudFront distributions from the default of 200, but that should not be a problem.
There are multiple ways to accomplish that database lookup. As mentioned, before, the Lambda#Edge function can invoke a second Lambda function inside VPC to query the database for routing instructions, or you could push the domain location config to a DynamoDB global table, which would replicate your domain routing instructions to multiple DynamoDB regions (currently Virginia, Ohio, Oregon, Ireland, and Frankfurt) and DynamoDB can be queried directly from a Lambda#Edge function.
I'm interested in hosting a website for a small business (< 100 users / month) and I wanted to try going 'serverless'. I've read that using Amazon S3, Lambda and DynamoDB is a way to set this up, by hosting the front-end on S3, using Lambda functions to access the back-end, and storing data in DynamoDB. I'll need to run a script on page load to get data to display, save user profiles/allow logins, and acccept payments using Stripe or Braintree.
Is this a good situation to use this setup, or am I better off just using EC2 with a LAMP stack? Which is better in terms of cost?
It is a perfectly good solution, and will probably cost you nothing at all to host on AWS - literally pennies a month. I host several low traffic sites this way and it works well.
The only caveat would be, since your traffic is so slow, almost every time someone hits a page, if it needs to make any back-end calls, those lambda functions will likely need a 'cold-start', which may introduce a delay and cause the page to load a bit slower than if it had more traffic that tended to keep the lambda cache 'warm'.
I'm creating a simple web app that needs to be deployed to multiple regions in AWS. The application requires some dynamic configuration which is managed by a separate service. When the configuration is changed through this service, I need those changes to propagate to all web app instances across all regions.
I considered using cross-region replication with DynamoDB to do this, but I do not want to incur the added cost of running DynamoDB in every region, and the replication console. Then the thought occurred to me of using S3 which is inherently cross-region.
Basically, the configuration service would write all configurations to S3 as static JSON files. Each web app instance will periodically check S3 to see if the any of the config files have changed since the last check, and download the new config if necessary. The configuration changes are not time-sensitive, so polling for changes every 5/10 mins should suffice.
Have any of you used a similar approach to manage app configurations before? Do you think this is a smart solution, or do you have any better recommendations?
The right tool for this configuration depends on the size of the configuration and the granularity you need it.
You can use both DynamoDB and S3 from a single region to serve your application in all regions. You can read a configuration file in S3 from all the regions, and you can read the configuration records from a single DynamoDB table from all the regions. There is some latency due to the distance around the globe, but for reading configuration it shouldn't be much of an issue.
If you need the whole set of configuration every time that you are loading the configuration, it might make more sense to use S3. But if you need to read small parts of a large configuration, by different parts of your application and in different times and schedule, it makes more sense to store it in DynamoDB.
In both options, the cost of the configuration is tiny, as the cost of a text file in S3 and a few gets to that file, should be almost free. The same low cost is expected in DynamoDB as you have probably only a few KB of data and the number of reads per second is very low (5 Read capacity per second is more than enough). Even if you decide to replicate the data to all regions it will still be almost free.
I have an application I wrote that works in exactly the manner you suggest, and it works terrific. As it was pointed out, S3 is not 'inherently cross-region', but it is inherently durable across multiple availability zones, and that combined with cross region replication should be more than sufficient.
In my case, my application is also not time-sensitive to config changes, but none-the-less besides having the app poll on a regular basis (in my case 1 once per hour or after every long-running job), I also have each application subscribed to SNS endpoints so that when the config file changes on S3, an SNS event is raised and the applications are notified that a change occurred - so in some cases the applications get the config changes right away, but if for whatever reason they are unable to process the SNS event immediately, they will 'catch up' at the top of every hour, when the server reboots and/or in the worst case by polling S3 for changes every 60 minutes.
I have a RESTful webservice running on Amazon EC2. Since my application needs to deal with large number of photos, I plan to put them on Amazon S3. So the URL for retrieving a photo from S3 could look like this:
http://johnsmith.s3.amazonaws.com/photos/puppy.jpg
Is there any way or necessity to cache the images on EC2? The pros and cons I can think of is:
1) Reduced S3 usage and cost with improved image fetching performance. However on the other hand EC2 cost can rise plus EC2 may not have the capability to handle the image cache due to bandwidth restrictions.
2) Increased development complexity cuz you need to check the cache first and ask S3 to transfer the image to EC2 and then transfer to the client.
I'm using the EC2 micro instance and feel it might be better not to do the image cache on EC2. But the scale might grow fast and eventually will need a image cache.(Am I right?) If cache is needed, is it better to do it on EC2, or on S3? (Is there a way for caching for S3?)
By the way, when the client uploads an image, should it be uploaded to EC2 or S3 directly?
Why bring EC2 into the equation? I strongly recommend using CloudFront for the scenario.
When you use CloudFront in conjunction with S3 as origin; the content gets distributed to 49 different locations worldwide ( as of count of edge locations worldwide today ) directly working out as a cache globally and the content being fetched from nearest location based on the latency to your end users.
The way you don't need to worry about the scale and performance of Cache and EC2 can straightforward offload this to CloudFront and S3.
Static vs dynamic
Generally speaking, here are the tiers:
best CDN (cloudfront)
good static hosting (S3)
okay dynamic (EC2)
Why? There are a few reasons.
maintainability and scalability: cloudfront and S3 scale "for free". You don't need to worry about capacity or bandwidth or request rate.
price: approximately speaking, it's cheaper to use S3 than EC2.
latency: CDNs are located around the world, leading to shorter load times.
Caching
No matter where you are serving your static content from, proper use of the Cache-Control header will make life better. With that header you can tell a browser how long the content is good for. If it is something that never changes, you can instruct a browser to keep it for a year. If it frequently changes, you can instruct a browser to keep it for an hour, or a minute, or revalidate every time. You can give similar instructions to a CDN.
Here's a good guide, and here are some examples:
# keep for one year
Cache-Control: max-age=2592000
# keep for a day on a CDN, but a minute on client browsers
Cache-Control: s-maxage=86400, maxage=60
You can add this to pages served from your EC2 instance (no matter if it's nginx, Tornado, Tomcat, IIS), you can add it to the headers on S3 files, and CloudFront will use these values.
I would not pull the images from S3 to EC2 and then serve them. It's wasted effort. There are only a small number of use cases where that makes sense.
Few scenarios when EC2 caching instance:
your upload/download ratio is far from 50/50
you hit S3 limit 100req/sec
you need URL masking
you want to optimise kernel, TCP/IP settings, cache SSL session for clients
you want proper cache invalidating mechanism for all geo locations
you need 100% control where data is stored
you need to count number of requests
you have custom authentication mechanism
For number of reasons I recommend to take a look at Nginx S3 proxy.