How do you implement cloud solutions without incurring costs during development? - amazon-web-services

I am completely new to the implementation of cloud solutions. I've just started taking AWS training courses.
But I already have a very fundamental question about the flow of development in cloud projects:
How do you go about developing solutions without incurring costs? I know that there are free tiers, but in practice you need a lot of unfree elements. Especially when working with infrastructure-as-code approaches (e.g. CloudFormation), it can happen that every time you try out the templates, costs can be incurred immediately.
Is there maybe something like a sandbox mode or how else do you go about it in practice?

Outside of the AWS Free Tier you will be billed for creating services.
The best way to keep costs as low as possible is to combing the lowest priced settings (such as instance class) with removing resources you're not using after you're complete. I understand that this will cost, however many resources are now moving to per second billing (where you normally have to pay for at least the first minute) so the cost is kept low.
Additionally when dealing with some services (such as EC2, ECS, Fargate and ECR) you can make use of spot instances to pay sometimes as low as 10% of the original cost which will help to reduce these resources.
To ensure you can recreate resources when you want them use infrastructure as code to reroll out as you need the resources (CloudFormation or Terraform are great offerings for this).
Finally be on the lookout for AWS conferences, they are a great way to pickup AWS credits for attending which will offset your bill against most AWS services.

Related

aws billing breakdown to system components and artifacts

We have been running multi-tier application on aws and using various aws services like ECS, Lambda and RDS. Looking for a solution to map billing items to actual system components, finding the most money spending component etc.
AWS improved its Detailed Cost Usage Reports and have Cost Explorer API however it only break down the billing to services or instances. However per instance breakdown does not bring so much value if you looking for what is the cost of each component. Any solutions/recommendations for this?
Cost Allocation Tags
You can create a tag such as "system" or "app" and apply it to all of your resources and set the value to the different applications/systems/Components that you wish to track. Then you can go to the billing page, click on "Cost Allocation Tags" and activate that tag that you created.
Then you can see costs broken down by the different values of that tag. They will show up in Cost Explorer, tag will be one of the filters available. However, I think it takes 24 hours after activation before they will show up.
If you do need to enforce tag usage, and you have developers that work on multiple components, it's possible to have IAM roles for managing each components, each role is limited to interacting with resources with a specific tag (i.e. they can only modify existing resources with that tag, and they can only create new resources with that tag). A developer can have an IAM user (or you could federate identities, but that's a whole different conversation) and allow them to assume different roles depending on which component they are working on. This has the added benefit of making cross-account management easier. However, it may require a non-trivial IAM overhaul.
More info on cost allocation tags here: https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/cost-alloc-tags.html
Divide Cost boundaries by AWS account
To attack the components that are not taggable such as data transfers, you could build your account strategy around cost boundaries and have a separate account for each cost silo (if that's tenable). That may increase cost, because you'd have to break systems into specific accounts (and therefore specific EC2 Instances).
When you centralize reporting, monitoring, config management, log analysis, etc. Each application will add a little bit to that cost, but usually you just have to consider that centralization a system in itself and cost it out separately on its own. Obviously, you can have separate monitoring, alerting, reporting, log collection, config management, etc. for each system, but this will cost more overall (both in infrastructure costs and engineering hours). So you would have to prioritize cost visibility versus cost optimization.
There are still a great deal of capabilities within AWS to connect resources from disparate accounts, and it's not difficult to have a data-layer in one account, and an app-tier in another (though it's not a paradigm I often see).
Custom Tooling
Maybe the above are imperfect solutions for your environment, you could use the above as far as they are feasible and write scripts to estimate usage of things that are more difficult to track. For bandwidth, if you had your own EC2 Instances that ran as Forward Proxies or NAT gateways, you could write some outbound data transfer account software. If everything in your VPCs had a route to point to ENIs on these instances, then you could better track outbound transfer by any parameters you choose. This does sound a little fragile to me, and there may be several cases where this isn't tenable from a network perspective, but it's a possibility.
Similarly, with Cloudwatch metrics, you can use Namespaces, I wasn't able to find any reference to the ability to filter by Cloudwatch Namespaces in Cost Explorer, but it probably would be pretty easy to suss out raw metrics per namespace and estimate costs per namespace. Then you could divide your components in Cloudwatch by namespace. This may lead to some duplication, which may lead to more management effort or increased cost, but that would be the tradeoff for more granular cost visibility.
Kubernetes
This may be very pie-in-the-sky for your environment, but it's worth mentioning. If you ran a cluster using EKS or a self-managed cluster on EC2, you can harness the power of that platform, which would allow you provision a base level of compute resources, divide components into namespaces and use built-in or third party tools to grab usage statistics per namespace (or even per workload). This is much more easy to enforce, because you can give developers access to specific namespaces and outliers are generally more obvious. When you know the amount of CPU and Memory each workload uses over time, you can get a pretty good estimate of individual cost patterns by component.
Of course, you will still have a cost for the k8s management plane, which will be in a cost bucket apart from all of your other applications/systems.
Istio, while not a simple technology by any means, allows you to collect granular metrics about data egress which you can use to get an idea of how much data transfer costs are being ran up.
It might be easier to duplicate monitoring in each namespace, since you already have to abstract your monitoring workload to a certain extent to run on k8s at all. However, that still increases management and overall cost, but perhaps less than siloing at the Infrastructure (AWS) layer.
Summary
There's not a lot of options I know for getting to the level of granularity and control that you need in AWS. And efforts to this end will probably increase overall cost and management overhead. AWS is rather notorious for it's difficult to estimate cost model. Perhaps look into platforms other than AWS for your workloads that might provide better visibility into component costs.
It's also difficult to avoid systems that operate centrally and whose cost-per-system is difficult to trace. These include log management, config management, authentication systems, alerting systems, monitoring systems, etc. Generally it's more cost effective and more manageable to centralize these functions for all of your workloads, but then TCO of individual apps becomes difficult. In my experience, most teams write this off as infrastructure cost, and track the cost of an app more with the compute, storage, and AWS service usage data points.

New infrastructure for our project (AWS, GCP)

I started last month in a new company. Where I will be responsible for the infrastructure and the backend of the SAAS.
We currently have one droplet/instance per customer. In the current phase of the company it is a good choice. But in the future when the number of instances grow, it will be difficult to maintain. At the moment there are 150 instances online, each with 1CPU and 1GB memory.
Our customers only use the environments for moments a week, a month or a year. So most of the time, they do nothing. So we want to change that. I am thinking of Kubernetes, Docker Swarm or another tool.
What advice can you give us? Should we make the step to Kubernetes or Docker Swarm, or stay with the droplets / VMs at DigitalOcean, AWS or GCP?
If we move to AWS or GCP our average price will go up from 5$ p/m to above the 10$ p/m.
We want to make the next step to lower the waste of resources but also thinking about the monthly bill. In my mind, it will be better to have 2 our 3 bigger VMs running Kubernetes or Docker Swarm to lower the monthly bill and lower our reserved resources.
What do you think?
If you are serious about scaling, then you should rethink your application architecture. The most expensive part of computing is memory (RAM), so having dedicated memory per-customer will not allow you to scale.
Rather than keeping customers separate by using droplets, you should move this logical separation to the data layer. So, every customer can use the same horizontally-scaled compute servers and databases, but the software separates their data and access based on a User Identifier in the database.
Think for a moment... does Gmail keep RAM around for each specific customer? No, everybody uses the same compute and database, but the software separates their messages from other users. This allows them to scale to huge numbers of customers without assigning per-customer resources.
Here's another couple of examples...
Atlassian used to have exactly what you have. Each JIRA Cloud customer would be assigned their own virtual machine with CPU, RAM and a database. They had to grow their data center to a crazy size, and it was Expensive!
They then embarked on a journey to move to multi-tenancy, first by separating the databases from each customer (and using a common pool of databases), then by moving to shared microservices and eventually they removed all per-customer resources.
See:
Atlassian’s two-year cloud journey | TechCrunch
How Atlassian moved Jira and Confluence users to Amazon Web Services, and what it learned along the way – GeekWire
Atlassian cloud architecture - Atlassian Documentation
Salesforce chose to go multi-tenant from the very beginning. They defined the concept of SaaS and used to call themselves the "cloud" (before Cloud Computing as we know it now). While their systems are sharded to allow scale, multiple customers share the same resources within a shard. The separation of customer data is done at the database-level.
See:
The Magic of Multitenancy - Salesforce Engineering
Multi Tenant Architecture - developer.force.com
Bottom line: Sure, you can try to optimize around the current architecture by using containers, but if you want to get serious about scale (I'm talking 10x or 100x), then you need to re-think the architecture.

AWS vs GCP Cost Model

I need to make a cost model for AWS vs GCP. Currently, our organization is using AWS. Our biggest services used are:
EC2
RDS
Labda
AWS Gateway
S3
Elasticache
Cloudfront
Kinesis
I have very limited knowledge of cloud platforms. However, I have access to:
AWS Simple Monthly Calculator
Google Cloud Platform Pricing Calculator
MAP AWS services to GCP products
I also have access to CloudHealth so that I can get a breakdown of costs per services within our organization.
Of the 8 major services listed above are main usage and costs go to EC2, S3, and RDS.
Our director of engineering mentioned that I should be most concerned with vCPU and memory.
I would appreciate any insight (big or small) that people have into how I can go about creating this model, any other factors I should consider, which functionalities of the two providers for the services are considered historically "better" or cheaper, etc.
Thanks in advance, and any questions people may have, I am more than happy to answer.
-M
You should certainly cost-optimize your resources. It's so easy to create cloud resources that people don't always think about turning things off or right-sizing them.
Looking at your Top 5...
Amazon EC2
The simplest way to save money with Amazon EC2 is to turn off unused resources. You can even stop instances overnight and on the weekend. If they are only used 8 hours per workday, then that is only 40 out of 168 hours, so you can save 75% by turning them off when unused! For example, Dev and Test instances. People have written various types of automated utilities to turn instances on and off based on tags. Try search the Internet for AWS Stopinator.
Another way to save money on Amazon EC2 is to use spot instances. They are a fraction of the price, but have a risk that they might be turned off when demand increases. They are great where it is okay for systems to be terminated sometimes, such as automated testing systems. They are also a great way to supplement existing capacity at a fraction of the price.
If you definitely need the Amazon EC2 instances to keep running all the time, purchase Amazon EC2 Reserved Instances, which also offer a price saving.
Chat with your AWS Account Manager for help with the above options.
Amazon Relational Database Service (RDS)
Again, Amazon RDS instances can be stopped overnight/on weekends and turned on again when needed. You only pay while the instance is running (plus storage costs).
Examine the CloudWatch metrics for your RDS instances and determine whether they can be downsized without impacting applications. You can even resize them when they are used less (eg over weekends). Everything can be scripted, so you could trigger such downsizing and upsizing on a schedule.
Also look at the Engine used with RDS. Commercial offerings such as Oracle and Microsoft SQL Server are more expensive than open-source offerings like MySQL and PostgreSQL. Yes, your applications might need some changes, but the cost savings can be significant.
AWS Lambda
It is most unusual that Lambda is #3 in your list. In fact, some customers never get a charge for Lambda because it falls in the monthly free usage tier. Having high charges means you're making good use of Lambda (which is saving you EC2 costs), but take a look at which applications are using it the most and see whether they are using it wisely.
When correctly used, a Lambda function should only ever run for a few seconds, so check whether any application seem to be using it outside this pattern.
AWS API Gateway
Once again, these costs tend to be low ($3.50/million calls) so again I'd recommend trying to figure out how this is being used. If you really need that many calls, it would also explain the high Lambda costs. It would probably be more expensive if you were providing such functionality via Amazon EC2.
Amazon S3
Consider using different Storage Classes to reduce your costs. Costs can be reduced by:
Moving infrequently-accessed data to a different storage class
Moving data to One-Zone (if you have a copy of the data elsewhere, so don't need the redundancy)
Archiving infrequently-accessed data to Amazon Glacier, which offers much cheaper storage but does not have instant access
With GCP, you can benefit by receiving discounts such as the Committed Use Discount and the Sustained Use Discount.
With a Committed Use Discount, you can receive a discount of up to 70% if your usage is predictable.
With the Sustained Use Discount, there is an incremental discount if you reach certain usage thresholds.
On your concern with vCPU and memory, you may use predefined machine types. They are cheaper than custom machine types.
Lastly, you can also test the charges by trying out the Google Cloud Platform Free Tier.

Manage multiple aws accounts

I would like to know a system by which I can keep track of multiple aws accounts, somewhere around 130+ accounts with each account containing around 200+ servers.
I wanna know methods to keep track of machine failure, service failure etc.
I also wanna know methods by which I can automatically turn up a machine if the underlying hardware failed or the machine terminated while on spot.
I'm open to all solutions including chef/terraform automation, healing scripts etc.
You guys will be saving me a lot of sleepless nights :)
Thanks in advance!!
This is purely my take on implementing your problem statement.
1) Well.. for managing and keeping track of multiple aws accounts you can use AWS Organization. This will help you manage centrally with one root account all the other 130+ accounts. You can enable consolidated billing as well.
2) As far as keeping track of failures... you may need to customize this according to your requirements. For example: You can build a micro service on top of docker containers or ecs whose sole purpose is to keep track of failures, generate a report and push to s3 on a daily basis.You can further create a dashboard using AWS quicksight out of this reports in S3.
There can be another micro service which will rectify the failures. It just depends on how exhaustive and fine grained you want your implementation to be.
3) For spawning instances when spot instances are terminated, it can be achieved through you simple autoscaling configurations. Here are some of the articles you may want to go through which will give you some ideas:
Using Spot Instances with On-Demand instances
Optimizing Spot Fleet+Docker with High Availability
AWS Organisations are useful for management. You can also look at multiple account billing strategy and security strategy. A shared services account with your IAM users will make things easier.
Regarding tracking failures you can set up automatic instance recovery using CloudWatch. CloudWatch can also have alerts defined that will email you when something happens you don't expect, though setting them up individually could be time consuming. At your scale I think you should look into third party tools.

Alternative for built-in autoscaling groups for spot instances on AWS

I am currently using spot instances managed with auto-scaling groups. However, ASG has a number of shortcomings for use with spot instances. For example, it cannot launch instances of a different instance type if the current type is experiencing a price spike across all availability zones. It can't even re-distribute the number of running instances across zones (if one zone has a price spike, you're down 30% in the number of running instances.)
Are there any software solutions that I could run which would replace built-in AWS Auto-Scaling Groups? I've heard of SpotInst and Batchly, but I do not trust them. Basically, I think their business plan involves being bought out and killed by Amazon, like what happened to ClusterK. The evidence for this is the bizarre pricing policies and other red flags. I need something that I can self-host and depend on.
AWS recently released Auto Scaling for Spot Fleets which seems to fit your use case pretty well. You can define the cluster capacity in terms of vCPU that you need, choose the instance types you'd like to use and their weights and let AWS manage the rest.
They will provision spot instances at their current market price up to a limit you can define per instance type (as before), but integrating Auto Scaling capabilities.
You can find more information here.
https://aws.amazon.com/blogs/aws/new-auto-scaling-for-ec2-spot-fleets/
It's unlikely that you're going to find something that takes into account everything you want. But because everything in Amazon is an API, so you can write that yourself. There are lots of ways to do that.
For example, you could write a small script (bash, ruby, python etc) that shells out the AWS CLI to get the price, then shells out to launch boxes. For bonus points, use the native AWS SDK library instead of shelling out. (That will be slightly easier to handle errors, etc.) For even more bonus points, open source it, and hope that other people to improve on it!
This script can run on your home computer, or on a t1.micro for $5/month. Or you could write it in node.js, and run it on Lambda for pennies per month.
Here at Spotinst, these are exactly the problems we built Elastigroup to solve.
Elastigroup enables running simultaneously on as many instance types and availability zones (within a region) as you’d like. This is coupled with several things to maintain production availability:
Our algorithm makes live choices for the best Spot markets in terms of price and availability.
When an interruption happens, we predict it about 15 minutes in advance and take all the necessary steps to ensure (and insure) the capacity of your group.
In the extreme case that none of the markets have Spot availability, we simply fall back to an on-demand instance.
We have a great relationship with AWS and work closely with both their technical and business teams to provide our joined customers with the best experience possible. As we manage resources inside your own AWS account, I wouldn’t put the relationship between us as a concern, to begin with.