High resource demanding tasks in websites - web-services

I'm newbie in web development and wondering about how to process high resource-demanding task in websites. For example, suppose you have some task that should be done once a day and it requires too much resources and takes a long time on your hosting. Example of such task can getting data from elsewhere, make some calculations on that data, plotting some graphs based on results of calculations and finally inserting results to database. What is the most optimal way of doing such tasks? Does AWS provide solutions for that (for example, rent a computer only for processing your task at a specified time in a day) and if yes, what is the name of that service? I'll be very appreciated by your general advice and options suggestions for me.

For AWS, I'd recommend AWS Batch for something like this.
From the documentation:
As a fully managed service, AWS Batch enables developers, scientists,
and engineers to run batch computing workloads of any scale. AWS Batch
automatically provisions compute resources and optimizes the workload
distribution based on the quantity and scale of the workloads. With
AWS Batch, there is no need to install or manage batch computing
software, which allows you to focus on analyzing results and solving
problems. AWS Batch reduces operational complexities, saves time, and
reduces costs, which makes it easy for developers, scientists, and
engineers to run their batch jobs in the AWS Cloud.

Related

GKE vs Cloud run

I have a python Flask APIs deployed on cloud run. Autoscaling, CPU, concurrency everything is configurable in cloud run.
Now the problem is with the actual load testing with around 40k concurrent users hitting APIs continuously.
Does cloud run handle these huge volumes or should we port our app to GKE?
What are the factors decide Cloud run vs GKE?
Cloud Run is designed to handle exactly what you're talking about. It's very performant and scalable. You can set things like concurrency per container/service as well which can be handy where payloads might be larger.
Where you would use GKE is when you need to customise your platform, perform man in the middle or environment complexity, or to handle potentially long running compute, etc. You might find this in large enterprise or highly regulated environments. Kubernetes is almost like a private cloud though, it's very complex, has its own way of working, and requires ongoing maintenance.
This is obviously opinionated but if you can't think of a reason why you need Kubernetes/GKE specifically, Cloud Run wins for an API.
To provide more detail though; see Cloud Run Limits and Quotas.
The particularly interesting limit is the 1000 container instances but note that it can be increased at request.

How do you implement cloud solutions without incurring costs during development?

I am completely new to the implementation of cloud solutions. I've just started taking AWS training courses.
But I already have a very fundamental question about the flow of development in cloud projects:
How do you go about developing solutions without incurring costs? I know that there are free tiers, but in practice you need a lot of unfree elements. Especially when working with infrastructure-as-code approaches (e.g. CloudFormation), it can happen that every time you try out the templates, costs can be incurred immediately.
Is there maybe something like a sandbox mode or how else do you go about it in practice?
Outside of the AWS Free Tier you will be billed for creating services.
The best way to keep costs as low as possible is to combing the lowest priced settings (such as instance class) with removing resources you're not using after you're complete. I understand that this will cost, however many resources are now moving to per second billing (where you normally have to pay for at least the first minute) so the cost is kept low.
Additionally when dealing with some services (such as EC2, ECS, Fargate and ECR) you can make use of spot instances to pay sometimes as low as 10% of the original cost which will help to reduce these resources.
To ensure you can recreate resources when you want them use infrastructure as code to reroll out as you need the resources (CloudFormation or Terraform are great offerings for this).
Finally be on the lookout for AWS conferences, they are a great way to pickup AWS credits for attending which will offset your bill against most AWS services.

Control AWS Sagemaker costs

I want to use GPU capacity for deep learning models. Sagemaker is great in its flexibility of starting on demand clusters for training. However, my department wants to have guarantees we won't overspend on the AWS budget. Is there a way to 'cap' the costs without resorting to using a dedicated machine?
Here are couple ideas:
You can use Service Catalog on top of SageMaker to restrict how
end-users consume the product (and for example limit instance types
and permissions)
On SageMaker Notebooks: you can use a lifecycle configuration to automatically shutdown notebooks that are idle.
You can also create AWS Lambda functions that automate controls (eg
shut down notebook instances at night, send a notif if big machines
are used etc)
On SageMaker Training you can use Spot capacity to benefit of significant savings on the training costs (up to 90% is possible). You can also apply a max training duration with train_max_run
it's a good idea to make sure your code makes good use of GPU. For example on P3 instance (V100 cards), you should try to use mixed-precision training so that training is faster and cheaper. Also, tune data loading, batch size and algorithm complexity so that GPU have enough work to do and don't just spend
their time doing data reads and model updates. Also, scale training vertically first (bigger and bigger machine) vs horizontally. Multi-machine training is generally more costly as harder to write and debug and with more communication overhead. This is more on your side that on SageMaker side though.
Training in the SageMaker Training API, not in SageMaker Notebook. When you write code in a Notebook, very little time is spent on training and much more in writing, debugging and reading documentation. All this time, GPU sits idle. On the other hand, SageMaker Training provides ephemeral, fresh instances billed per second only for the duration of training and comes with additional benefits such as spot, managed metrics, logs, data I/O and metadata management.
You can create cost allocation tags to make cost reporting at custom granularity

What is best way to performance and load test AWS cloud applications

Want to keep this question generic and expected answer in terms of best practice/approach/guidelines,
We need to know the best way to performance test and load test AWS cloud based applications.
What we have tried:
We used Gatling and Jmeter to execute our performance tests. These frameworks are pretty useful to test our functionality and to benchmark our applications latency and request rate.
Problem:
Performance benchmarks and limits of AWS managed services like Lambda and DDB are already specified by AWS e.g. Lambda concurrency behavior and DDB autoscaling under load etc. AWS also provides high availability and guaranteed performance of managed services.
Is it really worth executing expensive performance test and load test jobs for AWS managed services?
How to ensure that we are testing our application and not actually testing AWS limits which are already known.
What is the best practice and approach to performance test cloud based applications.
Any suggestions will help tremendously.
Thanks,
It depends on how you can use the AWS infrastructure
If you don't use Auto Scaling you can treat the AWS cloud-based application as a "normal" application just deployed not on your company premises, but somewhere at Amazon
If you use Auto Scaling you might want to come up with a some form of scalability testing. AWS instances can scale up in order to adapt to the increased load but your application must be able to scale as well. So you can test:
how fast are scale-up and scale-down processes, i.e. if you rapidly increase the load do AWS/your application react fast enough or there will be performance drop
scalability factor. For example with 100 virtual users you have 50 requests per second. Ideally with 200 virtual users you should have 100 requests per second, with 300 virtual users - 150 requests per second. Response time should remain the same. But normally the situation differs from the "ideal" so it would be good to know this scalability factor.
Also if your application is behind the AWS ELB you will need to add DNS Cache Manager to your JMeter test plan otherwise you will be hitting only one backend instance while others will be idle.

Monitoring works or identifying bottlenecks in data pipeline

I am using google cloud datafow. Some of my data pipelines needs to be optimized. I need to understand how workers are performing in the dataflow cluster on these lines .
1. How much memory is being used ?
Currently I am logging memory usage using java code .
2. Is there a bottleneck on the disk operations ? To understand whether a SSD is required ?
3. Is there a bottleneck in Vcpus ? So as to increase the Vcpus in workers nodes.
I know stackdriver can be used to monitor Cpu and disk usage for the cluster. However it does not provide information on individual workers and also on whether we are hitting the bottle neck in these.
Within the Dataflow Stackdriver UI, you are correct, you cannot view the individual worker's metrics. However, you can certainly setup a Stackdriver Dashboard which gives you the invdividual worker metrics for all of what you have mention. Below is a sample dashboard which shows metrics for CPU, Memory, Network, Read IOPs, and Write IOPS.
Since the Dataflow job name will be part of the GCE instance name, here I filter down the GCE instances being monitored by the job name I'm interested in. In this case, my Dataflow job was named "pubsub-to-bigquery", so I filtered down to instance_name ~= pubsub-to-bigquery.*. I did a regex filter to be sure I captured any job names which may be suffixed with additional data in future runs. Setting up a dashboard such as this can inform you when you'd actually benefit from SSDs, more network bandwidth, etc.
Also be sure to check the Dataflow job graph in the cloud console when looking to optimize your pipeline. The wall time below the step name can give a good indication on what custom transforms or dofns should be targeted for optimization.