Ho would you do the below
Please share your thoughts.
There are many ways to perform stress test on AWS architecture some of them are Jmeter, Blazemeter etc. Regarding restriction you have to let AWS support know before hand regarding the stress test or penetration test you are going to perform on the AWS infrastructure you have created. Check this link for more details.
https://aws.amazon.com/security/penetration-testing/
Since you pay for bytes into and out of the Amazon infrastructure, to maintain low costs keep your load generators in the same data center as your application under test. There are some drawbacks to this, but the primary one is that your network will lack the complexity of impairment that real users will experience. If you are using a tool which includes modeling of network impairment with the virtual users then this drawback is reduced.
No matter what tool you use, if you have the load generators running on virtual machines in AWS, you will face the issue of clock float on the virtual machine clock. Periodically this virtual clock will need to be resync'd to the system clock on the hypervisor host. This will result in clock jump. This will happen while you have a timing record open - it is unavoidable. The net impact of this is that you will have higher average, percentile values, standard deviation and maximums than if you were running on physical hardware.
Related
I have created an endpoint using Sagemaker, and designed my system so that it is called about 100 times simultaneously. This seemed to cause 'Model error' and take too much time. Do I need to create an endpoint for each event, and make one call per endpoint, instead?
you can go in cloudwatch logs to diagnose your model failure.
Real-time inference traffic scaling can be addressed via working on 3 independent dimensions:
hardware: choosing larger machines or more
machines. For example you can load test your model endpoint with bigger and bigger machines and see when hardware size gives you acceptable latency. The Autoscaling feature of SageMaker helps you address this automatically. If deploying a deep neural net, you can also consider using appropriate accelerators, eg GPU (EC2 P3, EC2 G4) or Amazon Elastic Inference Accelerator to make each prediction much faster.
software: you have 2 levers to tune here:
choosing a serving stack that is lean and fast. Different servers will handle load at different levels of performance. One common trick is to batch the load - for example, instead of hitting 100 times your server can you hit it only once with a batch of 100 records? If clients cannot batch their requests, can you use micro-asynchrony so that you do the batching yourself after they issued requests? You can usually configure such micro-batching in advanced deep learning servers such as TF Serving or MXNet Model Server (both can be used in SageMaker), but otherwise you can also do it yourself by having a queue (SQS) in front of your server.
model compilation - optimizing the model graph and its runtime. This is a very smart concept, that leverages the fact that when you know where you're going to deploy (eg NVIDIA, Intel, ARM, etc), you have an insider edge and you can refine your model artifact and create a bespoke runtime application that are tailor-made for this specific target platform. This can reduce memory consumption and latency by double-digit percentage, and is an active area of ML research. In the SageMaker ecosystem, such a compilation task can be performed with SageMaker Neo, but the open source ecosystem is developing fast, with notably treelite (paper, doc) for decision tree compilation and TVM (paper, doc) for arbitrary neural net compilation. Both are dependencies of Neo by the way.
science: some models are slower or heavier than others. If speed and concurrency are your priorities over accuracy, and if you already exploited all possible tricks at level (1) and (2) above, consider using fast-throughput models, eg linear models & logistic regression for structured data, MobileNet or SqueezeNet instead of large Resnets for classification (nice benchmark here), Yolov3 instead of FasterRCNN for detection (nice benchmark here), etc. But be aware that unlike levels (1) and (2), changing model science will alter accuracy.
As mentioned above, those 3 areas of improvements really are about real-time inference; if you can afford to pre-compute all possible model inputs, then the ultimate low-latency high-throughput solution is to pre-compute offline a variety of input-predictions pairs of interest and serve them on demand from a fast database or local read-only store.
Lately, I have been struggling to understand what is my network speed (downlink) between nodes on AWS (in a multi-homed cluster, computers in different regions).
I have a lot of fluctuations when I measure it with a script which I have written (based on this link and SCP) or with Iperf.
I believe it is based on network use which changes rapidly (mostly between regions), but I still don't understand AWS documentation about what is the performance I am paying for, a minimum and a maximum downlink rate for example (aws instances).
At first, I have tried the T2 type, and as I saw it had burst CPU performance, I thought that maybe the NIC performance is also bursty so I have moved to M4 type, but I have got the same problems with M4.
Is there any way to know my NIC downlink rate based on the type and flavor?
*I have asked a similar question on the AWS forum, but I haven't got a response (https://forums.aws.amazon.com/thread.jspa?threadID=296389).
There is no way to get a better indication that your measuring. AWS does not publish anything indicating this performance, and unless we are talking the larger instance where network performance is actually specifically given. I.e. m5.12xlarge having 10 gbps. Most likely network performance does have a burst component for smaller instance types.
There are pages with other peoples benchmarks, but you won't find any official answer for any of this.
I created an Amazon EC2 Auto Scaling group, where it should have at least 1 server all the time.
Add up 2 servers when CPU utilization passes beyond 80%
Terminate 2 servers when CPU utilization comes down less then 30%
Challenge here is, How should I increase/decrease CPU utilization? I cannot connect to any instance or use CLI since I am in Office system / restricted AWS access.
Is there a way to test this despite of these restrictions?
There is a way to stress test an instance or container (assuming it is Linux based) using Stress, a package that is designed to crank up the CPU for a specified amount of time and then bring the CPU percentages down after the specified amount of time. It has other parameters to customize the testing.
My personal favorite tool for testing system response and DR is to use Netflix's ChaosMonkey. It is an open source project, backed by Netflix that is designed to test fault tolerance. Using it in production comes down to personal preference, but it is a tool for testing systems.
If you want to test the "real" situation, then you will need a way to generate load on the system.
This could be artificial load (eg triggering a program that does calculations, just to spin the CPU) or a real-world simulation of actual activities that you system will perform.
There is no need to test whether Amazon EC2's Auto Scaling actually works — there would be issues shown on the AWS Status Page if that were the case — so I presume you just wish to test your own configuration. In this case, you should really be testing a real world scenario, such as simulating a quantity of simultaneous users doing typical activities that users would perform.
If you do any other form of testing (such as fake increasing of CPU load), you're not really testing the real situation in which you want Auto Scaling to perform, so the results of your test won't actually be useful.
For example, it might be that your application runs into memory issues or single-threading issues way before it hits any CPU limits. That would be something you'd really like to know before throwing real users at your system.
I'm using cloud VPS instances to host very small private game servers. On Amazon EC2, I get good performance on their micro instance (1 vCPU [single hyperthread on a 2.5GHz Intel Xeon], 1GB memory).
I want to use Google Compute Engine though, because I'm more comfortable with their UX and billing. I'm testing out their small instance (1 vCPU [single hyperthread on a 2.6GHz Intel Xeon], 1.7GB memory).
The issue is that even when I configure near-identical instances with the same game using the same settings, the AWS EC2 instances perform much better than the GCE ones. To give you an idea, while the game isn't Minecraft I'll use that as an example. On the AWS EC2 instances, succeeding world chunks would load perfectly fine as players approach the edge of a chunk. On the GCE instances, even on more powerful machine types, chunks fail to load after players travel a certain distance; and they must disconnect from and re-login to the server to continue playing.
I can provide more information if necessary, but I'm not sure what is relevant. Any advice would be appreciated.
Diagnostic protocols to evaluate this scenario may be more complex than you want to deal with. My first thought is that this shared core machine type might have some limitations in consistency. Here are a couple of strategies:
1) Try backing into the smaller instance. Since you only pay for 10 minutes, you could see if the performance is better on higher level machines. If you have consistent performance problems no matter what the size of the box, then I'm guessing it's something to do with the nature of your application and the nature of their virtualization technology.
2) Try measuring the consistency of the performance. I get that it is unacceptable, but is it unacceptable based on how long it's been running? The nature of the workload? Time of day? If the performance is sometimes good, but sometimes bad, then it's probably once again related to the type of your work load and their virtualization strategy.
Something Amazon is famous for is consistency. They work very had to manage the consistency of the performance. it shouldn't spike up or down.
My best guess here without all the details is you are using a very small disk. GCE throttles disk performance based on the size. You have two options ... attach a larger disk or use PD-SSD.
See here for details on GCE Disk Performance - https://cloud.google.com/compute/docs/disks
Please post back if this helps.
Anthony F. Voellm (aka Tony the #p3rfguy)
Google Cloud Performance Team
I'm running a few Blockchain related containers in a cloud environment (Google Compute Engine). I want to measure the power/energy consumption of the containers or the instance that I'm running.
There are tools like powerapi which is possible to do this in real infrastructure where it has access to real CPU counters. This should be possible by doing an estimation based on the CPU, Memory, and Network usage. There's one such model proposed in the literature.
Is it theoretically possible to do this? If so is there already existing tools for this task.
A generalized answer is "No, it is impossible". A program running in a virtual machine deals with virtual hardware. The goal of virtualization is to abstract running programs from physical hardware, while access to physical equipment is required to measure energy consumption. For instance, without processor affinity enabled, it's unlikely that PowerAPI will be able to collect useful statistics due to virtual CPU migrations. Time-slicing that allows to run multiple vCPUs on one physical CPU is another factor to keep in mind. Needless to say about energy consumed by RAM, I/O controllers, storage devices, etc.
A substantive answer is "No, it is impossible". Authors of the manuscript use PowerAPI libraries to collect CPU statistics, and a monitoring agent to count bytes transmitted through network. HWPC-sensor the PowerAPI relies on has distinct requirement:
"monitored node(s) run a Linux distribution that is not hosted in a
virtual environment"
Also, authors emphasize that they couldn't measure absolute values of power consumption but rather used percentage to compare certain Java classes and methods in order to suppose the origin of energy leaks in CPU-intensive workloads.
As for the network I/O, the number of bytes used to estimate power consumption in their model differs significantly between the network interface exposed into the virtual guest and the host hardware port on the network with the SDN stack in between.
Measuring in cloud instances is more complicated yes. If you do have access to the underlying hardware, then it would be possible. There is a pretty recent project that allows to get power consumption metrics for a qemu/kvm hypervisor and share the metrics releveant to each virtual machine through another agent in the VM: github.com/hubblo-org/scaphandre/
They are also working (very early) on a "degraded mode" to estimate* the power consumption of the instance/vm if such communication between the hypervisor and the vm is not possible, like when working on a public cloud instance.
*: Based on the resources (cpu/ram/gpu...) consumption in the VM and characteristics of the instance and the underlying hardware power consumption "profile".
Can we conclude that , it is impossible to measure power/energy consumption of process running on virtual machines?