Google Cloud Instance downtime issue - google-cloud-platform

I am using Google Cloud Instance for 1 of my website.
But daily at same time my server went down. you can say that only 1 - 10 minutes difference daily maximum
When I checks in monitoring it shows me that Disk Throughput (Write) is very high.
I changed disk as well as using N2 Type machine
Waiting for Suggestions.
Thanks

For this scenarios usually an application running in your VM is consuming more resources than the VM has.
You could also review if there is any peak at the same time for CPU utlization and or if there is any peak network traffic this could point to to http requests overlading your vm.
As shot term solution you could add more persistant disk and change the machien type to increse the I/O disk performance , for reference you can review the article Optimizing persistent disk performance

Related

Suitable Google Cloud Compute instance for Website Hosting

I am new to cloud computing, but want to use it to host a website I am building. The website will be a data analytics site, and each user will interacting with a MySQL database and reading data from text files. I want to be able to accommodate about 500 users at a time. The site will likely have around 1000-5000 users fully scaled. I have chosen GCP, and am wondering if the e2-standard-2 VM instance would be enough to get started. I will also be using a GCP HA MySQL server, I am thinking that 2 vCPU's and 5GB memory will be enough, with 50GB high availability SDD storage. Any suggestions would be appreciated? Also, is there anything other service I will need? Thank you!!
Your question is irrelevant. On Google Cloud Platform you have real time monitoring for CPU and RAM usage so you if your website is gaining more users you can just upgrade or downgrade your CPU or RAM with 2 or 3 mouse clicks. Start small and upgrade later if you see CPU or RAM is getting close to 100% usage. Start with a N1 chip micro instance 600MB RAM.

GCP VM stopped responding

We have a VM server on GCP. Yesterday, the server stopped responding, we could not even SSH into the server, but everything was ok after restarting the server. I am having a look at the metrics and this is what I have noticed:
There is no Memory Utilization data for that period. Before this, the Memory Utilization was 90%.
Read Through Put is quite high; 13 MiB/s
What could have gone wrong? What else should I consider looking at?
Harith:
The applications processes running in your VM consumed the totality of the memory assigned to the VM.
Analyze each application hosted on the VM and evaluate its MTRs (Minimal Technical Requirements) and the actual work load that each one represent, this in order to estimate if the memory amount assigned is enough to support that load.
Consult log entries if available on those applications to see if they can reveal the consumption level just after the unresponsive condition.
Consider changing the machine type if you have to increase any resource capcity assigned to your vm.
If the resource consumption of your applications running on your VM will be very variable, you will need consider the implementation of autoscaling groups of instances.

Azure VM Inbound Throttling to VMs?

We have 2 Elastic VMs (Linux) (Currently DS2V2) behind an Azure Load Balancer. We are doing HTTP Posts from our local lan into the Load Balancer, but we seem to be getting throttled. We have tried: Changing the size of the VMs, no difference; adding additional premium SSDs, again no difference; running multiple threads on our end, again no differenece.
What we did do though, was to having the Elastic Engine suck in all of the log files from the Linux boxes and the index rate jump pretty high while it was ingesting them. So we are assuming that it's not really the Linux Elastic boxes that are throttling us.
We do have Kibana installed on the boxes, and as a base line, we're just using the "Cluster Indexing Rate" for both our local posts to the box, and the local ingestion of the log files.
We do understand that yes, there is going to be some latency and overhead since we are now involving the internet, but not the rates we are currently getting. (We have a 1G pipe to the internet, it's nowhere near capacity, so we can rule out at least getting out of our company).
The question is, where else can we look to determine where we might be getting throttled?
For the performance "MUCH slower", it is a bit subjective question and hard to identify. I just provide some information that may impact it.
Azure Compute requests may be throttled at a subscription and on a per-region basis. If you have an API throttling error, you could refer to this document to troubleshoot throttling issues, and best practices to avoid being throttled.
Some factors CPU and storage limits that differ on Azure VM sizes may impact the Azure VM to process incoming data. You may change the size to a higher CPU and premium SSD disk. You could also change Azure resources to another region which is close to your location. You could refer to this article.

AWS Network out

Our web application has 5 pages (Signin, Dashboard, Map, Devices, Notification)
We have done the load test for this application, and load test script does the following:
Signin and go to Dashboard page
Click Map
Click Devices
Click Notification
We have a basic free plan in AWS.
While performing load test, till about 100 users, we didn’t get any error. please see the below image. We could see NetworkIn, CPUUtilization seems to be normal. But the NetworkOut showed 846K.
But when reach around 114 users, we started getting error in the map page (highlighted in red). During that time, it seems only NetworkOut is high. Please see the below image.
We want to know what is the optimal score for the NetworkOut, If this number is high, is there any way to reduce this number?
Please let me know if you need more information. Thanks in advance for your help.
You are using a t2.micro instance.
This instance type has limitations on CPU that means it is good for bursty workloads, but sustained loads will consume all the available CPU credits. Thus, it might perform poorly under sustained loads over long periods.
The instance also has limited network bandwidth that might impact the throughput of the server. While all Amazon EC2 instances have limited allocations of bandwidth, the t2.micro and t2.nano have particularly low bandwidth allocations. You can see this when copying data to/from the instance and it might be impacting your workloads during testing.
The t2 family, especially at the low-end, is not a good choice for production workloads. It is great for workloads that are sometimes high, but not consistently high. It is also particularly low-cost, but please realise that there are trade-offs for such a low cost.
See:
Amazon EC2 T2 Instances – Amazon Web Services (AWS)
CPU Credits and Baseline Performance for Burstable Performance Instances - Amazon Elastic Compute Cloud
Unlimited Mode for Burstable Performance Instances - Amazon Elastic Compute Cloud
That said, the network throughput showing on the graphs is a result of your application. While the t2 might be limiting the throughput, it is not responsible for the spike on the graph. For that, you will need to investigate the resources being used by the application(s) themselves.
NetworkOut simply refers to volume of outgoing traffic from the instance. You reduce the requests you are sending from this instance to reduce the NetworkOut .So you may need to see which one of click Map, Click Devices and Click Notification is sending traffic outside of the instances. It may not necessarily related only to the number of users but a combination of number of users and application module.

Lag spikes on google container engine running a flask restful api

I'm running flask restplus api on google container engine with TCP Load Balancer. The flask restplus api makes calls to google cloud datastore or cloud sql but this does not seem to be the problem.
A few times a day or even more, there is a moment of latency spikes. Restarting the pod solves this or it solves itself in a 5 to 10 minute period. Of course this is too much and needs to be resolved.
Anyone knows what could be the problem or has experience with these kind of issues?
Thx
One thing you could try is monitoring your instance CPU load.
Although the latency doesn't correspond with usage spikes, it may be the case that there is a cumulative effect on CPU load and the latency you're experiencing occurs when the CPU reaches a given % and needs to back off temporarily. If this is the case you could make use of cluster autoscaling, or try running a higher spec machine to see if that makes any difference. Or, if you have limited CPU use on pods/containers, try increasing this limit.
If you're confident CPU isn’t the cause of the issue, you could try to SSH into the affected instance when the issue is occurring, send a request through the load balancer and use tcpdump to analyse the traffic coming in and out. You may be able to spot if the latency stems from the load balancer (by monitoring the latency of HTTP traffic to the instance), or to Cloud Datastore or Cloud SQL (from the instance).
Alternatively, try using strace to monitor the relevant processes both before and during the latency, or dtrace to monitor the system as a whole.