GCP Network Egress charges are growing exponentially - google-cloud-platform

Our Network egress charges are growing month on month. Going by the cost, we are egressing upwards of 800GB in a month, to the tune of 300KB/s on avg (600Kb/s daytime and 200kb/s night time)
I analyzed all possible scripts which are sending out data. But none of them is sending out data at this volume. I turned them off one by one but it didn't make much difference.
I momentarily turned on VPC logs, downloaded and analyzed the logs.. it is all distributed across IPs.. about 300 different IPs in a min with average of 10-12kb so about 33Mb/min. there were no IPs which stood out.
I noticed most of them using port 443.
When I use nethogs to identify the process which is doing most of the egress.. it only gave Apache & only showed me 50Kb/s. Where is the rest of the egress??
I mooted the possibility of a DDoS attack but that should show up in the Apache access logs. Apache access logs do not show any suspicious IP/url.
Looking for hints/direction I should take. Apologize if I am missing to give any crucial detail for you to analyze the issue. I will keep adding more details to the question.

If you suspect the DDos attack happened already i would recommend to use the Cloud Armour but before that see if you have followed all the mitigations to avoid the DDos attack.
https://cloud.google.com/files/GCPDDoSprotection-04122016.pdf

What you're experiencing is most probably a DDOS attack just as sankar wrote.
According to you nothing stands out particurarily in the logs which makes DDOS theory more probable.
Using Cloud Armor seems the easiest way to protect your server/app out of the box without too much effort since one of it's key features is Adaptive Protection;
Google Cloud Armor Adaptive Protection helps you protect your Google
Cloud applications, websites, and services against L7 distributed
denial-of-service (DDoS) attacks such as HTTP floods and other
high-frequency layer 7 (application-level) malicious activity.
Adaptive Protection builds machine-learning models that do the
following:
Detect and alert on anomalous activity
Generate a signature describing the potential attack
Generate a custom Google Cloud Armor WAF rule to block the signature
This way you will be able to avoid most of that kind of attacks and save money. Even the fact that you pay for this feature should be beneficial to you in terms of money - not to speak that your server will be a lot more secure and you can focus more on other things.
---------- U P D A T E --------------
There may be one more reason.
A rootkit typically patches the kernel or other software libraries to alter the behavior of the operating system. Once this is happening, you cannot trust anything that the operating system tells you.
This way typical tools won't show the traffic or any suspicious processes.
Have a look at the list of tools that may be helpful to detect any rootkits.

Related

what will happen if my virtual machine too slow

i have a newbie question in here, but i'm new to clouds and linux, i'm using google cloud now and wondering when choosing a machine config
what if my machine is too slow? will it make the app crash? or just slow it down
how fast should my vm be? in the image bellow
last 6 hours of a python scripts i'm running and it's cpu usage, it's obviously running for less than %2 of the cpu for most of it's time, but there's a small spike, should i care about the spike? and also, how much should my cpu usage be max before i upgrade? if a script i'm running is using 50-60% of the cpu most of the i assume i'm safe, or what's the max before you upgrade?
what if my machine is too slow? will it make the app crash? or just
slow it down
It depends.
Some applications will just respond slower. Some will fail if they have timeout restrictions. Some applications will begin to thrash which means that all of a sudden the app becomes very very slow.
A general rule, which varies among architects, is to never consume more than 80% of any resource. I use the rule 50% so that my service can handle burst traffic or denial of service attempts.
Based on your graph, your service is fine. The spike is probably normal system processing. If the spike went to 100%, I would be concerned.
Once your service consumes more than 50% of a resource (CPU, memory, disk I/O, etc) then it is time to upgrade that resource.
Also, consider that there are other services that you might want to add. Examples are load balancers, Cloud Storage, CDNs, firewalls such as Cloud Armor, etc. Those types of services tend to offload requirements from your service and make your service more resilient, available and performant. The biggest plus is your service is usually faster for the end user. Some of those services are so cheap, that I almost always deploy them.
You should choose machine family based on your needs. Check the link below for details and recommendations.
https://cloud.google.com/compute/docs/machine-types
If CPU is your concern you should create a managed instance group that automatically scales based on CPU usage. Usually 80-85% is a good value for a max CPU value. Check the link below for details.
https://cloud.google.com/compute/docs/autoscaler/scaling-cpu
You should also consider the availability needed for your workload to keep costs efficient. See below link for other useful info.
https://cloud.google.com/compute/docs/choose-compute-deployment-option

AWS EC2 Performance explanation

I have a REST API web server, built in .NetCore, that has data heavy APIs.
This is hosted on AWS EC2, I have noticed that the average response time for certain APIs are ~4 seconds and if I turn up the AWS-EC2 specs, the response time goes down to a few milliseconds. I guess this is expected, what I don't understand is that even when I load test the APIs on a lower end CPU, the server never crosses 50% utilization of memory/CPU. So what is the correct technical explanation that makes the APIs perform faster if the lower end CPU never reaches a 100% utilization of memory/CPU?
There is no simple answer, there are so many ec2 variations you need to first figure out what is slowing down your API.
When you 'turn up' your ec2 instance, you are getting some combination of more memory, faster cpu, faster disk and more network bandwidth - and we can't tell which one of those 'more' features are improving your performance. Different instance classes ar optimized for different problems.
It could be as simple as the better network bandwidth, or it could be that your application is disk-bound and the better instance you chose is optimized for i/O performance.
Depending on what feature your instance is lacking, it would help you decide which type of instance to upgrade to - or as you have found out, just upgrade to something 'bigger' and be happy with the performance (at the tradeoff of being more expensive).

Generating SCOM alerts for web servers

I have 4-7 sharepoint servers. We have a scom alert already implemented to generate alert if the server is down. But we want to implement scom alert if the website is down.
Can we genearate alert using scom by using ping functionality?
My idea is, we ping the server continuosly and when the website is unresponsive for some time we get alert saying that the website is unresposive.
Can this be implemented? And how much effort is needed? And do we need any other services to be implemented?
Any help would be appreciated.
Ravi,
Forgive me for the post being more philosophy and less answer.
For better or worse, Microsoft has resisted implemented a simple ping monitor in SCOM. Their is a solid reason for this. It would be overly leveraged by folks that don't know any better. The result of which would reflect poorly on the quality of SCOM as a monitoring tool. What I mean by that is that a ping monitor is a terrible idea as it doesn't tell the poor soul that was awoken at 2am much of anything beyond the highest level notion that something is wrong.
If you have 5 minutes to sit in front of the SCOM console to create a ping alert then you would serve your support teams much better if you spent those same 5 minutes creating a Web Application Availability monitor. The reason for this is that the Web App Avail monitor will actually look at the response to ensure that it is logical and successful.
Here is the documentation to create a Web Application Availability Monitor. It looks difficult only until your first implementation. It really is a snap. https://technet.microsoft.com/en-us/library/hh881882(v=sc.12).aspx
Consider that if you had a ping monitor and someone accidentally deleted your index.html file, your ping will happily chug along without telling anyone. Same with a bad code update. Heck, you could even stop your web application server and ping is still going to respond.
Conversely, If you had a Web App Avail monitor pointed at each of the nodes in a load balanced web farm and your load balancer failed, all of your Web monitors will continue to post as healthy while your monitor looking at the load balancer will start to fail. A quick glace at the console will tell your support team that indeed the issue is not with the web servers themselves.
It is a good philosophy to implement your monitors in a way that they testing the target as completely as possible and in the most isolated way possible. You would not want to point a Web App Avail monitor at a load balancer as you would not necessarily know which endpoint did not respond to SCOM to trigger the alert. Some folks go to great lengths to work around this by implementing health check pages that respond with there hostnames. This is usually not necessary, simply create a monitor against each individual node. You are going to want to monitor your load balancer directly so that you know it is up as well.
On another note, there already is a SharePoint management pack (actually one for each version of SharePoint) that you can download from MS for free. This management pack will automatically discovery and monitor all of the components of SharePoint in your infrastructure. It works quite well but if you are new to SCOM then the volume of data and alerts that it creates can be a bit overwhelming at first.
SharePoint 2016 (there is one for each version) management pack: https://blogs.technet.microsoft.com/wbaer/2015/09/08/system-center-operations-management-pack-for-sharepoint-server-2016-it-preview/
There is also a third party management pack that allows you to simply create ping monitors. People REALLY want this. I respectfully will tell you that that they are doing more harm then good in the majority of implementations that use this. But at the end of the day sometimes you just want something that works and you understand so here it is:
Ping management pack: https://www.opslogix.com/ping-management-pack/

Improve Upload Speed

I've a client who I developed a Rails app for. The app relies on his customers uploading varies of images, files, and pdf size ranges from 1mb to 100mb.
His been telling me that many of his customer are complaining the slowness and unstable upload speed.
I use direct connect to Amazon S3 to handle the upload. I explain to him that it could there are factors that is out of my control in terms of upload speed.
But he insist that there is something we can do to improve upload speed.
I'm running out of ideas and expertise here. Does anyone have a solution?
On the surface, there are two answers -- no, of course there's nothing you can do, the Internet is a best-effort transport, etc., etc.,... and no, there really shouldn't be a problem, because S3 uploads perform quite well.
There is an option worth considering, though.
You can deploy a global network of proxy servers in front of S3 and use geographic DNS to route those customers to their nearest proxy. Then install high-speed, low latency optical circuits from the proxies back to S3, reducing the amount of "unknown" in the path, as well as reducing the round-trip time and packet loss potential between the browser and the chosen proxy node at the edge of your network, improving throughput.
I hope the previous paragraph is amusing on first reading, since it sounds like a preposterously grandiose plan for improving uploads to S3... but of course, I'm referring to CloudFront.
You don't actually have to use it for downloads; you can, if you want, just use it for uploads.
your users can now benefit from accelerated content uploads. After you enable the additional HTTP methods for your application’s distribution, PUT and POST operations will be sent to the origin (e.g. Amazon S3) via the CloudFront edge location, improving efficiency, reducing latency, and allowing the application to benefit from the monitored, persistent connections that CloudFront maintains from the edge locations to the origin servers.
https://aws.amazon.com/blogs/aws/amazon-cloudfront-content-uploads-post-put-other-methods/
To illustrate that the benefit here does have a solid theoretical basis...
Back in the day when we still used telnet, when T1s were fast Internet connections and 33.6kbps was a good modem, I discovered that I had far better responsiveness from home, making a telnet connection to a distant system, if I first made a telnet connection to a server immediately on the other side of the modem link, then make a telnet connection to the distant node from within the server.
A direct telnet connection to the distant system followed exactly the same path, through all the same routers and circuits, and yet, it was so sluggish as to be unusable. Why the stark difference, and what caused the substantial improvement?
The explanation was that making the intermediate connection to the server meant there were two independent TCP connections, with only their payload tied together: me to the server... and the server to the distant system. Both connections were bad in their own way -- high latency on my modem link, and congestion/packet loss on the distant link (which had much lower round-trip times, but was overloaded with traffic). The direct connection meant I had a TCP connection that had to recover from packet loss while dealing with excessive latency. Making the intermediate connection meant that the recovery from the packet loss was not further impaired by the additional latency added by my modem connection, because the packet loss was handled only on the 2nd leg of the connection.
Using CloudFront in front of S3 promises to solve the same sort of problem in reverse -- improving the responsiveness, and therefore the throughput, of a connection of unknown quality by splitting the TCP connection into two independent connections, at the user's nearest CloudFront edge.

What does amazon AWS mean by "network performance"?

When choosing an amazon aws instance type to launch, there is a property of each type which is "Network Performance" which is either "Low", "Moderate", or "High".
I'm wondering what this exactly means. Will my ping be lower if I choose low? Or will it be ok as long as many users aren't logged in at once?
I'm launching a real time multiplayer game and I am so I am curious as to exactly what is meant under "network performance". I actually need fairly low memory and processing power, but instances with those criteria usually have "low" network performance.
Has anyone experience with the different network performances or have more information?
It's not official, but Serhiy Topchiy did a benchmark with different instance types:
http://epamcloud.blogspot.com.br/2013/03/testing-amazon-ec2-network-speed.html
For US-EAST-1, it seems that LOW corresponds to 50Mb/s, Moderate corresponds to 300Mb/s and High corresponds to 1Gb/s.
My recent experience here: https://serverfault.com/questions/1094608/benchmarking-aws-outbound-internet-bandwidth-egress-up-to-25-gbps
We ran a live video broadcast on two AWS EC2 servers, hosting 500 viewers, that degraded catastrophically after 10 minutes.
We reproduced the outbound bandwidth throttling with iperf (see link above).
I believe it was mentioned at the reInvent 2013 conference that the different properties are related to the underlying network connection: Some servers have 10GB connections (High) some have 1GB (Moderate) and some have 100MB (Low).
I cannot find any on-line documentation to confirm this, however.
Edit: There is an interesting article on Packet per second limit available here
Since this question was first posed, AWS has released more information on the networking stack, and many of the newer instance families can support up to 25Gbps with the appropriate ENA drivers. It looks like much of the increased performance is due to the new Nitro system.