Why CPU Utilization on VM Instance has intervals? (Google Cloud Platform) - google-cloud-platform

I'm new to the google cloud platform and currently trying to understand the CPU utilization chart. (Attaching below).
This is the chart for medium machine type with the 2 reserved vCPUs.
What I can't understand is, why CPU utilization has this pattern (line goes up and down, from 5% to 26% again and again), when my machine usage is more or less linear. I know, that small machines are allowing CPU bursting, but it doesn't seem to be an explanation, since my usage never topped the CPU cap.
Details on VM:
machine type is e2-medium with 2 vCPUs and 4G of memory
the instance is used as a white label server
Will be grateful for any hint!

Metric you mentioned indicates that some process on your VM uses the CPU in a specific time intervals.
Without more information it's a guessing game to figure out why your chart spikes from 5 to 25% and back.
To nail down the process that's causing it you may try using ps or more "human friendly" htop commands and see what's going on your system.
for example:
$ ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%mem | head
PID PPID CMD %MEM %CPU
397 1 /usr/bin/google_osconfig_ag 0.7 0.0
437 1 /usr/bin/google_guest_agent 0.5 0.0
408 1 /usr/bin/python3 /usr/share 0.4 0.0
1 0 /sbin/init 0.2 0.0
234 1 /lib/systemd/systemd-journa 0.2 0.0
17419 1 /lib/systemd/systemd --user 0.2 0.0
17416 543 sshd: wb [priv] 0.2 0.0
575 1 /lib/systemd/systemd-logind 0.1 0.0
543 1 /usr/sbin/sshd -D 0.1 0.0
Here's some tips on how to use ps and another usefull Q & A regarding it.
Have a look at this answer - it may give you some insight.
Also - this is what the official GCP docs say about the CPU usage metric collected from the agent running on the VM:
Percentage of the total CPU capacity spent in different states. This value is reported from inside the VM and can differ from compute.googleapis.com/instance/cpu/utilization, which is reported by the hypervisor for the VM. Sampled every 60 seconds.
cpu_number: CPU number, for example, "0", "1", or "2". This label is only set with certain Monitoring configurations. Linux only.
cpu_state: CPU state, one of [idle, interrupt, nice, softirq, steal, system, user, wait] on Linux or [idle, used] on Windows.

Related

How to test the integrity of hardware on aws instance?

I have a cluster of consumers (50 or so instance) consuming from kafka partitions.
I notice that there is this one server that is consistently slow. Its cpu usage is always around 80-100%. While the other partitions is around 50%.
Originally I thought there is a slight chance that this is traffic dependent, so I manually switch the partitions that the slow loader is consuming.
However I did not observe an increase in processing speed.
I also don't see cpu steal from iostat, but since all consumer is running the same code I suspect there is some bottle neck in the hardware.
Unfortunately, I can't just replace the server unless I can provide conclusive proof that the hardware is the problem.
So I want to write a load testing script that pin point the bottle neck.
My plan is to write a while loop in python that does n computations, and find out what is the max computation that the slow consumer can do and what is the max computation that the fast consumer can do.
What other testing strategy can I do?
Perhaps I should test disk bottle neck by having my python script write to txt file?
Here is fast consumer iostat
avg-cpu: %user %nice %system %iowait %steal %idle
50.01 0.00 3.96 0.13 0.12 45.77
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvda 1.06 0.16 11.46 422953 30331733
xvdb 377.63 0.01 46937.99 35897 124281808572
xvdc 373.43 0.01 46648.25 26603 123514631628
md0 762.53 0.01 93586.24 22235 247796440032
Here is slow consumer iostat
avg-cpu: %user %nice %system %iowait %steal %idle
81.58 0.00 5.28 0.11 0.06 12.98
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvda 1.02 0.40 13.74 371145 12685265
xvdb 332.85 0.02 40775.06 18229 37636091096
xvdc 327.42 0.01 40514.44 10899 37395540132
md0 676.47 0.01 81289.50 11287 75031631060

Limiting Java 8 Memory Consumption

I'm running three Java 8 JVMs on a 64 bit Ubuntu VM which was built from a minimal install with nothing extra running other than the three JVMs. The VM itself has 2GB of memory and each JVM was limited by -Xmx512M which I assumed would be fine as there would be a couple of hundred MB spare.
A few weeks ago, one crashed and the hs_err_pid dump showed:
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 196608 bytes for committing reserved memory.
# Possible reasons:
# The system is out of physical RAM or swap space
# In 32 bit mode, the process size limit was hit
# Possible solutions:
# Reduce memory load on the system
# Increase physical memory or swap space
# Check if swap backing store is full
# Use 64 bit Java on a 64 bit OS
# Decrease Java heap size (-Xmx/-Xms)
# Decrease number of Java threads
# Decrease Java thread stack sizes (-Xss)
# Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
I restarted the JVM with a reduced heap size of 384MB and so far everything is fine. However when I currently look at the VM using the ps command and sort in descending RSS size I see
RSS %MEM VSZ PID CMD
708768 35.4 2536124 29568 java -Xms64m -Xmx512m ...
542776 27.1 2340996 12934 java -Xms64m -Xmx384m ...
387336 19.3 2542336 6788 java -Xms64m -Xmx512m ...
12128 0.6 288120 1239 /usr/lib/snapd/snapd
4564 0.2 21476 27132 -bash
3524 0.1 5724 1235 /sbin/iscsid
3184 0.1 37928 1 /sbin/init
3032 0.1 27772 28829 ps ax -o rss,pmem,vsz,pid,cmd --sort -rss
3020 0.1 652988 1308 /usr/bin/lxcfs /var/lib/lxcfs/
2936 0.1 274596 1237 /usr/lib/accountsservice/accounts-daemon
..
..
and the free command shows
total used free shared buff/cache available
Mem: 1952 1657 80 20 213 41
Swap: 0 0 0
Taking the first process as an example, there is an RSS size of 708768 KB even though the heap limit would be 524288 KB (512*1024).
I am aware that extra memory is used over the JVM heap but the question is how can I control this to ensure I do not run out of memory again ? I am trying to set the heap size for each JVM as large as I can without crashing them.
Or is there a good general guideline as to how to set JVM heap size in relation to overall memory availability ?
There does not appear to be a way of controlling how much extra memory the JVM will use over the heap. However by monitoring the application over a period of time, a good estimate of this amount can be obtained. If the overall consumption of the java process is higher than desired, then the heap size can be reduced. Further monitoring is needed to see if this impacts performance.
Continuing with the example above and using the command ps ax -o rss,pmem,vsz,pid,cmd --sort -rss we see usage as of today is
RSS %MEM VSZ PID CMD
704144 35.2 2536124 29568 java -Xms64m -Xmx512m ...
429504 21.4 2340996 12934 java -Xms64m -Xmx384m ...
367732 18.3 2542336 6788 java -Xms64m -Xmx512m ...
13872 0.6 288120 1239 /usr/lib/snapd/snapd
..
..
These java processes are all running the same application but with different data sets. The first process (29568) has stayed stable using about 190M beyond the heap limit while the second (12934) has reduced from 156M to 35M. The total memory usage of the third has stayed well under the heap size which suggests the heap limit could be reduced.
It would seem that allowing 200MB extra non heap memory per java process here would be more than enough as that gives 600MB leeway total. Subtracting this from 2GB leaves 1400MB so the three -Xmx parameter values combined should be less than this amount.
As will be gleaned from reading the article pointed out in a comment by Fairoz there are many different ways in which the JVM can use non heap memory. One of these that is measurable though is the thread stack size. The default for a JVM can be found on linux using java -XX:+PrintFlagsFinal -version | grep ThreadStackSize In the case above it is 1MB and as there are about 25 threads, we can safely say that at least 25MB extra will always be required.

Ember CLI build killed

I build my Ember CLI app inside a docker container on startup. The build fails without an error message, it just says killed:
root#fstaging:/frontend/source# node_modules/ember-cli/bin/ember build -prod
version: 1.13.15
Could not find watchman, falling back to NodeWatcher for file system events.
Visit http://www.ember-cli.com/user-guide/#watchman for more info.
Buildingember-auto-register-helpers is not required for Ember 2.0.0 and later please remove from your `package.json`.
Building.DEPRECATION: The `bind-attr` helper ('app/templates/components/file-selector.hbs' # L1:C7) is deprecated in favor of HTMLBars-style bound attributes.
at isBindAttrModifier (/app/source/bower_components/ember/ember-template-compiler.js:11751:34)
Killed
The same docker image successfully starts up in another environment, but without hardware constraints. Does Ember CLI have hard-coded hardware constraints for the build process? The RAM is limited to 128m and swap to 2g.
That is likely not enough memory for Ember CLI to do what it needs. You are correct in that, the process is being killed because of an OOM situation. If you log in to the host and take a look at the dmesg output you will probably see something like:
V8 WorkerThread invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
V8 WorkerThread cpuset=867781e35d8a0a231ef60a272ae5d418796c45e92b5aa0233df317ce659b0032 mems_allowed=0
CPU: 0 PID: 2027 Comm: V8 WorkerThread Tainted: G O 4.1.13-boot2docker #1
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
0000000000000000 00000000000000d0 ffffffff8154e053 ffff880039381000
ffffffff8154d3f7 ffff8800395db528 ffff8800392b4528 ffff88003e214580
ffff8800392b4000 ffff88003e217080 ffffffff81087faf ffff88003e217080
Call Trace:
[<ffffffff8154e053>] ? dump_stack+0x40/0x50
[<ffffffff8154d3f7>] ? dump_header.isra.10+0x8c/0x1f4
[<ffffffff81087faf>] ? finish_task_switch+0x4c/0xda
[<ffffffff810f46b1>] ? oom_kill_process+0x99/0x31c
[<ffffffff811340e6>] ? task_in_mem_cgroup+0x5d/0x6a
[<ffffffff81132ac5>] ? mem_cgroup_iter+0x1c/0x1b2
[<ffffffff81134984>] ? mem_cgroup_oom_synchronize+0x441/0x45a
[<ffffffff8113402f>] ? mem_cgroup_is_descendant+0x1d/0x1d
[<ffffffff810f4d77>] ? pagefault_out_of_memory+0x17/0x91
[<ffffffff815565d8>] ? page_fault+0x28/0x30
Task in /docker/867781e35d8a0a231ef60a272ae5d418796c45e92b5aa0233df317ce659b0032 killed as a result of limit of /docker/867781e35d8a0a231ef60a272ae5d418796c45e92b5aa0233df317ce659b0032
memory: usage 131072kB, limit 131072kB, failcnt 2284203
memory+swap: usage 262032kB, limit 262144kB, failcnt 970540
kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /docker/867781e35d8a0a231ef60a272ae5d418796c45e92b5aa0233df317ce659b0032: cache:340KB rss:130732KB rss_huge:10240KB mapped_file:8KB writeback:0KB swap:130960KB inactive_anon:72912KB active_anon:57880KB inactive_file:112KB active_file:40KB unevictable:0KB
[ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[ 1993] 0 1993 380 1 6 3 17 0 sh
[ 2025] 0 2025 203490 32546 221 140 32713 0 npm
Memory cgroup out of memory: Kill process 2025 (npm) score 1001 or sacrifice child
Killed process 2025 (npm) total-vm:813960kB, anon-rss:130184kB, file-rss:0kB
It might be worthwhile to profile the container using something like https://github.com/google/cadvisor to find out what kind of memory maximums it may need.

Machine Repair Simulation?

The Problem Statement:
A system is composed of 1, 2, or 3 machines and a repairman responsible for
maintaining these machines. Normally, the machines are running and producing
a product. At random points in time, the machines fail and are fixed by the
repairman. If a second or third machine fails while the repairman is busy
fixing the first machine, these machines will wait on the services of the
repairman in a first come, first served order. When repair on a machine is
complete, the machine will begin running again and producing a product.
The repairman will then repair the next machine waiting. When all machines
are running, the repairman becomes idle.
simulate this system for a fixed period of time and calculate the fraction
of time the machines are busy (utilization) and the fraction of time
the repairman is busy (utilization).
Now, the Input is 50 Running time and 50 Repairing time, then given the period
to calculate the utilization over it and the number of machines to simulate
for each test case.
Sample Input:
7.0 4.5 13.0 10.5 3.0 12.0 ....
9.5 2.5 4.5 12.0 5.7 1.5 ....
20.0 1
20.0 3
0.0 0
Sample Output:
No of Utilization
Case Machines Machine Repairman
1 1 .525 .475
2 3 .558 .775
Case 2 Explanation:
Machine Utilization = ((7+3)+(4.5+6)+(13))/(3*20) = .558
Repairman Utilization = (15.5)/20 = .775
My Approach:
1) load the machines into minimum heap (called runHeap) and give each
of them a run time, so the next to give run time will be a new one
from the 50 run times in the input,
2) calculate the minimum time between minimum reminding run time in the runHeap
,the reminding repair time in the head of the repair queue Or the reminding
time to finish simulation, And Call that value "toGo".
3) Subtract all reminding run time for all machines in the runHeap by toGo,
Subtract the reminding repair time of head of repairQueue by toGo,
4) All machines having reminding run time == 0, push it into the repairQueue,
The head of the repair Queue if the reminding repair time == 0 push it into
the runHeap,
5) Add toGo to the current time
6) if current time < simulation time go to step 2, else return utilization's.
Now, the Question Is It A Good Approach Or one can figure out a better one ??

Why would an immediately destroyed shared pointer leak memory?

Is there a memory leak here?
class myclass : public boost::enable_shared_from_this<myclass>{
//...
void broadcast(const char *buf){
broadcast(new std::string(buf));
}
void broadcast(std::string *buf){
boost::shared_ptr<std::string> msg(buf);
}
//...
};
(This is the stripped down version that still shows the problem - normally I do real work in that second broadcast call!)
My assumption was that the first call gets some memory, then because I do nothing with the smart pointer the second call would immediately delete it. Simple? But, when I run my program, the memory increases over time, in jumps. Yet when I comment out the only call in the program to broadcast(), it does not!
The ps output for the version without broadcast():
%CPU %MEM VSZ RSS TIME
3.2 0.0 158068 1988 0:00
3.3 0.0 158068 1988 0:25 (12 mins later)
With the call to broadcast() (on Ubuntu 10.04, g++ 4.4, boost 1.40)
%CPU %MEM VSZ RSS TIME
1.0 0.0 158068 1980 0:00
3.3 0.0 158068 1988 0:04 (2 mins)
3.4 0.0 223604 1996 0:06 (3.5 mins)
3.3 0.0 223604 2000 0:09
3.1 0.0 223604 2000 2:21 (82 mins)
3.1 0.0 223604 2000 3:50 (120 mins)
(Seeing that jump at around 3 minutes is reproducible in the few times I've tried so far.)
With the call to broadcast() (on Centos 5.6, g++ 4.1, boost 1.41)
%CPU %MEM VSZ RSS TIME
0.0 0.0 51224 1744 0:00
0.0 0.0 51224 1744 0:00 (30s)
1.1 0.0 182296 1776 0:02 (3.5 mins)
0.7 0.0 182296 1776 0:03
0.7 0.0 182296 1776 0:09 (20 mins)
0.7 0.0 247832 1788 0:14 (34 mins)
0.7 0.0 247832 1788 0:17
0.7 0.0 247832 1788 0:24 (55 mins)
0.7 0.0 247832 1788 0:36 (71 mins)
Here is how broadcast() is being called (from a boost::asio timer) and now I'm wondering if it could matter:
void callback(){
//...
timer.expires_from_now(boost::posix_time::milliseconds(20));
//...
char buf[512];
sprintf(buf,"...");
broadcast(buf);
timer.async_wait(boost::bind(&myclass::callback, shared_from_this() ));
//...
}
(callback is in the same class as the broadcast function)
I have 4 of these timers going, and my io_service.run() is being called by a pool of 3 threads. My 20ms time-out means each timer calls broadcast() 50 times/second. I set the expiry at the start of my function, and run the timer near the end. The elided code is not doing that much; outputting debug info to std::cout is perhaps the most CPU-intensive job. I suppose it may be possible the timer triggers immediately sometimes; but, still, I cannot see how that would be a problem, let alone cause a memory leak.
(The program runs fine, by the way, even when doing its full tasks; I only got suspicious when I noticed the memory usage reported by ps had jumped up.)
UPDATE: Thanks for the answers and comments. I can add that I left the program running on each system for another couple of hours and memory usage did not increase any further. (I was also ready to dismiss this as a one-off heap restructuring or something, when the Centos version jumped for a second time.) Anyway, it is good to know that my understanding of smart pointers is still sound, and that there is no weird corner case with threading that I need to be concerned about.
If there is a leak, you allocate a std::string (20 bytes, more or less) 50 times per second.
in 1 hour you should have been allocated ... 3600*50*20 = 3,4MBytes.
Nothing to do with th 64K you see, that's probably due to the way the system allocate the memory to the process, that new sub-allocates to the variables.
The system, when something is deleted, needs to "garbage collect it" placing it back into the available memory chain for further allocations.
But since this takes time, most of the systems don't do this operation until the released memory goes over certain amounts, so that a "repack" can be done.
Well, what happens here is probably not that your program is leaking, but that for some reason the system memory allocator decided to keep another 64 kB page around for your application. If there was a constant memory leak at that point, at 50 Hz rate, that would have a much more dramatic effect!
Exactly why that is done after 3 minutes I don't know (I am not an expert in that area), but I would guess that there are some heuristics and statistics involved. Or, it could simply be that the heap has gotten fragmented.
Another thing that may have happened is that the messages you are holding in the buffer become longer over time :)