Google Cloud Filestore in status REPAIRING blocks *everything* - google-cloud-platform

We are using Google's Filestore cloud service for sharing files between our GCE VMs. Randomly, all processes seem to hang, especially interactive SSH sessions, and after some investigation we have determined that our Filestore, universally mounted across all VMs, was being repaired and was blocking all processes that tried to get any information on it.
I was able to log in as root and investigate, I noticed that all my interactive activity would hang, and eventually I pinpointed it to trying to stat the mountpoint of the Filestore instance. An strace df would hang like this:
statfs("/sys/kernel/config", {f_type=0x62656570, f_bsize=4096, f_blocks=0, f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={0, 0}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_RELATIME}) = 0
stat("/sys/kernel/config", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
statfs("/sys/fs/selinux", {f_type=SELINUX_MAGIC, f_bsize=4096, f_blocks=0, f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={0, 0}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_RELATIME}) = 0
stat("/sys/fs/selinux", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
statfs("/proc/sys/fs/binfmt_misc", {f_type=BINFMTFS_MAGIC, f_bsize=4096, f_blocks=0, f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={0, 0}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_RELATIME}) = 0
stat("/proc/sys/fs/binfmt_misc", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
statfs("/dev/hugepages", {f_type=HUGETLBFS_MAGIC, f_bsize=2097152, f_blocks=0, f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={0, 0}, f_namelen=255, f_frsize=2097152, f_flags=ST_VALID|ST_RELATIME}) = 0
stat("/dev/hugepages", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
statfs("/mnt/local-storage", {f_type=0x58465342, f_bsize=4096, f_blocks=131007745, f_bfree=86129973, f_bavail=86129973, f_files=262143488, f_ffree=262141571, f_fsid={2065, 0}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_RELATIME}) = 0
stat("/mnt/local-extra", {st_mode=S_IFDIR|0755, st_size=75, ...}) = 0
statfs("/mnt/shared-storage" ***HANG***
There was apparently no remedy except for waiting for the repair operation to complete. gcloud filestore operations list was showing no operations ongoing during that time. But gcloud filestore instances list would show the REPAIRING state like this:
[root#vm ~]# gcloud filestore instances list
INSTANCE_NAME ZONE TIER CAPACITY_GB FILE_SHARE_NAME IP_ADDRESS STATE CREATE_TIME
shared-storage europe-west1-b STANDARD 1024 shared_storage **.**.**.** REPAIRING 2019-08-09T16:03:02
Google Cloud Status Dashboard never showed any issue at or around the time.
Does anybody know why this happens and how to prevent it from happening, if possible. As shown in the output above, we are using the standard tier of Filestore.

We've configured coredumps to be written to the share from two dozen VMs, when a mass-death of our processes occurs, it seems that we reach the throughput limit of the share (standard tier) and that causes the share to enter the REPAIRING state, in turn blocking everything that tries to access it.
If you have a similar problem: check if it's possible that somehow you're reaching the throughput limit on your share.

Related

Too many brk() noticed in strace

I have a c++ service, where I am trying to debug the cause of high service startup time. From strace logs I notice a lot of brk() (which contributes to about 300ms, which is very high for our SLA). On studying further , I see brk() to be a memory management call which helps control amount of memory allocated to data segment of the process. malloc() can use brk() (for small allocations) or mmap (for big allocations) based on situation.
Logs from strace:
891905 1674977609.119549 mmap2(NULL, 163840, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf6a01000 <0.000051>
891905 1674977609.119963 brk(0x209be000) = 0x209be000 <0.000025>
891905 1674977609.120495 brk(0x209df000) = 0x209df000 <0.000022>
891905 1674977609.121167 brk(0x20a00000) = 0x20a00000 <0.000041>
891905 1674977609.121776 brk(0x20a21000) = 0x20a21000 <0.000024>
891905 1674977609.122327 brk(0x20a42000) = 0x20a42000 <0.000024>
...
..
.
891905 1674977609.427039 brk(0x2432b000) = 0x2432b000 <0.000064>
891905 1674977609.427827 brk(0x2434c000) = 0x2434c000 <0.000069>
891905 1674977609.428695 brk(0x2436d000) = 0x2436d000 <0.000050> 0 0 2 2
I believe , there is way number of brk() calls called, which has a scope for improving efficiency . I am trying play around with tunables such as M_TRIM_THRESHOLD, M_MMAP_THRESHOLD to reduce the number of brk() , but I don't notice any change.
https://www.linuxjournal.com/files/linuxjournal.com/linuxjournal/articles/063/6390/6390s2.html
For instance , this is one of the ways I tried to make the changes before restarting the service. The service is running as a docker container, so I am trying to make the changes on the docker container and on the host. , but I don't notice any change.
export MALLOC_MMAP_THRESHOLD_=131072
export MALLOC_TRIM_THRESHOLD_=131072
export MALLOC_MMAP_MAX_=65536
Am I trying to hit the wrong target , or is my understanding there is lot of fragmentation that is happening is wrong. Any help is appreciated.
Thanks

How do I increase ebs partitions volume size on aws ec2?

While I was deploying an application on some partition (on aws-ec2 instance) I got fatal error I'm out of space on this block No space left on device.
How do I increase ebs partitions volume size on aws ec2?
you can use "growpart" and "lvextend" based on what partition type you have .
https://www.tldp.org/HOWTO/html_single/LVM-HOWTO/#extendlv
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/recognize-expanded-volume-linux.html
Solution:
Go to your amazon machine and run lsblk (ls block) - useful Linux command which lists information about all or the specified block devices (disks and partitions). It queries the /sys virtual file system to obtain the information that it displays. The command displays details about all block devices excluding except RAM disks in a tree-like format by default.)
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 93.9M 1 loop /snap/core/123
loop2 7:2 0 18M 1 loop /snap/amazon-ssm-agent/456
loop3 7:3 0 18M 1 loop /snap/amazon-ssm-agent/789
loop4 7:4 0 93.8M 1 loop /snap/core/777
xvda 202:0 0 20G 0 disk
└─xvda1 202:1 0 20G 0 part /
xvdb 202:16 0 20G 0 disk /home/boot
xvdc 202:32 0 20G 0 disk
├─xvdc1 202:33 0 10G 0 part
| └─cryptswap1 253:0 0 10G 0 crypt
└─xvdc2 202:34 0 10G 0 part
└─crypttmp 253:1 0 10G 0 crypt /tmp
xvdd 202:48 0 50G 0 disk
└─enc_xvdd 253:2 0 50G 0 crypt /home
xvde 202:64 0 8.9T 0 disk
└─enc_xvde 253:3 0 8.9T 0 crypt /var/data
Locate your disk volume name of the specific partition.
Go to your amazon aws account -> ec2 -> instances -> Description Panel -> Block Devices -> Click on the right block -> Click on the volume id -> Right CLick on the block id -> Modify Volume -> Select The correct size
4.ssh your machine and perform the following commands (on the correct volume - in my example I chose increasing /home which is under enc_xvdd):
sudo cryptsetup resize enc_xvdd -v
sudo resize2fs /dev/mapper/enc_xvdd

How to execute command on multiple servers for executing a command

i have set of servers (150) for logging and a command (to get disk space). How can i execute this command for each server.
Suppose if script is taking 1 min to get report of command for single server, how can i send report for all the servers for every 10 min?
use strict;
use warnings;
use Net::SSH::Perl;
use Filesys::DiskSpace;
# i have almost more than 100 servers..
my %hosts = (
'localhost' => {
user => "z",
password => "qumquat",
},
'129.221.63.205' => {
user => "z",
password => "aardvark",
},
'129.221.63.205' => {
user => "z",
password => "aardvark",
},
);
# file system /home or /dev/sda5
my $dir = "/home";
my $cmd = "df $dir";
foreach my $host (keys %hosts) {
my $ssh = Net::SSH::Perl->new($host,port => 22,debug => 1,protocol => 2,1 );
$ssh->login($hostdata{$host}{user},$hostdata{$host}{password} );
my ($out) = $ssh->cmd($cmd});
print "$out\n";
}
It has to send output of disk space for each server
Is there a reason this needs to be done in Perl? There is an existing tool, dsh, which provides precisely this functionality of using ssh to run a shell command on multiple hosts and report the output from each. It also has the ability, with the -c (concurrent) switch to run the command at the same time on all hosts rather than waiting for each one to complete before going on to the next, which you would need if you want to monitor 150 machines every 10 minutes, but it takes 1 minute to check each host.
To use dsh, first create a file in ~/.dsh/group/ containing a list of your servers. I'll put mine in ~/.dsh/group/test-group with the content:
galera-1
galera-2
galera-3
Then I can run the command
dsh -g test-group -c 'df -h /'
And get back the result:
galera-3: Filesystem Size Used Avail Use% Mounted on
galera-3: /dev/mapper/debian-system 140G 36G 99G 27% /
galera-1: Filesystem Size Used Avail Use% Mounted on
galera-1: /dev/mapper/debian-system 140G 29G 106G 22% /
galera-2: Filesystem Size Used Avail Use% Mounted on
galera-2: /dev/mapper/debian-system 140G 26G 109G 20% /
(They're out-of-order because I used -c, so the command was sent to all three servers at once and the results were printed in the order the responses were received. Without -c, they would appear in the same order the servers are listed in the group file, but then it would wait for each reponse before connecting to the next server.)
But, really, with the talk of repeating this check every 10 minutes, it sounds like what you really want is a proper monitoring system such as Icinga (a high-performance fork of the better-known Nagios), rather than just a way to run commands remotely on multiple machines (which is what dsh provides). Unfortunately, configuring an Icinga monitoring system is too involved for me to provide an example here, but I can tell you that monitoring disk space is one of the checks that are included and enabled by default when using it.
There is a ready-made tool called Ansible for exactly this purpose. There you can define your list of servers, group then and execute commands on all of them.

'Premature end of Content-Length' with Spark Application using s3a

I'm writing a Spark based application which works around a pretty huge data stored on s3. It's about 15 TB in size uncompressed. Data is laid across multiple small LZO compressed files files, varying from 10-100MB.
By default the job spawns 130k tasks while reading dataset and mapping it to schema.
And then it fails around 70k tasks completions and after ~20 tasks failure.
Exception:
WARN lzo.LzopInputStream: IOException in getCompressedData; likely LZO corruption.
org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body
Looks like the s3 connection is getting closed prematurely.
I have tried nearly 40 different combos of configurations.
To summarize them: 1 executor to 3 executors per node, 18GB to 42GB --executor-memory, 3-5 --executor-cores, 1.8GB-4.0 GB spark.yarn.executor.memoryOverhead, Both, Kryo and Default Java serializers, 0.5 to 0.35 spark.memory.storageFraction, default, 130000 to 200000 partitions for bigger dataset. default, 200 to 2001 spark.sql.shuffle.partitions.
And most importantly: 100 to 2048 fs.s3a.connection.maximum property.
[This seems to be most relevant property to exception.]
[In all cases, driver was set to memory = 51GB, cores = 12, MEMORY_AND_DISK_SER level for caching]
Nothing worked!
If I run the program with half of the bigger dataset size (7.5TB), it finishes successfully in 1.5 hr.
What could I be doing wrong?
How do I determine the optimal value for fs.s3a.connection.maximum?
Is it possible that the s3 clients are getting GCed?
Any help will be appreciated!
Environment:
AWS EMR 5.7.0, 60 x i2.2xlarge SPOT Instances (16 vCPU, 61GB RAM, 2 x 800GB SSD), Spark 2.1.0
YARN is used as resource manager.
Code:
It's a fairly simple job, doing something like this:
val sl = StorageLevel.MEMORY_AND_DISK_SER
sparkSession.sparkContext.hadoopConfiguration.set("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sparkSession.sparkContext.hadoopConfiguration.setInt("fs.s3a.connection.maximum", 1200)
val dataset_1: DataFrame = sparkSession
.read
.format("csv")
.option("delimiter", ",")
.schema(<schema: StructType>)
.csv("s3a://...")
.select("ID") //15 TB
dataset_1.persist(sl)
print(dataset_1.count())
tmp = dataset_1.groupBy(“ID”).agg(count("*").alias("count_id”))
tmp2 = tmp.groupBy("count_id").agg(count("*").alias(“count_count_id”))
tmp2.write.csv(…)
dataset_1.unpersist()
Full Stacktrace:
17/08/21 20:02:36 INFO compress.CodecPool: Got brand-new decompressor [.lzo]
17/08/21 20:06:18 WARN lzo.LzopInputStream: IOException in getCompressedData; likely LZO corruption.
org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 79627927; received: 19388396
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:180)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.services.s3.model.S3ObjectInputStream.read(S3ObjectInputStream.java:155)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:108)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.services.s3.model.S3ObjectInputStream.read(S3ObjectInputStream.java:155)
at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:160)
at java.io.DataInputStream.read(DataInputStream.java:149)
at com.hadoop.compression.lzo.LzopInputStream.readFully(LzopInputStream.java:73)
at com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:321)
at com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:261)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:186)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:50)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:99)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:91)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:364)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1021)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:996)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:936)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:996)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:700)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
EDIT: We have another service which consume exactly same logs, it works just fine. But it uses old "s3://" scheme and is based on Spark-1.6. I'll try using "s3://" instead of "s3a://".

Python script terminated by SIGKILL rather than throwing MemoryError

Update Again
I have tried to create some simple way to reproduce this, but have not been successful.
So far, I have tried various simple array allocations and manipulations, but they all throw an MemoryError rather than just SIGKILL crashing.
For example:
x =np.asarray(range(999999999))
or:
x = np.empty([100,100,100,100,7])
just throw MemoryErrors as they should.
I hope to have a simple way to recreate this at some point.
End Update
I have a python script running numpy/scipy and some custom C extensions.
On my Ubuntu 14.04 under Virtual Box, it runs to completion just fine.
On an Amazon EC2 T2 micro instance, it terminates (after running a while) with the output:
Killed
Running under the python debugger, the signal is not caught and the debugger exits as well.
Running under strace, I get:
munmap(0x7fa5b7fa6000, 67112960) = 0
mmap(NULL, 67112960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa5b7fa6000
mmap(NULL, 67112960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa5affa4000
mmap(NULL, 67112960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa5abfa3000
mmap(NULL, 67637248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa5a7f22000
mmap(NULL, 67637248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa5a3ea1000
mmap(NULL, 67637248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa59fe20000
gettimeofday({1406518336, 306209}, NULL) = 0
gettimeofday({1406518336, 580022}, NULL) = 0
+++ killed by SIGKILL +++
running under gdb while trying to catch "SIGKILL", I get:
[Thread 0x7fffe7148700 (LWP 28022) exited]
Program terminated with signal SIGKILL, Killed.
The program no longer exists.
(gdb) where
No stack.
running python's trace module (python -m trace --trace ), I get:
defmatrix.py(292): if (isinstance(obj, matrix) and obj._getitem): return
defmatrix.py(293): ndim = self.ndim
defmatrix.py(294): if (ndim == 2):
defmatrix.py(295): return
defmatrix.py(336): return out
--- modulename: linalg, funcname: norm
linalg.py(2052): x = asarray(x)
--- modulename: numeric, funcname: asarray
numeric.py(460): return array(a, dtype, copy=False, order=order)
I can't think of anything else at the moment to figure out what is going on.
I suspect maybe it might be running out of memory (it is an AWS Micro instance), but I can't figure out how to confirm or deny that.
Is there another tool I could use that might help pinpoint exactly where the program is stopping? (or I am running one of the above tools the wrong way for this problem?)
Update
The Amazon EC2 T2 micro instance has no swap space defined by default, so I added a 4GB swap file and was able to run the program to completion.
However, I am still very interested in a way to have run the program such that it terminated with some message a little closer to "Not Enough Memory" rather than "Killed"
If anyone has any suggestions, they would be appreciated.
It sounds like you've run into the dreaded Linux OOM Killer. When the system completely runs of out of memory and the kernel absolutely needs to allocate memory, it kills a process rather than crashing the entire system.
Look in the syslog for confirmation of this. A line similar to:
kernel: [884145.344240] mysqld invoked oom-killer:
followed sometime later with:
kernel: [884145.344399] Out of memory: Kill process 3318
Should be present (in this example, it mentions mysql specifically)
You can add these lines to your /etc/sysctl.conf file to effectively disable the OOM killer:
vm.overcommit_memory = 2
vm.overcommit_ratio = 100
And then reboot. Now, the original, memory hungry, process should fail to allocate memory and, hopefully, throw the proper exception.
Setting overcommit_memory means that Linux won't over commit memory, meaning memory allocations will fail if there isn't enough memory for them. See this answer for details on what effect the overcommit_ratio has: https://serverfault.com/a/510857