snapper in open suse 42.3 leap creating lots of snapshots - opensuse

I'm using openSuse as my desktop, prior to that i was using Ubuntu. My root (/) file system is btrfs and xfs for /home.
Whenever i try to run yast it is creating a pre and post snapshot even if there is no changes.
For example , If we are opening Hardware information in Yast. Which is not going to do any file system changes, except the harddisk attributes file modificaiton at /var/lib/smartmontools/
My Question is how do i tell to snapper ,Create snapshot only if there a real changes in certain folders (list of exclusion or list of inclusion). Because there are more snapshot. What is the optimal snapshot configuration.
I did the some changes in config /etc/snapper/configs/root Please correct me if anything wrong.
/etc/snapper/configs/root
# subvolume to snapshot
SUBVOLUME="/"
# filesystem type
FSTYPE="btrfs"
# btrfs qgroup for space aware cleanup algorithms
QGROUP="1/0"
# fraction of the filesystems space the snapshots may use
SPACE_LIMIT="0.5"
# users and groups allowed to work with config
ALLOW_USERS=""
ALLOW_GROUPS=""
# sync users and groups from ALLOW_USERS and ALLOW_GROUPS to .snapshots
# directory
SYNC_ACL="no"
# start comparing pre- and post-snapshot in background after creating
# post-snapshot
BACKGROUND_COMPARISON="yes"
# run daily number cleanup
NUMBER_CLEANUP="yes"
# limit for number cleanup
NUMBER_MIN_AGE="1800"
NUMBER_LIMIT="2-10"
NUMBER_LIMIT_IMPORTANT="4-10"
# create hourly snapshots
TIMELINE_CREATE="no"
# cleanup hourly snapshots after some time
TIMELINE_CLEANUP="yes"
# limits for timeline cleanup
TIMELINE_MIN_AGE="1800"
TIMELINE_LIMIT_HOURLY="0"
TIMELINE_LIMIT_DAILY="2"
TIMELINE_LIMIT_WEEKLY="0"
TIMELINE_LIMIT_MONTHLY="0"
TIMELINE_LIMIT_YEARLY="0"
# cleanup empty pre-post-pairs
EMPTY_PRE_POST_CLEANUP="yes"
# limits for empty pre-post-pair cleanup
EMPTY_PRE_POST_MIN_AGE="1800"

Related

GCP Composer - Airflow webserver shutdown constantly

I'm using GCP Composer with newest image version composer-1.16.1-airflow-1.10.15.
Mine webservers are dying from time to time because of some missing cache files
{cli.py:1050} ERROR - [Errno 2] No such file or directory
Does anybody know how to solve it?
Additional info:
Workers:
Node count 3 Disk size (GB) 20 Machine type n1-standard-1
Web server configuration:
Machine type composer-n1-webserver-8 (8 vCPU, 7.6 GB memory)
Configuration overrides:
UPDATE 27.04.2021
I've managed to find the place responsible for killing the web-server
https://github.com/apache/airflow/blob/4aec433e48dcc66c9c7b74947c499260ab6be9e9/airflow/bin/cli.py#L1032-L1138
GCP Composer is using Celery Executor underneath - soo during the check it tries to read some cache files that are already removed by workers?
I've found it! Aaand I'll report the bug to GCP Composer team
So if the config webserver.reload_on_plugin_change=True then cli is going into that section:
https://github.com/apache/airflow/blob/4aec433e48dcc66c9c7b74947c499260ab6be9e9/airflow/bin/cli.py#L1118-L1138
# if we should check the directory with the plugin,
if self.reload_on_plugin_change:
# compare the previous and current contents of the directory
new_state = self._generate_plugin_state()
# If changed, wait until its content is fully saved.
if new_state != self._last_plugin_state:
self.log.debug(
'[%d / %d] Plugins folder changed. The gunicorn will be restarted the next time the '
'plugin directory is checked, if there is no change in it.',
num_ready_workers_running, num_workers_running
)
self._restart_on_next_plugin_check = True
self._last_plugin_state = new_state
elif self._restart_on_next_plugin_check:
self.log.debug(
'[%d / %d] Starts reloading the gunicorn configuration.',
num_ready_workers_running, num_workers_running
)
self._restart_on_next_plugin_check = False
self._last_refresh_time = time.time()
self._reload_gunicorn()
def _generate_plugin_state(self):
"""
Generate dict of filenames and last modification time of all files in settings.PLUGINS_FOLDER
directory.
"""
if not settings.PLUGINS_FOLDER:
return {}
all_filenames = []
for (root, _, filenames) in os.walk(settings.PLUGINS_FOLDER):
all_filenames.extend(os.path.join(root, f) for f in filenames)
plugin_state = {f: self._get_file_hash(f) for f in sorted(all_filenames)}
return plugin_state
It is generating files to check by calling os.walk(settings.PLUGINS_FOLDER) function.
In the same time gcsfuse is deciding to delete part of these files
And an error happens - file is not found.
So disabling webserver.reload_on_plugin_change is making the work - but this option is really convenient so I'll create the bug ticket for google

How to execute command on multiple servers for executing a command

i have set of servers (150) for logging and a command (to get disk space). How can i execute this command for each server.
Suppose if script is taking 1 min to get report of command for single server, how can i send report for all the servers for every 10 min?
use strict;
use warnings;
use Net::SSH::Perl;
use Filesys::DiskSpace;
# i have almost more than 100 servers..
my %hosts = (
'localhost' => {
user => "z",
password => "qumquat",
},
'129.221.63.205' => {
user => "z",
password => "aardvark",
},
'129.221.63.205' => {
user => "z",
password => "aardvark",
},
);
# file system /home or /dev/sda5
my $dir = "/home";
my $cmd = "df $dir";
foreach my $host (keys %hosts) {
my $ssh = Net::SSH::Perl->new($host,port => 22,debug => 1,protocol => 2,1 );
$ssh->login($hostdata{$host}{user},$hostdata{$host}{password} );
my ($out) = $ssh->cmd($cmd});
print "$out\n";
}
It has to send output of disk space for each server
Is there a reason this needs to be done in Perl? There is an existing tool, dsh, which provides precisely this functionality of using ssh to run a shell command on multiple hosts and report the output from each. It also has the ability, with the -c (concurrent) switch to run the command at the same time on all hosts rather than waiting for each one to complete before going on to the next, which you would need if you want to monitor 150 machines every 10 minutes, but it takes 1 minute to check each host.
To use dsh, first create a file in ~/.dsh/group/ containing a list of your servers. I'll put mine in ~/.dsh/group/test-group with the content:
galera-1
galera-2
galera-3
Then I can run the command
dsh -g test-group -c 'df -h /'
And get back the result:
galera-3: Filesystem Size Used Avail Use% Mounted on
galera-3: /dev/mapper/debian-system 140G 36G 99G 27% /
galera-1: Filesystem Size Used Avail Use% Mounted on
galera-1: /dev/mapper/debian-system 140G 29G 106G 22% /
galera-2: Filesystem Size Used Avail Use% Mounted on
galera-2: /dev/mapper/debian-system 140G 26G 109G 20% /
(They're out-of-order because I used -c, so the command was sent to all three servers at once and the results were printed in the order the responses were received. Without -c, they would appear in the same order the servers are listed in the group file, but then it would wait for each reponse before connecting to the next server.)
But, really, with the talk of repeating this check every 10 minutes, it sounds like what you really want is a proper monitoring system such as Icinga (a high-performance fork of the better-known Nagios), rather than just a way to run commands remotely on multiple machines (which is what dsh provides). Unfortunately, configuring an Icinga monitoring system is too involved for me to provide an example here, but I can tell you that monitoring disk space is one of the checks that are included and enabled by default when using it.
There is a ready-made tool called Ansible for exactly this purpose. There you can define your list of servers, group then and execute commands on all of them.

No space left on device in Sagemaker model training

I'm using custom algorithm running shipped with Docker image on p2 instance with AWS Sagemaker (a bit similar to https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb)
At the end of training process, I try to write down my model to output directory, that is mounted via Sagemaker (like in tutorial), like this:
model_path = "/opt/ml/model"
model.save(os.path.join(model_path, 'model.h5'))
Unluckily, apparently the model gets too big with time and I get the
following error:
RuntimeError: Problems closing file (file write failed: time = Thu Jul
26 00:24:48 2018
00:24:49 , filename = 'model.h5', file descriptor = 22, errno = 28,
error message = 'No space left on device', buf = 0x1a41d7d0, total
write[...]
So all my hours of GPU time are wasted. How can I prevent this from happening again? Does anyone know what is the size limit for model that I store on Sagemaker/mounted directories?
When you train a model with Estimators, it defaults to 30 GB of storage, which may not be enough. You can use the train_volume_size param on the constructor to increase this value. Try with a large-ish number (like 100GB) and see how big your model is. In subsequent jobs, you can tune down the value to something closer to what you actually need.
Storage costs $0.14 per GB-month of provisioned storage. Partial usage is prorated, so giving yourself some extra room is a cheap insurance policy against running out of storage.
In the SageMaker Jupyter notebook, you can check free space on the filesystem(s) by running !df -h. For a specific path, try something like !df -h /opt.

Aerospike error: All batch queues are full

I am running an Aerospike cluster in Google Cloud. Following the recommendation on this post, I updated to the last version (3.11.1.1) and re-created all servers. In fact, this change cause my 5 servers to operate in a much lower CPU load (it was around 75% load before, now it is on 20%, as show in the graph bellow:
Because of this low load, I decided to reduce the cluster size to 4 servers. When I did this, my application started to receive the following error:
All batch queues are full
I found this discussion about the topic, recommending to change the parameters batch-index-threads and batch-max-unused-buffers with the command
asadm -e "asinfo -v 'set-config:context=service;batch-index-threads=NEW_VALUE'"
I tried many combinations of values (batch-index-threads with 2,4,8,16) and none of them solved the problem, and also changing the batch-index-threads param. Nothing solves my problem. I keep receiving the All batch queues are full error.
Here is my aerospace.conf relevant information:
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
paxos-recovery-policy auto-reset-master
pidfile /var/run/aerospike/asd.pid
service-threads 32
transaction-queues 32
transaction-threads-per-queue 4
batch-index-threads 40
proto-fd-max 15000
batch-max-requests 30000
replication-fire-and-forget true
}
I use 300GB SSD disks on these servers.
A quick note which may or may not pertain to you:
A common mistake we have seen in the past is that developers decide to use 'batch get' as a general purpose 'get' for single and multiple record requests. The single record get will perform better for single record requests.
It's possible that you are being constrained by the network between the clients and servers. Reducing from 5 to 4 nodes reduced the aggregate pipe. In addition, removing a node will start cluster migrations which adds additional network load.
I would look at the batch-max-buffer-per-queue config parameter.
Maximum number of 128KB response buffers allowed in each batch index
queue. If all batch index queues are full, new batch requests are
rejected.
In conjunction with raising this value from the default of 255 you will want to also raise the batch-max-unused-buffers to batch-index-threads x batch-max-buffer-per-queue + 1 (at least). If you do not do that new buffers will be created and destroyed constantly, as the amount of free (unused) buffers is smaller than the ones you're using. The moment the batch response is served the system will strive to trim the buffers down to the max unused number. You will see this reflected in the batch_index_created_buffers metric constantly rising.
Be aware that you need to have enough DRAM for this. For example if you raise the batch-max-buffer-per-queue to 320 you will consume
40 (`batch-index-threads`) x 320 (`batch-max-buffer-per-queue`) x 128K = 1600MB
For the sake of performance the batch-max-unused-buffers should be set to 13000 which will have a max memory consumption of 1625MB (1.59GB) per-node.

Redhat Kickstart Multiple Harddrive (Virtualbox Implementation)

I am setting up a kickstart server using Apache Server. I followed this tutorial and everything works fine there. However I am writing the ks.cfg file trying to mount separate disk into separate partitions beyond of that post scope. Say I have 10 disks, and I want to use the first disk maybe as /root, /boot, swap..etc. And /dev/sdb mount to /data1, /dev/sdc mount to /data2...
I am testing it using virtual box but it doesn't like my ks.cfg file.
My part part looks like below:
# Wipe all partitions and build them with the info below
clearpart --all --drives=sda,sdb,sdc,sdd --initlabel
# Create the bootloader in the MBR with drive sda being the drive to install it on
bootloader --location=mbr --driveorder=sda,sdb,sdc,sdd
part /boot --fstype ext3 --size=100 --onpart=sda
part / --fstype ext3 --size=3000 --onpart=sda
part swap --size=2000 --onpart=sda
part /home --fstype ext3 --size=100 --grow
part /data1 --fstype=ext4 --onpart=sdb --grow --asprimary --size=200
part /data2 --fstype=ext4 --onpart=sdc --grow --asprimary --size=200
part /data3 --fstype=ext4 --onpart=sdd --grow --asprimary --size=200
Can anyone tell me what is wrong with the part? Also, there is a sample ks.cfg that I reference.
I ran into a similar problem attempting to mount a second hard drive (sdb) during the install. What worked for me was to run the install the first time, then login to a root accessible account and format and partition the sdb drive. Afterwards, when re-installing the image, kickstart recognized and mounted sdb. Hope this helps.