GCP Composer - Airflow webserver shutdown constantly - google-cloud-platform

I'm using GCP Composer with newest image version composer-1.16.1-airflow-1.10.15.
Mine webservers are dying from time to time because of some missing cache files
{cli.py:1050} ERROR - [Errno 2] No such file or directory
Does anybody know how to solve it?
Additional info:
Workers:
Node count 3 Disk size (GB) 20 Machine type n1-standard-1
Web server configuration:
Machine type composer-n1-webserver-8 (8 vCPU, 7.6 GB memory)
Configuration overrides:
UPDATE 27.04.2021
I've managed to find the place responsible for killing the web-server
https://github.com/apache/airflow/blob/4aec433e48dcc66c9c7b74947c499260ab6be9e9/airflow/bin/cli.py#L1032-L1138
GCP Composer is using Celery Executor underneath - soo during the check it tries to read some cache files that are already removed by workers?

I've found it! Aaand I'll report the bug to GCP Composer team
So if the config webserver.reload_on_plugin_change=True then cli is going into that section:
https://github.com/apache/airflow/blob/4aec433e48dcc66c9c7b74947c499260ab6be9e9/airflow/bin/cli.py#L1118-L1138
# if we should check the directory with the plugin,
if self.reload_on_plugin_change:
# compare the previous and current contents of the directory
new_state = self._generate_plugin_state()
# If changed, wait until its content is fully saved.
if new_state != self._last_plugin_state:
self.log.debug(
'[%d / %d] Plugins folder changed. The gunicorn will be restarted the next time the '
'plugin directory is checked, if there is no change in it.',
num_ready_workers_running, num_workers_running
)
self._restart_on_next_plugin_check = True
self._last_plugin_state = new_state
elif self._restart_on_next_plugin_check:
self.log.debug(
'[%d / %d] Starts reloading the gunicorn configuration.',
num_ready_workers_running, num_workers_running
)
self._restart_on_next_plugin_check = False
self._last_refresh_time = time.time()
self._reload_gunicorn()
def _generate_plugin_state(self):
"""
Generate dict of filenames and last modification time of all files in settings.PLUGINS_FOLDER
directory.
"""
if not settings.PLUGINS_FOLDER:
return {}
all_filenames = []
for (root, _, filenames) in os.walk(settings.PLUGINS_FOLDER):
all_filenames.extend(os.path.join(root, f) for f in filenames)
plugin_state = {f: self._get_file_hash(f) for f in sorted(all_filenames)}
return plugin_state
It is generating files to check by calling os.walk(settings.PLUGINS_FOLDER) function.
In the same time gcsfuse is deciding to delete part of these files
And an error happens - file is not found.
So disabling webserver.reload_on_plugin_change is making the work - but this option is really convenient so I'll create the bug ticket for google

Related

Large file upload to Django Rest Framework

I try to upload a big file (4GB) with a PUT on a DRF viewset.
During the upload my memory is stable. At 100%, the python runserver process takes more and more RAM and is killed by the kernel. I have a logging line in the put method of this APIView but the process is killed before this method call.
I use this setting to force file usage FILE_UPLOAD_HANDLERS = ["django.core.files.uploadhandler.TemporaryFileUploadHandler"]
Where does this memory peak comes from? I guess it try to load the file content in memory but why (and where)?
More information:
I tried DEBUG true and false
The runserver is in a docker behind a traefik but there is no limitation in traefik AFAIK and the upload reaches 100%
I do not know yet if I would get the same behavior with daphne instead of runserver
EDIT: front use a Content-Type multipart/form-data
EDIT: I have tried FileUploadParser and (FormParser, MultiPartParser) for parser_classes in my APIView
TL;DR:
Neither a DRF nor a Django issue, it's a 2.5 years known Daphne issue. The solution is to use uvicorn, hypercorn, or something else for the time being.
Explanations
What you're seeing here is not coming from Django Rest Framework as:
The FileUploadParser is meant to handle large file uploads, as it reads the file chunk by chunk;
Your view not being executed rules out the parsers which aren't executed until you access the request.FILES property
The fact that you're mentioning Daphne reminds me of this SO answer which mentions a similar problem and points to a code that Daphne doesn't handle large file uploads as it loads the whole body in RAM before passing it to the view. (The code is still present in their master branch at the time of writing)
You're seeing the same behavior with runserver because when installed, Daphne replaces the initial runserver command with itself to provide WebSockets support for dev purposes.
To make sure that it's the real culprit, try to disable Channels/run the default Django runserver and see for yourself if your app is killed by the OOM Killer.
I don't know if it works with django rest, but you can try to chunk de file.
[...]
anexo_files = request.FILES.getlist('anexo_file_'+str(k))
index = 0
for file in anexo_files:
index = index + 1
extension = os.path.splitext(str(file))[1]
nome_arquivo_anexo = 'media/uploads/' + os.path.splitext(str(file))[0] + "_" + str(index) + datetime.datetime.now().strftime("%m%d%Y%H%M%S") + extension
handle_uploaded_file(file, nome_arquivo_anexo)
AnexoProjeto.objects.create(
projeto=projeto,
arquivo_anexo = nome_arquivo_anexo
)
[...]
Where handle_uploaded_file is
def handle_uploaded_file(f, nome_arquivo):
with open(nome_arquivo, 'wb+') as destination:
for chunk in f.chunks():
destination.write(chunk)

snapper in open suse 42.3 leap creating lots of snapshots

I'm using openSuse as my desktop, prior to that i was using Ubuntu. My root (/) file system is btrfs and xfs for /home.
Whenever i try to run yast it is creating a pre and post snapshot even if there is no changes.
For example , If we are opening Hardware information in Yast. Which is not going to do any file system changes, except the harddisk attributes file modificaiton at /var/lib/smartmontools/
My Question is how do i tell to snapper ,Create snapshot only if there a real changes in certain folders (list of exclusion or list of inclusion). Because there are more snapshot. What is the optimal snapshot configuration.
I did the some changes in config /etc/snapper/configs/root Please correct me if anything wrong.
/etc/snapper/configs/root
# subvolume to snapshot
SUBVOLUME="/"
# filesystem type
FSTYPE="btrfs"
# btrfs qgroup for space aware cleanup algorithms
QGROUP="1/0"
# fraction of the filesystems space the snapshots may use
SPACE_LIMIT="0.5"
# users and groups allowed to work with config
ALLOW_USERS=""
ALLOW_GROUPS=""
# sync users and groups from ALLOW_USERS and ALLOW_GROUPS to .snapshots
# directory
SYNC_ACL="no"
# start comparing pre- and post-snapshot in background after creating
# post-snapshot
BACKGROUND_COMPARISON="yes"
# run daily number cleanup
NUMBER_CLEANUP="yes"
# limit for number cleanup
NUMBER_MIN_AGE="1800"
NUMBER_LIMIT="2-10"
NUMBER_LIMIT_IMPORTANT="4-10"
# create hourly snapshots
TIMELINE_CREATE="no"
# cleanup hourly snapshots after some time
TIMELINE_CLEANUP="yes"
# limits for timeline cleanup
TIMELINE_MIN_AGE="1800"
TIMELINE_LIMIT_HOURLY="0"
TIMELINE_LIMIT_DAILY="2"
TIMELINE_LIMIT_WEEKLY="0"
TIMELINE_LIMIT_MONTHLY="0"
TIMELINE_LIMIT_YEARLY="0"
# cleanup empty pre-post-pairs
EMPTY_PRE_POST_CLEANUP="yes"
# limits for empty pre-post-pair cleanup
EMPTY_PRE_POST_MIN_AGE="1800"

Simple libtorrent Python client

I tried creating a simple libtorrent python client (for magnet uri), and I failed, the program never continues past the "downloading metadata".
If you may help me write a simple client it would be amazing.
P.S. When I choose a save path, is the save path the folder which I want my data to be saved in? or the path for the data itself.
(I used a code someone posted here)
import libtorrent as lt
import time
ses = lt.session()
ses.listen_on(6881, 6891)
params = {
'save_path': '/home/downloads/',
'storage_mode': lt.storage_mode_t(2),
'paused': False,
'auto_managed': True,
'duplicate_is_error': True}
link = "magnet:?xt=urn:btih:4MR6HU7SIHXAXQQFXFJTNLTYSREDR5EI&tr=http://tracker.vodo.net:6970/announce"
handle = lt.add_magnet_uri(ses, link, params)
ses.start_dht()
print 'downloading metadata...'
while (not handle.has_metadata()):
time.sleep(1)
print 'got metadata, starting torrent download...'
while (handle.status().state != lt.torrent_status.seeding):
s = handle.status()
state_str = ['queued', 'checking', 'downloading metadata', \
'downloading', 'finished', 'seeding', 'allocating']
print '%.2f%% complete (down: %.1f kb/s up: %.1f kB/s peers: %d) %s %.3' % \
(s.progress * 100, s.download_rate / 1000, s.upload_rate / 1000, \
s.num_peers, state_str[s.state], s.total_download/1000000)
time.sleep(5)
What happens it is that the first while loop becomes infinite because the state does not change.
You have to add a s = handle.status (); for having the metadata the status changes and the loop stops. Alternatively add the first while inside the other while so that the same will happen.
Yes, the save path you specify is the one that the torrents will be downloaded to.
As for the metadata downloading part, I would add the following extensions first:
ses.add_extension(lt.create_metadata_plugin)
ses.add_extension(lt.create_ut_metadata_plugin)
Second, I would add a DHT bootstrap node:
ses.add_dht_router("router.bittorrent.com", 6881)
Finally, I would begin debugging the application by seeing if my network interface is binding or if any other errors come up (my experience with BitTorrent download problems, in general, is that they are network related). To get an idea of what's happening I would use libtorrent-rasterbar's alert system:
ses.set_alert_mask(lt.alert.category_t.all_categories)
And make a thread (with the following code) to collect the alerts and display them:
while True:
ses.wait_for_alert(500)
alert = lt_session.pop_alert()
if not alert:
continue
print "[%s] %s" % (type(alert), alert.__str__())
Even with all this working correctly, make sure that torrent you are trying to download actually has peers. Even if there are a few peers, none may be configured correctly or support metadata exchange (exchanging metadata is not a standard BitTorrent feature). Try to load a torrent file (which doesn't require downloading metadata) and see if you can download successfully (to rule out some network issues).

Unable to get CAP_CHOWN and CAP_DAC_OVERRIDE working for regular user

My requirement
My python server runs as a regular user on RHEL
But it needs to create files/directories at places it doesn't have access to.
Also needs to do chown those files with random UID/GID
My approach
Trying this in capability-only environment, no setuid.
I am trying to make use of cap_chown and cap_dac_override capabilities.
But am totally lost of how to get it working in systemctl kind of environment
At present I have following in the service file:
#cat /usr/lib/systemd/system/my_server.service
[Service]
Type=simple
SecureBits=keep-caps
User=testuser
CapabilityBoundingSet=~
Capabilities=cap_dac_override,cap_chown=eip
ExecStart=/usr/bin/linux_capability_test.py
And following on the binary itself:
# getcap /usr/bin/linux_capability_test.py
/usr/bin/linux_capability_test.py = cap_chown,cap_dac_override+ei
But this here says, that it will never work on scripts:
Is there a way for non-root processes to bind to "privileged" ports on Linux?
With the current setting, the capabilities I have for the running process are:
# ps -ef | grep lin
testuser 28268 1 0 22:31 ? 00:00:00 python /usr/bin/linux_capability_test.py
# getpcaps 28268
Capabilities for `28268': = cap_chown,cap_dac_override+i
But if I try to create file in /etc/ from within that script:
try:
file_name = '/etc/junk'
with open(file_name, 'w') as f:
os.utime(file_name,None)
It fails with 'Permission denied'
Is that the same case for me that it won't work ?
Can I use python-prctl module here to get it working ?
setuid will not work with scripts because it is a security hole, due to the way that scripts execute. There are several documents on this. You can even start by looking at the wikipedia page.
A really good workaround is to write a small C program that will launch your Python script with hard-coded paths to python and the script. A really good discussion of all the issues may be found here
Update: A method to do this, not sure if the best one. Using 'python-prctl' module:
1. Ditch 'User=testuser' from my-server.service
2. Start server as root
3. Set 'keep_caps' flag True
4. Do 'setgroups, setgid and setuid'
5. And immediately limit the permitted capability set to 'DAC_OVERRIDE' and 'CHOWN' capability only
6. Set the effective capability for both to True
Here is the code for the same
import prctl
prctl.securebits.keep_caps = True
os.setgroups([160])
os.setgid(160)
os.setuid(160)
prctl.cap_permitted.limit(prctl.CAP_CHOWN, prctl.CAP_DAC_OVERRIDE)
prctl.cap_effective.dac_override = True
prctl.cap_effective.chown = True`
DONE !!
Based upon our discussion above, I did the following:
[Service]
Type=simple
User=testuser
SecureBits=keep-caps
Capabilities=cap_chown,cap_dac_override=i
ExecStart=/usr/bin/linux_capability_test.py
This starts the server with both those capabilities as inheritable.
Wrote a small C, test code to chown file
#include <unistd.h>
int main()
{
int ret = 0;
ret = chown("/etc/junk", 160, 160);
return ret;
}
Set following on the gcc'ed binary
chown testuser:testuser /usr/bin/chown_c
chmod 550 /usr/bin/chown_c
setcap cap_chown,cap_dac_override=ie /usr/bin/chown_c
The server does following to call the binary
import prctl
prctl.cap_inheritable.chown = True
prctl.cap_inheritable.dac_override = True
execve('/usr/bin/chown_c',[],os.environ)
And I was able to get the desired result
# ll /etc/junk
-rw-r--r-- 1 root root 0 Aug 8 22:33 /etc/junk
# python capability_client.py
# ll /etc/junk
-rw-r--r-- 1 testuser testuser 0 Aug 8 22:33 /etc/junk

Django: Gracefully restart nginx + fastcgi sites to reflect code changes?

Common situation: I have a client on my server who may update some of the code in his python project. He can ssh into his shell and pull from his repository and all is fine -- but the code is stored in memory (as far as I know) so I need to actually kill the fastcgi process and restart it to have the code change.
I know I can gracefully restart fcgi but I don't want to have to manually do this. I want my client to update the code, and within 5 minutes or whatever, to have the new code running under the fcgi process.
Thanks
First off, if uptime is important to you, I'd suggest making the client do it. It can be as simple as giving him a command called deploy-code. Using your method, if there is an error in their code, your method requires a 10 minute turnaround (read: downtime) for fixing it, assuming he gets it correct.
That said, if you actually want to do this, you should create a daemon which will look for files modified within the last 5 minutes. If it detects one, it will execute the reboot command.
Code might look something like:
import os, time
CODE_DIR = '/tmp/foo'
while True:
if restarted = True:
restarted = False
time.sleep(5*60)
for root, dirs, files in os.walk(CODE_DIR):
if restarted=True:
break
for filename in files:
if restared=True:
break
updated_on = os.path.getmtime(os.path.join(root, filename))
current_time = time.time()
if current_time - updated_on <= 6 * 60: # 6 min
# 6 min could offer false negatives, but that's better
# than false positives
restarted = True
print "We should execute the restart command here."