HTCondor - Partitionable slot not working - amazon-web-services

I am following the tutorial on
Center for High Throughput Computing and Introduction to Configuration in the HTCondor website to set up a Partitionable slot. Before any configuration I run
condor_status
and get the following output.
I update the file 00-minicondor in /etc/condor/config.d by adding the following lines at the end of the file.
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=4
SLOT_TYPE_1_PARTITIONABLE = TRUE
and reconfigure
sudo condor_reconfig
Now with
condor_status
I get this output as expected. Now, I run the following command to check everything is fine
condor_status -af Name Slotype Cpus
and find slot1#ip-172-31-54-214.ec2.internal undefined 1 instead of slot1#ip-172-31-54-214.ec2.internal Partitionable 4 61295 that is what I would expect. Moreover, when I try to summit a job that asks for more than 1 cpu it does not allocate space for it (It stays waiting forever) as it should.
I don't know if I made some mistake during the installation process or what could be happening. I would really appreciate any help!
EXTRA INFO: If it can be of any help have have installed HTCondor with the command
curl -fsSL https://get.htcondor.org | sudo /bin/bash -s – –no-dry-run
on Ubuntu 18.04 running on an old p2.xlarge instance (it has 4 cores).
UPDATE: After rebooting the whole thing it seems to be working. I can now send jobs with different CPUs requests and it will start them properly.
The only issue I would say persists is that Memory allocation is not showing properly, for example:
But in reality it is allocating enough memory for the job (in this case around 12 GB).
If I run again
condor_status -af Name Slotype Cpus
I still get something I am not supposed to
But at least it is showing the correct number of CPUs (even if it just says undefined).

What is the output of condor_q -better when the job is idle?

Related

Memchk (valgrind) reporting inconsistent results in different docker hosts

I have a fairly robust CI test for a C++ library, these tests (around 50) run over the same docker image but in different machines.
In one machine ("A") all the memcheck (valgrind) tests pass (i.e. no memory leaks).
In the other ("B"), all tests produce the same valgrind error below.
51/56 MemCheck #51: combinations.cpp.x ....................***Exception: SegFault 0.14 sec
valgrind: m_libcfile.c:66 (vgPlain_safe_fd): Assertion 'newfd >= VG_(fd_hard_limit)' failed.
Cannot find memory tester output file: /builds/user/boost-multi/build/Testing/Temporary/MemoryChecker.51.log
The machines are very similar, both are intel i7.
The only difference I can think of is that one is:
A. Ubuntu 22.10, Linux 5.19.0-29, docker 20.10.16
and the other:
B. Fedora 37, Linux 6.1.7-200.fc37.x86_64, docker 20.10.23
and perhaps some configuration of docker I don't know about.
Is there some configuration of docker that might generate the difference? or of the kernel? or some option in valgrind to workaround this problem?
I know for a fact that in real machines (not docker) valgrind doesn't produce any memory error.
The options I use for valgrind are always -leak-check=yes --num-callers=51 --trace-children=yes --leak-check=full --track-origins=yes --gen-suppressions=all.
Valgrind version in the image is 3.19.0-1 from the debian:testing image.
Note that this isn't an error reported by valgrind, it is an error within valgrind.
Perhaps after all, the only difference is that Ubuntu version of valgrind is compiled in release mode and the error is just ignored. (<-- this doesn't make sense, valgrind is the same in both cases because the docker image is the same).
I tried removing --num-callers=51 or setting it at 12 (default value), to no avail.
I found a difference between the images and the real machine and a workaround.
It has to do with the number of file descriptors.
(This was pointed out briefly in one of the threads on valgind bug issues on Mac OS https://bugs.kde.org/show_bug.cgi?id=381815#c0)
Inside the docker image running in Ubuntu 22.10:
ulimit -n
1048576
Inside the docker image running in Fedora 37:
ulimit -n
1073741816
(which looks like a ridiculous number or an overflow)
In the Fedora 37 and the Ubuntu 22.10 real machines:
ulimit -n
1024
So, doing this in the CI recipe, "solved" the problem:
- ulimit -n # reports current value
- ulimit -n 1024 # workaround neededed by valgrind in docker running in Fedora 37
- ctest ... (with memcheck)
I have no idea why this workaround works.
For reference:
$ ulimit --help
...
-n the maximum number of open file descriptors
First off, "you are doing it wrong" with your Valgrind arguments. For CI I recommend a two stage approach. Use as many default arguments as possible for the CI run (--trace-children=yes may well be necessary but not the others). If your codebase is leaky then you may need to check for leaks, but if you can maintain a zero leak policy (or only suppressed leaks) then you can tell if there are new leaks from the summary. After your CI detects an issue you can run again with the kitchen sink options to get full information. Your runs will be significantly faster without all those options.
Back to the question.
Valgrind is trying to dup() some file (the guest exe, a tempfile or something like that). The fd that it fets is higher than what it thinks the nofile rlimit is, so it is asserting.
A billion files is ridiculous.
Valgrind will try to call prlimit RLIMIT_NOFILE, with a fallback call to rlimit, and a second fallback to setting the limit to 1024.
To realy see what is going on you need to modify the Valgrind source (m_main.c, setup_file_descriptors, set local show to True). With this change I see
fd limits: host, before: cur 65535 max 65535
fd limits: host, after: cur 65535 max 65535
fd limits: guest : cur 65523 max 65523
Otherwise with strace I see
2049 prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=65535, rlim_max=65535}) = 0
2049 prlimit64(0, RLIMIT_NOFILE, {rlim_cur=65535, rlim_max=65535}, NULL) = 0
(all the above on RHEL 7.6 amd64)
EDIT: Note that the above show Valgrind querying and setting the resource limit. If you use ulimit to lower the limit before running Valgrind, then Valgrind will try to honour that limit. Also note that Valgrind reserves a small number (8) of files for its own use.

why does use "perf stat -B -e branches,branch-misses ./a.out" to get one process branch prediction miss is always zero in centos7 [duplicate]

I want to evaluate the performance of one process by "branch-misses" hardware event. But when I used perf stat to get "branch-misses" data, it always return 0 just because my os is in KVM.
Because it's trouble for me to get one real machine to do the test. So I want to know that is there any method to get "branch-misses" by "perf stat" when I am in KVM.
I'm really need your help. Thanks a lot.

Aws - End of script output before headers: wsgi.py

I have a django application that does some heavy computation. It works very good with less data on my machine and on 'aws -elasticbeanstalk' as well. But When the data becomes large it on aws, gives, internal server error, and in the logs it shows:
[core:error]End of script output before headers: wsgi.py
However works fine on my machine
The code where it constantly gives this error is :
[my_big_lst[int(i[0][1])-1].appendleft((int(i[0][0]) - i[1])) for i in itertools.product(zipped_list,temp_list)]
where:
my_big_lst is a big list of deques
zipped_list is a large list of tuples
temp_list is a large list of numbers
It is notable, that as data grows large, the processing time also increases, and also that this problem is only coming on aws when data is large, and on my machine, it always works fine.
Update:
I worked out, that this error happens when the processing time exceeds 60 seconds, I also changed the Idle Loadbalancer time to 3600, but no effect, still error is there
Please anyone suggest a solution ?
If you are using a c-extension module, You could try setting
WSGIApplicationGroup %{GLOBAL}
in your virtualhost.
Something about python subinterpreters not working with c-extension modules. However since your code works for a smaller data set, your problem might be solved by setting memory-specific directives.
https://code.google.com/archive/p/modwsgi/wikis/ApplicationIssues.wiki#Python_Simplified_GIL_State_API

Fatal error: Allowed memory size of 33554432 bytes exhausted (tried to allocate 5 bytes)...in mysqli.php on line 483

I am trying to upgrade my website joomla from 1.5 to 2.5 using jUpgrade. but I am encountering this problem since a long time. jUpgrade component starts upgrading, downloads and exctracts the new joomla but while migrating the categories OR contents, it stucks and gives this error:
==========
[undefined] [undefined]
Fatal error: Allowed memory size of 33554432 bytes exhausted (tried to allocate 5 bytes) in /home/daneshna/public_html/up/libraries/joomla/database/database/mysqli.php on line 483
I tried allocating more memory size in php.ini file in the server to even 1028M (128M was the default), but this problem persists and I cant get through with it. Tried everything I could find online, but its still there.
Can anyone please help me out with this issue?
(p.s. my website is www.daneshnamah.com, a persian educational website run from Afghanistan since almost 4 years perfectly with over 6000 articles in it and over 19M visitors so far, bcz of security reasons now I wanna upgrade it to 2.5)
Thank You
Ebtihaj
Looks like something (you added recently?) is eating up massive amounts of memory.
Things like that happen often with PHP which is why it shouldn't be used for more than Hypertext.
Did you restart your webserver after changing the php.ini?
What you can do to identify the hungry parts of your website is use Xdebug to see where the memory is eaten up.
If you recently added code/plugins, you can simply remove half of your new additions, see if it happens again, if it does, comment out half of that, etc, and keep doing this until you found the culprit.

Django: Gracefully restart nginx + fastcgi sites to reflect code changes?

Common situation: I have a client on my server who may update some of the code in his python project. He can ssh into his shell and pull from his repository and all is fine -- but the code is stored in memory (as far as I know) so I need to actually kill the fastcgi process and restart it to have the code change.
I know I can gracefully restart fcgi but I don't want to have to manually do this. I want my client to update the code, and within 5 minutes or whatever, to have the new code running under the fcgi process.
Thanks
First off, if uptime is important to you, I'd suggest making the client do it. It can be as simple as giving him a command called deploy-code. Using your method, if there is an error in their code, your method requires a 10 minute turnaround (read: downtime) for fixing it, assuming he gets it correct.
That said, if you actually want to do this, you should create a daemon which will look for files modified within the last 5 minutes. If it detects one, it will execute the reboot command.
Code might look something like:
import os, time
CODE_DIR = '/tmp/foo'
while True:
if restarted = True:
restarted = False
time.sleep(5*60)
for root, dirs, files in os.walk(CODE_DIR):
if restarted=True:
break
for filename in files:
if restared=True:
break
updated_on = os.path.getmtime(os.path.join(root, filename))
current_time = time.time()
if current_time - updated_on <= 6 * 60: # 6 min
# 6 min could offer false negatives, but that's better
# than false positives
restarted = True
print "We should execute the restart command here."