How to launch ray tune with poetry on workers - amazon-web-services

What's the secret to getting ray workers to run within a certain directory, and within a certain virtual environment? I have the feeling I'm missing something fundamental:
I want to run ray tune for hyperparameter tuning (using aws). When I launch the head (and workers), I want to run ray from within a virtual machine (poetry, in my case). It just doesn't work at all. I tried:
setup_commands:
- cd /home/ubuntu/myimportantdirectory
- poetry shell
But then I get
2020-02-25 17:34:28,592 INFO updater.py:256 -- NodeUpdater: i-0893fd1914fd8743e: Running cd /home/ubuntu/myimportantdirectory on 18.207.133.151...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
... which I guess means it fails, because then poetry fails with could not find a pyproject.toml file in /home/ubuntu or its parents
I also tried
setup_commands:
- cd /home/ubuntu/myimportantdirectory
- . path/to/poetry/activate
But then I get Command 'ray' not found, even though I'm certain it's there.

This works. But there's gotta be a better way. Also I can't believe this isn't documented. In the yaml, do this:
head_start_ray_commands:
- . /home/ubuntu/.cache/pypoetry/virtualenvs/robotics-7LCEAdjK-py3.6/bin/activate; ray stop
- ulimit -n 65536; . /home/ubuntu/.cache/pypoetry/virtualenvs/robotics-7LCEAdjK-py3.6/bin/activate; ray start --head --redis-port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- . /home/ubuntu/.cache/pypoetry/virtualenvs/robotics-7LCEAdjK-py3.6/bin/activate; ray stop
- ulimit -n 65536; . /home/ubuntu/.cache/pypoetry/virtualenvs/robotics-7LCEAdjK-py3.6/bin/activate; ray start --redis-address=$RAY_HEAD_IP:6379 --object-manager-port=8076

Related

Memchk (valgrind) reporting inconsistent results in different docker hosts

I have a fairly robust CI test for a C++ library, these tests (around 50) run over the same docker image but in different machines.
In one machine ("A") all the memcheck (valgrind) tests pass (i.e. no memory leaks).
In the other ("B"), all tests produce the same valgrind error below.
51/56 MemCheck #51: combinations.cpp.x ....................***Exception: SegFault 0.14 sec
valgrind: m_libcfile.c:66 (vgPlain_safe_fd): Assertion 'newfd >= VG_(fd_hard_limit)' failed.
Cannot find memory tester output file: /builds/user/boost-multi/build/Testing/Temporary/MemoryChecker.51.log
The machines are very similar, both are intel i7.
The only difference I can think of is that one is:
A. Ubuntu 22.10, Linux 5.19.0-29, docker 20.10.16
and the other:
B. Fedora 37, Linux 6.1.7-200.fc37.x86_64, docker 20.10.23
and perhaps some configuration of docker I don't know about.
Is there some configuration of docker that might generate the difference? or of the kernel? or some option in valgrind to workaround this problem?
I know for a fact that in real machines (not docker) valgrind doesn't produce any memory error.
The options I use for valgrind are always -leak-check=yes --num-callers=51 --trace-children=yes --leak-check=full --track-origins=yes --gen-suppressions=all.
Valgrind version in the image is 3.19.0-1 from the debian:testing image.
Note that this isn't an error reported by valgrind, it is an error within valgrind.
Perhaps after all, the only difference is that Ubuntu version of valgrind is compiled in release mode and the error is just ignored. (<-- this doesn't make sense, valgrind is the same in both cases because the docker image is the same).
I tried removing --num-callers=51 or setting it at 12 (default value), to no avail.
I found a difference between the images and the real machine and a workaround.
It has to do with the number of file descriptors.
(This was pointed out briefly in one of the threads on valgind bug issues on Mac OS https://bugs.kde.org/show_bug.cgi?id=381815#c0)
Inside the docker image running in Ubuntu 22.10:
ulimit -n
1048576
Inside the docker image running in Fedora 37:
ulimit -n
1073741816
(which looks like a ridiculous number or an overflow)
In the Fedora 37 and the Ubuntu 22.10 real machines:
ulimit -n
1024
So, doing this in the CI recipe, "solved" the problem:
- ulimit -n # reports current value
- ulimit -n 1024 # workaround neededed by valgrind in docker running in Fedora 37
- ctest ... (with memcheck)
I have no idea why this workaround works.
For reference:
$ ulimit --help
...
-n the maximum number of open file descriptors
First off, "you are doing it wrong" with your Valgrind arguments. For CI I recommend a two stage approach. Use as many default arguments as possible for the CI run (--trace-children=yes may well be necessary but not the others). If your codebase is leaky then you may need to check for leaks, but if you can maintain a zero leak policy (or only suppressed leaks) then you can tell if there are new leaks from the summary. After your CI detects an issue you can run again with the kitchen sink options to get full information. Your runs will be significantly faster without all those options.
Back to the question.
Valgrind is trying to dup() some file (the guest exe, a tempfile or something like that). The fd that it fets is higher than what it thinks the nofile rlimit is, so it is asserting.
A billion files is ridiculous.
Valgrind will try to call prlimit RLIMIT_NOFILE, with a fallback call to rlimit, and a second fallback to setting the limit to 1024.
To realy see what is going on you need to modify the Valgrind source (m_main.c, setup_file_descriptors, set local show to True). With this change I see
fd limits: host, before: cur 65535 max 65535
fd limits: host, after: cur 65535 max 65535
fd limits: guest : cur 65523 max 65523
Otherwise with strace I see
2049 prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=65535, rlim_max=65535}) = 0
2049 prlimit64(0, RLIMIT_NOFILE, {rlim_cur=65535, rlim_max=65535}, NULL) = 0
(all the above on RHEL 7.6 amd64)
EDIT: Note that the above show Valgrind querying and setting the resource limit. If you use ulimit to lower the limit before running Valgrind, then Valgrind will try to honour that limit. Also note that Valgrind reserves a small number (8) of files for its own use.

HTCondor - Partitionable slot not working

I am following the tutorial on
Center for High Throughput Computing and Introduction to Configuration in the HTCondor website to set up a Partitionable slot. Before any configuration I run
condor_status
and get the following output.
I update the file 00-minicondor in /etc/condor/config.d by adding the following lines at the end of the file.
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=4
SLOT_TYPE_1_PARTITIONABLE = TRUE
and reconfigure
sudo condor_reconfig
Now with
condor_status
I get this output as expected. Now, I run the following command to check everything is fine
condor_status -af Name Slotype Cpus
and find slot1#ip-172-31-54-214.ec2.internal undefined 1 instead of slot1#ip-172-31-54-214.ec2.internal Partitionable 4 61295 that is what I would expect. Moreover, when I try to summit a job that asks for more than 1 cpu it does not allocate space for it (It stays waiting forever) as it should.
I don't know if I made some mistake during the installation process or what could be happening. I would really appreciate any help!
EXTRA INFO: If it can be of any help have have installed HTCondor with the command
curl -fsSL https://get.htcondor.org | sudo /bin/bash -s – –no-dry-run
on Ubuntu 18.04 running on an old p2.xlarge instance (it has 4 cores).
UPDATE: After rebooting the whole thing it seems to be working. I can now send jobs with different CPUs requests and it will start them properly.
The only issue I would say persists is that Memory allocation is not showing properly, for example:
But in reality it is allocating enough memory for the job (in this case around 12 GB).
If I run again
condor_status -af Name Slotype Cpus
I still get something I am not supposed to
But at least it is showing the correct number of CPUs (even if it just says undefined).
What is the output of condor_q -better when the job is idle?

Redis mass insert problem "ERR Protocol error: too big mbulk count string"

UPDATE
I split the file into multiple files each roughly with 1.5 million lines and no issues.
Attempting to pipe into Redis 6.0.6 roughly 15 million lines of SADD and HSET commands properly formatted to Redis Mass Insertion but it fails with the following message:
ERR Protocol error: too big mbulk count string
I use the following command:
echo -e "$(cat load.txt)" | redis-cli --pipe
I run dbsize command in redis-cli and it shows no increase during the entire time.
I can use the formatting app I wrote (a c++ app with client library redis-plus-plus), which correctly formats the lines, write to std::cout then using the following command as well:
./app | redis-cli --pipe
but it exits right away and only sometimes produces the error message.
If I take roughly 400,000 lines from the load.txt file and load it in a smaller file then use echo -e etc.... it loads fine. The problem seems to be the large number of lines.
Any suggestions? It's not a formatting issue afaik. I can code my app to write all the commands into Redis but mass insertion should be faster and I'd prefer that route.

why does use "perf stat -B -e branches,branch-misses ./a.out" to get one process branch prediction miss is always zero in centos7 [duplicate]

I want to evaluate the performance of one process by "branch-misses" hardware event. But when I used perf stat to get "branch-misses" data, it always return 0 just because my os is in KVM.
Because it's trouble for me to get one real machine to do the test. So I want to know that is there any method to get "branch-misses" by "perf stat" when I am in KVM.
I'm really need your help. Thanks a lot.

Redhat Kickstart Multiple Harddrive (Virtualbox Implementation)

I am setting up a kickstart server using Apache Server. I followed this tutorial and everything works fine there. However I am writing the ks.cfg file trying to mount separate disk into separate partitions beyond of that post scope. Say I have 10 disks, and I want to use the first disk maybe as /root, /boot, swap..etc. And /dev/sdb mount to /data1, /dev/sdc mount to /data2...
I am testing it using virtual box but it doesn't like my ks.cfg file.
My part part looks like below:
# Wipe all partitions and build them with the info below
clearpart --all --drives=sda,sdb,sdc,sdd --initlabel
# Create the bootloader in the MBR with drive sda being the drive to install it on
bootloader --location=mbr --driveorder=sda,sdb,sdc,sdd
part /boot --fstype ext3 --size=100 --onpart=sda
part / --fstype ext3 --size=3000 --onpart=sda
part swap --size=2000 --onpart=sda
part /home --fstype ext3 --size=100 --grow
part /data1 --fstype=ext4 --onpart=sdb --grow --asprimary --size=200
part /data2 --fstype=ext4 --onpart=sdc --grow --asprimary --size=200
part /data3 --fstype=ext4 --onpart=sdd --grow --asprimary --size=200
Can anyone tell me what is wrong with the part? Also, there is a sample ks.cfg that I reference.
I ran into a similar problem attempting to mount a second hard drive (sdb) during the install. What worked for me was to run the install the first time, then login to a root accessible account and format and partition the sdb drive. Afterwards, when re-installing the image, kickstart recognized and mounted sdb. Hope this helps.