Unexpected crash while clustering with RStudio on ec2 (AWS) - amazon-web-services

I am experiencing crashes with RStudio on the ec2 while clustering with currently 32 cores using the package doSNOW. The problem keeps happening and the logs in RStudio and the awslogs show following problems:
The previous R session was abnormally terminated due to an unexpected crash. You may have lost workspace data as a result of this crash
I have tried a workaround found on the RStudio community page like this:
rm -rf ~/.rstudio
I restarted it, terminated the RStudio many times, but it didn't help. I change to a bigger instance:
r4.8xlarge
but the calculation couldn't be made either.
Apr 30 14:14:23 ip-172-31-46-102 rsession-rstudio[12984]: ERROR session hadabend; LOGGED FROM: rstudio::core::Error {anonymous}::rInit(const rstudio::r::session::RInitInfo&) /home/ubuntu/rstudio/src/cpp/session/SessionMain.cpp:563
This is the following code when the RStudio crashes:
# Clustering using gower distance and hclust()
d <- sapply(1:nrow(data), function(i) gower_dist(data[i,], data))
d <- as.dist(d)
h <- hclust(d) # this causes error

The problem is solved - the hclust is not really suitable for big data. Replacing that by flashClust does not lead to a crash of RStudio anymore and the calculation was successful.

Related

HTCondor - Partitionable slot not working

I am following the tutorial on
Center for High Throughput Computing and Introduction to Configuration in the HTCondor website to set up a Partitionable slot. Before any configuration I run
condor_status
and get the following output.
I update the file 00-minicondor in /etc/condor/config.d by adding the following lines at the end of the file.
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=4
SLOT_TYPE_1_PARTITIONABLE = TRUE
and reconfigure
sudo condor_reconfig
Now with
condor_status
I get this output as expected. Now, I run the following command to check everything is fine
condor_status -af Name Slotype Cpus
and find slot1#ip-172-31-54-214.ec2.internal undefined 1 instead of slot1#ip-172-31-54-214.ec2.internal Partitionable 4 61295 that is what I would expect. Moreover, when I try to summit a job that asks for more than 1 cpu it does not allocate space for it (It stays waiting forever) as it should.
I don't know if I made some mistake during the installation process or what could be happening. I would really appreciate any help!
EXTRA INFO: If it can be of any help have have installed HTCondor with the command
curl -fsSL https://get.htcondor.org | sudo /bin/bash -s – –no-dry-run
on Ubuntu 18.04 running on an old p2.xlarge instance (it has 4 cores).
UPDATE: After rebooting the whole thing it seems to be working. I can now send jobs with different CPUs requests and it will start them properly.
The only issue I would say persists is that Memory allocation is not showing properly, for example:
But in reality it is allocating enough memory for the job (in this case around 12 GB).
If I run again
condor_status -af Name Slotype Cpus
I still get something I am not supposed to
But at least it is showing the correct number of CPUs (even if it just says undefined).
What is the output of condor_q -better when the job is idle?

Cardano-cli query utxo fails

Error message I am trying to create a staking pool for cardano, i got the node up and running but cardano-cli is giving me a hard time. I have it installed as when i type cardano-cli version it returns infocardano-cli version.
However when i enter cardano-cli query utxo --mainnet --address $RECEIVER i get this error:
cardano-cli: Network.Socket.connect: <socket: 11>: does not exist (No such file or directory)root#vmi803461:~#
Could it be because the blockchain isn't fully synched?
I am running windows 10 with vs
node
Yes ivan p is correct however its also important to mention that there have been cardano node example scripts involving stakepools such as creating a shelley blockchain from scratch that had a bug for quite some time essentially causing multiple nodes to point to the same socket. If you're attempting to run multiple nodes - which you should be while running a stakepool, double check the precise path and location of the socket
cardano-cli requires both CARDANO_NODE_SOCKET_PATH variables installed in the shell and node server running to access it with the socket.
export CARDANO_NODE_SOCKET_PATH="/root/db/node.socket"
To query the balance of an address we need a running node and the
environment variable CARDANO_NODE_SOCKET_PATH set to the path of the
node.socket:

Aws - End of script output before headers: wsgi.py

I have a django application that does some heavy computation. It works very good with less data on my machine and on 'aws -elasticbeanstalk' as well. But When the data becomes large it on aws, gives, internal server error, and in the logs it shows:
[core:error]End of script output before headers: wsgi.py
However works fine on my machine
The code where it constantly gives this error is :
[my_big_lst[int(i[0][1])-1].appendleft((int(i[0][0]) - i[1])) for i in itertools.product(zipped_list,temp_list)]
where:
my_big_lst is a big list of deques
zipped_list is a large list of tuples
temp_list is a large list of numbers
It is notable, that as data grows large, the processing time also increases, and also that this problem is only coming on aws when data is large, and on my machine, it always works fine.
Update:
I worked out, that this error happens when the processing time exceeds 60 seconds, I also changed the Idle Loadbalancer time to 3600, but no effect, still error is there
Please anyone suggest a solution ?
If you are using a c-extension module, You could try setting
WSGIApplicationGroup %{GLOBAL}
in your virtualhost.
Something about python subinterpreters not working with c-extension modules. However since your code works for a smaller data set, your problem might be solved by setting memory-specific directives.
https://code.google.com/archive/p/modwsgi/wikis/ApplicationIssues.wiki#Python_Simplified_GIL_State_API

How to debug "could not receive data from client: Connection reset by peer"

I'm running a django-celery application on Ubuntu-12.04.
When I run a celery task from my web interface, I get the following error, taken form postgresql-9.3 logfile (maximum level of log):
2013-11-12 13:57:01 GMT tss_usr 8113 LOG: could not receive data from client: Connection reset by peer
tss_usr is the postgresql user of the django application database and (in this example) 8113 is the pid of the process who killed the connection, I guess.
Have you got any idea on why this happens or at least how to debug this issue?
To make things work again I need to restart postgresql which is extremely uncomfortable.
I know this is an older post, but I just found it because I had the same error today in my postgres logs. I narrowed it down to a PDO select statement. I'm using Zend Framework 1.10.3 on Ubuntu Precise.
The following pdo statement generated an error if $opinion is a long text string. The column opinion is type Text in my postgres table. The query succeeds if $opinion is under a certain number of characters. 1000 characters works fine. 2000 characters fails with "could not receive data from client: Connection reset by peer".
$select = $this->db->select()
->from( 'datauserstopics' )
->where("opinion = ?",trim($opinion))
->where("datatopicsid = ?",trim($tid))
->where("datausersid= ?",$datausersid);
$stmt = $this->db->query($select);
I circumvented the problem by using:
->where("substr(opinion,1,100) = ?",trim(substr($opinion,1,100)))
This is not a perfect solution, but for my purposes, the select statement using substr() suffices.
Note that I have no problem inserting long strings into the same table/column. The disconnect problem only appears for me on the PDO select with relatively long text strings.
I'm getting it in 2017 with 9.4, I have no text fields, don't know what a PDO is. My select statement is about 50 bytes long, I'm trying to fetch an int4 and a double precision. I suspect the error message can mean multiple things.
I've since found https://dba.stackexchange.com/questions/142350/postgres-could-not-receive-data-from-client-connection-reset-by-peer which indicates it could be a problem with the client configuration. My client is libpg and PQconnectdb() is giving me a CONNECTION_OK return. It works at least partly.
For me, restarting the hypervisor where both the Postgres and the application using it helped. I've seen stack traces in dmesg before, though.

Debugging livelock in Django/Postgresql

I run a moderately popular web app on Django with Apache2, mod_python, and PostgreSQL 8.3 with the postgresql_psycopg2 database backend. I'm experiencing occasional livelock, identifiable when an apache2 process continually consumes 99% of CPU for several minutes or more.
I did an strace -ppid on the apache2 process, and found that it was continually repeating these system calls:
sendto(25, "Q\0\0\0SSELECT (1) AS \"a\" FROM \"account_profile\" WHERE \"account_profile\".\"id\" = 66201 \0", 84, 0, NULL, 0) = 84
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=25, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recvfrom(25, "E\0\0\0\210SERROR\0C25P02\0Mcurrent transaction is aborted, commands ignored until end of transaction block\0Fpostgres.c\0L906\0Rexec_simple_query\0\0Z\0\0\0\5E", 16384, 0, NULL, NULL) = 143
This exact fragment repeats continually in the trace, and was running for over 10 minutes before I finally killed the apache2 process. (Note: I edited this to replace my previous strace fragment with a new one that shows full the full string contents rather than truncated.)
My interpretation of the above is that django is attempting to do an existence check on my table account_profile, but at some earlier point (before I started the trace) something went wrong (SQL parse error? referential integrity or uniqueness constraint violation? who knows?), and now Postgresql is returning the error "current transaction is aborted". For some reason, instead of raising an Exception and giving up, it just keeps retrying.
One possibility is that this is being triggered in a call to Profile.objects.get_or_create. This is the model class that maps to the account_profile table. Perhaps there is something in get_or_create that is designed to catch too broad a set of exceptions and retry? From the web server logs, it appears that this livelock might have occurred as a result of a double-click on the POST button in my site's registration form.
This condition has occurred a couple of times over the past few days on the live site, and results in a significant slowdown until I intervene, so pretty much anything other than infinite deadlock would be an improvement! :)
This turned out to be entirely my fault. I found the spot where the select (1) as 'a' statement seemed to originate (in django/models/base.py) and hacked it to log a traceback, which pointed clearly at my code.
I had some code that makes up a unique email "key" for each Profile. These keys are randomly generated, so because there is some possibility of overlap, I run it in a try/except within a while loop. My assumption was that the database's unique constraint would cause the save to fail if the key was not unique, and I'd be able to try again.
Unfortunately, in Postgresql you cannot simply try again after an integrity error. You have to issue a COMMIT or ROLLBACK command (even if you're in autocommit mode, apparently) before you can try again. So I had an infinite loop of failing save attempts where I was ignoring the error message.
Now I look for a more specific exception (django.db.IntegrityError) and run a limited number of attempts so that the loop is not infinite.
Thanks to everyone for viewing/answering.
Your analysis sounds pretty good. Clearly it's not picking up the fact that the transaction is aborted. I suggest you report this as a bug to the django project...