We are runing a spark job which runs close to 30 scripts one by one. it usually takes 14-15h to run, but this time it failed in 13h. Below is the details:
Command:spark-submit --executor-memory=80g --executor-cores=5 --conf spark.sql.shuffle.partitions=800 run.py
Setup: Running spark jobs via jenkins on AWS EMR with 16 spot nodes
Error: Since the YARN log is huge (270Mb+), below are some extracts from it:
[2022-07-25 04:50:08.646]Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : ermediates/master/email/_temporary/0/_temporary/attempt_202207250435265404741257029168752_0641_m_000599_168147 s3://memberanalytics-data-out-prod/pipelined_intermediates/master/email/_temporary/0/task_202207250435265404741257029168752_0641_m_000599 using algorithm version 1 22/07/25 04:37:05 INFO FileOutputCommitter: Saved output of task 'attempt_202207250435265404741257029168752_0641_m_000599_168147' to s3://memberanalytics-data-out-prod/pipelined_intermediates/master/email/_temporary/0/task_202207250435265404741257029168752_0641_m_000599 22/07/25 04:37:05 INFO SparkHadoopMapRedUtil: attempt_202207250435265404741257029168752_0641_m_000599_168147: Committed 22/07/25 04:37:05 INFO Executor: Finished task 599.0 in stage 641.0 (TID 168147). 9341 bytes result sent to driver 22/07/25 04:49:36 ERROR YarnCoarseGrainedExecutorBackend: Executor self-exiting due to : Driver ip-10-13-52-109.bjw2k.asg:45383 disassociated! Shutting down. 22/07/25 04:49:36 INFO MemoryStore: MemoryStore cleared 22/07/25 04:49:36 INFO BlockManager: BlockManager stopped 22/07/25 04:50:06 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95) 22/07/25 04:50:06 ERROR Utils: Uncaught exception in thread shutdown-hook-0 java.lang.InterruptedException
I am getting the below error when i try to run the docker container for postfix
2020-05-29 08:49:05,837 CRIT Supervisor is running as root. Privileges were not dropped because no user is specified in the config file. If you intend to run as root, you can set user=root in the config file to avoid this message.
2020-05-29 08:49:05,837 INFO Included extra file "/etc/supervisor/conf.d/supervisord.conf" during parsing
2020-05-29 08:49:05,844 INFO RPC interface 'supervisor' initialized
2020-05-29 08:49:05,844 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2020-05-29 08:49:05,844 INFO supervisord started with pid 17
2020-05-29 08:49:06,852 INFO spawned: 'postfix' with pid 19
2020-05-29 08:49:06,856 INFO spawnerr: can't find command 'rsyslogd'
2020-05-29 08:49:07,167 INFO exited: postfix (exit status 1; not expected)
2020-05-29 08:49:08,172 INFO spawned: 'postfix' with pid 136
2020-05-29 08:49:08,174 INFO spawnerr: can't find command 'rsyslogd'
2020-05-29 08:49:08,219 INFO exited: postfix (exit status 1; not expected)
2020-05-29 08:49:10,230 INFO spawned: 'postfix' with pid 151
2020-05-29 08:49:10,233 INFO spawnerr: can't find command 'rsyslogd'
2020-05-29 08:49:10,274 INFO exited: postfix (exit status 1; not expected)
2020-05-29 08:49:13,283 INFO spawned: 'postfix' with pid 166
2020-05-29 08:49:13,286 INFO spawnerr: can't find command 'rsyslogd'
2020-05-29 08:49:13,286 INFO gave up: rsyslog entered FATAL state, too many start retries too quickly
2020-05-29 08:49:13,325 INFO exited: postfix (exit status 1; not expected)
2020-05-29 08:49:14,330 INFO gave up: postfix entered FATAL state, too many start retries too quickly
The block corresponding to the above is
command=/usr/sbin/rsyslogd -n -c3
Kindly help
Thanks,
Suv
Please provide the container config you are using to run.
Coming to what I understood was, I guess rsyslog is not installed in the container, so please install it before using it. If you are on a Debian or ubuntu container use the below commands to install it in the container.
add-apt-repository ppa:adiscon/v8-stable
apt-get install rsyslog
In dockerfile:
RUN add-apt-repository ppa:adiscon/v8-stable && \
apt-get -y install rsyslog
I'm running a django application inside a container using supervisord.
But sometimes i need to view the log to fix some errors and i could'nt find a way to do it.
I tried to add an stdout_logfile and stderr_logfile but always the err logfile is empty
this is my supervisor.conf
[supervisord]
loglevel=info
logfile=/tmp/supervisord.log
[program:myapp]
command = python3 -u /usr/src/app/manage.py runserver 0.0.0.0:8000
stdout_logfile=/usr/src/app/out.log
stderr_logfile=/usr/src/app/err.log
And always the same result, the out.log file will contain the lines before the exception happen and the err.log won't be created
This is the output that i get when i run docker compose
2020-05-13 17:33:44,140 INFO supervisord started with pid 1
2020-05-13 17:33:45,144 INFO spawned: 'myapp' with pid 9
2020-05-13 17:33:46,201 INFO success: myapp entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
After a big struggling i found the log is being buffered, so the solution is by adding environment = PYTHONUNBUFFERED=1 to the supervisor.conf file
my conf file after modification
[supervisord]
loglevel=info
logfile=/tmp/supervisord.log
[program:myapp]
command = python3 -u /usr/src/app/manage.py runserver 0.0.0.0:8000
environment = PYTHONUNBUFFERED=1
stdout_logfile=/usr/src/app/out.log
stderr_logfile=/usr/src/app/err.log
I've been trying to follow this thorough explanation on how to deploy a django app with celery worker to aws elastic beanstalk:
How to run a celery worker with Django app scalable by AWS Elastic Beanstalk?
I had some problems installing pycurl but solved it with the comment in:
Pip Requirements.txt --global-option causing installation errors with other packages. "option not recognized"
Then i got:
[2019-01-26T06:43:04.865Z] INFO [12249] - [Application update app-190126_134200#28/AppDeployStage0/EbExtensionPostBuild/Infra-EmbeddedPostBuild/postbuild_1_raiseflags/Command 05_celery_tasks_run] : Activity execution failed, because: /usr/bin/env: bash
: No such file or directory
(ElasticBeanstalk::ExternalInvocationError)
But also solved it: it turns out I had to convert "celery_configuration.txt" file to UNIX EOL (i'm using Windows, and Notepad++ automatically converted it to Windows EOL).
With all these modifications I can successfully deploy the project. But the problem is that the periodic tasks are not running.
I get:
2019-01-26 09:12:57,337 INFO exited: celeryd-beat (exit status 1; not expected)
2019-01-26 09:12:58,583 INFO spawned: 'celeryd-worker' with pid 25691
2019-01-26 09:12:59,453 INFO spawned: 'celeryd-beat' with pid 25695
2019-01-26 09:12:59,666 INFO exited: celeryd-worker (exit status 1; not expected)
2019-01-26 09:13:00,790 INFO spawned: 'celeryd-worker' with pid 25705
2019-01-26 09:13:00,791 INFO exited: celeryd-beat (exit status 1; not expected)
2019-01-26 09:13:01,915 INFO exited: celeryd-worker (exit status 1; not expected)
2019-01-26 09:13:03,919 INFO spawned: 'celeryd-worker' with pid 25728
2019-01-26 09:13:03,920 INFO spawned: 'celeryd-beat' with pid 25729
2019-01-26 09:13:05,985 INFO exited: celeryd-worker (exit status 1; not expected)
2019-01-26 09:13:06,091 INFO exited: celeryd-beat (exit status 1; not expected)
2019-01-26 09:13:07,092 INFO gave up: celeryd-beat entered FATAL state, too many start retries too quickly
2019-01-26 09:13:09,096 INFO spawned: 'celeryd-worker' with pid 25737
2019-01-26 09:13:10,084 INFO exited: celeryd-worker (exit status 1; not expected)
2019-01-26 09:13:11,085 INFO gave up: celeryd-worker entered FATAL state, too many start retries too quickly
I also have this part of the logs:
[2019-01-26T09:13:00.583Z] INFO [25247] - [Application update app-190126_161213#43/AppDeployStage1/AppDeployPostHook/run_supervised_celeryd.sh] : Completed activity. Result:
[program:celeryd-worker]
; Set full path to celery program if using virtualenv
command=/opt/python/run/venv/bin/celery worker -A raiseflags --loglevel=INFO
directory=/opt/python/current/app
user=nobody
numprocs=1
stdout_logfile=/var/log/celery-worker.log
stderr_logfile=/var/log/celery-worker.log
autostart=true
autorestart=true
startsecs=10
; Need to wait for currently executing tasks to finish at shutdown.
; Increase this if you have very long running tasks.
stopwaitsecs = 600
; When resorting to send SIGKILL to the program to terminate it
; send SIGKILL to its whole process group instead,
; taking care of its children as well.
killasgroup=true
; if rabbitmq is supervised, set its priority higher
; so it starts first
priority=998
environment=PYTHONPATH="/opt/python/current/app/:",PATH="/opt/python/run/venv/bin/:%%(ENV_PATH)s",RDS_PORT="5432",RDS_DB_NAME="ebdb",RDS_USERNAME="foobar",PYCURL_SSL_LIBRARY="nss",DJANGO_SETTINGS_MODULE="raiseflags.settings",RDS_PASSWORD="foobar",RDS_HOSTNAME="something.something.eu-west-1.rds.amazonaws.com"
[program:celeryd-beat]
; Set full path to celery program if using virtualenv
command=/opt/python/run/venv/bin/celery beat -A raiseflags --loglevel=INFO --workdir=/tmp -S django --pidfile /tmp/celerybeat.pid
directory=/opt/python/current/app
user=nobody
numprocs=1
stdout_logfile=/var/log/celery-beat.log
stderr_logfile=/var/log/celery-beat.log
autostart=true
autorestart=true
startsecs=10
; Need to wait for currently executing tasks to finish at shutdown.
; Increase this if you have very long running tasks.
stopwaitsecs = 600
; When resorting to send SIGKILL to the program to terminate it
; send SIGKILL to its whole process group instead,
; taking care of its children as well.
killasgroup=true
; if rabbitmq is supervised, set its priority higher
; so it starts first
priority=998
environment=PYTHONPATH="/opt/python/current/app/:",PATH="/opt/python/run/venv/bin/:%%(ENV_PATH)s",RDS_PORT="5432",RDS_DB_NAME="ebdb",RDS_USERNAME="puigdemontAWS",PYCURL_SSL_LIBRARY="nss",DJANGO_SETTINGS_MODULE="raiseflags.settings",RDS_PASSWORD="holahola",RDS_HOSTNAME="aa1m59206y4fljn.cdreg3t50bbl.eu-west-1.rds.amazonaws.com"
No config updates to processes
celeryd-beat: ERROR (not running)
celeryd-beat: ERROR (abnormal termination)
celeryd-worker: ERROR (not running)
celeryd-worker: ERROR (abnormal termination)
[2019-01-26T09:13:00.583Z] INFO [25247] - [Application update app-190126_161213#43/AppDeployStage1/AppDeployPostHook] : Completed activity. Result:
Successfully execute hooks in directory /opt/elasticbeanstalk/hooks/appdeploy/post.
[2019-01-26T09:13:00.583Z] INFO [25247] - [Application update app-190126_161213#43/AppDeployStage1] : Completed activity. Result:
Application version switch - Command CMD-AppDeploy stage 1 completed
[2019-01-26T09:13:00.583Z] INFO [25247] - [Application update app-190126_161213#43/AddonsAfter] : Starting activity...
[2019-01-26T09:13:00.583Z] INFO [25247] - [Application update app-190126_161213#43/AddonsAfter/ConfigLogRotation] : Starting activity...
[2019-01-26T09:13:00.583Z] INFO [25247] - [Application update app-190126_161213#43/AddonsAfter/ConfigLogRotation/10-config.sh] : Starting activity...
[2019-01-26T09:13:00.756Z] INFO [25247] - [Application update app-190126_161213#43/AddonsAfter/ConfigLogRotation/10-config.sh] : Completed activity. Result:
Disabled forced hourly log rotation.
[2019-01-26T09:13:00.756Z] INFO [25247] - [Application update app-190126_161213#43/AddonsAfter/ConfigLogRotation] : Completed activity. Result:
Successfully execute hooks in directory /opt/elasticbeanstalk/addons/logpublish/hooks/config.
I don't know if it has something to do with the error, but notice above the line [[ PATH="/opt/python/run/venv/bin/:%%(ENV_PATH)s" ]] --> shouldn't ENV_PATH be something else?:
environment=PYTHONPATH="/opt/python/current/app/:",PATH="/opt/python/run/venv/bin/:%%(ENV_PATH)s",RDS_PORT="5432",RDS_DB_NAME="ebdb",RDS_USERNAME="foobar",PYCURL_SSL_LIBRARY="nss",DJANGO_SETTINGS_MODULE="raiseflags.settings",RDS_PASSWORD="foobar",RDS_HOSTNAME="something.something.eu-west-1.rds.amazonaws.com"
I'ts my first time deploying an app with celery, and i'm really lost to be honest. I fought a lot to solve the first two errors (i'm really amateur), and now that i get this I don't even know where to start.
Also, i'm not sure if I'm using "celery_configuration.txt" the right way. The only thing I edited was the 2 places where it says "django_app", which I changed for "raiseflags" (the name of my django project). Is this correct?
Does anyone know how to solve it? I can paste my files if needed, but they are just like the ones provided in the first link. I'm using Windows.
Thank you very much!
Ok, the problem had nothing to do with the PATH line I was referring to. I just had to add 'django_celery_beat' and 'django_celery_results' in INSTALLED_APPS in my settings.py
The connection error I later referred to talking to Fran was because I needed to set BROKER_URL instead of CELERY_BROKER_URL, also in the settings.py file. I guess this had to do with me not specifying 'CELERY' as the namespace in the app.autodiscover_tasks() in celery.py file (although in the linked question they do it, i didn't do it because i was using a different version of celery).
Thanks to Fran for everything, specially for pointing out that i should review the celery error logs. I didn't know how to do it. If any other amateur is also struggling, know that you have to "eb ssh" to your instance and then "tail -n 40 /var/log/celery-worker.log" and ""tail -n 40 /var/log/celery-beat.log" (where "40" is the number of lines you want to read). I know this sounds obvious to a lot of people but, stupid me, I had no clue.
(btw, i'm still struggling with a problem with the celery worker, that can't find pycurl module, but this has nothing to do with this question).
Referring to the line you pointed out where appears
environment=PYTHONPATH="/opt/python/current/app/:",PATH="/opt/python/run/venv/bin/:%%(ENV_PATH)s",RDS_PORT="5432",RDS_DB_NAME="ebdb",RDS_USERNAME="foobar",PYCURL_SSL_LIBRARY="nss",DJANGO_SETTINGS_MODULE="raiseflags.settings",RDS_PASSWORD="foobar",RDS_HOSTNAME="something.something.eu-west-1.rds.amazonaws.com", do you copy this line from somewhere? Because I don't see it in the link you posted.
In the linked answer was environment=$celeryenv, where $celeryenv was defined as
celeryenv=`cat /opt/python/current/env | tr '\n' ',' | sed 's/export //g' | sed 's/$PATH/%(ENV_PATH)s/g' | sed 's/$PYTHONPATH//g' | sed 's/$LD_LIBRARY_PATH//g' | sed 's/%/%%/g'`
celeryenv=${celeryenv%?}```
I am hoping to use supervisor to monitor and run a gunicorn server.
When I run:
/usr/bin/gunicorn app.wsgi:application -c config.conf
it works.
But the exact same command in my supervisor conf file does not work. Any explaination?
supervisor.conf
[supervisord]
[group:app]
programs=gunicorn_app
[program:gunicorn_app]
environment=PYTHONPATH=usr/bin
command=/usr/bin/gunicorn app.wsgi:application -c gunicorn.conf.py
directory=~/path/to/app
autostart=true
autorestart=true
environment=LANG="en_US.UTF-8",LC_ALL="en_US.UTF-8",LC_LANG="en_US.UTF-8"
I'm receiving an error like this:
2016-05-31 22:53:34,786 INFO spawned: 'gunicorn_app' with pid 18763
2016-05-31 22:53:34,789 INFO exited: gunicorn_app (exit status 127; not expected)
2016-05-31 22:53:35,791 INFO spawned: 'gunicorn_app' with pid 18764
2016-05-31 22:53:35,795 INFO exited: gunicorn_app (exit status 127; not expected)
2016-05-31 22:53:37,798 INFO spawned: 'gunicorn_app' with pid 18765
2016-05-31 22:53:37,802 INFO exited: gunicorn_app (exit status 127; not expected)
2016-05-31 22:53:40,807 INFO spawned: 'gunicorn_app' with pid 18766
2016-05-31 22:53:40,810 INFO exited: gunicorn_app (exit status 127; not expected)
I understand that exit code 127 means "command not found" but I can execute the exact same command on the command line.
Try to use absolute path.
/home/path/to/app
not ~/path/to/app
As you rightly said, this code means "command not found" which could be as a result of either:
Supervisor not being able to locate the configuration file
Wrong configuration
Whatever the case, i will recommend you in:
case 1:
make sure you provide the absolute path(full path) path to you gunicorn.conf.py file(.e.g. /home/user/path/to/gunicorn.conf.py)
case 2:
revisit your supervisor configuration file and try to determine where the error may arise. The best way to do so is to locate the log file and open it to verify the cause. In other to facilitate this, I recommend to add to your supervisord.conf file the following:
[program:gunicorn]
# where the configuation file is located on the /home/<user>/path/to/configuration_file
command=/usr/local/bin/gunicorn app.wsgi:application -c /home/<user>/path/to/gunicorn.conf.py
directory=~/path/to/app
autostart=true
autorestart=true
#add this setting to log error
stderr_logfile=/var/log/gunicorn.err.log
stdout_logfile=/var/log/gunicorn.out.log
environment=LANG="en_US.UTF-8",LC_ALL="en_US.UTF-8",LC_LANG="en_US.UTF-8"
NB: The assumption i am making here is that you want to deploy or run a django application using gunicorn. Any error encountered when you lunch your server can be verifying in the gunicorn.err.log file.