Flume takes time to upload a file to HDFS - hdfs

I need to get your assistance with regard to checking why does flume take time to upload flatfiles to HDFS. I tried uploading just 1 file (10MB size) however, 17 hours has past it's still uploading with ".tmp". When I checked the Log Details, it seems like it's stuck in the Channel:
Nov 10, X:XX:XX.XXX PM INFO org.apache.flume.channel.file.LogFile Closing RandomReader /data5/flumedata/log-1
Nov 10, X:XX:XX.XXX PM INFO org.apache.flume.channel.file.LogFile Closing RandomReader /data5/flumedata/log-2
Nov 10, X:XX:XX.XXX PM INFO org.apache.flume.channel.file.LogFile Closing RandomReader /data5/flumedata/log-3
Nov 10, X:XX:XX.XXX PM INFO org.apache.flume.channel.file.LogFile Closing RandomReader /data5/flumedata/log-4
Nov 10, X:XX:XX.XXX PM INFO org.apache.flume.channel.file.LogFile Closing RandomReader /data5/flumedata/log-5
Nov 10, X:XX:XX.XXX PM INFO org.apache.flume.channel.file.LogFile Closing RandomReader /data5/flumedata/log-6
Nov 10, X:XX:XX.XXX PM INFO org.apache.flume.channel.file.LogFile Closing RandomReader /data5/flumedata/log-7
Nov 10, X:XX:XX.XXX PM INFO org.apache.flume.channel.file.LogFile Closing RandomReader /data5/flumedata/log-8
Nov 10, X:XX:XX.XXX PM INFO org.apache.flume.channel.file.EventQueueBackingStoreFile CheckpointBackupCompleted
Nov 10, X:XX:XX.XXX PM INFO org.apache.flume.channel.file.LogFile Closing RandomReader /data5/flumedata/log-9
Nov 10, X:XX:XX.XXX PM INFO org.apache.flume.channel.file.LogFile Closing RandomReader /data5/flumedata/log-10
Nov 10, X:XX:XX.XXX PM INFO org.apache.flume.channel.file.LogFile Closing RandomReader /data5/flumedata/log-11
Nov 10, X:XX:XX.XXX PM INFO org.apache.flume.channel.file.LogFile Closing RandomReader /data5/flumedata/log-12
Nov 10, X:XX:XX.XXX PM INFO org.apache.flume.channel.file.LogFile Closing RandomReader /data5/flumedata/log-13
Nov 10, X:XX:XX.XXX PM INFO org.apache.flume.channel.file.LogFile Closing RandomReader /data5/flumedata/log-14
Nov 10, X:XX:XX.XXX PM INFO org.apache.flume.channel.file.LogFile Closing RandomReader /data5/flumedata/log-15
Nov 10, X:XX:XX.XXX PM INFO org.apache.flume.channel.file.LogFile Closing RandomReader /data5/flumedata/log-16
Here is the configuration:
agent.sources = source1
agent.channels = channel1
agent.sinks = sinks1
agent.sources.source1.type = spooldir
agent.sources.source1.spoolDir = /data1/forupload
agent.sources.source1.channels = channel1
agent.sources.source1.basenameHeader = true
agent.channels.channel1.type = file
agent.channels.channel1.capacity = 1000000
agent.channels.channel1.transactionCapacity = 10000
agent.channels.channel1.checkpointDir = /data5/checkpoint
agent.channels.channel1.dataDirs = /data5/flumedata
agent.channels.channel1.useDualCheckpoints = true
agent.channels.channel1.backupCheckpointDir = /data5/backupcheckpoint
agent.channels.channel1.maxFileSize = 900000000
agent.sinks.sinks1.type = hdfs
agent.sinks.sinks1.hdfs.path = /user/flume
agent.sinks.sinks1.hdfs.filetype = DataStream
agent.sinks.sinks1.channel = channel1
agent.sinks.sinks1.hdfs.filePrefix = %{basename}
agent.sinks.sinks1.hdfs.fileSuffix = .csv
agent.sinks.sinks1.hdfs.rollInterval = 0
agent.sinks.sinks1.hdfs.rollSize = 0
agent.sinks.sinks1.hdfs.rollCount = 0
Appreciate your help with this

I think all data has been sent. You can check if the file you want to send has been removed to file.name.COMPLETED. If it has been removed, the file should be already sent.
However, there might be some data that's still in the file channel, since data is transmitted in batch. If the size of data left is less than the batch size, it will be left in the channel.
In order to finish sending, you can use kill -SIGTERM flume_process_id to kill the process. When flume receives this signal, it flushes all data left to HDFS. And the file on HDFS will be renamed, i.e. remove the .tmp suffix.

Related

OpenEdx error while running python code in Codejail Plugins using Dockerize container services

I have installed a stack of OpexEDX platform using Tutor and installed OpexEdx "Codejail" plugin using below link
pip install git+https://github.com/edunext/tutor-contrib-codejail
https://github.com/eduNEXT/tutor-contrib-codejail
I am facing a problem during working on the code jail while importing python matplotlib library.
importing the same library inside codejail container is working fine. the only problem is import through OpnexEdx code block. > advance black > problem.
I have already installed the Codejail and Matplotlib on docker.
I have to run this code. which gives error
<problem>
<script type="loncapa/python">
import matplotlib
</script>
</problem>
import os works fine
but getting error while
import matplotlib
detail of current stack:
open edx version : openedx-mfe:14.0.1
code jail version : codejailservice:14.1.0
please see the error message below
cannot create LoncapaProblem block-v1:VUP+Math101+2022+type#problem+block#3319c4e42da64a74b0e40f048e3f2599: Error while executing script code: Couldn't execute jailed code: stdout: b'', stderr: b'Traceback (most recent call last):\n File "jailed_code", line 19, in <module>\n exec(code, g_dict)\n File "<string>", line 66, in <module>\n File "/sandbox/venv/lib/python3.8/site-packages/matplotlib/__init__.py", line 921, in <module>\n dict.update(rcParams, rc_params_in_file(matplotlib_fname()))\n File "/sandbox/venv/lib/python3.8/site-packages/matplotlib/__init__.py", line 602, in matplotlib_fname\n for fname in gen_candidates():\n File "/sandbox/venv/lib/python3.8/site-packages/matplotlib/__init__.py", line 599, in gen_candidates\n yield os.path.join(get_configdir(), \'matplotlibrc\')\n File "/sandbox/venv/lib/python3.8/site-packages/matplotlib/__init__.py", line 239, in wrapper\n ret = func(**kwargs)\n File "/sandbox/venv/lib/python3.8/site-packages/matplotlib/__init__.py", line 502, in get_configdir\n return get_config_or_cache_dir(_get_xdg_config_dir())\n File "/sandbox/venv/lib/python3.8/site-packages/matplotlib/__init__.py", line 474, in get_config_or_cache_dir\n tempfile.mkdtemp(prefix="matplotlib-")\n File "/opt/pyenv/versions/3.8.6_sandbox/lib/python3.8/tempfile.py", line 347, in mkdtemp\n prefix, suffix, dir, output_type = sanitize_params(prefix, suffix, dir)\n File "/opt/pyenv/versions/3.8.6_sandbox/lib/python3.8/tempfile.py", line 117, in sanitize_params\n dir = gettempdir()\n File "/opt/pyenv/versions/3.8.6_sandbox/lib/python3.8/tempfile.py", line 286, in gettempdir\n tempdir = get_default_tempdir()\n File "/opt/pyenv/versions/3.8.6_sandbox/lib/python3.8/tempfile.py", line 218, in _get_default_tempdir\n raise FileNotFoundError(_errno.ENOENT,\nFileNotFoundError: [Errno 2] No usable temporary directory found in [\'/tmp\', \'/var/tmp\', \'/usr/tmp\', \'/tmp/codejail-lbfd69da\']\n' with status code: 1. For more information check Codejail Service logs.
Codejail service logs are as follows:
{"log":"[pid: 6|app: 0|req: 20/39] 172.18.0.10 () {36 vars in 483 bytes} [Tue Nov 22 11:24:59 2022] POST /api/v0/code-exec =\u003e generated 1978 bytes in 742 msecs (HTTP/1.1 200) 2 headers in 73 bytes (1 switches on core 0)\n","stream":"stderr","time":"2022-11-22T11:25:00.151315626Z"} {"log":"2022-11-22 11:26:23,304 INFO 9 [codejailservice.app] code_exec_service.py:52 - Running problem_id:53fbaa04859f41989ab967c15a12c013 jailed code for course_id:course-v1:VUP+Math101+2022 ...\n","stream":"stderr","time":"2022-11-22T11:26:23.30489438Z"} {"log":"2022-11-22 11:26:23,343 INFO 9 [codejailservice.app] code_exec_service.py:73 - Jailed code was executed in 0.03849988000001758 seconds.\n","stream":"stderr","time":"2022-11-22T11:26:23.343618965Z"} {"log":"[pid: 9|app: 0|req: 20/40] 172.18.0.10 () {36 vars in 483 bytes} [Tue Nov 22 11:26:23 2022] POST /api/v0/code-exec =\u003e generated 73 bytes in 40 msecs (HTTP/1.1 200) 2 headers in 71 bytes (1 switches on core 0)\n","stream":"stderr","time":"2022-11-22T11:26:23.344178308Z"} {"log":"2022-11-23 04:15:24,786 INFO 6 [codejailservice.app] code_exec_service.py:52 - Running problem_id:3319c4e42da64a74b0e40f048e3f2599 jailed code for course_id:course-v1:VUP+Math101+2022 ...\n","stream":"stderr","time":"2022-11-23T04:15:24.786287416Z"} {"log":"2022-11-23 04:15:25,582 ERROR 6 [codejailservice.app] code_exec_service.py:70 - Error found while executing jailed code.\n","stream":"stderr","time":"2022-11-23T04:15:25.582527974Z"} {"log":"[pid: 6|app: 0|req: 21/41] 172.18.0.10 () {36 vars in 483 bytes} [Wed Nov 23 04:15:24 2022] POST /api/v0/code-exec =\u003e generated 1978 bytes in 798 msecs (HTTP/1.1 200) 2 headers in 73 bytes (1 switches on core 0)\n","stream":"stderr","time":"2022-11-23T04:15:25.583132326Z"} {"log":"2022-11-23 06:00:15,150 INFO 9 [codejailservice.app] code_exec_service.py:52 - Running problem_id:3319c4e42da64a74b0e40f048e3f2599 jailed code for course_id:course-v1:VUP+Math101+2022 ...\n","stream":"stderr","time":"2022-11-23T06:00:15.15073834Z"} {"log":"2022-11-23 06:00:15,891 ERROR 9 [codejailservice.app] code_exec_service.py:70 - Error found while executing jailed code.\n","stream":"stderr","time":"2022-11-23T06:00:15.8916806Z"} {"log":"[pid: 9|app: 0|req: 21/42] 172.18.0.10 () {36 vars in 483 bytes} [Wed Nov 23 06:00:15 2022] POST /api/v0/code-exec =\u003e generated 1978 bytes in 742 msecs (HTTP/1.1 200) 2 headers in 73 bytes (1 switches on core 0)\n","stream":"stderr","time":"2022-11-23T06:00:15.892225441Z"}

ImportError: libpq.so.5: cannot open shared object file: No such file or directory

(My operating system is fedora 34)
I use django with haystack and postgresql. For development purposes I run heroku local command. I use three files for settings: base.py, local.py, pro.py. When I run heroku I use local.py file:
from . base import *
DEBUG = True
SECRET_KEY='secretKey'
DATABASES = {
'default': {
'ENGINE':'django.db.backends.sqlite3',
'NAME':os.path.join(BASE_DIR, 'db.sqlite3'),
}
}
if DEBUG:
INTERNAL_IPS = ('127.0.0.1', 'localhost',)
DEBUG_TOOLBAR_PANELS = [
'debug_toolbar.panels.versions.VersionsPanel',
'debug_toolbar.panels.timer.TimerPanel',
'debug_toolbar.panels.settings.SettingsPanel',
'debug_toolbar.panels.headers.HeadersPanel',
'debug_toolbar.panels.request.RequestPanel',
'debug_toolbar.panels.sql.SQLPanel',
'debug_toolbar.panels.staticfiles.StaticFilesPanel',
'debug_toolbar.panels.templates.TemplatesPanel',
'debug_toolbar.panels.cache.CachePanel',
'debug_toolbar.panels.signals.SignalsPanel',
'debug_toolbar.panels.logging.LoggingPanel',
'debug_toolbar.panels.redirects.RedirectsPanel',
]
DEBUG_TOOLBAR_CONFIG = {
'INTERCEPT_REDIRECTS': False,
}
export DJANGO_SETTINGS_MODULE=myshop.settings.local
but heroku shows this error:
12:54:14 PM web.1 | File "/home/user/env2/lib64/python3.9/site-packages/django/contrib/postgres/apps.py", line 1, in <module>
12:54:14 PM web.1 | from psycopg2.extras import (
12:54:14 PM web.1 | File "/home/user/env2/lib64/python3.9/site-packages/psycopg2/__init__.py", line 51, in <module>
12:54:14 PM web.1 | from psycopg2._psycopg import ( # noqa
12:54:14 PM web.1 | ImportError: libpq.so.5: cannot open shared object file: No such file or directory
12:54:14 PM web.1 | [2021-07-21 09:54:14 +0000] [7689] [INFO] Worker exiting (pid: 7689)
12:54:14 PM web.1 | [2021-07-21 12:54:14 +0300] [7688] [INFO] Shutting down: Master
12:54:14 PM web.1 | [2021-07-21 12:54:14 +0300] [7688] [INFO] Reason: Worker failed to boot.
Postgresql is running:
postgresql.service - PostgreSQL database server
Loaded: loaded (/usr/lib/systemd/system/postgresql.service; disabled; vend>
Active: active (running) since Wed 2021-07-21 10:54:34 EEST; 1h 58min ago
Process: 2474 ExecStartPre=/usr/libexec/postgresql-check-db-dir postgresql >
Main PID: 2476 (postmaster)
Tasks: 8 (limit: 9381)
Memory: 31.2M
CPU: 663ms
CGroup: /system.slice/postgresql.service
├─2476 /usr/bin/postmaster -D /var/lib/pgsql/data
├─2477 postgres: logger
├─2479 postgres: checkpointer
├─2480 postgres: background writer
├─2481 postgres: walwriter
├─2482 postgres: autovacuum launcher
├─2483 postgres: stats collector
└─2484 postgres: logical replication launcher
Jul 21 10:54:34 fedora systemd[1]: Starting PostgreSQL database server...
Jul 21 10:54:34 fedora postmaster[2476]: 2021-07-21 10:54:34.242 EEST [2476] LO>
Jul 21 10:54:34 fedora postmaster[2476]: 2021-07-21 10:54:34.242 EEST [2476] HI>
Jul 21 10:54:34 fedora systemd[1]: Started PostgreSQL database server.
How do I fix this error. Thank you
I fixed this error by adding psycopg2-binary dependency

Jupyter Hub breaks connection on Google Cloud

I'm hosting Jupyter hub on a separate vm instance in Google Cloud and for some reason connection fails every time when I don't do anything there actively for about 15 minutes. And after that I had to relaunch server and rerun everything again.
Is there some kind of timeout that I could change or maybe it's some kind of an optimised mode of usage I could turn off? I tried increasing CPU memory but still the same thing happens all the time.
I pinged the external IP:
PING <EXTERNAL_IP> (<EXTERNAL_IP>) 56(84) bytes of data.
64 bytes from <EXTERNAL_IP>: icmp_seq=1 ttl=61 time=0.857 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=2 ttl=61 time=0.390 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=3 ttl=61 time=0.418 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=4 ttl=61 time=0.363 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=5 ttl=61 time=0.385 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=6 ttl=61 time=0.429 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=7 ttl=61 time=0.440 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=8 ttl=61 time=0.352 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=9 ttl=61 time=0.357 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=10 ttl=61 time=0.396 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=11 ttl=61 time=0.356 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=12 ttl=61 time=0.594 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=13 ttl=61 time=0.408 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=14 ttl=61 time=0.424 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=15 ttl=61 time=0.414 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=16 ttl=61 time=0.390 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=17 ttl=61 time=0.378 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=18 ttl=61 time=0.350 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=19 ttl=61 time=0.437 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=20 ttl=61 time=0.384 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=21 ttl=61 time=0.361 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=22 ttl=61 time=0.340 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=23 ttl=61 time=0.496 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=24 ttl=61 time=0.361 ms
64 bytes from <EXTERNAL_IP>: icmp_seq=25 ttl=61 time=0.333 ms
Logs from the Serial port 1:
Jun 21 17:11:09 jupyterhub bash[28473]: [I 2021-06-21 17:11:09.542 SingleUserNotebookApp log:189] 200 GET /user/<MY_USERNAME>/metrics (<MY_USERNAME>#<EXTERNAL_IP>) 9.10ms
Jun 21 17:13:52 jupyterhub systemd[1]: Stopping /bin/bash -c cd /home/jupyter-<MY_USERNAME> && exec jupyterhub-singleuser --port=59331...
Jun 21 17:13:52 jupyterhub bash[28473]: [C 2021-06-21 17:13:52.335 SingleUserNotebookApp notebookapp:1978] received signal 15, stopping
Jun 21 17:13:52 jupyterhub bash[28473]: [I 2021-06-21 17:13:52.336 SingleUserNotebookApp notebookapp:2145] Shutting down 2 kernels
Jun 21 17:13:52 jupyterhub bash[28473]: [I 2021-06-21 17:13:52.438 SingleUserNotebookApp multikernelmanager:226] Kernel shutdown: d179550b-b0df-4889-8605-049d4ec59f70
Jun 21 17:13:52 jupyterhub bash[28473]: [I 2021-06-21 17:13:52.438 SingleUserNotebookApp multikernelmanager:226] Kernel shutdown: 7f2ab807-18b8-465e-b21b-fd9d82a7c3c7
Jun 21 17:13:52 jupyterhub bash[28473]: [I 2021-06-21 17:13:52.438 SingleUserNotebookApp notebookapp:2160] Shutting down 2 terminals
Jun 21 17:13:52 jupyterhub bash[28473]: [I 2021-06-21 17:13:52.439 SingleUserNotebookApp management:199] EOF on FD 12; stopping reading
Jun 21 17:13:52 jupyterhub bash[28473]: [I 2021-06-21 17:13:52.540 SingleUserNotebookApp management:362] Terminal 2 closed
Jun 21 17:13:52 jupyterhub bash[28473]: [I 2021-06-21 17:13:52.541 SingleUserNotebookApp management:199] EOF on FD 16; stopping reading
Jun 21 17:13:52 jupyterhub bash[28473]: [I 2021-06-21 17:13:52.641 SingleUserNotebookApp management:362] Terminal 1 closed
Jun 21 17:13:52 jupyterhub bash[28473]: Websocket closed
Jun 21 17:13:52 jupyterhub bash[28473]: Websocket closed
Jun 21 17:13:52 jupyterhub bash[28473]: Traceback (most recent call last):
Jun 21 17:13:52 jupyterhub bash[28473]: File "/opt/tljh/user/bin/jupyterhub-singleuser", line 10, in <module>
Jun 21 17:13:52 jupyterhub bash[28473]: sys.exit(main())
Jun 21 17:13:52 jupyterhub bash[28473]: File "/opt/tljh/user/lib/python3.7/site-packages/jupyter_core/application.py", line 254, in launch_instance
Jun 21 17:13:52 jupyterhub bash[28473]: return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
Jun 21 17:13:52 jupyterhub bash[28473]: File "/opt/tljh/user/lib/python3.7/site-packages/traitlets/config/application.py", line 845, in launch_instance
Jun 21 17:13:52 jupyterhub bash[28473]: app.start()
Jun 21 17:13:52 jupyterhub bash[28473]: File "/opt/tljh/user/lib/python3.7/site-packages/jupyterhub/singleuser/mixins.py", line 571, in start
Jun 21 17:13:52 jupyterhub bash[28473]: super().start()
Jun 21 17:13:52 jupyterhub bash[28473]: File "/opt/tljh/user/lib/python3.7/site-packages/notebook/notebookapp.py", line 2362, in start
Jun 21 17:13:52 jupyterhub bash[28473]: self.cleanup_terminals()
Jun 21 17:13:52 jupyterhub bash[28473]: File "/opt/tljh/user/lib/python3.7/site-packages/notebook/notebookapp.py", line 2161, in cleanup_terminals
Jun 21 17:13:52 jupyterhub bash[28473]: run_sync(terminal_manager.terminate_all())
Jun 21 17:13:52 jupyterhub bash[28473]: File "/opt/tljh/user/lib/python3.7/site-packages/notebook/utils.py", line 370, in run_sync
Jun 21 17:13:52 jupyterhub bash[28473]: return wrapped()
Jun 21 17:13:52 jupyterhub bash[28473]: File "/opt/tljh/user/lib/python3.7/site-packages/notebook/utils.py", line 364, in wrapped
Jun 21 17:13:52 jupyterhub bash[28473]: result = loop.run_until_complete(maybe_async)
Jun 21 17:13:52 jupyterhub bash[28473]: File "/opt/tljh/user/lib/python3.7/asyncio/base_events.py", line 587, in run_until_complete
Jun 21 17:13:52 jupyterhub bash[28473]: return future.result()
Jun 21 17:13:52 jupyterhub bash[28473]: File "/opt/tljh/user/lib/python3.7/site-packages/notebook/terminal/terminalmanager.py", line 96, in terminate_all
Jun 21 17:13:52 jupyterhub bash[28473]: await self.terminate(term, force=True)
Jun 21 17:13:52 jupyterhub bash[28473]: File "/opt/tljh/user/lib/python3.7/site-packages/notebook/terminal/terminalmanager.py", line 85, in terminate
Jun 21 17:13:52 jupyterhub bash[28473]: self._check_terminal(name)
Jun 21 17:13:52 jupyterhub bash[28473]: File "/opt/tljh/user/lib/python3.7/site-packages/notebook/terminal/terminalmanager.py", line 113, in _check_terminal
Jun 21 17:13:52 jupyterhub bash[28473]: raise web.HTTPError(404, u'Terminal not found: %s' % name)
Jun 21 17:13:52 jupyterhub bash[28473]: tornado.web.HTTPError: HTTP 404: Not Found (Terminal not found: 2)
Jun 21 17:13:52 jupyterhub systemd[1]: jupyter-<MY_USERNAME>.service: Main process exited, code=exited, status=1/FAILURE
Jun 21 17:13:52 jupyterhub systemd[1]: jupyter-<MY_USERNAME>.service: Failed with result 'exit-code'.
Jun 21 17:13:52 jupyterhub systemd[1]: Stopped /bin/bash -c cd /home/jupyter-<MY_USERNAME> && exec jupyterhub-singleuser --port=59331.

Vora 1.4 Catalog fails to start

So I upgraded Vora from 1.3 to 1.4 on recently upgraded HDP 2.5.6.
All services seem to be starting fine, except Catalog. In the log I see a lot of messages like this:
2017-08-16 11:43:34.591183|+1000|ERROR|Was not able to create new dlog via XXXXX:37999, Status was ERROR_OP_TIMED_OUT, Details: |v2catalog_server|Distributed Log|140607339825056|CreateDLog|log_administration.cpp(211)^^
2017-08-16 11:43:34.611044|+1000|ERROR|Operation (CREATE_LOG) timed out, last status was: ERROR_INTERNAL|v2catalog_server|Distributed Log|140607279314688|Retry|callback_base.cpp(222)^^
2017-08-16 11:43:34.611204|+1000|ERROR|Was not able to create new dlog via XXXXX:20439, Status was ERROR_OP_TIMED_OUT, Details: |v2catalog_server|Distributed Log|140607339825056|CreateDLog|log_administration.cpp(211)^^
2017-08-16 11:43:34.611235|+1000|ERROR|Create DLog ended with status ERROR_OP_TIMED_OUT, retrying in 1000ms|v2catalog_server|Distributed Log|140607339825056|CreateDLog|log_administration.cpp(163)^^
2017-08-16 11:43:35.611757|+1000|ERROR|can't create dlog client[ ERROR_OP_TIMED_OUT ]|v2catalog_server|Catalog|140607339825056|Init|dlog_accessor.cpp(174)^^
terminate called after throwing an instance of 'std::system_error'
what(): Invalid argument
Any ideas what I left misconfigured?
[UPDATE] DLog's log below:
[Wed Aug 16 10:31:23 2017] DLOG Server Version: 1.2.330.20859
[Wed Aug 16 10:31:23 2017] Listening on XXXXXX:46026
[Wed Aug 16 10:31:23 2017] Loading data store
2017-08-16 10:31:23.475454|+1000|WARN |Server file descriptor limit too large vs system limit; reducing to 896|v2dlog|Distributed Log|140349419014080|Load|store.cpp(2187)^^
[Wed Aug 16 10:31:23 2017] Server file descriptor limit too large vs system limit; reducing to 896
[Wed Aug 16 10:31:23 2017] Recovering log in store
[Wed Aug 16 10:31:23 2017] Starting server in managed mode
[Wed Aug 16 10:31:23 2017] Initializing management interface
2017-08-16 10:31:39.365780|+1000|WARN |f(1)h(1):Host 1 has timed out, disabling|v2dlog|Distributed Log|140349343360768|newcluster.(*FragmentRef).ProcessRule|dlog.go(607)^^
2017-08-16 10:32:10.333444|+1000|ERROR|Log with ID 1 is not registered on unit.|v2dlog|Distributed Log|140349238322944|Seal|tenant_registry.cpp(63)^^
2017-08-16 10:32:10.333754|+1000|ERROR|f(1)h(1):Sealing local unit failed for log 1: disabling|v2dlog|Distributed Log|140349238322944|newcluster.(*replicaStateRef).disable|dlog.go(991)^^
[Wed Aug 16 11:22:24 2017] Received signal: 15. Shutting down
[Wed Aug 16 11:22:24 2017] Flushing store...
[Wed Aug 16 11:22:24 2017] Store flush complete
[Wed Aug 16 11:30:17 2017] DLOG Server Version: 1.2.330.20859
[Wed Aug 16 11:30:17 2017] Listening on XXXXXX:37999
[Wed Aug 16 11:30:17 2017] Loading data store
2017-08-16 11:30:17.371415|+1000|WARN |Server file descriptor limit too large vs system limit; reducing to 896|v2dlog|Distributed Log|140388824664000|Load|store.cpp(2187)^^
[Wed Aug 16 11:30:17 2017] Server file descriptor limit too large vs system limit; reducing to 896
[Wed Aug 16 11:30:17 2017] Recovering log in store
[Wed Aug 16 11:30:17 2017] Starting server in managed mode
[Wed Aug 16 11:30:17 2017] Initializing management interface
2017-08-16 11:30:19.421458|+1000|WARN |missed heartbeat for log 1, host 2; poking with state 2|v2dlog|Distributed Log|140388740617984|newcluster.(*FragmentRef).ProcessRule|dlog.go(619)^^
Further on this, I've configured Vora DLog to run on all three nodes of the cluster, but I see it's not running on one of them. The (likely) related part of Vora Manager's log is:
Aug 17 09:32:36 XXXXXX vora.vora-dlog: [c.63f700da] : stdout from check: [Thu Aug 17 09:32:36 2017] Checking for store #012[Thu Aug 17 09:32:36 2017] No valid store found
Aug 17 09:32:36 XXXXXX vora.vora-dlog: [c.63f700da] : stderr from check: 2017-08-17 09:32:36.590974|+1000|INFO |Command Line: /opt/vora/lib/vora-dlog/bin/v2dlog check --trace-level DEBUG --trace-to-stderr /var/local/vora/vora-dlog|v2dlog|Distributed Log|139919669938112|server_main|main.cpp(1323) #0122017-08-17 09:32:36.592784|+1000|INFO |Checking for store|v2dlog|Distributed Log|139919669938112|Run|main.cpp(1146) #0122017-08-17 09:32:36.593074|+1000|ERROR|Exception during recovery: Encountered a generic I/O error|v2dlog|Distributed Log|139919669938112|Load|store.cpp(2201) #0122017-08-17 09:32:36.593157|+1000|FATAL|Error during recovery|v2dlog|Distributed Log|139919669938112|handle_recovery_error|main.cpp(767) #012[Thu Aug 17 09:32:36 2017] Error during recovery #0122017-08-17 09:32:36.593214|+1000|FATAL| Encountered a generic I/O error|v2dlog|Distributed Log|139919669938112|handle_recovery_error|main.cpp(767) #012[Thu Aug 17 09:32:36 2017] Encountered a generic I/O error #0122017-08-17 09:32:36.593277|+1000|FATAL| boost::filesystem::status: Permission den
Aug 17 09:32:36 XXXXXX vora.vora-dlog: [c.63f700da] : ... ied: "/var/local/vora/vora-dlog"|v2dlog|Distributed Log|139919669938112|handle_recovery_error|main.cpp(767) #012[Thu Aug 17 09:32:36 2017] boost::filesystem::status: Permission denied: "/var/local/vora/vora-dlog" #0122017-08-17 09:32:36.593330|+1000|INFO |No valid store found|v2dlog|Distributed Log|139919669938112|Run|main.cpp(1151)
Aug 17 09:32:36 XXXXXX vora.vora-dlog: [c.63f700da] : Creating SAP Hana Vora Distributed Log store ...
Aug 17 09:32:36 XXXXXX vora.vora-dlog: [c.63f700da] : stdout from format: [Thu Aug 17 09:32:36 2017] Formatting store
Aug 17 09:32:36 XXXXXX vora.vora-dlog: [c.63f700da] : stderr from format: 2017-08-17 09:32:36.615558|+1000|INFO |Command Line: /opt/vora/lib/vora-dlog/bin/v2dlog format --trace-level DEBUG --trace-to-stderr /var/local/vora/vora-dlog|v2dlog|Distributed Log|140176991168448|server_main|main.cpp(1323) #0122017-08-17 09:32:36.617444|+1000|INFO |Formatting store|v2dlog|Distributed Log|140176991168448|Run|main.cpp(1093) #0122017-08-17 09:32:36.617655|+1000|ERROR|boost::filesystem::status: Permission denied: "/var/local/vora/vora-dlog"|v2dlog|Distributed Log|140176991168448|Format|store.cpp(2107) #0122017-08-17 09:32:36.617693|+1000|FATAL|Could not format store.|v2dlog|Distributed Log|140176991168448|Run|main.cpp(1095) #012[Thu Aug 17 09:32:36 2017] Could not format store.
Aug 17 09:32:36 XXXXXX vora.vora-dlog: [c.63f700da] : Error while creating dlog store.
Aug 17 09:32:36 XXXXXX nomad[628]: client: task "vora-dlog-server" for alloc "058fd477-4e80-59ca-7703-e97f2ca1c8c2" failed: Wait returned exit code 1, signal 0, and error <nil>
[UPDATE2] So I see quite a few lines like this in Vora Manager log:
Aug 17 14:38:27 XXXXXX vora.vora-dlog: [c.2235f785] : Running['sudo', '-i', '-u', 'root', 'chown', 'vora:vora', '/var/log/vora/vora-dlog/']
And I would guess it should be successful, as on that node I see that the directory vora-dlog belongs to vora user:
-rw-r--r-- 1 vora vora 0 Jun 29 19:04 .keep
drwxrwx--- 2 vora vora 4096 Aug 16 10:31 dbdir
drwxrwx--- 6 root vora 4096 Aug 15 16:24 vora-discovery
drwxrwx--- 2 vora vora 4096 Aug 16 10:31 vora-dlog
drwxr-xr-x 4 root root 4096 Aug 15 16:23 vora-scheduler
The contents of vora-dlog is empty.

Celery not connecting to Redis Broker (Django)

This is my first time using Celery and redis so there's probably something obvious that I'm not inferring from the documentation and searching through others' questions on here. Whenever I try to run a worker my connection keeps timing out with:
ResponseError: unknown command 'WATCH'
[2013-06-12 18:25:23,059: ERROR/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
here's my requirements.txt
South==0.7.6
amqp==1.0.11
anyjson==0.3.3
billiard==2.7.3.28
boto==2.9.4
celery==3.0.19
celery-with-redis==3.0
dj-database-url==0.2.1
django-admin-bootstrapped==0.3.2
django-celery==3.0.17
django-jsonfield==0.9.4
django-stripe-payments==2.0b20
mimeparse==0.1.3
oauthlib==0.4.0
paramiko==1.10.1
psycopg2==2.5
pycrypto==2.6
python-dateutil==2.1
python-openid==2.2.5
pytz==2013b
redis==2.7.5
requests==1.2.0
requests-oauthlib==0.3.1
six==1.3.0
stripe==1.7.9
wsgiref==0.1.2
settings.py
import djcelery
djcelery.setup_loader()
INSTALLED_APPS = (
...
'djcelery',
...
)
CACHES = {
"default": {
"BACKEND": "redis_cache.cache.RedisCache",
"LOCATION": "127.0.0.1:6379:1",
"OPTIONS": {
"CLIENT_CLASS": "redis_cache.client.DefaultClient",
}
}
}
BROKER_URL = 'redis://localhost:6379/0'
when I start my redis server and run
./manage.py celeryd -B
My connect just keeps timing out with:
Traceback (most recent call last):
File "/venv/lib/python2.7/site-packages/celery/worker/consumer.py", line 395, in start
self.consume_messages()
File "/venv/lib/python2.7/site-packages/celery/worker/consumer.py", line 407, in consume_messages
with self.hub as hub:
File "/venv/lib/python2.7/site-packages/celery/worker/hub.py", line 198, in __enter__
self.init()
File "/venv/lib/python2.7/site-packages/celery/worker/hub.py", line 146, in init
callback(self)
File "/venv/lib/python2.7/site-packages/celery/worker/consumer.py", line 401, in on_poll_init
self.connection.transport.on_poll_init(hub.poller)
File "/venv/lib/python2.7/site-packages/kombu/transport/redis.py", line 749, in on_poll_init
self.cycle.on_poll_init(poller)
File "/venv/lib/python2.7/site-packages/kombu/transport/redis.py", line 266, in on_poll_init
num=channel.unacked_restore_limit,
File "/venv/lib/python2.7/site-packages/kombu/transport/redis.py", line 159, in restore_visible
self.restore_by_tag(tag, client)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 35, in __exit__
self.gen.throw(type, value, traceback)
File "/venv/lib/python2.7/site-packages/kombu/transport/redis.py", line 94, in Mutex
pipe.watch(name)
File "/venv/lib/python2.7/site-packages/redis/client.py", line 1941, in watch
return self.execute_command('WATCH', *names)
File "/venv/lib/python2.7/site-packages/redis/client.py", line 1760, in execute_command
return self.immediate_execute_command(*args, **kwargs)
File "/venv/lib/python2.7/site-packages/redis/client.py", line 1779, in immediate_execute_command
return self.parse_response(conn, command_name, **options)
File "/venv/lib/python2.7/site-packages/redis/client.py", line 1883, in parse_response
self, connection, command_name, **options)
File "/venv/lib/python2.7/site-packages/redis/client.py", line 388, in parse_response
response = connection.read_response()
File "/venv/lib/python2.7/site-packages/redis/connection.py", line 309, in read_response
raise response
ResponseError: unknown command 'WATCH'
[2013-06-12 18:25:23,059: ERROR/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
redis:
[1197] 12 Jun 18:50:09 * Server started, Redis version 1.3.14
[1197] 12 Jun 18:50:09 * DB loaded from disk: 0 seconds
[1197] 12 Jun 18:50:09 * The server is now ready to accept connections on port 6379
[1197] 12 Jun 18:50:09 - Accepted 127.0.0.1:53061
[1197] 12 Jun 18:50:09 - DB 0: 2 keys (0 volatile) in 4 slots HT.
[1197] 12 Jun 18:50:09 - 1 clients connected (0 slaves), 1076976 bytes in use
[1197] 12 Jun 18:50:09 - Accepted 127.0.0.1:53062
[1197] 12 Jun 18:50:09 - Accepted 127.0.0.1:53063
[1197] 12 Jun 18:50:09 - Client closed connection
[1197] 12 Jun 18:50:09 - Accepted 127.0.0.1:53064
[1197] 12 Jun 18:50:09 - Client closed connection
[1197] 12 Jun 18:50:09 - Accepted 127.0.0.1:53065
[1197] 12 Jun 18:50:09 - Client closed connection
[1197] 12 Jun 18:50:09 - Accepted 127.0.0.1:53066
[1197] 12 Jun 18:50:09 - Client closed connection
etc etc.
Any guidance for where I should be looking or what possible culprits are? thanks
Your Redis server is too old (1.3.14) to be used with Celery. From this error you can see Celery is trying to use the WATCH command which was introduced in Redis 2.2.