We are running airflow on kubernetes. Though airflow runs the DAGs without any issues but I see the below weird error messages in the airflow logs.
However, when I check the kubernetes pod logs where the DAG is running, it doesnt have any issues and all the tasks in the DAG completed successfully. Any reason why i'm getting the below error messages in the airflow logs? Also all the logs are not printed in the airflow logs though everything is printed without any error messages in kubernetes pod logs.
[2021-03-19 06:36:33,214] - Sending Signals.SIGTERM to GPID 86142
[2021-03-19 06:36:33,215] - Received SIGTERM. Terminating subprocesses.
[2021-03-19 06:36:33,450] - Pod Launching failed: Task received SIGTERM signal
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/operators/kubernetes_pod_operator.py", line 251, in execute
get_logs=self.get_logs)
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/kubernetes/pod_launcher.py", line 117, in run_pod
return self._monitor_pod(pod, get_logs)
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/kubernetes/pod_launcher.py", line 124, in _monitor_pod
for line in logs:
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 807, in __iter__
for chunk in self.stream(decode_content=True):
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 571, in stream
for line in self.read_chunked(amt, decode_content=decode_content):
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 763, in read_chunked
self._update_chunk_length()
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 693, in _update_chunk_length
line = self._fp.fp.readline()
File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/usr/local/lib/python3.7/ssl.py", line 1071, in recv_into
return self.read(nbytes, buffer)
File "/usr/local/lib/python3.7/ssl.py", line 929, in read
return self._sslobj.read(len, buffer)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 943, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 966, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/operators/kubernetes_pod_operator.py", line 263, in execute
raise AirflowException('Pod Launching failed: {error}'.format(error=ex))
airflow.exceptions.AirflowException: Pod Launching failed: Task received SIGTERM signal
[2021-03-19 06:36:33,452] - Marking task as UP_FOR_RETRY
[2021-03-19 06:36:33,708] - Process psutil.Process(pid=86142, status='terminated', exitcode=1, started='04:24:15') (86142) terminated with exit code 1
[2021-03-19 06:36:37,599] - 2021-03-19 06:36:37,599 INFO - Task exited with return code 1
Thanks
Related
I successfully created a Flask app locally (Ubuntu) and wanted to host this app via amazon EC2. I copied all the files to aws and run the application via
(testVenv38) ubuntu#ip-xxx-yy-39-70:~/virtuelleUmgebungen/testVenv38/flaskTest$ python3 application.py
* Serving Flask app "flaskTest" (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: on
* Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
* Restarting with stat
* Debugger is active!
* Debugger PIN: xxx-yyy-030
Afterwards I was expecting to access my application from any machine via the following link (yellow) followed by the "5000" from definition in my application.py
app.run(host='0.0.0.0', port=5000, debug=True)
(yellow link is https://3.69.xyz.abc/5000), but I get no response. How can I make my app accessible for everybody?
Here is the whole log. But thats the same I get when I run it locally, so this does not indicate the problem I guess.
(testVenv38) ubuntu#ip-xxxx:~/virtuelleUmgebungen/testVenv38/flasklogin-tutorial-master$ python3 application.py
* Serving Flask app "flask_login_tutorial" (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: on
* Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
* Restarting with stat
* Debugger is active!
* Debugger PIN: xxx
failed to send traces to Datadog Agent at http://localhost:8126
Traceback (most recent call last):
File "/home/ubuntu/virtuelleUmgebungen/testVenv38/lib/python3.8/site-packages/tenacity/__init__.py", line 412, in call
result = fn(*args, **kwargs)
File "/home/ubuntu/virtuelleUmgebungen/testVenv38/lib/python3.8/site-packages/ddtrace/internal/writer.py", line 356, in _send_payload
response = self._put(payload, headers)
File "/home/ubuntu/virtuelleUmgebungen/testVenv38/lib/python3.8/site-packages/ddtrace/internal/writer.py", line 332, in _put
conn.request("PUT", self._endpoint, data, headers)
File "/usr/lib/python3.8/http/client.py", line 1252, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1298, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1247, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1007, in _send_output
self.send(msg)
File "/usr/lib/python3.8/http/client.py", line 947, in send
self.connect()
File "/usr/lib/python3.8/http/client.py", line 918, in connect
self.sock = self._create_connection(
File "/usr/lib/python3.8/socket.py", line 808, in create_connection
raise err
File "/usr/lib/python3.8/socket.py", line 796, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ubuntu/virtuelleUmgebungen/testVenv38/lib/python3.8/site-packages/ddtrace/internal/writer.py", line 458, in flush_queue
self._retry_upload(self._send_payload, encoded, n_traces)
File "/home/ubuntu/virtuelleUmgebungen/testVenv38/lib/python3.8/site-packages/tenacity/__init__.py", line 409, in call
do = self.iter(retry_state=retry_state)
File "/home/ubuntu/virtuelleUmgebungen/testVenv38/lib/python3.8/site-packages/tenacity/__init__.py", line 369, in iter
six.raise_from(retry_exc, fut.exception())
File "<string>", line 3, in raise_from
tenacity.RetryError: RetryError[<Future at 0x7f04d4bdfd60 state=finished raised ConnectionRefusedError>]
failed to send traces to Datadog Agent at http://localhost:8126
Traceback (most recent call last):
File "/home/ubuntu/virtuelleUmgebungen/testVenv38/lib/python3.8/site-packages/tenacity/__init__.py", line 412, in call
result = fn(*args, **kwargs)
File "/home/ubuntu/virtuelleUmgebungen/testVenv38/lib/python3.8/site-packages/ddtrace/internal/writer.py", line 356, in _send_payload
response = self._put(payload, headers)
File "/home/ubuntu/virtuelleUmgebungen/testVenv38/lib/python3.8/site-packages/ddtrace/internal/writer.py", line 332, in _put
conn.request("PUT", self._endpoint, data, headers)
File "/usr/lib/python3.8/http/client.py", line 1252, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1298, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1247, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1007, in _send_output
self.send(msg)
File "/usr/lib/python3.8/http/client.py", line 947, in send
self.connect()
File "/usr/lib/python3.8/http/client.py", line 918, in connect
self.sock = self._create_connection(
File "/usr/lib/python3.8/socket.py", line 808, in create_connection
raise err
File "/usr/lib/python3.8/socket.py", line 796, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ubuntu/virtuelleUmgebungen/testVenv38/lib/python3.8/site-packages/ddtrace/internal/writer.py", line 458, in flush_queue
self._retry_upload(self._send_payload, encoded, n_traces)
File "/home/ubuntu/virtuelleUmgebungen/testVenv38/lib/python3.8/site-packages/tenacity/__init__.py", line 409, in call
do = self.iter(retry_state=retry_state)
File "/home/ubuntu/virtuelleUmgebungen/testVenv38/lib/python3.8/site-packages/tenacity/__init__.py", line 369, in iter
six.raise_from(retry_exc, fut.exception())
File "<string>", line 3, in raise_from
tenacity.RetryError: RetryError[<Future at 0x7f4c81faf490 state=finished raised ConnectionRefusedError>]
Edit2:
The opened ports (other than my own ip for ssh access) are the following:
Portbereich Protokoll Quelle Sicherheitsgruppen
Alle Alle sg-XXX7113746eda1a92 default
22 TCP 0.0.0.0/0 default
Edit3:
I added the following entry to incoming traffic:
0 - 65535 TCP 0.0.0.0/0 default
Still does not work.
I'm getting this error when running a Spark cluster with YARN using AWS EMR service:
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1594292341949_0004/container_1594292341949_0004_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1594292341949_0004/container_1594292341949_0004_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
response = connection.send_command(command)
File "/mnt/yarn/usercache/hadoop/appcache/application_1594292341949_0004/container_1594292341949_0004_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
Traceback (most recent call last):
File "process_ecommerce.py", line 131, in <module>
cfg["partitions"]["info"]
File "/mnt/yarn/usercache/hadoop/appcache/application_1594292341949_0004/container_1594292341949_0004_01_000001/__pyfiles__/spark_utils.py", line 10, in save_dataframe
.parquet(path)
File "/mnt/yarn/usercache/hadoop/appcache/application_1594292341949_0004/container_1594292341949_0004_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 844, in parquet
File "/mnt/yarn/usercache/hadoop/appcache/application_1594292341949_0004/container_1594292341949_0004_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/mnt/yarn/usercache/hadoop/appcache/application_1594292341949_0004/container_1594292341949_0004_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/mnt/yarn/usercache/hadoop/appcache/application_1594292341949_0004/container_1594292341949_0004_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 336, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o343.parquet
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:36063)
Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1594292341949_0004/container_1594292341949_0004_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 929, in _get_connection
connection = self.deque.pop()
IndexError: pop from an empty deque
I'm runnig the cluster with 1 master and 20 slave nodes of type r5.2xlarge. Each of them have 8 cpus and 64gb. The configuration of SPARK is:
20 gb executor memory
30 gb executor memory overhead
8 cores per executor
1 cpu per task
How can I solve this error?
I am working on a python spark project, where initially i had written a script to load dataframe to postgres for a particular client which include a utility function which loads data to postgres.
df.rdd.repartition(self.__max_conn).foreachPartition(
lambda iterator: load_utils.load_tab_postgres(conn_prop=conn_prop,
tab_name=<tablename>,
iterator=iterator))
Initially the entire code was in single module including the above code snippet and load_utils(), which was working perfectly fine.
Later i had to extract common code including load_utils into a base module that could be used in different client modules. This is when the same code failed with below error:
File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 764, in foreachPartition File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1004, in count File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 995, in sum File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 869, in fold File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 771, in collect File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in call File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
line 45, in deco File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py",
line 308, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 18 in stage 126.0 failed 4 times, most recent failure: Lost task
18.3 in stage 126.0 (TID 24028, tbsatad6r15g24.company.co.us, executor 242): org.apache.spark.api.python.PythonException: Traceback (most
recent call last): File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile) File "/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py",
line 164, in _read_with_length
return self.loads(obj) File "/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py",
line 422, in loads
return pickle.loads(obj) File "build/bdist.linux-x86_64/egg/basemodule/init.py", line 12, in
import basemodule.entitymodule.base File "build/bdist.linux-x86_64/egg/basemodule/entitymodule/base.py", line
12, in File
"build/bdist.linux-x86_64/egg/basemodule/contexts.py", line 17, in
File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/conf.py",
line 104, in init
SparkContext._ensure_initialized() File "/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 245, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway() File "/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/java_gateway.py", line 48, in launch_gateway
SPARK_HOME = os.environ["SPARK_HOME"] File "/dhcommon/dhpython/python/lib/python2.7/UserDict.py", line 23, in
getitem
raise KeyError(key) KeyError: 'SPARK_HOME'
Below is the spark_submit command to run the code:
spark-submit --master yarn --deploy-mode cluster --driver-class-path
postgresql-42.2.4.jre6.jar --jars
spark-csv_2.10-1.4.0.jar,commons-csv-1.4.jar,postgresql-42.2.4.jre6.jar
--py-files project.egg driver_file.py
In both above scenarios the load_utils file containing "load_tab_postgres" method will be bundled in project.egg.
I hope somebody here can help. I've been googling this error like crazy but haven't found anything.
I have a pipeline that works perfectly when executed locally but it fails when executed on GCP. The following are the error messages that I get.
Workflow failed. Causes: S03:Write transform
fn/WriteMetadata/ResolveBeamFutures/CreateSingleton/Read+Write
transform fn/WriteMetadata/ResolveBeamFutures/ResolveFutures/Do+Write
transform fn/WriteMetadata/WriteMetadata failed., A work item was
attempted 4 times without success. Each time the worker eventually
lost contact with the service. The work item was attempted on:
Traceback (most recent call last): File "preprocess.py", line 491,
in
main() File "preprocess.py", line 487, in main
transform_data(args,pipeline_options,runner) File "preprocess.py", line 451, in transform_data
eval_data |= 'Identity eval' >> beam.ParDo(Identity()) File "/Library/Python/2.7/site-packages/apache_beam/pipeline.py", line 335,
in exit
self.run().wait_until_finish() File "/Library/Python/2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py",
line 897, in wait_until_finish
(self.state, getattr(self._runner, 'last_error_msg', None)), self) apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException:
Dataflow pipeline failed. State: FAILED, Error: Traceback (most recent
call last): File
"/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py",
line 582, in do_work
work_executor.execute() File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py",
line 166, in execute
op.start() File "apache_beam/runners/worker/operations.py", line 294, in apache_beam.runners.worker.operations.DoOperation.start
(apache_beam/runners/worker/operations.c:10607)
def start(self): File "apache_beam/runners/worker/operations.py", line 295, in
apache_beam.runners.worker.operations.DoOperation.start
(apache_beam/runners/worker/operations.c:10501)
with self.scoped_start_state: File "apache_beam/runners/worker/operations.py", line 300, in
apache_beam.runners.worker.operations.DoOperation.start
(apache_beam/runners/worker/operations.c:9702)
pickler.loads(self.spec.serialized_fn)) File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py",
line 225, in loads
return dill.loads(s) File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 277, in
loads
return load(file) File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 266, in
load
obj = pik.load() File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatchkey File "/usr/lib/python2.7/pickle.py", line 1083, in load_newobj
obj = cls.new(cls, *args) TypeError: new() takes exactly 4 arguments (1 given)
Any ideas??
Thanks,
Pedro
If the pipeline works locally but fails on GCP it's possible that you're running into a version mismatch.
What TF, tf.Transform, beam versions are you running locally and on GCP?
I have a django celery task that is only partly executing.
I start up the app and the connection looks good:
INFO/MainProcess] Connected to
redis://elasticache.cache.amazonaws.com:6379/0 [2018-02-17
23:27:24,314: INFO/MainProcess] mingle: searching for neighbors
[2018-02-17 23:27:25,339: INFO/MainProcess] mingle: all alone
[2018-02-17 23:27:25,604: INFO/MainProcess]
worker1#test_vmstracker_com ready.
I initiate the process and the task is received an executed:
[2018-02-17 23:27:49,810: INFO/MainProcess] Received task:
tracking.tasks.escalate[92f54d48202] ETA:[2018-02-18
07:27:59.797380+00:00] [2018-02-17 23:27:49,830: INFO/MainProcess]
Received task: tracking.tasks.escalate[09a0aebef72b] ETA:[2018-02-18
07:28:19.809712+00:00] [2018-02-17 23:28:00,205:
WARNING/ForkPoolWorker-7] -my app is working-
Then I start getting errors and it doesn't finish the task where my app sends an email
[2018-02-17 23:28:00,214: ERROR/ForkPoolWorker-7] Connection to Redis
lost: Retry (0/20) now. [2018-02-17 23:28:00,220:
ERROR/ForkPoolWorker-7] Connection to Redis lost: Retry (1/2
Does anyone know why only have executes and then the connection is lost?
Here is the full stacktrace:
[2018-02-17 23:28:19,382: WARNING/ForkPoolWorker-7] /usr/local/lib/python3.6/site-packages/celery/app/trace.py:549: RuntimeWarning: Exception raised outside body: ConnectionError("Error while reading from socket: ('Connection closed by server.',)",):
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 177, in _read_from_socket
raise socket.error(SERVER_CLOSED_CONNECTION_ERROR)
OSError: Connection closed by server.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/redis/client.py", line 2879, in execute
return execute(conn, stack, raise_on_error)
File "/usr/local/lib/python3.6/site-packages/redis/client.py", line 2764, in _execute_transaction
self.parse_response(connection, '_')
File "/usr/local/lib/python3.6/site-packages/redis/client.py", line 2838, in parse_response
self, connection, command_name, **options)
File "/usr/local/lib/python3.6/site-packages/redis/client.py", line 680, in parse_response
response = connection.read_response()
File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 624, in read_response
response = self._parser.read_response()
File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 284, in read_response
response = self._buffer.readline()
File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 216, in readline
self._read_from_socket()
File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 191, in _read_from_socket
(e.args,))
redis.exceptions.ConnectionError: Error while reading from socket: ('Connection closed by server.',)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 177, in _read_from_socket
raise socket.error(SERVER_CLOSED_CONNECTION_ERROR)
OSError: Connection closed by server.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/celery/app/trace.py", line 434, in trace_task
uuid, retval, task_request, publish_result,
File "/usr/local/lib/python3.6/site-packages/celery/backends/base.py", line 152, in mark_as_done
self.store_result(task_id, result, state, request=request)
File "/usr/local/lib/python3.6/site-packages/celery/backends/base.py", line 309, in store_result
request=request, **kwargs)
File "/usr/local/lib/python3.6/site-packages/celery/backends/base.py", line 652, in _store_result
self.set(self.get_key_for_task(task_id), self.encode(meta))
File "/usr/local/lib/python3.6/site-packages/celery/backends/redis.py", line 213, in set
return self.ensure(self._set, (key, value), **retry_policy)
File "/usr/local/lib/python3.6/site-packages/celery/backends/redis.py", line 203, in ensure
**retry_policy)
File "/usr/local/lib/python3.6/site-packages/kombu/utils/functional.py", line 333, in retry_over_time
return fun(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/celery/backends/redis.py", line 222, in _set
pipe.execute()
File "/usr/local/lib/python3.6/site-packages/redis/client.py", line 2894, in execute
return execute(conn, stack, raise_on_error)
File "/usr/local/lib/python3.6/site-packages/redis/client.py", line 2764, in _execute_transaction
self.parse_response(connection, '_')
File "/usr/local/lib/python3.6/site-packages/redis/client.py", line 2838, in parse_response
self, connection, command_name, **options)
File "/usr/local/lib/python3.6/site-packages/redis/client.py", line 680, in parse_response
response = connection.read_response()
File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 624, in read_response
response = self._parser.read_response()
File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 284, in read_response
response = self._buffer.readline()
File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 216, in readline
self._read_from_socket()
File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 191, in _read_from_socket
(e.args,))
redis.exceptions.ConnectionError: Error while reading from socket: ('Connection closed by server.',)
exc, exc_info.traceback)))
The problem was that my app server and broker instances were too small. I'm using ec2. As soon as I upgraded to the large hardware, the problem went away. Either the ec2 instance or elasticache instance was too small in terms of CPU, memory, and network.