State of this instance has been externally set to None. Taking the poison pill - airflow-scheduler

We are running airflow on kubernetes. Though airflow runs the DAGs without any issues but I see the below weird error messages in the airflow logs.
However, when I check the kubernetes pod logs where the DAG is running, it doesnt have any issues and all the tasks in the DAG completed successfully. Any reason why i'm getting the below error messages in the airflow logs? Also all the logs are not printed in the airflow logs though everything is printed without any error messages in kubernetes pod logs.
Airflow logs is printing the message until a step in the python code is taking more than 1 hr...thats when the airflow logs is stopped writing to a file. Not sure if this is something related to task time out when it runs more than specific time
Please let me know if you have any ideas regarding this. Thanks
[2021-04-08 23:36:33,929] - 2021-04-08 23:36:33,929 INFO - b'orchestrator - Previous append blob deleted\n'
[2021-04-08 23:36:33,930] - 2021-04-08 23:36:33,930 INFO - b'orchestrator - Starting data recipe using recipes.recipe-journal-lines-sap\n'
[2021-04-09 01:50:58,705] - 2021-04-09 01:50:58,705 WARNING - State of this instance has been externally set to None. Taking the poison pill.
[2021-04-09 01:50:58,710] - Sending Signals.SIGTERM to GPID 12851
[2021-04-09 01:50:58,710] - Received SIGTERM. Terminating subprocesses.
[2021-04-09 01:50:59,399] - Pod Launching failed: Task received SIGTERM signal
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/operators/kubernetes_pod_operator.py", line 251, in execute
get_logs=self.get_logs)
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/kubernetes/pod_launcher.py", line 117, in run_pod
return self._monitor_pod(pod, get_logs)
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/kubernetes/pod_launcher.py", line 124, in _monitor_pod
for line in logs:
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 808, in __iter__
for chunk in self.stream(decode_content=True):
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 572, in stream
for line in self.read_chunked(amt, decode_content=decode_content):
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 764, in read_chunked
self._update_chunk_length()
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 694, in _update_chunk_length
line = self._fp.fp.readline()
File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/usr/local/lib/python3.7/ssl.py", line 1071, in recv_into
return self.read(nbytes, buffer)
File "/usr/local/lib/python3.7/ssl.py", line 929, in read
return self._sslobj.read(len, buffer)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 943, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal

Related

Airflow on GCP - Errno 111 Connection refused

We have Airflow 1.10.5, using CeleryExecutor, running on Google Cloud Platform.
Occasionally, the following error happens:
[2019-12-17 19:00:45,990] {{base_task_runner.py:115}} INFO - Job 704: Subtask our-task-name ERROR: (gcloud.container.clusters.get-credentials) [Errno 111] Connection refused
[2019-12-17 19:00:45,990] {{base_task_runner.py:115}} INFO - Job 704: Subtask our-task-name This may be due to network connectivity issues. Please check your network settings, and the status of the service you are trying to reach.
[2019-12-17 19:00:46,279] {{taskinstance.py:1051}} ERROR - Command '['gcloud', 'container', 'clusters', 'get-credentials', 'airflow-pipeline-name', '--zone', 'us-central1-a', '--project', 'project-name']' returned non-zero exit status 1.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 926, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/operators/gcp_container_operator.py", line 271, in execute
"--project", self.project_id])
File "/usr/local/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gcloud', 'container', 'clusters', 'get-credentials', 'airflow-pipeline-name', '--zone', 'us-central1-a', '--project', 'project-name']' returned non-zero exit status 1.
[2019-12-17 19:00:46,358] {{taskinstance.py:1082}} INFO - Marking task as FAILED.
Is it a bug in Airflow itself, in its plugins (like the plugin for Kubernetes) or in Google Cloud Platform?
Is there any way to fix it?
The problem was that the metadata server was not responding at some moments. Our colleagues fixed this.

tensorflow IOError: [Errno socket error] [Errno 111] Connection refused

I tried to run a demo of TensorFlow.The MNIST dataset has been download,but there exists one error. Who can tell me what's wrong? Thanks very much! The error detail as follows:
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Traceback (most recent call last):
File "/home/linbinghui/文档/pycode/my_tensorflow_code/test_mnist.py", line 7, in <module>
mnist = input_data.read_data_sets("MNIST_data/", one_hot=False)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py", line 189, in read_data_sets
local_file = maybe_download(TEST_IMAGES, train_dir, SOURCE_URL + TEST_IMAGES)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/base.py", line 81, in maybe_download
urllib.request.urlretrieve(source_url, temp_file_name)
File "/usr/lib/python2.7/urllib.py", line 98, in urlretrieve
return opener.retrieve(url, filename, reporthook, data)
File "/usr/lib/python2.7/urllib.py", line 245, in retrieve
fp = self.open(url, data)
return getattr(self, name)(url)
File "/usr/lib/python2.7/urllib.py", line 350, in open_http
h.endheaders(data)
File "/usr/lib/python2.7/httplib.py", line 1053, in endheaders
self._send_output(message_body)
File "/usr/lib/python2.7/httplib.py", line 897, in _send_output
self.send(msg)
File "/usr/lib/python2.7/httplib.py", line 859, in send
self.connect()
File "/usr/lib/python2.7/httplib.py", line 836, in connect
self.timeout, self.source_address)
File "/usr/lib/python2.7/socket.py", line 575, in create_connection
raise err
IOError: [Errno socket error] [Errno 111] Connection refused
This code is attempting to download https://storage.googleapis.com/cvdf-datasets/mnist/t10k-images-idx3-ubyte.gz and failing. It failed because of "Connection Refused" which generally indicates that the remote end is not running a server on the port you tried to contact it on.
This URL refers a Google storage service. I was able to successfully download this file. Either you encountered a transient failure of Google's service, or some intermediary between you and Google caused this problem.
Normally "connection refused" is not caused by anything other than the intended remote end being unavailable (there's a computer there but no specific service). However, in the face of modern HTTP and HTTPS proxies, DNS redirection and the like, you could very well have encountered some feature of your business/school/home/government internet interdiction. HTTPS urls can be troubling to the entity hosting your internet service because it represents a private communication channel through which you could download malware or upload secrets. This troubling nature makes it more likely to be intercepted or redirected or disabled entirely.
I recommend that you troubleshoot this problem with wget/curl or similar on your machine. If those work well, consider a small python script with the requests package. Consider also the impact of environment variables on these utilities/libraries. Try repeating this procedure at network endpoints other than the one you're using.
If you find that the results of repeated tests are unstable even in the same network endpoint, perhaps you're facing local load balancers/proxies or some other transient local failure. When in doubt, contact your local network support team.

Connect to AS400 database with Django (ibm_db_django)

I have created a new project in Django and am following this tutorial to help me pull data from my remote database. Whenever I try to run migrate in manage.py I get the following error:
"C:\Program Files (x86)\JetBrains\PyCharm 2016.2.1\bin\runnerw.exe" C:\Users\ecjohnson\AppData\Local\Programs\Python\Python35-32\python.exe "C:\Program Files (x86)\JetBrains\PyCharm 2016.2.1\helpers\pycharm\django_manage.py" migrate U:/incoming_parts_monitor
Traceback (most recent call last):
File "C:\Users\ecjohnson\AppData\Local\Programs\Python\Python35-32\lib\site-packages\ibm_db_dbi.py", line 585, in connect
conn = ibm_db.connect(dsn, '', '', conn_options)
SQLCODE=-30081
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\ecjohnson\AppData\Local\Programs\Python\Python35-32\lib\site-packages\django\db\backends\base\base.py", line 199, in ensure_connection
self.connect()
File "C:\Users\ecjohnson\AppData\Local\Programs\Python\Python35-32\lib\site-packages\django\db\backends\base\base.py", line 171, in connect
self.connection = self.get_new_connection(conn_params)
I've connected to this database before in other .NET web applications so I'm not sure why this is not working.
The sqlcode=-30081 is a generic "unable to connect" condition. Look to the configuration and connection details, on both the requesting and server systems, and review for any messaging about a connection issue on the server side [of possible assist is a review of the Display Log (DSPLOG) for the history (QHST), and then review of the joblog(s) of the job(s) that originated that messaging] as an indication that the communications-level connection was at least completed, but some error after that initial comm connection prevented establishing an actual database connection; e.g. perhaps the prestart job or the subsystem in which the database server jobs operate is not started.

connecting to remote server using fabric library in django

We are building our application using django web framework.
We are facing this very critical issue in our application. We are trying to connect to remote server using fabric library.
Normally from command line we are able to connect to the server. Also it runs fine in the inbuilt server which comes in django as it runs from the command line.
Command
fab get_string -H user#10.10.10.10>>django.txt 2>&1
We deployed our code in apache and the application gets stuck when it encounters this command in the function. Once in a while we get these logs
"> Using fabfile 'C:\fabfile.py'
Commands to run: generic_task_linux_django
Parallel tasks now using pool size of 1
[user#10.10.10.10] Executing task 'generic_task_linux_django'
2012-08-30 09:36:15.805000
[user#10.10.10.10] run: /bin/bash -l -c "rm -rf /tmp//admin7"
Timed out trying to connect to 10.10.10.10 (attempt 1 of 1), giving up)
Fatal error: Timed out trying to connect to 10.10.10.10 (tried 1 time)
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\fabric-1.4.1-py2.7.egg\fabric\tasks.py", line 298, in execute
multiprocessing
File "C:\Python27\lib\site-packages\fabric-1.4.1-py2.7.egg\fabric\tasks.py", line 197, in _execute
return task.run(*args, **kwargs)
File "C:\Python27\lib\site-packages\fabric-1.4.1-py2.7.egg\fabric\tasks.py", line 112, in run
return self.wrapped(*args, **kwargs)
File "C:\fabfile.py", line 314, in generic_task_linux_django
run("rm -rf "+remote_path + '/' + local_dir_name)
File "C:\Python27\lib\site-packages\fabric-1.4.1-py2.7.egg\fabric\network.py", line 457, in host_prompting_wrapper
return func(*args, **kwargs)
File "C:\Python27\lib\site-packages\fabric-1.4.1-py2.7.egg\fabric\operations.py", line 905, in run
return _run_command(command, shell, pty, combine_stderr)
File "C:\Python27\lib\site-packages\fabric-1.4.1-py2.7.egg\fabric\operations.py", line 815, in _run_command
stdout, stderr, status = _execute(default_channel(), wrapped_command, pty,
File "C:\Python27\lib\site-packages\fabric-1.4.1-py2.7.egg\fabric\state.py", line 340, in default_channel
chan = connections[env.host_string].get_transport().open_session()
File "C:\Python27\lib\site-packages\fabric-1.4.1-py2.7.egg\fabric\network.py", line 84, in __getitem__
self.connect(key)
File "C:\Python27\lib\site-packages\fabric-1.4.1-py2.7.egg\fabric\network.py", line 76, in connect
self[key] = connect(user, host, port)
File "C:\Python27\lib\site-packages\fabric-1.4.1-py2.7.egg\fabric\network.py", line 393, in connect
raise NetworkError(msg, e)
NetworkError: Timed out trying to connect to 10.10.10.10 (tried 1 time)
Aborting."
as it is able to terminate the connection.But in majority of the cases it is getting stuck and we don’t have any logs to show.
Then we decided to run it as a windows service.We used the attached python file which creates service for us(recipe.py).
Again we are facing same issue as we are facing with apache.
You might want to see if your connection if failing due to ssh credentials or similar. You can get more information if you follow my suggestion here and have the ssh/paramiko lib verbosely log.

Django runserver keeps timing out on osx lion

I'm having issues with runserver on osx lion. Static assets won't transfer randomly, and I occasionally get this message:
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 57555)
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/SocketServer.py", line 284, in _handle_request_noblock
self.process_request(request, client_address)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/SocketServer.py", line 310, in process_request
self.finish_request(request, client_address)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/SocketServer.py", line 323, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/Users/ehutch79/pyenv/sd/lib/python2.7/site-packages/django/core/servers/basehttp.py", line 570, in __init__
BaseHTTPRequestHandler.__init__(self, *args, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/SocketServer.py", line 639, in __init__
self.handle()
File "/Users/ehutch79/pyenv/sd/lib/python2.7/site-packages/django/core/servers/basehttp.py", line 610, in handle
self.raw_requestline = self.rfile.readline()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 447, in readline
data = self._sock.recv(self._rbufsize)
timeout: timed out
----------------------------------------
It's very inconsistant. Google searches have revealed nothing. Does anyone have any idea what might be causing it?
It doesn't seem to be the browser, as i've seen it in both chrome, locally, and ie and chrome on a remote box.
The dev server is monothreaded, so if 2 connections are made at the same time, and are blocking, they will wait forever on each other. Make sure you don't have such a problem:
check if you have some settings that can trigger two connections at the same time (some plugin, or some javascript that make connections implicitly).
check if you have some views that are accessing the same blocking ressource at the same times.
It happened to me recently, when trying to test a SOAP calls from my own code to my own code, urllib was opening from a view an url triggering another view.
Sometimes, it's not accessing blocking ressources, but the concurrency make the requests very slow and so it times out. E.G: you have two browsers pointed to the same devserver and they are doing aggressive HTTP KeepAlive.
There is nothing to do, except using something else to test these specific features, such as gunicorn or werkzeug. For werzeug, install django-extensions, and use ./manage.py runserver_plus --threaded