CeleryBeat suddenly skips scheduling few tasks while using Custom DatabaseScheduler - django

We are using
django-celery==3.1.10
celery==3.1.20
python 2.7.13
Broker - RabbitMQ
We have written a CustomDataBaseScheduler to schedule task which reads entries from a MySQL Table(Django Model) and schedules the task perfectly on given time (mysql Table column specifies the time). We are running CeleryBeat Process as init script.
Our Scheduler Model has approx 3000 entries, most of them are getting scheduled every 5 mins and some are every 15 mins and few is hourly. But out of them some tasks are getting skipped i.e doesn't get schedule on time. This behaviour is randomly and happens to any of the tasks.
While digging on beat logs, we found an MySQL exception on the logs,
Traceback (most recent call last):
File "/opt/DataMonster/datamonster/db_monster/scheduler.py", line 202, in schedule_changed
transaction.commit()
File "/opt/python-2.7.11/lib/python2.7/site-packages/django/db/transaction.py", line 154, in commit
get_connection(using).commit()
File "/opt/python-2.7.11/lib/python2.7/site-packages/django/db/backends/__init__.py", line 175, in commit
self.validate_no_atomic_block()
File "/opt/python-2.7.11/lib/python2.7/site- packages/django/db/backends/__init__.py", line 381, in validate_no_atomic_block
"This is forbidden when an 'atomic' block is active.")
Have Checked the error across multiple sites and it shows it realted to isolation level. We didn't found any other exception in BEAT logs.
The isolation method used in MySQL is READ-UNCOMMITED
Need help in digging this issue.

Related

Elastic Beanstalk server using 97% memory warning

I just discovered that my Elastic Beanstalk server has health status with a warning "97% of memory is in use". Because of this I can not deploy updates or ssh and run the django shell. I just receive the following error:
MemoryError
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/var/app/venv/staging-LQM1lest/lib/python3.7/site-packages/sentry_sdk/worker.py", line 70, in start
self._thread.start()
File "/var/app/venv/staging-LQM1lest/lib/python3.7/site-packages/sentry_sdk/integrations/threading.py", line 54, in sentry_start
return old_start(self, *a, **kw) # type: ignore
File "/usr/lib64/python3.7/threading.py", line 852, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
I received more memory errors when trying AWS documented troubleshooting suggestions which required installing a package:
sudo amazon-linux-extras install epel
From here I don't know what else I can do to troubleshoot this issue or how to fix it.

ImportError: No module named idlelib" when running Google Dataflow worker

I have a python 2.7 script I run locally to launch a Apache Beam / Google Dataflow job (SDK 2.12.0). The job takes a csv file from a Google storage bucket, processes it and then creates an entity in Google Datastore for each row. The script ran fine for years ...but now it is failing:
INFO:root:2019-05-15T22:07:11.481Z: JOB_MESSAGE_DETAILED: Workers have started successfully.
INFO:root:2019-05-15T21:47:13.370Z: JOB_MESSAGE_ERROR: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 773, in run
self._load_main_session(self.local_staging_directory)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 489, in _load_main_session
pickler.load_session(session_file)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 280, in load_session
return dill.load_session(file_path)
File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 410, in load_session
module = unpickler.load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatch[key](self)
File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 827, in _import_module
return __import__(import_name)
ImportError: No module named idlelib
I believe this error is happening at the worker level (not locally). I don't make reference to it in my script. To make sure it wasn't me I have installed updates for all google-cloud packages, apache-beam[gcp] etc locally -just in case. I tried importing idlelib into my script I get the same error. Any suggestions?
It has been fine for years and started failing from SDK 2.12.0 release.
What was the last release that this script succeeding on? 2.11?

Airflow: How to unpause DAG from python script

I'm creating a python script to generate DAGs (generate a new python file with the DAG specifications) from templates. It all works fine, except that I need the DAG to be be generated as unpaused.
I've searched and tried to run shell commands in the script like this:
bash_command1 = 'airflow list_dags'
bash_command2 = 'airflow trigger_dag ' + str(DAG_ID)
bash_command3 = 'airflow list_tasks ' + str(DAG_ID)
bash_command4 = 'airflow unpause '+ str(DAG_ID)
subprocess.call(bash_command1.split())
subprocess.call(bash_command2.split())
subprocess.call(bash_command3.split())
subprocess.call(bash_command4.split())
But every time I create a new DAG it is shown as paused in the web UI.
By the research I´ve made, the command airflow unpause <dag_id> should solve the problem, but when the script executes it, I get the error:
Traceback (most recent call last):
File "/home/cubo/anaconda2/bin/airflow", line 28, in <module>
args.func(args)
File "/home/cubo/anaconda2/lib/python2.7/site-packages/airflow/bin/cli.py", line 303, in unpause
set_is_paused(False, args, dag)
File "/home/cubo/anaconda2/lib/python2.7/site- packages/airflow/bin/cli.py", line 312, in set_is_paused
dm.is_paused = is_paused
AttributeError: 'NoneType' object has no attribute 'is_paused'
But when I execute the same airflow unpause <dag_id> command in the terminal it works fine, and it prints:
Dag: <DAG: DAG_ID>, paused: False
Any help would be greatly appreciated.
Airflow (1.8 and newer) pauses new dags by default. If you want all DAGs to be unpaused at creation, you can override the Airflow config to retain the prior behavior of unpausing at creation.
Here's the link that walks you through setting configuration options. You want to set a core configuration setting: dags_are_paused_at_creation to False.
We use the environment variable approach on my team.

AssertionError: INTERNAL: No default project is specified

New to airflow. Trying to run the sql and store the result in a BigQuery table.
Getting following error. Not sure where to setup the default_rpoject_id.
Please help me.
Error:
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 28, in <module>
args.func(args)
File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py", line 585, in test
ti.run(ignore_task_deps=True, ignore_ti_state=True, test_mode=True)
File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 53, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1374, in run
result = task_copy.execute(context=context)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/operators/bigquery_operator.py", line 82, in execute
self.allow_large_results, self.udf_config, self.use_legacy_sql)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/bigquery_hook.py", line 228, in run_query
default_project_id=self.project_id)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/bigquery_hook.py", line 917, in _split_tablename
assert default_project_id is not None, "INTERNAL: No default project is specified"
AssertionError: INTERNAL: No default project is specified
Code:
sql_bigquery = BigQueryOperator(
task_id='sql_bigquery',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
allow_large_results=True,
bql='''
#standardSQL
SELECT ID, Name, Group, Mark, RATIO_TO_REPORT(Mark) OVER(PARTITION BY Group) AS percent FROM `tensile-site-168620.temp.marks`
''',
destination_dataset_table='temp.percentage',
dag=dag
)
EDIT: I finally fixed this problem by simply adding the bigquery_conn_id='bigquery' parameter in the BigQueryOperator task, after running the code below in a separate python script.
Apparently you need to specify your project ID in Admin -> Connection in the Airflow UI. You must do this as a JSON object such as "project" : "".
Personally I can't get the webserver working on GCP so this is unfeasible. There is a programmatic solution here:
from airflow.models import Connection
from airflow.settings import Session
session = Session()
gcp_conn = Connection(
conn_id='bigquery',
conn_type='google_cloud_platform',
extra='{"extra__google_cloud_platform__project":"<YOUR PROJECT HERE>"}')
if not session.query(Connection).filter(
Connection.conn_id == gcp_conn.conn_id).first():
session.add(gcp_conn)
session.commit()
These suggestions are from a similar question here.
I get the same error when running airflow locally. My solution is to add a the following connection string as a environment variable:
AIRFLOW_CONN_BIGQUERY_DEFAULT="google-cloud-platform://?extra__google_cloud_platform__project=<YOUR PROJECT HERE>"
BigQueryOperator uses the "bigquery_default" connection. When not specified, local airflow uses an internal version of the connection which misses the property project_id. As you can see the connection string above provides the project_id property.
On startup Airflow loads environment variables that start with "AIRFLOW_" into memory. This mechanism can be used to override airflow properties and providing connections when running locally, as explained in the airflow documentation here. Note this also works when running airflow directly without starting the web server.
So I have set up environments variables for all my connections, for example AIRFLOW_CONN_MYSQL_DEFAULT. I have put them into a .ENV file that get sourced from my IDE, but putting them into your .bash_profile would work fine too.
When you look inside your airflow instance on Cloud Composer, you see that the at the "bigquery_default" connection there has the project_idproperty set. That's why BigQueryOperator works when running through Cloud Composer.
(I am on airflow 1.10.2 and BigQuery 1.10.2)

Delayed Job on Heroku does not work

My app runs fine on my local machine which has 16 Gig of Ram using 'heroku local' command to start both the dyno and workers using the Procfile. The background jobs queued in Delayed Job are processed one-by-one and then the table is emptied. When I run on Heroku, it fails to execute the background processing at all. It gets stuck with the following out of memory message in my logfile:
2016-04-03T23:48:06.382070+00:00 app[web.1]: Using rack adapter
2016-04-03T23:48:06.382149+00:00 app[web.1]: Thin web server (v1.6.4 codename Gob Bluth)
2016-04-03T23:48:06.382154+00:00 app[web.1]: Maximum connections set to 1024
2016-04-03T23:48:06.382155+00:00 app[web.1]: Listening on 0.0.0.0:7557, CTRL+C to stop
2016-04-03T23:48:06.711418+00:00 heroku[web.1]: State changed from starting to up
2016-04-03T23:48:37.519962+00:00 heroku[worker.1]: Process running mem=541M(105.8%)
2016-04-03T23:48:37.519962+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2016-04-03T23:48:59.317063+00:00 heroku[worker.1]: Process running mem=708M(138.3%)
2016-04-03T23:48:59.317063+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2016-04-03T23:49:21.449475+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2016-04-03T23:49:21.449325+00:00 heroku[worker.1]: Process running mem=829M(161.9%)
2016-04-03T23:49:24.273557+00:00 app[worker.1]: rake aborted!
2016-04-03T23:49:24.273587+00:00 app[worker.1]: Can't modify frozen hash
2016-04-03T23:49:24.274764+00:00 app[worker.1]: /app/vendor/bundle/ruby/2.2.0/gems/activerecord-4.2.6/lib/active_record/attribute_set/builder.rb:45:in `[]='
2016-04-03T23:49:24.274771+00:00 app[worker.1]: /app/vendor/bundle/ruby/2.2.0/gems/activerecord-4.2.6/lib/active_record/attribute_set.rb:39:in `write_from_user'
I know that R14 is out of memory error. so I have two questions:
Is there anyway that delayed job can be tuned to take less memory. There will be some disk swapping involved, but it least it will run.
Why do I keep getting rake aborted! Can't modify frozen hash error (lines 4 and 5 from bottom of the log shown below). I do not get it in my local environment. What does it mean? Is it memory related?
Thanks in advance for your time. I am running Rails 4.2.6 and delayed_job 4.1.1 as shown below:
→ gem list | grep delayed
delayed_job (4.1.1)
delayed_job_active_record (4.1.0)
delayed_job_web (1.2.10)
Bharat
I found the problem. I am posting my solution here for those who may be running in similar problems.
I increase the heroku worker memory to use 2 standard dynos meaning I gave it 1 Gig memory so as to remove the memory quota problem. That made R14 go away, but still I continued to get
rake aborted!
Can't modify frozen hash
error and the program will crash then. So the problem was clearly here. After much research, I found that the previous programmer had used the 'workless' gem to reduce heroku charges. Workless gem makes heroku workers go to sleep when not being used and therefore no charges are incurred when not running heroku.
What I did not post in my original question is that I have upgraded the app from Rails 3.2.9 to Rails 4.2.6. Also my research showed that the workless gem had not been upgraded in the last three years and there was no mention on rails 4 on their site. So the chances were that it may not work well with Rails 4.2.6 and Heroku.
I saw some lines in my stack trace which were related to the workless gem. This was a clue for me to see what happens if I subtract, i.e., remove this gem from production. So I removed it and redeployed.
The frozen hash error went away and my delayed_job worker ran successfully to completion on Heroku.
The lesson for me was carefully read the log and check out all the dependencies :)
Hope this helps.