APScheduler run only a limited number of jobs - python-2.7

I realized that APScheduler (3.3.1) with Python (2.7) only run a limited number of my jobs, so I made a test file to check it better, this is my code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import datetime
import time
from apscheduler.schedulers.background import BackgroundScheduler
def fetch_job(a):
print a
time.sleep(1)
print a, ' awakened'
scheduler = BackgroundScheduler()
schedules = [
{'source': '1', 'interval': 10, 'job': fetch_job},
{'source': '2', 'interval': 10, 'job': fetch_job},
{'source': '3', 'interval': 10, 'job': fetch_job},
{'source': '4', 'interval': 10, 'job': fetch_job},
{'source': '5', 'interval': 10, 'job': fetch_job},
{'source': '6', 'interval': 10, 'job': fetch_job},
{'source': '7', 'interval': 10, 'job': fetch_job},
{'source': '8', 'interval': 10, 'job': fetch_job},
{'source': '9', 'interval': 10, 'job': fetch_job},
{'source': '10', 'interval': 10, 'priority': 3, 'job': fetch_job},
{'source': '11', 'interval': 10, 'job': fetch_job},
{'source': '12', 'interval': 10, 'job': fetch_job},
{'source': '13', 'interval': 10, 'job': fetch_job},
{'source': '14', 'interval': 10, 'job': fetch_job}
]
for i, k in enumerate(schedules):
scheduler.add_job(k['job'], 'interval', seconds=k['interval'],
args=[k['source'],], next_run_time=datetime.datetime.now())
scheduler.start()
while True:
time.sleep(2)
When i run it Output is like this:
1
2
3
4
5
6
7
8
9
10
1 awakened
No handlers could be found for logger "apscheduler.executors.default"
2 awakened
43 awakened65
awakened
awakened
awakened
7 awakened
8 awakened
9 awakened
10 awakened
In first run, only 10 jobs trigger, while I added 14 job to scheduler, but when scheduler trigger for second round and so on, I see other jobs too.

Related

How to transform data before loading into BigQuery in Apache Airflow?

I am new to Apache Airflow. My task is to read data from Google Cloud Storage, transform the data and upload the transformed data into BigQuery table. I'm able to get data from Cloud Storage bucket and directly store that to BigQuery table. I'm not sure how to include the transform function in this pipeline.
Here's my code:
# Import libraries needed for the operation
import airflow
from datetime import timedelta, datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.contrib.operators.gcs_to_bq import GoogleCloudStorageToBigQueryOperator
# Default Argument
default_args = {
'owner': <OWNER_NAME>,
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(1),
'email_on_failure': False,
'email_on_retry': False,
'retries': 2,
'retry_delay': timedelta(minutes=2),
}
# DAG Definition
dag = DAG('load_from_bucket_to_bq',
schedule_interval='0 * * * *',
default_args=default_args)
# Variable Configurations
BQ_CONN_ID = <CONN_ID>
BQ_PROJECT = <PROJECT_ID>
BQ_DATASET = <DATASET_ID>
with dag:
# Tasks
start = DummyOperator(
task_id='start'
)
upload = GoogleCloudStorageToBigQueryOperator(
task_id='load_from_bucket_to_bigquery',
bucket=<BUCKET_NAME>,
source_objects=['*.csv'],
schema_fields=[
{'name': 'Active_Cases', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'Country', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'Last_Update', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'New_Cases', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'New_Deaths', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'Total_Cases', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'Total_Deaths', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'Total_Recovered', 'type': 'STRING', 'mode': 'NULLABLE'},
],
destination_project_dataset_table=BQ_PROJECT + '.' + BQ_DATASET + '.' + <TABLE_NAME>,
write_disposition='WRITE_TRUNCATE',
google_cloud_storage_conn_id=BQ_CONN_ID,
bigquery_conn_id=BQ_CONN_ID,
dag = dag
)
end = DummyOperator(
task_id='end'
)
# Setting Dependencies
start >> upload >> end
Any help on how to proceed is appreciated. Thanks.
Posting the conversation with #sachinmb27 as an answer. The transform can be placed in a python function and use PythonOperator to call the transform function at runtime. More details on what operators can be used in Airflow can be seen in Airflow Operator docs.

Celery Django Body Encoding

Hi does anyone know how the body of a celery json is encoded before it is entered in the queue cache (i use Redis in my case).
{'body': 'W1sic2hhd25AdWJ4LnBoIiwge31dLCB7fSwgeyJjYWxsYmFja3MiOiBudWxsLCAiZXJyYmFja3MiOiBudWxsLCAiY2hhaW4iOiBudWxsLCAiY2hvcmQiOiBudWxsfV0=',
'content-encoding': 'utf-8',
'content-type': 'application/json',
'headers': {'lang': 'py',
'task': 'export_users',
'id': '6e506f75-628e-4aa1-9703-c0185c8b3aaa',
'shadow': None,
'eta': None,
'expires': None,
'group': None,
'retries': 0,
'timelimit': [None, None],
'root_id': '6e506f75-628e-4aa1-9703-c0185c8b3aaa',
'parent_id': None,
'argsrepr': "('<email#example.com>', {})",
'kwargsrepr': '{}',
'origin': 'gen187209#ubuntu'},
'properties': {'correlation_id': '6e506f75-628e-4aa1-9703-c0185c8b3aaa',
'reply_to': '403f7314-384a-30a3-a518-65911b7cba5c',
'delivery_mode': 2,
'delivery_info': {'exchange': '', 'routing_key': 'celery'},
'priority': 0,
'body_encoding': 'base64',
'delivery_tag': 'dad6b5d3-c667-473e-a62c-0881a7349684'}}
Just a background I have a nodejs project which needs to trigger my celery (django). Background tasks are all in the django app but the trigger and the details will come from a nodejs app.
Thanks in advance
It may just be simpler to use the nodejs celery client
https://github.com/mher/node-celery/blob/master/celery.js
to invoke a celery task from nodejs.

"container started" event of a pod from kubernetes using pythons kubernetes library

I've a deployment with one container having postStart hook as shown below
containers:
- name: openvas
image: my-image:test
lifecycle:
postStart:
exec:
command:
- /usr/local/tools/is_service_ready.sh
I'm watching for the events for pods using python's kubernetes library.
when the pod gets deployed, container comes up and postStart script will be executed until postStart script exits successfully. I want to get the event from kubernetes using pythons kubernetes library when CONTAINER comes up.
I tried watching the event, I get the event with status as 'containersReady' only when postStart completes and the POD comes up,it can be seen below.
'status': {'conditions': [{'last_probe_time': None,
'last_transition_time': datetime.datetime(2019, 4, 18, 16, 25, 3, tzinfo=tzlocal()),
'message': None,
'reason': None,
'status': 'True',
'type': 'Initialized'},
{'last_probe_time': None,
'last_transition_time': datetime.datetime(2019, 4, 18, 16, 26, 51, tzinfo=tzlocal()),
'message': None,
'reason': None,
'status': 'True',
'type': 'Ready'},
{'last_probe_time': None,
'last_transition_time': None,
'message': None,
'reason': None,
'status': 'True',
'type': 'ContainersReady'},
{'last_probe_time': None,
'last_transition_time': datetime.datetime(2019, 4, 18, 16, 25, 3, tzinfo=tzlocal()),
'message': None,
'reason': None,
'status': 'True',
'type': 'PodScheduled'}],
'container_statuses': [{'container_id': 'docker://1c39e13dc777a34c38d4194edc23c3668697223746b60276acffe3d62f9f0c44',
'image': 'my-image:test',
'image_id': 'docker://sha256:9903437699d871c1f3af7958a7294fe419ed7b1076cdb8e839687e67501b301b',
'last_state': {'running': None,
'terminated': None,
'waiting': None},
'name': 'samplename',
'ready': True,
'restart_count': 0,
'state': {'running': {'started_at': datetime.datetime(2019, 4, 18, 16, 25, 14, tzinfo=tzlocal())},
'terminated': None,
'waiting': None}}],
and before this I get status 'podScheduled' as 'True'
'status': {'conditions': [{'last_probe_time': None,
'last_transition_time': datetime.datetime(2019, 4, 18, 16, 25, 3, tzinfo=tzlocal()),
'message': None,
'reason': None,
'status': 'True',
'type': 'Initialized'},
{'last_probe_time': None,
'last_transition_time': datetime.datetime(2019, 4, 18, 16, 25, 3, tzinfo=tzlocal()),
'message': 'containers with unready status: [openvas]',
'reason': 'ContainersNotReady',
'status': 'False',
'type': 'Ready'},
{'last_probe_time': None,
'last_transition_time': None,
'message': 'containers with unready status: [openvas]',
'reason': 'ContainersNotReady',
'status': 'False',
'type': 'ContainersReady'},
{'last_probe_time': None,
'last_transition_time': datetime.datetime(2019, 4, 18, 16, 25, 3, tzinfo=tzlocal()),
'message': None,
'reason': None,
'status': 'True',
'type': 'PodScheduled'}],
'container_statuses': [{'container_id': None,
'image': 'ns-openvas:test',
'image_id': '',
'last_state': {'running': None,
'terminated': None,
'waiting': None},
'name': 'openvas',
'ready': False,
'restart_count': 0,
'state': {'running': None,
'terminated': None,
'waiting': {'message': None,
'reason': 'ContainerCreating'}}}],
Anything I can try to get the event when the CONTAINER comes up.
Obviously, with current approach you will never get it working, because, as describe here :
The postStart handler runs asynchronously relative to the Container’s
code, but Kubernetes’ management of the container blocks until the
postStart handler completes. The Container’s status is not set to
RUNNING until the postStart handler completes.
Maybe you should create another pod with is_service_ready.sh script, which will be watching events of the main pod.

using nltk regex example in scikit-learn CountVectorizer

I was trying to use an example from the nltk book for a regex pattern inside the CountVectorizer from scikit-learn. I see examples with simple regex but not with something like this:
pattern = r''' (?x) # set flag to allow verbose regexps
([A-Z]\.)+ # abbreviations (e.g. U.S.A.)
| \w+(-\w+)* # words with optional internal hyphens
| \$?\d+(\.\d+)?%? # currency & percentages
| \.\.\. # ellipses '''
text = 'I love N.Y.C. 100% even with all of its traffic-ridden streets...'
vectorizer = CountVectorizer(stop_words='english',token_pattern=pattern)
analyze = vectorizer.build_analyzer()
analyze(text)
This produces:
[(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'-ridden', u''),
(u'', u'', u''),
(u'', u'', u'')]
With nltk, I get something entirely different:
nltk.regexp_tokenize(text,pattern)
['I',
'love',
'N.Y.C.',
'100',
'even',
'with',
'all',
'of',
'its',
'traffic-ridden',
'streets',
'...']
Is there a way to get the skl CountVectorizer to output the same thing? I was hoping to use some of the other handy features that are incorporated in the same function call.
TL;DR
from functools import partial
CountVectorizer(analyzer=partial(regexp_tokenize, pattern=pattern))
is a vectorizer that uses the NLTK tokenizer.
Now for the actual problem: apparently nltk.regexp_tokenize does something quite special with its pattern, whereas scikit-learn simply does an re.findall with the pattern you give it, and findall doesn't like this pattern:
In [33]: re.findall(pattern, text)
Out[33]:
[('', '', ''),
('', '', ''),
('C.', '', ''),
('', '', ''),
('', '', ''),
('', '', ''),
('', '', ''),
('', '', ''),
('', '', ''),
('', '-ridden', ''),
('', '', ''),
('', '', '')]
You'll either have to rewrite this pattern to make it work in scikit-learn style, or plug the NLTK tokenizer into scikit-learn:
In [41]: from functools import partial
In [42]: v = CountVectorizer(analyzer=partial(regexp_tokenize, pattern=pattern))
In [43]: v.build_analyzer()(text)
Out[43]:
['I',
'love',
'N.Y.C.',
'100',
'even',
'with',
'all',
'of',
'its',
'traffic-ridden',
'streets',
'...']

What are my next debugging steps? InternalError: current transaction is aborted, commands ignored until end of transaction block

UPDATE: The problem wasn't with the previous transaction, it was happening because the database wasn't synced / migrated properly.
I've recently switched my local database to Postgres, and I'm now getting an InternalError.
I understand from this question that the problem likely originates from a previous transaction not executing properly:
"This is what postgres does when a query produces an error and you try to run another query without first rolling back the transaction."
However, from the logs below, it seems like the first query, DEBUG (0.0001), executes fine (I also tested this exact query though the Django DB Shell):
DEBUG (0.001) SELECT "django_site"."id", "django_site"."domain", "django_site"."name"
FROM "django_site" WHERE "django_site"."id" = 12 ; args=(12,)
Full SQL Logs
DEBUG (0.001) SELECT "django_site"."id", "django_site"."domain", "django_site"."name" FROM "django_site" WHERE "django_site"."id" = 12 ; args=(12,)
DEBUG (0.003) INSERT INTO "application_app" ("applied_date", "fname", "lname",
"email_address", "phone_number", "skype_id", "applied_track", "college1",
"field_of_study1", "graduation_month1", "graduation_year1", "degree1",
"degree_other1", "working_during_program", "explain_working_during_program",
"sib_goal", "twitter_link", "linkedin_link", "other_social_link1",
"other_social_link2", "other_social_link3", "applied_class", "applied_location",
"referral", "colossal_failure", "next_week_year_10year", "you_created",
"your_inspiration", "dev_years_of_exp", "dev_fav_lang", "dev_fav_lang_why",
"dev_link_youve_built", "dev_link_github", "dev_fav_resource", "prod_cool_prod",
"prod_fav_designer", "prod_portfolio", "prod_bad_design", "prod_link_dribble",
"mark_ind_trend", "mark_email_to_coworkers", "mark_keep_em_happy",
"mark_article_or_blog", "sales_why_you", "sales_convince_restaurant",
"sales_hardest_door", "sales_sale_within_the_year", "housing_needed",
"program_payment", "any_last_requests") VALUES ('2013-04-20 13:22:06.565691+00:00',
'Brian', 'Dant', 'test#gmail.com', '', '', 'MAR', '', '', 1, NULL, '', '', false,
'', 'GN', '', 'http://linkedin.com/', '', '', '', 'NYCSUM13', '', 'test', 'test',
'test', 'test', 'test', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', false, 'UF', 'test') RETURNING "application_app"."id"; args=(u'2013-04-20 13:22:06.565691+00:00',
u'Brian', u'Dant', u'test#gmail.com', u'', u'', u'MAR', u'', u'', 1, None, '', u'',
False, u'', u'GN', u'', u'http://linkedin.com/', u'', u'', '', u'NYCSUM13', '',
u'test', u'test', u'test', u'test', u'test', '', '', u'', u'', u'', u'', u'', '',
u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', False, u'UF', u'test')
ERROR Internal Server Error: /
Traceback (most recent call last):
File "/Users/chaz/dev/envs/startupinstitute.com/lib/python2.7/site-packages/django/core/handlers/base.py", line 111, in get_response
response = callback(request, *callback_args, **callback_kwargs)
File "/Users/chaz/dev/projects/startupinstitute.com/apps/application/views.py", line 22, in application
new_app = f.save()
File "/Users/chaz/dev/envs/startupinstitute.com/lib/python2.7/site-packages/django/forms/models.py", line 364, in save
fail_message, commit, construct=False)
File "/Users/chaz/dev/envs/startupinstitute.com/lib/python2.7/site-packages/django/forms/models.py", line 86, in save_instance
instance.save()
File "/Users/chaz/dev/envs/startupinstitute.com/lib/python2.7/site-packages/django/db/models/base.py", line 463, in save
self.save_base(using=using, force_insert=force_insert, force_update=force_update)
File "/Users/chaz/dev/envs/startupinstitute.com/lib/python2.7/site-packages/django/db/models/base.py", line 551, in save_base
result = manager._insert([self], fields=fields, return_id=update_pk, using=using, raw=raw)
File "/Users/chaz/dev/envs/startupinstitute.com/lib/python2.7/site-packages/django/db/models/manager.py", line 203, in _insert
return insert_query(self.model, objs, fields, **kwargs)
File "/Users/chaz/dev/envs/startupinstitute.com/lib/python2.7/site-packages/django/db/models/query.py", line 1593, in insert_query
return query.get_compiler(using=using).execute_sql(return_id)
File "/Users/chaz/dev/envs/startupinstitute.com/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 912, in execute_sql
cursor.execute(sql, params)
File "/Users/chaz/dev/envs/startupinstitute.com/lib/python2.7/site-packages/debug_toolbar/utils/tracking/db.py", line 153, in execute
'iso_level': conn.isolation_level,
InternalError: current transaction is aborted, commands ignored until end of transaction block
[20/Apr/2013 08:22:06] "POST / HTTP/1.1" 500 413131
DEBUG (0.002) SELECT "django_site"."id", "django_site"."domain", "django_site"."name" FROM "django_site" WHERE "django_site"."id" = 12 ; args=(12,)
WARNING Not Found: /favicon.ico
views.py:
def application(request):
if request.method == 'POST':
f = forms.AppForm(request.POST)
selected_track = request.POST['applied_track']
if f.is_valid():
new_app = f.save()
new_app.save()
The problem was that my database wasn't migrated properly. Apparently this error can come from (at least) both a) the previous transaction, or b) the database not being synced properly by south.