I was trying to use an example from the nltk book for a regex pattern inside the CountVectorizer from scikit-learn. I see examples with simple regex but not with something like this:
pattern = r''' (?x) # set flag to allow verbose regexps
([A-Z]\.)+ # abbreviations (e.g. U.S.A.)
| \w+(-\w+)* # words with optional internal hyphens
| \$?\d+(\.\d+)?%? # currency & percentages
| \.\.\. # ellipses '''
text = 'I love N.Y.C. 100% even with all of its traffic-ridden streets...'
vectorizer = CountVectorizer(stop_words='english',token_pattern=pattern)
analyze = vectorizer.build_analyzer()
analyze(text)
This produces:
[(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'-ridden', u''),
(u'', u'', u''),
(u'', u'', u'')]
With nltk, I get something entirely different:
nltk.regexp_tokenize(text,pattern)
['I',
'love',
'N.Y.C.',
'100',
'even',
'with',
'all',
'of',
'its',
'traffic-ridden',
'streets',
'...']
Is there a way to get the skl CountVectorizer to output the same thing? I was hoping to use some of the other handy features that are incorporated in the same function call.
TL;DR
from functools import partial
CountVectorizer(analyzer=partial(regexp_tokenize, pattern=pattern))
is a vectorizer that uses the NLTK tokenizer.
Now for the actual problem: apparently nltk.regexp_tokenize does something quite special with its pattern, whereas scikit-learn simply does an re.findall with the pattern you give it, and findall doesn't like this pattern:
In [33]: re.findall(pattern, text)
Out[33]:
[('', '', ''),
('', '', ''),
('C.', '', ''),
('', '', ''),
('', '', ''),
('', '', ''),
('', '', ''),
('', '', ''),
('', '', ''),
('', '-ridden', ''),
('', '', ''),
('', '', '')]
You'll either have to rewrite this pattern to make it work in scikit-learn style, or plug the NLTK tokenizer into scikit-learn:
In [41]: from functools import partial
In [42]: v = CountVectorizer(analyzer=partial(regexp_tokenize, pattern=pattern))
In [43]: v.build_analyzer()(text)
Out[43]:
['I',
'love',
'N.Y.C.',
'100',
'even',
'with',
'all',
'of',
'its',
'traffic-ridden',
'streets',
'...']
Related
Hi does anyone know how the body of a celery json is encoded before it is entered in the queue cache (i use Redis in my case).
{'body': 'W1sic2hhd25AdWJ4LnBoIiwge31dLCB7fSwgeyJjYWxsYmFja3MiOiBudWxsLCAiZXJyYmFja3MiOiBudWxsLCAiY2hhaW4iOiBudWxsLCAiY2hvcmQiOiBudWxsfV0=',
'content-encoding': 'utf-8',
'content-type': 'application/json',
'headers': {'lang': 'py',
'task': 'export_users',
'id': '6e506f75-628e-4aa1-9703-c0185c8b3aaa',
'shadow': None,
'eta': None,
'expires': None,
'group': None,
'retries': 0,
'timelimit': [None, None],
'root_id': '6e506f75-628e-4aa1-9703-c0185c8b3aaa',
'parent_id': None,
'argsrepr': "('<email#example.com>', {})",
'kwargsrepr': '{}',
'origin': 'gen187209#ubuntu'},
'properties': {'correlation_id': '6e506f75-628e-4aa1-9703-c0185c8b3aaa',
'reply_to': '403f7314-384a-30a3-a518-65911b7cba5c',
'delivery_mode': 2,
'delivery_info': {'exchange': '', 'routing_key': 'celery'},
'priority': 0,
'body_encoding': 'base64',
'delivery_tag': 'dad6b5d3-c667-473e-a62c-0881a7349684'}}
Just a background I have a nodejs project which needs to trigger my celery (django). Background tasks are all in the django app but the trigger and the details will come from a nodejs app.
Thanks in advance
It may just be simpler to use the nodejs celery client
https://github.com/mher/node-celery/blob/master/celery.js
to invoke a celery task from nodejs.
I've a deployment with one container having postStart hook as shown below
containers:
- name: openvas
image: my-image:test
lifecycle:
postStart:
exec:
command:
- /usr/local/tools/is_service_ready.sh
I'm watching for the events for pods using python's kubernetes library.
when the pod gets deployed, container comes up and postStart script will be executed until postStart script exits successfully. I want to get the event from kubernetes using pythons kubernetes library when CONTAINER comes up.
I tried watching the event, I get the event with status as 'containersReady' only when postStart completes and the POD comes up,it can be seen below.
'status': {'conditions': [{'last_probe_time': None,
'last_transition_time': datetime.datetime(2019, 4, 18, 16, 25, 3, tzinfo=tzlocal()),
'message': None,
'reason': None,
'status': 'True',
'type': 'Initialized'},
{'last_probe_time': None,
'last_transition_time': datetime.datetime(2019, 4, 18, 16, 26, 51, tzinfo=tzlocal()),
'message': None,
'reason': None,
'status': 'True',
'type': 'Ready'},
{'last_probe_time': None,
'last_transition_time': None,
'message': None,
'reason': None,
'status': 'True',
'type': 'ContainersReady'},
{'last_probe_time': None,
'last_transition_time': datetime.datetime(2019, 4, 18, 16, 25, 3, tzinfo=tzlocal()),
'message': None,
'reason': None,
'status': 'True',
'type': 'PodScheduled'}],
'container_statuses': [{'container_id': 'docker://1c39e13dc777a34c38d4194edc23c3668697223746b60276acffe3d62f9f0c44',
'image': 'my-image:test',
'image_id': 'docker://sha256:9903437699d871c1f3af7958a7294fe419ed7b1076cdb8e839687e67501b301b',
'last_state': {'running': None,
'terminated': None,
'waiting': None},
'name': 'samplename',
'ready': True,
'restart_count': 0,
'state': {'running': {'started_at': datetime.datetime(2019, 4, 18, 16, 25, 14, tzinfo=tzlocal())},
'terminated': None,
'waiting': None}}],
and before this I get status 'podScheduled' as 'True'
'status': {'conditions': [{'last_probe_time': None,
'last_transition_time': datetime.datetime(2019, 4, 18, 16, 25, 3, tzinfo=tzlocal()),
'message': None,
'reason': None,
'status': 'True',
'type': 'Initialized'},
{'last_probe_time': None,
'last_transition_time': datetime.datetime(2019, 4, 18, 16, 25, 3, tzinfo=tzlocal()),
'message': 'containers with unready status: [openvas]',
'reason': 'ContainersNotReady',
'status': 'False',
'type': 'Ready'},
{'last_probe_time': None,
'last_transition_time': None,
'message': 'containers with unready status: [openvas]',
'reason': 'ContainersNotReady',
'status': 'False',
'type': 'ContainersReady'},
{'last_probe_time': None,
'last_transition_time': datetime.datetime(2019, 4, 18, 16, 25, 3, tzinfo=tzlocal()),
'message': None,
'reason': None,
'status': 'True',
'type': 'PodScheduled'}],
'container_statuses': [{'container_id': None,
'image': 'ns-openvas:test',
'image_id': '',
'last_state': {'running': None,
'terminated': None,
'waiting': None},
'name': 'openvas',
'ready': False,
'restart_count': 0,
'state': {'running': None,
'terminated': None,
'waiting': {'message': None,
'reason': 'ContainerCreating'}}}],
Anything I can try to get the event when the CONTAINER comes up.
Obviously, with current approach you will never get it working, because, as describe here :
The postStart handler runs asynchronously relative to the Container’s
code, but Kubernetes’ management of the container blocks until the
postStart handler completes. The Container’s status is not set to
RUNNING until the postStart handler completes.
Maybe you should create another pod with is_service_ready.sh script, which will be watching events of the main pod.
I'm making posts requests with curl using this, and even more simple than this:
curl -u usr:pwd -H "Content-Type: multipart/form-data" -i --form "file=#/path/to/myfile/myfile.zip" -X POST http://midominio.co/api/mypostrequest/
but I'm always getting an empty body, this what I'm getting:
{
'session': <django.contrib.sessions.backends.db.SessionStore object at 0x7f118e502b90>,
'_post': <QueryDict: {}>,
'content_params': {'boundary': '------------------------axxxxxxxxx'},
'_post_parse_error': False,
'_messages': <django.contrib.messages.storage.fallback.FallbackStorage object at 0x7f118e502d90>,
'resolver_match': ResolverMatch(func=piston.resource.Resource, args=(), kwargs={}, url_name=None, app_names=[], namespaces=[]),
'GET': <QueryDict: {}>,
'_stream': <django.core.handlers.wsgi.LimitedStream object at 0x7f118e502b50>,
'COOKIES': {},
'_files': <MultiValueDict: {}>,
'_read_started': False,
'META': {'HTTP_AUTHORIZATION': 'Basic Y2FpbjpjYWlu',
'SERVER_SOFTWARE': 'gunicorn/19.7.1',
'SCRIPT_NAME': u'',
'REQUEST_METHOD': 'POST',
'PATH_INFO': u'/api/mypostrequest/',
'SERVER_PROTOCOL': 'HTTP/1.0',
'QUERY_STRING': '',
'CONTENT_LENGTH': '180',
'HTTP_USER_AGENT': 'curl/7.51.0',
'HTTP_CONNECTION': 'close',
'SERVER_NAME': 'midominio.co',
'REMOTE_ADDR': '',
'wsgi.url_scheme': 'http',
'SERVER_PORT': '80',
'wsgi.input': <gunicorn.http.body.Body object at 0x7f118e5029d0>,
'HTTP_HOST': 'midominio.co',
'wsgi.multithread': False,
'HTTP_ACCEPT': '*/*',
'wsgi.version': (1, 0),
'RAW_URI': '/api/mypostrequest/',
'wsgi.run_once': False,
'wsgi.errors': <gunicorn.http.wsgi.WSGIErrorsWrapper object at 0x7f118e502950>,
'wsgi.multiprocess': True,
'gunicorn.socket': <socket._socketobject object at 0x7f118e4f6de0>,
'CONTENT_TYPE': 'multipart/form-data; boundary=------------------------axxxxxxxxx',
'HTTP_X_FORWARDED_FOR': '186.00.00.000',
'wsgi.file_wrapper': <class 'gunicorn.http.wsgi.FileWrapper'>},
'environ': {'HTTP_AUTHORIZATION': 'Basic Y2FpbjpjYWlu',
'SERVER_SOFTWARE': 'gunicorn/19.7.1',
'SCRIPT_NAME': u'',
'REQUEST_METHOD': 'POST',
'PATH_INFO': u'/api/mypostrequest/',
'SERVER_PROTOCOL': 'HTTP/1.0',
'QUERY_STRING': '',
'CONTENT_LENGTH': '180',
'HTTP_USER_AGENT': 'curl/7.51.0',
'HTTP_CONNECTION': 'close',
'SERVER_NAME': 'midominio.co',
'REMOTE_ADDR': '',
'wsgi.url_scheme': 'http',
'SERVER_PORT': '80',
'wsgi.input': <gunicorn.http.body.Body object at 0x7f118e5029d0>,
'HTTP_HOST': 'midominio.co',
'wsgi.multithread': False,
'HTTP_ACCEPT': '*/*',
'wsgi.version': (1, 0),
'RAW_URI': '/api/mypostrequest/',
'wsgi.run_once': False,
'wsgi.errors': <gunicorn.http.wsgi.WSGIErrorsWrapper object at 0x7f118e502950>,
'wsgi.multiprocess': True,
'gunicorn.socket': <socket._socketobject object at 0x7f118e4f6de0>,
'CONTENT_TYPE': 'multipart/form-data; boundary=------------------------axxxxxxxxx',
'HTTP_X_FORWARDED_FOR': '186.00.00.000',
'wsgi.file_wrapper': <class 'gunicorn.http.wsgi.FileWrapper'>},
'path_info': u'/api/mypostrequest/',
'content_type': 'multipart/form-data; boundary=------------------------axxxxxxxxx',
'path': u'/api/mypostrequest/',
'data': <QueryDict: {}>,
'method': 'POST',
'user': <User: cain>
}
I'm using the default nginx configuration, I also have a web version of this part with an "upload button" and everything is working fine, there is not errors in the .log files of nginx or supervisor and the django installation is okay, any ideas? thanks
I was a mistake trying to take this as a normal request, this is a WSGIRequest so that means it need to be processed different I solved using this, I hope this help to someone else:
environ = request.environ
form = cgi.FieldStorage(fp=environ['wsgi.input'], environ=environ)
f = form['file'].file
I am using Django 1.6 with Celery. I have this task that I have running as a schedule. Everything looks correct, but I get this import error when the Celery(beat) runs:
Did you remember to import the module containing this task?
Or maybe you are using relative imports?
Please see http://bit.ly/gLye1c for more information.
The full contents of the message body was:
{'utc': True, 'chord': None, 'args': [], 'retries': 0, 'expires': None, 'task': 'bot_data.tasks.get_unanswered_threads', 'callbacks': None, 'errbacks': None, 'timelimit': (None, None), 'taskset': None, 'kwargs': {}, 'eta': None, 'id': '6143f259-721b-4984-99ae-790c00633271'} (233b)
Traceback (most recent call last):
File "/home/one/.virtualenvs/bot/local/lib/python2.7/site-packages/celery/worker/consumer.py", line 455, in on_task_received
strategies[name](message, body,
KeyError: 'bot_data.tasks.get_unanswered_threads'
base.py:
from datetime import timedelta
CELERYBEAT_SCHEDULE = {
'get-unanswered-threads--every-15-seconds': {
'task': 'bot_data.tasks.get_unanswered_threads',
'schedule': timedelta(seconds=15),
'args': ()
},
}
CELERY_TIMEZONE = 'UTC'
from bot_data/get_unanswered_threads:
#task()
def get_unanswered_threads():
slug = 'forums/threads/unanswered.json?PageSize=100'
thread_batch = []
Nevermind, I had bot_data.tasks.get_unanswered_threads incorrect.
The doc's specifies the the annotation as #shared_task instead of #task.
UPDATE: The problem wasn't with the previous transaction, it was happening because the database wasn't synced / migrated properly.
I've recently switched my local database to Postgres, and I'm now getting an InternalError.
I understand from this question that the problem likely originates from a previous transaction not executing properly:
"This is what postgres does when a query produces an error and you try to run another query without first rolling back the transaction."
However, from the logs below, it seems like the first query, DEBUG (0.0001), executes fine (I also tested this exact query though the Django DB Shell):
DEBUG (0.001) SELECT "django_site"."id", "django_site"."domain", "django_site"."name"
FROM "django_site" WHERE "django_site"."id" = 12 ; args=(12,)
Full SQL Logs
DEBUG (0.001) SELECT "django_site"."id", "django_site"."domain", "django_site"."name" FROM "django_site" WHERE "django_site"."id" = 12 ; args=(12,)
DEBUG (0.003) INSERT INTO "application_app" ("applied_date", "fname", "lname",
"email_address", "phone_number", "skype_id", "applied_track", "college1",
"field_of_study1", "graduation_month1", "graduation_year1", "degree1",
"degree_other1", "working_during_program", "explain_working_during_program",
"sib_goal", "twitter_link", "linkedin_link", "other_social_link1",
"other_social_link2", "other_social_link3", "applied_class", "applied_location",
"referral", "colossal_failure", "next_week_year_10year", "you_created",
"your_inspiration", "dev_years_of_exp", "dev_fav_lang", "dev_fav_lang_why",
"dev_link_youve_built", "dev_link_github", "dev_fav_resource", "prod_cool_prod",
"prod_fav_designer", "prod_portfolio", "prod_bad_design", "prod_link_dribble",
"mark_ind_trend", "mark_email_to_coworkers", "mark_keep_em_happy",
"mark_article_or_blog", "sales_why_you", "sales_convince_restaurant",
"sales_hardest_door", "sales_sale_within_the_year", "housing_needed",
"program_payment", "any_last_requests") VALUES ('2013-04-20 13:22:06.565691+00:00',
'Brian', 'Dant', 'test#gmail.com', '', '', 'MAR', '', '', 1, NULL, '', '', false,
'', 'GN', '', 'http://linkedin.com/', '', '', '', 'NYCSUM13', '', 'test', 'test',
'test', 'test', 'test', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', false, 'UF', 'test') RETURNING "application_app"."id"; args=(u'2013-04-20 13:22:06.565691+00:00',
u'Brian', u'Dant', u'test#gmail.com', u'', u'', u'MAR', u'', u'', 1, None, '', u'',
False, u'', u'GN', u'', u'http://linkedin.com/', u'', u'', '', u'NYCSUM13', '',
u'test', u'test', u'test', u'test', u'test', '', '', u'', u'', u'', u'', u'', '',
u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', False, u'UF', u'test')
ERROR Internal Server Error: /
Traceback (most recent call last):
File "/Users/chaz/dev/envs/startupinstitute.com/lib/python2.7/site-packages/django/core/handlers/base.py", line 111, in get_response
response = callback(request, *callback_args, **callback_kwargs)
File "/Users/chaz/dev/projects/startupinstitute.com/apps/application/views.py", line 22, in application
new_app = f.save()
File "/Users/chaz/dev/envs/startupinstitute.com/lib/python2.7/site-packages/django/forms/models.py", line 364, in save
fail_message, commit, construct=False)
File "/Users/chaz/dev/envs/startupinstitute.com/lib/python2.7/site-packages/django/forms/models.py", line 86, in save_instance
instance.save()
File "/Users/chaz/dev/envs/startupinstitute.com/lib/python2.7/site-packages/django/db/models/base.py", line 463, in save
self.save_base(using=using, force_insert=force_insert, force_update=force_update)
File "/Users/chaz/dev/envs/startupinstitute.com/lib/python2.7/site-packages/django/db/models/base.py", line 551, in save_base
result = manager._insert([self], fields=fields, return_id=update_pk, using=using, raw=raw)
File "/Users/chaz/dev/envs/startupinstitute.com/lib/python2.7/site-packages/django/db/models/manager.py", line 203, in _insert
return insert_query(self.model, objs, fields, **kwargs)
File "/Users/chaz/dev/envs/startupinstitute.com/lib/python2.7/site-packages/django/db/models/query.py", line 1593, in insert_query
return query.get_compiler(using=using).execute_sql(return_id)
File "/Users/chaz/dev/envs/startupinstitute.com/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 912, in execute_sql
cursor.execute(sql, params)
File "/Users/chaz/dev/envs/startupinstitute.com/lib/python2.7/site-packages/debug_toolbar/utils/tracking/db.py", line 153, in execute
'iso_level': conn.isolation_level,
InternalError: current transaction is aborted, commands ignored until end of transaction block
[20/Apr/2013 08:22:06] "POST / HTTP/1.1" 500 413131
DEBUG (0.002) SELECT "django_site"."id", "django_site"."domain", "django_site"."name" FROM "django_site" WHERE "django_site"."id" = 12 ; args=(12,)
WARNING Not Found: /favicon.ico
views.py:
def application(request):
if request.method == 'POST':
f = forms.AppForm(request.POST)
selected_track = request.POST['applied_track']
if f.is_valid():
new_app = f.save()
new_app.save()
The problem was that my database wasn't migrated properly. Apparently this error can come from (at least) both a) the previous transaction, or b) the database not being synced properly by south.