Datastore: Batch must be in progress to put() - python-2.7

I am trying to use the Google Cloud Datastore API client library to upload an entity with batch on datastore. My version is 1.6.0
This is my code:
from google.cloud import datastore
client = datastore.Client()
batch = client.batch()
key = client.key('test', 'key1')
entity = datastore.Entity(
key,
exclude_from_indexes=['attribute2'])
entity.update({
'attribute1': 'hello',
'attribute2': 'hello again',
'attribute3': 0.98,
})
batch.put(entity)
And I am getting this error when I do batch.put():
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/batch.py", line 185, in put
raise ValueError('Batch must be in progress to put()')
ValueError: Batch must be in progress to put()
What am I doing wrong?

You'll need to explicitly call batch.begin() if you aren't doing the puts in a context, i.e. using the with keyword.

Short answer: As stated by #JimMorrison in his answer, you need to call the batch.begin() method when beginning a batch in Datastore, unless you are using the with statement, as explained in the documentation for google.cloud.datastore.batch.Batch.
Long answer: The answer in short answer works, however working with batches directly is not the recommended approach according to the documentation. The Datastore documentation offers some information about how to work with batch operations, differentiating two types of batch operations:
Non-transactional batch: these are some predefined methods (generally having the _multi appendix, indicating that it will work over multiple entities simultaneously) that allow to operate on multiple objects in a single Datastore call. These methods are get_multi(), put_multi() and delete_multi(), but they are not executed transactionally, i.e. it is possible that only some operations in the request have ended successfully if an error happens.
Transactions: atomic operations which are guaranteed to never be partially applied, i.e. either all operations are applied, or none (if an error occurs). This is a nice feature that comes in handy depending on the time of operations you are trying to perform.
According to the documentation about Datastore's Batch class, the Batch class is overridden by the Transaction class, so unless you want to perform some specific operations with the underlying Batch, you should probably work with transactions instead, as explained in the documentation, where you will find more examples and best practices to work with. This practice is preferred to the one you are using, which is working with the Batch class directly, which is in fact overridden by Transaction.
TL;DR: batch.begin() will solve the issue in your code, as you need to initialize the batch with the _IN_PROGRESS status. However, if you want to make your life easier, and make an abstracted usage of batches through transactions, with better documentation and examples, I would recommend the usage of transactions instead.

Related

Dynamic user impersonation in Airflow

Is there a simple, efficient mechanism to dynamically set the run_as_user parameter for an Airflow 2.0 task instance based upon the user that triggered the DAG execution?
This answer provides a way to determine the triggering user by inspecting the metadata database (earlier answers that involve templating--such as this--no longer seem to work for Airflow 2.0). Using the metadata inspection approach, I've been able to dynamically set run_as_user via a customer operator, like so:
from airflow.operators.python import PythonOperator
from airflow.models.log import Log
from airflow.utils.db import create_session
class CustomPythonOperator(PythonOperator):
def __init__(self, **kwargs):
super().__init__(**kwargs)
with create_session() as session:
results = session.query(Log.owner)\
.filter(
Log.dag_id==dag.dag_id,
Log.event=='trigger')\
.order_by(Log.dttm.desc())\
.all()
self.run_as_user = results[0][0] if len(results) else 'airflow'
That said, this approach seems problematic for at least two reasons:
Ostensibly, it requires that the scheduler query the metadata database every time it instantiates the custom operator (rather than every time the task is executed). The documentation warns against expensive operations in the __init__ method of a custom operator. Moving the user lookup to the execute method of the custom operator does not work.
I suspect the semantics of this solution are not quite right: Specifically, I suspect that this approach will set run_as_user to the user that most recently triggered the DAG as of the most recent time that the scheduler parsed the DAG, rather than the user associated with the current DAG execution. Depending on the scheduler interval, they may be the same user, but this would seem to introduce a race condition.
Is there a better way to solve this?

WS2ESB: Store state between sequence invocations

I was wondering about the proper way to store state between sequence invocations in WSO2ESB. In other words, if I have a scheduled task that invokes sequence S, at the end of iteration 0 I want to store some String variable (lets' call it ID), and then I want to read this ID at the start (or in the middle) of iteration 1, and so on.
To be more precise, I want to get a list of new SMS messages from an existing service, Twilio to be exact. However, Twilio only lets me get messages for selected days, i.e. there's no way for me to say give me only new messages (since I last checked / newer than certain message ID). Therefore, I'd like to create a scheduled task that will query Twilio and pass only new messages via REST call to my service. In order to do this, my sequence needs to query Twilio and then go through the returned list of messages, and discard messages that were already reported in the previous invocation. Now, to do this I need to store some state between different task/sequence invocations, i.e. at the end of the sequence I need to store the ID of the newest message in the current batch. This ID can then be used in subsequent invocation to determine which messages were already reported in the previous invocation.
I could use DBLookup and DB Report mediators, but it seems like an overkill (using a database to store a single string) and not very performance friendly. On the other hand, as far as I can see Class mediators are instantiated as singletons, therefore I could create a custom Class mediator that would manage this state and filter the list of messages to be sent to my service. I am quite sure that this will work, but I was wondering if this is the way to go, or there might be a more elegant solution that I missed.
We can think of 3 options here.
Using DBLookup/Report as you've suggested
Using the Carbon registry to store the values (this again uses DBs in the back end)
Using a Custom mediator to hold the state and read/write it from/to properties
Out of these three, obviously the third one will deliver the best performance since everything will be in-memory. It's also quite simple to implement and sometime back I did something similar and wrote a blog post here.
But on the other hand, the first two options can keep the state even when the server crashes, if it's a concern for your use case.
Since esb 490 you can persist and read properties from registry using property mediator.
https://docs.wso2.com/display/ESB490/Property+Mediator

Use atomic blocks with error handling in Django

I have a Django 1.9 application that is running a section of code where changes are made to the database based on the results of queries to certain remote APIs. For instance, this might be data about commits, file changes, reviewers, pull requests, etc. that I want to save as entities in my database.
commit_data = commit_API_client.get_commit_info(argument1, argument2)
new_commit = models.Commit.Create(**commit_data)
#if the last API failed, this will fail
#I will need to run this again to get these files, so I need
#to get the commit all over again, too
files = file_API_client.get_file_info(new_commit.id)
new_files = models.Files.Create(**files)
#do some more stuff here
It's very possible that one of the few APIs I'm calling will return some error instead of valid data. I essentially need to make this section into a single atomic transaction so that if there are no errors returned from the HTTP requests, I commit all the changes to the database. Otherwise, I might lose some data if 2 APIs return correctly, but the 3rd doesn't.
I saw that Django supports commit hooks for database transactions, but I was wondering if that would applicable to this situation and how I would go about implementing that.
If you want to make that into an atomic operation, just wrap it in transaction.atomic(). If any of your code raises an exception the whole block will roll back. If you're only doing reads from the remote API that should work fine. A typical pattern is:
def my_function():
try:
with transaction.atomic():
# Do your work here
pass
except Exception:
# Do some error handling
pass
Django does have an on_commit hook, but that isn't really applicable here. Its purpose is to run some code after a transaction has completed successfully. For example, you might use it to write some log data to a remote API if the transaction was successful.

How to create separate python script for uploading data into ndb

Can anyone guide me towards the right direction as to where I should place a script solely for loading data into ndb. As I wish to upload all data into the gae ndb so that the application could perform query on it.
Right now, the loading of data is in my application. I wish to placed it separately from the main application.
Should it be edited in the yaml file?
EDITED
This is a snippet of the entity and the handler to upload the data into GAE ndb.
I wish to placed this chunk of code separately from my main application .py. Reason being the uploading of this data won't be done frequently and to keep the codes in the main application "cleaner".
class TagTrend_refine(ndb.Model):
tag = ndb.StringProperty()
trendData = ndb.BlobProperty(compressed=True)
class MigrateData(webapp2.RequestHandler):
def get(self):
listOfEntities = []
f = open("tagTrend_refine.txt")
lines = f.readlines()
f.close()
for line in lines:
temp = line.strip().split("\t")
data = TagTrend_refine(
tag = temp[0],
trendData = temp[1]
)
listOfEntities.append(data)
ndb.put_multi(listOfEntities)
For example if I placed the above code in a file called dataLoader.py, where should I call this script to invoke?
In app.yaml alongside my main application(knowledgeGraph.application)?
- url: /.*
script: knowledgeGraph.application
You don't show us the application object (no doubt a WSGI app) in your knowledge.py module, so I can't know what URL you want to serve with the MigrateData handler -- I'll just guess it's something like /migratedata.
So the class TagTrend_refine should be in a separate file (usually called models.py) so that both your dataloader.py, and your knowledge.py, can import models to access it (and models.py will need its own import of ndb of course). (Then of course access to the entity class will be as models.TagTrend_refine -- very basic Python).
Next, you'll complete dataloader.py by defining a WSGI app, e.g, at end of file,
app = webapp2.WSGIApplication(routes=[('/migratedata', MigrateData)])
(of course this means this module will need to import webapp2 as well -- can I take for granted a knowledge of super-elementary Python?).
In app.yaml, as the first URL, before that /.*, you'll have:
url: /migratedata
script: dataloader.app
Given all this, when you visit '/migratedata', your handler will read the "tagTrend_refine.txt" file that you uploaded together with your .py, .yaml, and so on, files in your overall GAE application, and unconditionally create one entity per line of that file (assuming you fix the multiple indentation problems in your code as displayed above, but, again, this is just super-elementary Python -- presumably you've used both tabs and spaces and they show up OK in your editor, but not here on SO... I recommend you use strictly, only spaces, never tabs, in Python code).
However this does seem to be a peculiar task. If /migratedata gets visited twice, it will create duplicates of all entities. If you change the tagTrend_refine.txt and deploy a changed variation, then visit /migratedata... all old entities will stick around and all the new entities will join them. And so forth.
Moreover -- /migratedata is NOT idempotent (if visited more than once it does not produce the same state as running it just once) so it shouldn't be a GET (and now we're on to super-elementary HTTP for a change!-) -- it should be a POST.
In fact I suspect (but I'm really flying blind here, since you see fit to give such tiny amounts of information) that you in fact want to upload a .txt file to a POST handler and do the updates that way (perhaps avoiding duplicates...?). However, I'm no mind reader, so this is about as far as I can go.
I believe I have fully answered the question you posted (though perhaps not the one you meant but didn't express:-) and by SO's etiquette it would be nice to upvote and accept this answer, then, if needed, post another question, expressing MUCH more clearly and completely what you're trying to achieve, your current .py and .yaml (ideally with correct indentation), what they actually do and why you'd like to do something different. For POST vs GET in particular, just study When should I use GET or POST method? What's the difference between them? ...
Alex's solution will work, as long as all you data can be loaded in under 1 minute, as that's the timeout for an app engine request.
For larger data, consider calling the datastore API directly from your own computer where you have the source. It's a bit of a hassle because it's a different API; it's not ndb. But it's still a pretty simple API. Here's some code that calls the API:
https://github.com/GoogleCloudPlatform/getting-started-python/blob/master/2-structured-data/bookshelf/model_datastore.py
Again, this code can run anywhere. It doesn't need to be uploaded to app engine to run.

Django: How can I protect against concurrent modification of database entries

If there a way to protect against concurrent modifications of the same data base entry by two or more users?
It would be acceptable to show an error message to the user performing the second commit/save operation, but data should not be silently overwritten.
I think locking the entry is not an option, as a user might use the "Back" button or simply close his browser, leaving the lock for ever.
This is how I do optimistic locking in Django:
updated = Entry.objects.filter(Q(id=e.id) && Q(version=e.version))\
.update(updated_field=new_value, version=e.version+1)
if not updated:
raise ConcurrentModificationException()
The code listed above can be implemented as a method in Custom Manager.
I am making the following assumptions:
filter().update() will result in a single database query because filter is lazy
a database query is atomic
These assumptions are enough to ensure that no one else has updated the entry before. If multiple rows are updated this way you should use transactions.
WARNING Django Doc:
Be aware that the update() method is
converted directly to an SQL
statement. It is a bulk operation for
direct updates. It doesn't run any
save() methods on your models, or emit
the pre_save or post_save signals
This question is a bit old and my answer a bit late, but after what I understand this has been fixed in Django 1.4 using:
select_for_update(nowait=True)
see the docs
Returns a queryset that will lock rows until the end of the transaction, generating a SELECT ... FOR UPDATE SQL statement on supported databases.
Usually, if another transaction has already acquired a lock on one of the selected rows, the query will block until the lock is released. If this is not the behavior you want, call select_for_update(nowait=True). This will make the call non-blocking. If a conflicting lock is already acquired by another transaction, DatabaseError will be raised when the queryset is evaluated.
Of course this will only work if the back-end support the "select for update" feature, which for example sqlite doesn't. Unfortunately: nowait=True is not supported by MySql, there you have to use: nowait=False, which will only block until the lock is released.
Actually, transactions don't help you much here ... unless you want to have transactions running over multiple HTTP requests (which you most probably don't want).
What we usually use in those cases is "Optimistic Locking". The Django ORM doesn't support that as far as I know. But there has been some discussion about adding this feature.
So you are on your own. Basically, what you should do is add a "version" field to your model and pass it to the user as a hidden field. The normal cycle for an update is :
read the data and show it to the user
user modify data
user post the data
the app saves it back in the database.
To implement optimistic locking, when you save the data, you check if the version that you got back from the user is the same as the one in the database, and then update the database and increment the version. If they are not, it means that there has been a change since the data was loaded.
You can do that with a single SQL call with something like :
UPDATE ... WHERE version = 'version_from_user';
This call will update the database only if the version is still the same.
Django 1.11 has three convenient options to handle this situation depending on your business logic requirements:
Something.objects.select_for_update() will block until the model become free
Something.objects.select_for_update(nowait=True) and catch DatabaseError if the model is currently locked for update
Something.objects.select_for_update(skip_locked=True) will not return the objects that are currently locked
In my application, which has both interactive and batch workflows on various models, I found these three options to solve most of my concurrent processing scenarios.
The "waiting" select_for_update is very convenient in sequential batch processes - I want them all to execute, but let them take their time. The nowait is used when an user wants to modify an object that is currently locked for update - I will just tell them it's being modified at this moment.
The skip_locked is useful for another type of update, when users can trigger a rescan of an object - and I don't care who triggers it, as long as it's triggered, so skip_locked allows me to silently skip the duplicated triggers.
For future reference, check out https://github.com/RobCombs/django-locking. It does locking in a way that doesn't leave everlasting locks, by a mixture of javascript unlocking when the user leaves the page, and lock timeouts (e.g. in case the user's browser crashes). The documentation is pretty complete.
You should probably use the django transaction middleware at least, even regardless of this problem.
As to your actual problem of having multiple users editing the same data... yes, use locking. OR:
Check what version a user is updating against (do this securely, so users can't simply hack the system to say they were updating the latest copy!), and only update if that version is current. Otherwise, send the user back a new page with the original version they were editing, their submitted version, and the new version(s) written by others. Ask them to merge the changes into one, completely up-to-date version. You might try to auto-merge these using a toolset like diff+patch, but you'll need to have the manual merge method working for failure cases anyway, so start with that. Also, you'll need to preserve version history, and allow admins to revert changes, in case someone unintentionally or intentionally messes up the merge. But you should probably have that anyway.
There's very likely a django app/library that does most of this for you.
Another thing to look for is the word "atomic". An atomic operation means that your database change will either happen successfully, or fail obviously. A quick search shows this question asking about atomic operations in Django.
The idea above
updated = Entry.objects.filter(Q(id=e.id) && Q(version=e.version))\
.update(updated_field=new_value, version=e.version+1)
if not updated:
raise ConcurrentModificationException()
looks great and should work fine even without serializable transactions.
The problem is how to augment the deafult .save() behavior as to not have to do manual plumbing to call the .update() method.
I looked at the Custom Manager idea.
My plan is to override the Manager _update method that is called by Model.save_base() to perform the update.
This is the current code in Django 1.3
def _update(self, values, **kwargs):
return self.get_query_set()._update(values, **kwargs)
What needs to be done IMHO is something like:
def _update(self, values, **kwargs):
#TODO Get version field value
v = self.get_version_field_value(values[0])
return self.get_query_set().filter(Q(version=v))._update(values, **kwargs)
Similar thing needs to happen on delete. However delete is a bit more difficult as Django is implementing quite some voodoo in this area through django.db.models.deletion.Collector.
It is weird that modren tool like Django lacks guidance for Optimictic Concurency Control.
I will update this post when I solve the riddle. Hopefully solution will be in a nice pythonic way that does not involve tons of coding, weird views, skipping essential pieces of Django etc.
To be safe the database needs to support transactions.
If the fields is "free-form" e.g. text etc. and you need to allow several users to be able to edit the same fields (you can't have single user ownership to the data), you could store the original data in a variable.
When the user committs, check if the input data has changed from the original data (if not, you don't need to bother the DB by rewriting old data),
if the original data compared to the current data in the db is the same you can save, if it has changed you can show the user the difference and ask the user what to do.
If the fields is numbers e.g. account balance, number of items in a store etc., you can handle it more automatically if you calculate the difference between the original value (stored when the user started filling out the form) and the new value you can start a transaction read the current value and add the difference, then end transaction. If you can't have negative values, you should abort the transaction if the result is negative, and tell the user.
I don't know django, so I can't give you teh cod3s.. ;)
From here:
How to prevent overwriting an object someone else has modified
I'm assuming that the timestamp will be held as a hidden field in the form you're trying to save the details of.
def save(self):
if(self.id):
foo = Foo.objects.get(pk=self.id)
if(foo.timestamp > self.timestamp):
raise Exception, "trying to save outdated Foo"
super(Foo, self).save()