Azure Storage Table [Python] - Batch does not fail - python-2.7

I am currently developing a Python (2.7) application that uses the Azure Table Storage service via the Azure Python package .As far as I have read from Azure's REST API, batching operations create atomic transactions. So, if one of the operations fails, the whole batch fails and no operations are executed.
The problem that I have encountered is the following:
The method below receives via the "rows" parameter a list of dicts. Some have an ETag set (provided from a previous query).
If the ETag is set, attempt a merge operation. Otherwise, attempt an insert operation. Since there are multiple processes that may modify the same entity, it is required to address the concurrency problem via the "if_match" parameter of the merge_entity function. If the merge/insert operations are individual operations (not included in batch), the system works as expected, raising an Exception if the ETags do not match.
Unfortunately, this does not happen if they are wrapped in "begin_batch" / "commit_batch" calls. The entities are merged (wrongly) even though the ETags DO NOT match.
I have provided below both the code and the test case used. Also ran some manual tests several time with the same conclusion.
I am unsure of how to approach this problem. Am I doing something wrong or is it an issue with the Python package?
The code used is the following:
def persist_entities(self, rows):
success = True
self._service.begin_batch() #If commented, works as expected (fails)
for row in rows:
print row
etag = row.pop("ETag")
if not etag:
self._service.insert_entity(self._name,
entity=row)
else:
print "Merging " + etag
self._service.merge_entity(self._name,
row["PartitionKey"],
row["RowKey"],
row, if_match=etag)
try: #Also tried with the try at the begining of the code
self._service.commit_batch() #If commented, works as expected (fails)
except WindowsAzureError:
print "Failed to merge"
self._service.cancel_batch()
success = False
return success
The test case used:
def test_fail_update(self):
service = self._conn.get_service()
partition, new_rows = self._generate_data() #Partition key and list of dicts
success = self._wrapper.persist_entities(new_rows) #Inserts fresh new entity
ok_(success) #Insert succeeds
rows = self._wrapper.get_entities_by_row(partition) #Retreives inserted data for ETag
eq_(len(rows), 1)
for index in rows:
row = rows[index]
data = new_rows[0]
data["Count"] = 155 #Same data, different value
data["ETag"] = "random_etag" #Change ETag to a random string
val = self._wrapper.persist_entities([data]) #Try to merge
ok_(not val) #val = True for merge success, False for merge fail.
#It's always True when operations in batch. False if operations out of batch
rows1 = self._wrapper.get_entities_by_row(partition)
eq_(len(rows1), 1)
eq_(rows1[index].Count, 123)
break
def _generate_data(self):
date = datetime.now().isoformat()
partition = "{0}_{1}_{2}".format("1",
Stats.RESOLUTION_DAY, date)
data = {
"PartitionKey": partition,
"RowKey": "viewitem",
"Count": 123,
"ETag": None
}
return partition, [data]

That's a bug in the SDK (v0.8 and earlier). I've created an issue and checked in a fix. It will be part of the next release. You can pip install from the git repo to test the fix.
https://github.com/Azure/azure-sdk-for-python/issues/149

Azure Table Storage has a new python library in preview release that is available for installation via pip. To install use the following pip command
pip install azure-data-tables
There are some differences with how you are performing batching and how batching works in the newest library.
A batch operation can only be committed on entities contained with the same partition key.
A batch operation can not contain multiple operations on the same row key.
Batches will only be committed if the entire batch is successful, if the send_batch raises an error, there will be no updates made.
With this being stated, the create and update operations work the same as a non-batching create and update. For example:
from azure.data.tables import TableClient, UpdateMode
table_client = TableClient.from_connection_string(conn_str, table_name="myTable")
batch = table_client.create_batch()
batch.create_entity(entity1)
batch.update_entity(entity2, etag=etag, match_condition=UpdateMode.MERGE)
batch.delete_entity(entity['PartitionKey'], entity['RowKey'])
try:
table_client.send_batch(batch)
except BatchErrorException as e:
print("There was an error with the batch")
print(e)
For more samples with the new library, check out the samples page on the repository.
(FYI, I am a Microsoft employee on the Azure SDK for Python team)

Related

Effecient Bulk Update of Model Records in Django

I'm building a Django app that will periodically take information from an external source, and use it to update model objects.
What I want to to be able to do is create a QuerySet which has all the objects which might match the final list. Then check which model objects need to be created, updated, and deleted. And then (ideally) perform the update in the fewest number of transactions. And without performing any unnecessary DB operations.
Using create_or_update gets me most of the way to what I want to do.
jobs = get_current_jobs(host, user)
for host, user, name, defaults in jobs:
obj, _ = Job.upate_or_create(host=host, user=user, name=name, defaults=defaults)
The problem with this approach is that it doesn't delete anything that no longer exists.
I could just delete everything up front, or do something dumb like
to_delete = set(Job.objects.filter(host=host, user=user)) - set(current)
(Which is an option) but I feel like there must already be an elegant solution that doesn't require either deleting everything, or reading everything into memory.
You should use Redis for storage and use this python package in your code. For example:
import redis
import requests
pool = redis.StrictRedis('localhost')
time_in_seconds = 3600 # the time period you want to keep your data
response = requests.get("url_to_ext_source")
pool.set("api_response", response.json(), ex=time_in_seconds)

Multi-DB Transactions

Django Version 1.10.5 with Postgres 9.6.1
For the last year I've been working in a multi-schema default database environment. However things are beginning to grow to the point I've decided to split the single database into 3 databases.
I've got things working with a master/slave router for all 3 databases.
I am not using the 'default' database key. Instead I have 'db1', 'db2', and 'db3'
The part I am confused about is with transactions in this multi-database environment.
In this example it fails as expected. Caused of course by not using #transaction.atomic(using='db1') which is clear to me.
#transaction.atomic()
def edit(self, context):
"""Edit
:param dict context: Context
:return: None
"""
# Check if employee exists
try:
result = Passport.objects.get(pk=self.user.employee_id)
except Passport.DoesNotExist:
return False
result.name = context.get('name')
result.save()
However I have this strange example, simply because I'm trying to understand... I would have expected this to fail but it does not:
#transaction.atomic(using='db1')
def edit(self, context):
"""Edit
:param dict context: Context
:return: None
"""
# Check if employee exists
try:
result = Passport.objects.get(pk=self.user.employee_id)
except Passport.DoesNotExist:
return False
result.name = context.get('name')
with transaction.atomic(using='db2'):
result.save()
The model Passport does not exist in DB2 models at all.
My router is setup so that all writes go to each respected DB.
So what is the purpose of setting the using='db1' in the atomic transaction? I've looked at the source and I see it defaults to default when not "using".
In the above example I even made another transaction inside of the initial transaction but this time using='db2' where the model doesn't even exist. I figured that would have failed, but it didn't and the data was written to the proper database.
I bring this up because there will be situations where I need to interact with all 3 databases and if a single problem occurs when writing to all 3 databases, all 3 need to be rolled back or if on success of everything, then committed of course.
Perhaps someone can help break this down for me so I can understand?
You're interpreting transaction.atomic(using='X') to mean: run the following database commands on X, inside a transaction.
In fact, it just means: open a transaction on database X, and then either commit it or roll it back at the end of the block.
Or, as the documentation puts it:
Under the hood, Django’s transaction management code:
opens a transaction when entering the outermost atomic block;
commits or rolls back the transaction when exiting the outermost block.
The question of which database to use for a given command is determined by your router, not the using clause. So your transaction.atomic(using='db2') block is pointless (it will simply open a transaction on db2 and then close it), but not an error.

Pulling data from datastore and converting it in Json in python(Google Appengine)

I am creating an apllication using google appengine, in which i am fetching a data from the website and storing it in my Database (Data store).Now whenever user hits my application url as "application_url\name =xyz&city= abc",i am fetching the data from the DB and want to show it as json.Right now i am using a filter to fetch data based on the name and city but getting output as [].I dont know how to get data from this.My code looks like this:
class MainHandler(webapp2.RequestHandler):
def get(self):
commodityname = self.request.get('veg',"Not supplied")
market = self.request.get('market',"No market found with this name")
self.response.write(commodityname)
self.response.write(market)
query = commoditydata.all()
logging.info(commodityname)
query.filter('commodity = ', commodityname)
result = query.fetch(limit = 1)
logging.info(result)
and the db structure for "commoditydata" table is
class commoditydata(db.Model):
commodity= db.StringProperty()
market= db.StringProperty()
arrival= db.StringProperty()
variety= db.StringProperty()
minprice= db.StringProperty()
maxprice= db.StringProperty()
modalprice= db.StringProperty()
reporteddate= db.DateTimeProperty(auto_now_add = True)
Can anyone tell me how to get data from the db using name and market and covert it in Json.First getting data from db is the more priority.Any suggestions will be of great use.
If you are starting with a new app, I would suggest to use the NDB API rather than the old DB API. Your code would look almost the same though.
As far as I can tell from your code sample, the query should give you results as far as the HTTP query parameters from the request would match entity objects in the datastore.
I can think of some possible reasons for the empty result:
you only think the output is empty, because you use write() too early; app-engine doesn't support streaming of response, you must write everything in one go and you should do this after you queried the datastore
the properties you are filtering are not indexed (yet) in the datastore, at least not for the entities you were looking for
the filters are just not matching anything (check the log for the values you got from the request)
your query uses a namespace different from where the data was stored in (but this is unlikely if you haven't explicitly set namespaces anywhere)
In the Cloud Developer Console you can query your datastore and even apply filters, so you can see the results with-out writing actual code.
Go to https://console.developers.google.com
On the left side, select Storage > Cloud Datastore > Query
Select the namespace (default should be fine)
Select the kind "commoditydata"
Add filters with example values you expect from the request and see how many results you get
Also look into Monitoring > Log which together with your logging.info() calls is really helpful to better understand what is going on during a request.
The conversion to JSON is rather easy, once you got your data. In your request handler, create an empty list of dictionaries. For each object you get from the query result: set the properties you want to send, define a key in the dict and set the value to the value you got from the datastore. At the end dump the dictionary as JSON string.
class MainHandler(webapp2.RequestHandler):
def get(self):
commodityname = self.request.get('veg')
market = self.request.get('market')
if commodityname is None and market is None:
# the request will be complete after this:
self.response.out.write("Please supply filters!")
# everything ok, try query:
query = commoditydata.all()
logging.info(commodityname)
query.filter('commodity = ', commodityname)
result = query.fetch(limit = 1)
logging.info(result)
# now build the JSON payload for the response
dicts = []
for match in result:
dicts.append({'market': match.market, 'reporteddate': match.reporteddate})
# set the appropriate header of the response:
self.response.headers['Content-Type'] = 'application/json; charset=utf-8'
# convert everything into a JSON string
import json
jsonString = json.dumps(dicts)
self.response.out.write( jsonString )

python code for directory api to batch retrieve all users from domain

Currently I have a method that retrieves all ~119,000 gmail accounts and writes them to a csv file using python code below and the enabled admin.sdk + auth 2.0:
def get_accounts(self):
students = []
page_token = None
params = {'customer': 'my_customer'}
while True:
try:
if page_token:
params['pageToken'] = page_token
current_page = self.dir_api.users().list(**params).execute()
students.extend(current_page['users'])
# write each page of data to a file
csv_file = CSVWriter(students, self.output_file)
csv_file.write_file()
# clear the list for the next page of data
del students[:]
page_token = current_page.get('nextPageToken')
if not page_token:
break
except errors.HttpError as error:
break
I would like to retrieve all 119,000 as a lump sum, that is, without having to loop or as a batch call. Is this possible and if so, can you provide example python code? I have run into communication issues and have to rerun the process multiple times to obtain the ~119,000 accts successfully (takes about 10 minutes to download). Would like to minimize communication errors. Please advise if better method exists or non-looping method also is possible.
There's no way to do this as a batch because you need to know each pageToken and those are only given as the page is retrieved. However, you can increase your performance somewhat by getting larger pages:
params = {'customer': 'my_customer', 'maxResults': 500}
since the default page size when maxResults is not set is 100, adding maxResults: 500 will reduce the number of API calls by an order of 5. While each call may take slightly longer, you should notice performance increases because you're making far fewer API calls and HTTP round trips.
You should also look at using the fields parameter to only specify user attributes you need to read in the list. That way you're not wasting time and bandwidth retrieving details about your users that your app never uses. Try something like:
my_fields = 'nextPageToken,users(primaryEmail,name,suspended)'
params = {
'customer': 'my_customer',
maxResults': 500,
fields: my_fields
}
Last of all, if your app retrieves the list of users fairly frequently, turning on caching may help.

Django fixture fails, stating "DatabaseError: value too long for type character varying(50)"

I have a fixture (json) which loads in development environment but fails to do so in server environment. The error says: "DatabaseError: value too long for type character varying(50)"
My development environment is Windows & Postgres 8.4. The server runs Debian and Postgres 8.3. Database encoding is UTF8 in both systems.
It is as if unicode markers in the fixture count as chars on the server and they cause some strings to exceed their field's max length. However that does not happen in the dev environment..
Update: the 50 char limit is now 255 in Django 1.8
--
Original answer:
I just encountered this this afternoon, too, and I have a fix (of sorts)
This post here implied it's a Django bug to do with length of the value allowed for auth_permission. Further digging backs up that idea, as does this Django ticket (even though it's initially MySQL-related).
It's basically that a permission name is created based on the verbose_name of a model plus a descriptive permission string, and that can overflow to more than the 50 chars allowed in auth.models.Permission.name.
To quote a comment on the Django ticket:
The longest prefixes for the string value in the column auth_permission.name are "Can change " and "Can delete ", both with 11 characters. The column maximum length is 50 so the maximum length of Meta.verbose_name is 39.
One solution would be to hack that column to support > 50 characters (ideally via a South migration, I say, so that it's easily repeatable) but the quickest, most reliable fix I could think of was simply to make my extra-long verbose_name definition a lot shorter (from 47 chars in the verbose_name to around 20). All works fine now.
Well, what makes the difference is the encoding of the template databases. On the production server they had ascii encoding while on the dev box it is utf-8.
By default postgres creates a database using the template1. My understanding is that if its encoding is not utf-8, then the database you create will have this issue, even though you create it with utf-8 encoding.
Therefore I dropped it and recreated it with its encoding set to UTF8. The snippet below does it (taken from here):
psql -U postgres
UPDATE pg_database SET datallowconn = TRUE where datname = 'template0';
\c template0
UPDATE pg_database SET datistemplate = FALSE where datname = 'template1';
drop database template1;
create database template1 with template = template0 encoding = 'UNICODE';
UPDATE pg_database SET datistemplate = TRUE where datname = 'template1';
\c template1
UPDATE pg_database SET datallowconn = FALSE where datname = 'template0';
Now the fixture loads smoothly.
Get the real SQL query on both systems and see what is different.
Just for information : I also had this error
DatabaseError: value too long for type character varying(10)
It seems that I was writing data over the limit of 10 for a field. I fixed it by increasing the size of a CharField from 10 to 20
I hope it helps
As #stevejalim says, it's quite possible that the column auth_permission.name is the problem with length 50, you verify this with \d+ auth_permission in postgres's shell. In my case this is the problema, thus when I load django models's fixtures I got “DatabaseError: value too long for type character varying(50)”, then change django.contrib.auth's Permission model is complicated, so ... the simple solution was perform a migrate on Permission model, I did this running ALTER TABLE auth_permission ALTER COLUMN name TYPE VARCHAR(100); command in postgres's shell, this works for me.
credits for this comment
You can make Django use longer fields for this model by monkey-patching the model prior to using it to create the database tables. In "manage.py", change:
if __name__ == "__main__":
execute_manager(settings)
to:
from django.contrib.auth.models import Permission
if __name__ == "__main__":
# Patch the field width to allow for our long model names
Permission._meta.get_field('name').max_length=200
Permission._meta.get_field('codename').max_length=200
execute_manager(settings)
This modifies the options on the field before (say) manage.py syncdb is run, so the databate table has nice wide varchar() fields. You don't need to do this when invoking your app, as you never attempt to modify the Permissions table whle running.