Django: How to depend on an externally ID, which can be switched? - django

Consider the following scenario:
Our Django database objects must rely on IDs that are provided by external service A (ESA) - this is because we use this ID to pull the information about objects that aren't created yet from the external directly. ESA might shut down soon, so we also pull information about the same objects from external service B (ESB), and save them as a fallback.
Because these IDs are relied on heavily in views and URLs, the ideal scenario would be to use a #property:
#property
dynamic_id = ESA_id
And then, if ESA shuts down, we can switch easily by changing dynamic_id to ESB_id. The problem with this though, is that properties cannot be used in queryset filters and various other scenarios, which is also a must in this case.
My current thought is to just save ESA_id, ESB_id, and dynamic_ID as regular fields separately and assign dynamic_ID = ESA_id, and then, in case ESA shuts down, simply go over the objects and do dynamic_ID = ESB_id.
But I feel there must be a better way?

Having ESA_id and ESB_id fields in the same table is a good solution, then you have some kind of setting (DEFAULT_SERVICE_ID='ESA_id'|'ESB_id') and your code change the lookup based on this option.
Here you can see an aproach to create filters dynamicly
https://stackoverflow.com/a/310785/1448667

Related

Doctrine specify logical name for JoinColumn [duplicate]

I'm working on an events site and have a one to many relationship between a production and its performances, when I have a performance object if I need its production id at the moment I have to do
$productionId = $performance->getProduction()->getId();
In cases when I literally just need the production id it seems like a waste to send off another database query to get a value that's already in the object somewhere.
Is there a way round this?
Edit 2013.02.17:
What I wrote below is no longer true. You don't have to do anything in the scenario outlined in the question, because Doctrine is clever enough to load the id fields into related entities, so the proxy objects will already contain the id, and it will not issue another call to the database.
Outdated answer below:
It is possible, but it is unadvised.
The reason behind that, is Doctrine tries to truly adhere to the principle that your entities should form an object graph, where the foreign keys have no place, because they are just "artifacts", that come from the way relational databases work.
You should rewrite the association to be
eager loaded, if you always need the related entity
write a DQL query (preferably on a Repository) to fetch-join the related entity
let it lazy-load the related entity by calling a getter on it
If you are not convinced, and really want to avoid all of the above, there are two ways (that I know of), to get the id of a related object, without triggering a load, and without resorting to tricks like reflection and serialization:
If you already have the object in hand, you can retrieve the inner UnitOfWork object that Doctrine uses internally, and use it's getEntityIdentifier() method, passing it the unloaded entity (the proxy object). It will return you the id, without triggering the lazy-load.
Assuming you have many-to-one relation, with multiple articles belonging to a category:
$articleId = 1;
$article = $em->find('Article', $articleId);
$categoryId = $em->getUnitOfWork()->getEntityIdentifier($article->getCategory());
Coming 2.2, you will be able to use the IDENTITY DQL function, to select just a foreign key, like this:
SELECT IDENTITY(u.Group) AS group_id FROM User u WHERE u.id = ?0
It is already committed to the development versions.
Still, you should really try to stick to one of the "correct" methods.

Secure-by-default django ORM layer---how?

I'm running a Django shop where we serve each our clients an object graph which is completely separate from the graphs of all the other clients. The data is moderately sensitive, so I don't want any of it to leak from one client to another, nor for one client to delete or alter another client's data.
I would like to structure my code such that I by default write code which adheres to the security requirements (No hard guarantees necessary), but lets me override them when I know I need to.
My main fear is that in a Twig.objects.get(...), I forget to add client=request.client, and likewise for Leaf.objects.get where I have to check that twig__client=request.client. This quickly becomes error-prone and complicated.
What are some good ways to get around my own forgetfulness? How do I make this a thing I don't have to think about?
One candidate solution I have in mind is this:
Set the default object manager as DANGER = models.Manager() on my abstract base class(es).
Have a method ok(request) on said base classes which applies .filter(leaf__twig__branch__trunk__root__client=request.client) as applicable.
use MyModel.ok(request) instead of MyModel.objects wherever feasible.
Can this be improved upon? One not so nice issue is when a view calls a model method, e.g. branch.get_twigs_with_fruit, I now have to either pass a request for it to run through ok or I have to invoke DANGER. I like neither :-\
Is there some way of getting access to the current request? I think that might mitigate the situation...
Ill explain a different problem I had however I think the solution might be something to look into.
Once I was working on a project to visualize data where I needed to have a really big table which will store all the data for all visualizations. That turned out to be a big problem because I would have to do things like Model.objects.filter(visualization=5) which was just not very elegant and not efficient.
To make things simpler and more efficient I ended up creating dynamic models on the fly. Essentially I would create a separate table in the db on the fly and then store a data only for that one visualization in that. My code is something like:
def get_model_class(table_name):
class ModelBase(ModelBase):
def __new__(cls, name, bases, attrs):
name = '{}_{}'.format(name, table_name)
return super(ModelBase, cls).__new__(cls, name, bases, attrs)
class Data(models.Model):
# fields here
__metaclass__ = ModelBase
class Meta(object):
db_table = table_name
return Data
dynamic_model = get_model_class('foo')
This was useful for my purposes because it allowed queries to be much faster but getting back to your issue I think something like this can be useful because this will make sure that each client's data is separate not only via a foreign key, but is actually separated in the db.
Using this method is pretty straight forward except before using the model, you have to call the function to get it for each client. To make things more efficient you can cache/memoize the results of the function call so that it does not have to recompute the same thing more than once.

read objects persisted but not yet flushed with doctrine

I'm new to symfony2 and doctrine.
here is the problem as I see it.
i cannot use :
$repository = $this->getDoctrine()->getRepository('entity');
$my_object = $repository->findOneBy($index);
on an object that is persisted, BUT NOT FLUSHED YET !!
i think getRepository read from DB, so it will not find a not-flushed object.
my question: how to read those objects that are persisted (i think they are somewhere in a "doctrine session") to re-use them before i do flush my entire batch ?
every profile has 256 physical plumes.
every profile has 1 plumeOptions record assigned to it.
In plumeOptions, I have a cartridgeplume which is a FK for PhysicalPlume.
every plume is identified by ID (auto-generated) and an INDEX (user-generated).
rule: I say profile 1 has physical_plume_index number 3 (=index) connected to it.
now, I want to copy a profile with all its related data to another profile.
new profile is created. New 256 plumes are created and copied from older profile.
i want to link the new profile to the new plume index 3.
check here: http://pastebin.com/WFa8vkt1
I think you might want to have a look at this function:
$entityManager->getUnitOfWork()->getScheduledEntityInsertions()
Gives you back a list of entity objects which are persisting yet.
Hmm, I didn't really read your question well, with the above you will retrieve a full list (as an array) but you cannot query it like with getRepository. I will try found something for u..
I think you might look at the problem from the wrong angle. Doctrine is your persistance layer and database access layer. It is the responsibility of your domain model to provide access to objects once they are in memory. So the problem boils down to how do you get a reference to an object without the persistance layer?
Where do you create the object you need to get hold of later? Can the method/service that create the object return a reference to the controller so it can propagate it to the other place you need it? Can you dispatch an event that you listen to elsewhere in your application to get hold of the object?
In my opinion, Doctrine should be used at the startup of the application (as early as possible), to initialize the domain model, and at the shutdown of the application, to persist any changes to the domain model during the request. To use a repository to get hold of objects in the middle of a request is, in my opinion, probably a code smell and you should look at how the application flow can be refactored to remove that need.
Your is a business logic problem effectively.
Querying down the Database a findby Query on Object that are not flushed yet, means heaving much more the DB layer querying object that you have already in your function scope.
Also Keep in mind a findOneBy will retrieve also other object previously saved with same features.
If you need to find only among those new created objects, you should make f.e. them in a Session Array Variable, and iterate them with the foreach.
If you need a mix of already saved items + some new items, you should threate the 2 parts separately, one with a foreach , other one with the repository query!

Running multiple sites on the same python process

In our company we make news portals for a pretty big number of local newspapers (currently 13, going to 30 next month and more in the future), each with 2k to 100k page views/day. Since we are evolving from a situation where each site was heavily customized to one where each difference is a matter of configuration or custom template, our software is already pretty much the same for all sites. Right now our deployment strategy is one gunicorn instance for each site (with 1-17 workers each, depending on the site traffic), on a 16-core server and 12GB RAM. The problem with this setup is that each worker (regular pre-forked gunicorn) takes 110MB, whether its being used or not. Now with the new sites we would need to add more RAM to serve not that much many requests, so basically it doesn't scale. Also, since we are moving from this model where each site is independent, each site has its own database and I quite like it that way, especially since we are using relational databases (mysql, but migrating to pgsql), so its much easier to shard this way.
I'm doing some research and experimenting with running all sites on one gunicorn instance, so I could use the servers fully and add more servers behind a load balancer when it came to it. The problem is that django assumes in a lot of places that only one site is running per process, so for what I've thought of so far I'd have to implement:
A middleware that takes the HTTP_HOST from the request and places an identifier on a threadlocal variable.
A template loader that uses that variable to load custom templates accordingly.
Monkey patch django.db.model.Model, probably adding a metaclass (not even sure that's possible, but I think I would need it because of the custom managers we sometimes need to use) that would overwrite the managers for one that would first call db_manager(identifier) on the original manager and then call the intended method. I would also need to overwrite the save and delete methods to always include the using=identifier parameter.
I guess I would need to stop using inclusion_tag decorators, not a big problem, but I need to think of other cases like this.
Heavy and ugly patching of urlresolvers if I need custom or extra urls for each site. I don't need them now, but probably will at some point.
And this is just is what I came up with without even implementing it and seeing where it breaks, I'm sure I'd need many more changes for it to work. So I really don't want to do it, especially with the extra maintenance effort I'll need, but I don't see any alternatives and would love to learn that someone already solved this in a better way. Of course I could also stop using django altogether (I already have many reasons to do so) but that would mean a major rewrite and having two maintain two incompatible branches of the software until the new one reached feature parity with the django version, so to me it seems even worse than all the ugly hacks.
I've recently developed an e-commerce system with similar requirements -- many instances running from the same project sharing almost everything. The previous version of the system was a bunch of independent installations (~30) so it was pretty unmaintainable. I'm sure the requirements still differ from yours (for example, all instances shared the same models in my case), but it still might be useful to share my experience.
You are right that Django doesn't help with scenarios like this out of the box, but it's actually surprisingly easy to work it around. Here is a brief description of what I did.
I could see a synergy between what I wanted to achieve and django.contrib.sites. Also because many third-party Django apps out there know how to work with it and use it, for example, to generate absolute URLs to the current site. The major problem with sites is that it wants you to specify the current site id in settings.SITE_ID, which a very naive approach to the multi host problem. What one naturally wants, and what you also mention, is to determine the current site from the Host request header. To fix this problem, I borrowed the hook idea from django-multisite: https://github.com/shestera/django-multisite/blob/master/multisite/threadlocals.py#L19
Next I created an app encapsulating all the functionality related to the multi host aspect of my project. In my case the app was called stores and among other things it featured two important classes: stores.middleware.StoreMiddleware and stores.models.Store.
The model class is a subclass of django.contrib.sites.models.Site. The good thing about subclassing Site is that you can pass a Store to any function where a Site is expected. So you are effectively still just using the old, well documented and tested sites framework. To the Store class I added all the fields needed to configure all the different stores. So it's got fields like urlconf, theme, robots_txt and whatnot.
The middleware class' function was to match the Host header with the corresponding Store instance in the database. Once the matching Store was retrieved, It would patch the SITE_ID in a way similar to https://github.com/shestera/django-multisite/blob/master/multisite/middleware.py. Also, it looked at the store's urlconf and if it was not None, it would set request.urlconf to apply its special URL requirements. After that, the current Store instance was stored in request.store. This has proven to be incredibly useful, because I was able to do things like this in my views:
def homepage(request):
featured = Product.objects.filter(featured=True, store=request.store)
...
request.store became a natural additional dimension of the request object throughout the project for me.
Another thing that was defined on the Store class was a function get_absolute_url whose implementation looked roughly like this:
def get_absolute_url(self, to='/'):
"""
Return an absolute url to this `Store` or to `to` on this store.
The URL includes http:// and the domain name of the store.
`to` can be an object with `get_absolute_url()` or an absolute path as string.
"""
if isinstance(to, basestring):
path = to
elif hasattr(to, 'get_absolute_url'):
path = to.get_absolute_url()
else:
raise ValueError(
'Invalid argument (need a string or an object with get_absolute_url): %s' % to
)
url = 'http://%s%s%s' % (
self.domain,
# This setting allowed for a sane development environment
# where I just set it to ".dev:8000" and configured `dnsmasq`.
# The same value was also removed from the `Host` value in the middleware
# before looking up the `Store` in database.
settings.DOMAIN_SUFFIX,
path
)
return url
So I could easily generate URLs to objects on other than the current store, e.g.:
# Redirect to `product` on `store`.
redirect(store.get_absolute_url(product))
This was basically all I needed to be able to implement a system allowing users to create a new e-shop living on its own domain via the Django admin.

Django: How can I protect against concurrent modification of database entries

If there a way to protect against concurrent modifications of the same data base entry by two or more users?
It would be acceptable to show an error message to the user performing the second commit/save operation, but data should not be silently overwritten.
I think locking the entry is not an option, as a user might use the "Back" button or simply close his browser, leaving the lock for ever.
This is how I do optimistic locking in Django:
updated = Entry.objects.filter(Q(id=e.id) && Q(version=e.version))\
.update(updated_field=new_value, version=e.version+1)
if not updated:
raise ConcurrentModificationException()
The code listed above can be implemented as a method in Custom Manager.
I am making the following assumptions:
filter().update() will result in a single database query because filter is lazy
a database query is atomic
These assumptions are enough to ensure that no one else has updated the entry before. If multiple rows are updated this way you should use transactions.
WARNING Django Doc:
Be aware that the update() method is
converted directly to an SQL
statement. It is a bulk operation for
direct updates. It doesn't run any
save() methods on your models, or emit
the pre_save or post_save signals
This question is a bit old and my answer a bit late, but after what I understand this has been fixed in Django 1.4 using:
select_for_update(nowait=True)
see the docs
Returns a queryset that will lock rows until the end of the transaction, generating a SELECT ... FOR UPDATE SQL statement on supported databases.
Usually, if another transaction has already acquired a lock on one of the selected rows, the query will block until the lock is released. If this is not the behavior you want, call select_for_update(nowait=True). This will make the call non-blocking. If a conflicting lock is already acquired by another transaction, DatabaseError will be raised when the queryset is evaluated.
Of course this will only work if the back-end support the "select for update" feature, which for example sqlite doesn't. Unfortunately: nowait=True is not supported by MySql, there you have to use: nowait=False, which will only block until the lock is released.
Actually, transactions don't help you much here ... unless you want to have transactions running over multiple HTTP requests (which you most probably don't want).
What we usually use in those cases is "Optimistic Locking". The Django ORM doesn't support that as far as I know. But there has been some discussion about adding this feature.
So you are on your own. Basically, what you should do is add a "version" field to your model and pass it to the user as a hidden field. The normal cycle for an update is :
read the data and show it to the user
user modify data
user post the data
the app saves it back in the database.
To implement optimistic locking, when you save the data, you check if the version that you got back from the user is the same as the one in the database, and then update the database and increment the version. If they are not, it means that there has been a change since the data was loaded.
You can do that with a single SQL call with something like :
UPDATE ... WHERE version = 'version_from_user';
This call will update the database only if the version is still the same.
Django 1.11 has three convenient options to handle this situation depending on your business logic requirements:
Something.objects.select_for_update() will block until the model become free
Something.objects.select_for_update(nowait=True) and catch DatabaseError if the model is currently locked for update
Something.objects.select_for_update(skip_locked=True) will not return the objects that are currently locked
In my application, which has both interactive and batch workflows on various models, I found these three options to solve most of my concurrent processing scenarios.
The "waiting" select_for_update is very convenient in sequential batch processes - I want them all to execute, but let them take their time. The nowait is used when an user wants to modify an object that is currently locked for update - I will just tell them it's being modified at this moment.
The skip_locked is useful for another type of update, when users can trigger a rescan of an object - and I don't care who triggers it, as long as it's triggered, so skip_locked allows me to silently skip the duplicated triggers.
For future reference, check out https://github.com/RobCombs/django-locking. It does locking in a way that doesn't leave everlasting locks, by a mixture of javascript unlocking when the user leaves the page, and lock timeouts (e.g. in case the user's browser crashes). The documentation is pretty complete.
You should probably use the django transaction middleware at least, even regardless of this problem.
As to your actual problem of having multiple users editing the same data... yes, use locking. OR:
Check what version a user is updating against (do this securely, so users can't simply hack the system to say they were updating the latest copy!), and only update if that version is current. Otherwise, send the user back a new page with the original version they were editing, their submitted version, and the new version(s) written by others. Ask them to merge the changes into one, completely up-to-date version. You might try to auto-merge these using a toolset like diff+patch, but you'll need to have the manual merge method working for failure cases anyway, so start with that. Also, you'll need to preserve version history, and allow admins to revert changes, in case someone unintentionally or intentionally messes up the merge. But you should probably have that anyway.
There's very likely a django app/library that does most of this for you.
Another thing to look for is the word "atomic". An atomic operation means that your database change will either happen successfully, or fail obviously. A quick search shows this question asking about atomic operations in Django.
The idea above
updated = Entry.objects.filter(Q(id=e.id) && Q(version=e.version))\
.update(updated_field=new_value, version=e.version+1)
if not updated:
raise ConcurrentModificationException()
looks great and should work fine even without serializable transactions.
The problem is how to augment the deafult .save() behavior as to not have to do manual plumbing to call the .update() method.
I looked at the Custom Manager idea.
My plan is to override the Manager _update method that is called by Model.save_base() to perform the update.
This is the current code in Django 1.3
def _update(self, values, **kwargs):
return self.get_query_set()._update(values, **kwargs)
What needs to be done IMHO is something like:
def _update(self, values, **kwargs):
#TODO Get version field value
v = self.get_version_field_value(values[0])
return self.get_query_set().filter(Q(version=v))._update(values, **kwargs)
Similar thing needs to happen on delete. However delete is a bit more difficult as Django is implementing quite some voodoo in this area through django.db.models.deletion.Collector.
It is weird that modren tool like Django lacks guidance for Optimictic Concurency Control.
I will update this post when I solve the riddle. Hopefully solution will be in a nice pythonic way that does not involve tons of coding, weird views, skipping essential pieces of Django etc.
To be safe the database needs to support transactions.
If the fields is "free-form" e.g. text etc. and you need to allow several users to be able to edit the same fields (you can't have single user ownership to the data), you could store the original data in a variable.
When the user committs, check if the input data has changed from the original data (if not, you don't need to bother the DB by rewriting old data),
if the original data compared to the current data in the db is the same you can save, if it has changed you can show the user the difference and ask the user what to do.
If the fields is numbers e.g. account balance, number of items in a store etc., you can handle it more automatically if you calculate the difference between the original value (stored when the user started filling out the form) and the new value you can start a transaction read the current value and add the difference, then end transaction. If you can't have negative values, you should abort the transaction if the result is negative, and tell the user.
I don't know django, so I can't give you teh cod3s.. ;)
From here:
How to prevent overwriting an object someone else has modified
I'm assuming that the timestamp will be held as a hidden field in the form you're trying to save the details of.
def save(self):
if(self.id):
foo = Foo.objects.get(pk=self.id)
if(foo.timestamp > self.timestamp):
raise Exception, "trying to save outdated Foo"
super(Foo, self).save()