Google App Engine - Add & Reflect pattern, working around eventual consistency - django

I'm building a web application on app engine.
In my case, that's built on django-nonrel, but the key point is that it's using Google's datastore.
I love the fact that I don't need to deal with replication, sharding, backups and such, but one thing that is always getting in my way is the eventual consistency, which seems to get in the way of implementing a common web app pattern which I'm calling "Add & Reflect".
Let's say I have a project management app. The Project is its central model.
Now there's a web page page where I see a list of all projects, can add a project, and then I'll reflect back the list of all projects, which should include the project I just added (assuming no errors).
So the pattern goes like this:
Get and display list of existing projects
User adds new project (using a form on that page)
New project is created
As a response, get and display list of existing projects (now includes the new project)
Now the thing is, that due to eventual consistency, there is no guarantee whatsoever that I will get that new project when I get a list of all projects right after adding a new project.
Now that would be fine if this momentary inconsistency happened when another request (e.g. another user: user B) requested the list of projects one second after the project was added by the first user (user A), but it's really a problem when user A performs an operation, and does not see the results of his action, therefore does not get feedback.
I have gotten used to doing something like this to work around this problem:
def create_project(request):
response_context = {}
new_project = Project(name=request.POST['name'])
project.save()
response_context['projects'] = Project.get_serialized_projects()
# on GAE, eventual consistency means we are not guaranteed to see the
# new projects while querying for all projects, therefore we might need
# to add it manually...
if project.serialize() not in response_context['projects']:
response_context['projects'].append(project.serialize())
return render('projects.html', response_context)
The problem is that this happens in many places in my code, so I'm thinking maybe I'm missing something there, since this pattern is such a basic web app pattern.
Any suggestions for other ways to handle this?

Yes its a common issue. No theres no magic fix.
From client-side once you know the commit succeeded you can save the item locally (globals or storage) and then when querying from datastore merge your saved data. Put an expiration on it so its temporary. Its not trivial to make it work in all cases (say added an item then removed/renamed it so also update cache etc).
From server-side its common to cache recent saves in memcache and also merge with your queries.

Related

Using non-trivial REST API and Ember

We're new to Ember, and our intended (ember-cli) app first works by opening a project (which we can think of a JSON object), and then acting on various sections of that project with various functions. We have this "pick your project first" approach neatly encapsulated in a Django REST api structure, e.g.
/projects/ lists all projects
/projects/1/ gives information about project 1
/projects/1/sectionA/ list all elements in sectionA of project 1
/projects/1/sectionA/2/ gives information about element 2 of sectionA in project 1
/projects/1/sectionA/2/sectionB/... and so on.
We made relatively good progress with the first two points in Ember using ember-data and this.store('project').find(...) etc. However, we've come unstuck trying to add further to our url (e.g. points 3., 4., and 5.). I believe our issues come from routing and handling multiple models (e.g. project and sectionA).
The question: what is the best way to structure the routes in Ember.js to match a non-trivial REST API, and use ember-data similarly?
Comments:
the "Ember way", and stuff working out of the box is preferred. Custom adapters and .getJSON might work, but we're not sure if we'll then lose out on what Ember offers.
we want the choice of project to affect the main app template. E.g. if a project does not have "sectionA", then a link to "sectionA" is not displayed in the main app. And, if the project does have "sectionA", we need the link to be to e.g. "/project/1/sectionA", i.e. dependant on the project open.
This seems similar to handling users (i.e. first I must "pick a user" and then continue), where the problem is solved outside of the URL (and is similar to using sessions as we have done in the past). However, we specifically want the project ID to be inside the URL, to remain stateless.
Bonus questions (if relevant):
how would we structure the models? Do we need to use hasMany/belongsTo and, if so, is this equivalent to just loading the whole project JSON in the first place?
can ember-data handle such complex requests? I.e. "give me item 2 from sectionA of project 1"? Can it do this "in one go", or do there have to be nested queries (i.e. "first give me project 1" and then from this "give me sectionA" and then from this "give me item 1")?
Finally, apologies if this is documented well somewhere. We've spent nearly a week trying to figure this out and have tried our best to find resources -- it's possible we just don't know what we're looking for.
I think this one will be a good thing to read: discuss.emberjs.com/t/… - you've got Tom Dale and Stefan Penner involved in the thread
My suggestion would be to change it to query params:
/projects?id=1&selectionA=a&selectionB=b
then, you won't have such problems. And yes, you can still use all the hasMany and belongsTo fields.
If there's anything unclear, I'll provide you with a longer answer (if I'm able to).
Check out ember-api-actions and ember-data-actions also ember-data-url-templates
Here's a few more resources from a blog I found. ember-data-working-with-custom-api-endpoints and ember-data-working-with-nested-api-resources

Sharing data across Sitecore pipelines

I´m trying to perform some actions in the pipeline "httpRequestBegin" only when necessary.
My processor is executed after Sitecore resolves the user (processor type="Sitecore.Pipelines.HttpRequest.UserResolver, Sitecore.Kernel" ), as i´m resolving the user too if Sitecore is not able to resolve it first.
Later, i want to add some rendering in the pipeline "insertRenderings", only if actions in the previous pipeline were executed (If i resolved the user, show a message), so i´m trying to save some "flag" in the first step, to check in the second.
My question is, where can I store that flag? I´m trying to find some kind of "per request" cache...
So far, I've tried:
The session: Wrong, it's too early, session doesn't exists yet.
Items (HttpContext.Current.Items): It doesn't work either, my item is not there on the seconds step.
So far i'm using the application cache (HttpContext.Current.Cache) with some unique key, but I don´t like this solution.
Anybody body knows a better approach to share this "flag"?
You could add a flag to the request header and then check it's existence in the latter pipelines, e.g.
// in HttpRequest pipeline
HttpContext.Current.Request.Headers.Add("CustomUserResolve", "true");
// in InsertRenderings pipeline
var customUserResolve = HttpContext.Current.Request.Headers["CustomUserResolve"];
if (Sitecore.MainUtil.GetBool(customUserResolve, false))
{
// custom logic goes here
}
This feels a little dirty, I think adding to Request.QueryString or Request.Params would been nicer but those are readonly. However, if you only need this for a one time deal (i.e. only the first time it is resolved) then it will work since in the next request the Headers are back to default without your custom header added.
HttpContext.Current.Cache or HttpRuntime.Cache could be the fastest solution here. Though this approach would not preserve data when the AppPool gets recycled.
If you add only a few keys to the cache and then maintain them, this solution might work for you. If each request puts an entry into the cache, it may eventually overflow the memory used by worker process in a long run.
As alternative to this you may try to use Sitecore.Context.ClientData property. It uses ClientDataStore that employs a database (look for clientDataStore section in the web.config file) to store data. These entries can survive the AppPool recycle.
Though if you use them a lot, it may become a bottleneck under the load when you need to write to and/or read from the entries.
If you do know that there could be a lot of entries created for sharing purposes, I'd create a scheduled task to clean up the data store from obsolete entries.
I know this is a very old question, but I just want post solution I worked around
Below will hold data per http request basis.
HttpContext.Current.Items["ModuleInfo"] = "Custom Module Info"
we can store data to httpcontext in one sitecore pipeline and retrieve in another...
https://www.codeproject.com/Articles/146455/When-Can-We-Use-HttpContext-Current-Items-to-Store

In Django, how can I tell my project to read/write in distinct log directories at runtime?

I am working on a django project that, along with having a database for its models and relations, writes to a log directory called activity_logs outside of the project directory to keep track of formatted user activities, one file for each user. This is an alternative file-structure-based solution to having a database table carry this information along, because this offloads some storage from the DB and is relatively easy to format and express such activities. Perhaps some of you may recommend storing this kind of data in the database, which is fine, but I still believe there is question from all of this that I need help answering.
This django project has multiple apps that have an extensive test suite, one for each app. Additionally, there is a logging.py file that encapsulates the logging functionality (writing/reading activities to/from log files), and so both the test cases within the test suites as well as the view functions (and various other utility functions) all utilize these logging functions in order to store these user activities and retrieve them based on model relationships to emulate a user notification system. Since the logging module takes care of this logging, it needs to know where to write to, and so we have a directory structure called activity_logs to which it writes user log files, creating one for a new user and deleting one for a user removed from the database. One of the newest changes we would like to make in this project is to create a separate logging directory for testing this logging functionality, something like test_activity_logs, so that it would never be confused when writing to the test directory for test users or the regular activity log directory for real users.
My problem is this: at runtime, how can I tell the system, at whichever startpoint of execution (whether it be from a view function call through the django test Client object, a test case, an actual HTTPrequest made via a URL, etc.), when to look inside the activity_logs or the test_activity_logs directory? It solely depends on whether I am generating new information for a real user or a test user, but a User is a User in our system, and I'm facing some trouble trying to tell these functions that call some logging functions to write to the test log directory vs. the regular one. For example, one approach I am trying is to pass a keyword argument (kwarg) to the logging functions so that they can be made aware of which directory to read/write to/from, like so:
self.assertTrue(activity_has_been_logged(ACTIVITY_ACCOUNT_CREATED, user.get_profile(), use_test_activity_log_directory=True) == True)
the kwarg called use_test_activity_log_directory=True will tell the logging function called activity_has_been_logged to read the test activity log directory. Unfortunately, apart from being a little inflexible (but tolerable), this doesn't solve the situation where the django test client object sends a GET or POST request via a URL to a view function that writes activities to log files:
response = client.post(propose_match_url, post) #Can't write to test_log_directory if by default it writes to regular directory!
How do I let the client pass on this kwarg to those view functions? I think that it should totally be possible to do this, but I'm not sure if fiddling with these kwargs is the best way, or maybe create a global variable in the project settings file, but maybe that might cause some trouble with race conditions with a shared mutable variable.
Your help would be great. Thanks in advance!
So I just solved this problem. The logging file hosting all logging functionality is really the only place that needs to know where to look (either test_activity_logs or activity_logs), since all other components will invoke functions from the logging module to write/read to/from these directories. I gave an additional field to the model class of the UserProfile class called is_test that is a boolean field to determine whether to look in the test_activity_logs if is_test=True, or activity_logs if is_test=False. That way, the logging module needs only to check the input parameter of type UserProfile and its new field to determine where to perform its logging functionalities. Problem solved!
Check out daemontools if you're on a *nix box or launchd on OS X. Both can make sure your Django instance stays running in whatever mode you prefer (daemontools has a few more options for that) and can isolate a directory for logging stdout/stderr.
You can set environment variables for each instance to help other log files and temporary files know where to be created, which you then get from os.environment or simply use the current working directory as a base if using daemontools.
The directory is automatically created for you using daemontools.

Running multiple sites on the same python process

In our company we make news portals for a pretty big number of local newspapers (currently 13, going to 30 next month and more in the future), each with 2k to 100k page views/day. Since we are evolving from a situation where each site was heavily customized to one where each difference is a matter of configuration or custom template, our software is already pretty much the same for all sites. Right now our deployment strategy is one gunicorn instance for each site (with 1-17 workers each, depending on the site traffic), on a 16-core server and 12GB RAM. The problem with this setup is that each worker (regular pre-forked gunicorn) takes 110MB, whether its being used or not. Now with the new sites we would need to add more RAM to serve not that much many requests, so basically it doesn't scale. Also, since we are moving from this model where each site is independent, each site has its own database and I quite like it that way, especially since we are using relational databases (mysql, but migrating to pgsql), so its much easier to shard this way.
I'm doing some research and experimenting with running all sites on one gunicorn instance, so I could use the servers fully and add more servers behind a load balancer when it came to it. The problem is that django assumes in a lot of places that only one site is running per process, so for what I've thought of so far I'd have to implement:
A middleware that takes the HTTP_HOST from the request and places an identifier on a threadlocal variable.
A template loader that uses that variable to load custom templates accordingly.
Monkey patch django.db.model.Model, probably adding a metaclass (not even sure that's possible, but I think I would need it because of the custom managers we sometimes need to use) that would overwrite the managers for one that would first call db_manager(identifier) on the original manager and then call the intended method. I would also need to overwrite the save and delete methods to always include the using=identifier parameter.
I guess I would need to stop using inclusion_tag decorators, not a big problem, but I need to think of other cases like this.
Heavy and ugly patching of urlresolvers if I need custom or extra urls for each site. I don't need them now, but probably will at some point.
And this is just is what I came up with without even implementing it and seeing where it breaks, I'm sure I'd need many more changes for it to work. So I really don't want to do it, especially with the extra maintenance effort I'll need, but I don't see any alternatives and would love to learn that someone already solved this in a better way. Of course I could also stop using django altogether (I already have many reasons to do so) but that would mean a major rewrite and having two maintain two incompatible branches of the software until the new one reached feature parity with the django version, so to me it seems even worse than all the ugly hacks.
I've recently developed an e-commerce system with similar requirements -- many instances running from the same project sharing almost everything. The previous version of the system was a bunch of independent installations (~30) so it was pretty unmaintainable. I'm sure the requirements still differ from yours (for example, all instances shared the same models in my case), but it still might be useful to share my experience.
You are right that Django doesn't help with scenarios like this out of the box, but it's actually surprisingly easy to work it around. Here is a brief description of what I did.
I could see a synergy between what I wanted to achieve and django.contrib.sites. Also because many third-party Django apps out there know how to work with it and use it, for example, to generate absolute URLs to the current site. The major problem with sites is that it wants you to specify the current site id in settings.SITE_ID, which a very naive approach to the multi host problem. What one naturally wants, and what you also mention, is to determine the current site from the Host request header. To fix this problem, I borrowed the hook idea from django-multisite: https://github.com/shestera/django-multisite/blob/master/multisite/threadlocals.py#L19
Next I created an app encapsulating all the functionality related to the multi host aspect of my project. In my case the app was called stores and among other things it featured two important classes: stores.middleware.StoreMiddleware and stores.models.Store.
The model class is a subclass of django.contrib.sites.models.Site. The good thing about subclassing Site is that you can pass a Store to any function where a Site is expected. So you are effectively still just using the old, well documented and tested sites framework. To the Store class I added all the fields needed to configure all the different stores. So it's got fields like urlconf, theme, robots_txt and whatnot.
The middleware class' function was to match the Host header with the corresponding Store instance in the database. Once the matching Store was retrieved, It would patch the SITE_ID in a way similar to https://github.com/shestera/django-multisite/blob/master/multisite/middleware.py. Also, it looked at the store's urlconf and if it was not None, it would set request.urlconf to apply its special URL requirements. After that, the current Store instance was stored in request.store. This has proven to be incredibly useful, because I was able to do things like this in my views:
def homepage(request):
featured = Product.objects.filter(featured=True, store=request.store)
...
request.store became a natural additional dimension of the request object throughout the project for me.
Another thing that was defined on the Store class was a function get_absolute_url whose implementation looked roughly like this:
def get_absolute_url(self, to='/'):
"""
Return an absolute url to this `Store` or to `to` on this store.
The URL includes http:// and the domain name of the store.
`to` can be an object with `get_absolute_url()` or an absolute path as string.
"""
if isinstance(to, basestring):
path = to
elif hasattr(to, 'get_absolute_url'):
path = to.get_absolute_url()
else:
raise ValueError(
'Invalid argument (need a string or an object with get_absolute_url): %s' % to
)
url = 'http://%s%s%s' % (
self.domain,
# This setting allowed for a sane development environment
# where I just set it to ".dev:8000" and configured `dnsmasq`.
# The same value was also removed from the `Host` value in the middleware
# before looking up the `Store` in database.
settings.DOMAIN_SUFFIX,
path
)
return url
So I could easily generate URLs to objects on other than the current store, e.g.:
# Redirect to `product` on `store`.
redirect(store.get_absolute_url(product))
This was basically all I needed to be able to implement a system allowing users to create a new e-shop living on its own domain via the Django admin.

What is a sane way to perform a radical Django Model migration in a production environment?

I have an existing django web app that is in use. I have to radically migrate one key model in my design to a completely new design, but I want to cache all of the existing data for that model and migrate them to the new records in production when ready to deploy.
I can afford to bring my website down for a few hours one night and do whatever I need to do to migrate. What are some sane ways I can do this migration?
It seems any migration would need to:
1) Dump all of the existing data into some format, such as SQL, JSON, XML
2) Migrate the model to the new format
3) Reload the data into the new model using a conversion script
I also thought of trying to store all of the existing data in some other model called "OldModel" (if Model is the name of the existing model) and then migrating the data live.
There is a project to help with migrations that I've heard of: South.
Having said that, I admit we've not used it. We still plan our migrations using a file of SQL statements. Madness, I know, but it has the advantage of testability. You can run it as many times as necessary during development and staging testing before the "big deploy". It can be source controlled, diffed, etc. It can also, therefore, be called from a larger deployment script. Of course, we back up production before running it :-)
If your database does journaling, using the old-fashioned method has the added advantage that there is a transaction history that can be rolled back.
Experiments we've run with JSON, XML and "OldModel" -> "NewModel" style dumps have scaled pretty poorly. Mind you, YMMV... we have quite a large database. By using a script, you can run on your production database without having to offload or reload vast amounts of data. This way even a complicated migration can take seconds, rather than hours.
There are around 5 or 6 tools to help automate some portion of migrations. Several of them are listed in this question and I'll add the others just for completeness.
Next, see S. Lott's answer to this question about migration workflows for a great idea on using version numbers in the model name to make migrations easier, including structuring a standalone script to properly convert the tables. To my mind this is vastly superior to serializing the data for export and then trying to build your new tables by importing.
Finally, I haven't been able to think of a way to do a hot migration properly and haven't seen any hints from anywhere else either, so maintenance downtime is inevitable.
Make all migrations in steps!
If you need to add a field, go ahead and add it, with a default value or being optional. This is safe.
If you need to make an existing optional field required, give it a default first.
If you need to make an existing field with a default not have a default, drop the default after fixing all the code that creates instances.
If you need to change the type of a field, add a new field that inherits the value from the current one, first. Then, run a script to update the existing instances to populate the new field. Thirdly, Remove all the code that uses the old field to use the new one. Finally, which no code is left using the original, you can drop it.
For every situation there is a small step you can make. For every bigger change, you can break it down into little ones. This is one place iterative development pays off. Keep good backups in place and don't be afraid to push often! Make the small changes quickly to see if they work.
If you are more comfortable with the Django ORM than with raw SQL, you might consider using Model -> BackupModel -> TestModel -> Model, where all but the last step can be performed without dropping data.
def backup(InModel,OutModel):
in_objs = InModel.objects.all()
for obj in in_objs:
out_obj = OutModel.convert_from(InModel,obj)
out_obj.save()
Here, you would just make sure that all your models have convert_from methods implemented. These should all be trivial conversions except for BackupModel -> TestModel. In the other cases, nothing but the class would change, all data being identically preserved.
The advantage to this is that before you go rewriting all your interfaces, you can play around with TestModel and make sure that your conversions were what you thought they'd be. If everything goes wrong, you convert from BackupModel->Model, and everything is okay. In a worst-case scenario, you give up on Django's ORM, run back to SQL, and simply rename all your tables that begin with backupmodel__* to model__* in your database.
Disclaimer: I've never done this.