How to create separate python script for uploading data into ndb - python-2.7

Can anyone guide me towards the right direction as to where I should place a script solely for loading data into ndb. As I wish to upload all data into the gae ndb so that the application could perform query on it.
Right now, the loading of data is in my application. I wish to placed it separately from the main application.
Should it be edited in the yaml file?
EDITED
This is a snippet of the entity and the handler to upload the data into GAE ndb.
I wish to placed this chunk of code separately from my main application .py. Reason being the uploading of this data won't be done frequently and to keep the codes in the main application "cleaner".
class TagTrend_refine(ndb.Model):
tag = ndb.StringProperty()
trendData = ndb.BlobProperty(compressed=True)
class MigrateData(webapp2.RequestHandler):
def get(self):
listOfEntities = []
f = open("tagTrend_refine.txt")
lines = f.readlines()
f.close()
for line in lines:
temp = line.strip().split("\t")
data = TagTrend_refine(
tag = temp[0],
trendData = temp[1]
)
listOfEntities.append(data)
ndb.put_multi(listOfEntities)
For example if I placed the above code in a file called dataLoader.py, where should I call this script to invoke?
In app.yaml alongside my main application(knowledgeGraph.application)?
- url: /.*
script: knowledgeGraph.application

You don't show us the application object (no doubt a WSGI app) in your knowledge.py module, so I can't know what URL you want to serve with the MigrateData handler -- I'll just guess it's something like /migratedata.
So the class TagTrend_refine should be in a separate file (usually called models.py) so that both your dataloader.py, and your knowledge.py, can import models to access it (and models.py will need its own import of ndb of course). (Then of course access to the entity class will be as models.TagTrend_refine -- very basic Python).
Next, you'll complete dataloader.py by defining a WSGI app, e.g, at end of file,
app = webapp2.WSGIApplication(routes=[('/migratedata', MigrateData)])
(of course this means this module will need to import webapp2 as well -- can I take for granted a knowledge of super-elementary Python?).
In app.yaml, as the first URL, before that /.*, you'll have:
url: /migratedata
script: dataloader.app
Given all this, when you visit '/migratedata', your handler will read the "tagTrend_refine.txt" file that you uploaded together with your .py, .yaml, and so on, files in your overall GAE application, and unconditionally create one entity per line of that file (assuming you fix the multiple indentation problems in your code as displayed above, but, again, this is just super-elementary Python -- presumably you've used both tabs and spaces and they show up OK in your editor, but not here on SO... I recommend you use strictly, only spaces, never tabs, in Python code).
However this does seem to be a peculiar task. If /migratedata gets visited twice, it will create duplicates of all entities. If you change the tagTrend_refine.txt and deploy a changed variation, then visit /migratedata... all old entities will stick around and all the new entities will join them. And so forth.
Moreover -- /migratedata is NOT idempotent (if visited more than once it does not produce the same state as running it just once) so it shouldn't be a GET (and now we're on to super-elementary HTTP for a change!-) -- it should be a POST.
In fact I suspect (but I'm really flying blind here, since you see fit to give such tiny amounts of information) that you in fact want to upload a .txt file to a POST handler and do the updates that way (perhaps avoiding duplicates...?). However, I'm no mind reader, so this is about as far as I can go.
I believe I have fully answered the question you posted (though perhaps not the one you meant but didn't express:-) and by SO's etiquette it would be nice to upvote and accept this answer, then, if needed, post another question, expressing MUCH more clearly and completely what you're trying to achieve, your current .py and .yaml (ideally with correct indentation), what they actually do and why you'd like to do something different. For POST vs GET in particular, just study When should I use GET or POST method? What's the difference between them? ...

Alex's solution will work, as long as all you data can be loaded in under 1 minute, as that's the timeout for an app engine request.
For larger data, consider calling the datastore API directly from your own computer where you have the source. It's a bit of a hassle because it's a different API; it's not ndb. But it's still a pretty simple API. Here's some code that calls the API:
https://github.com/GoogleCloudPlatform/getting-started-python/blob/master/2-structured-data/bookshelf/model_datastore.py
Again, this code can run anywhere. It doesn't need to be uploaded to app engine to run.

Related

Google App Engine - Add & Reflect pattern, working around eventual consistency

I'm building a web application on app engine.
In my case, that's built on django-nonrel, but the key point is that it's using Google's datastore.
I love the fact that I don't need to deal with replication, sharding, backups and such, but one thing that is always getting in my way is the eventual consistency, which seems to get in the way of implementing a common web app pattern which I'm calling "Add & Reflect".
Let's say I have a project management app. The Project is its central model.
Now there's a web page page where I see a list of all projects, can add a project, and then I'll reflect back the list of all projects, which should include the project I just added (assuming no errors).
So the pattern goes like this:
Get and display list of existing projects
User adds new project (using a form on that page)
New project is created
As a response, get and display list of existing projects (now includes the new project)
Now the thing is, that due to eventual consistency, there is no guarantee whatsoever that I will get that new project when I get a list of all projects right after adding a new project.
Now that would be fine if this momentary inconsistency happened when another request (e.g. another user: user B) requested the list of projects one second after the project was added by the first user (user A), but it's really a problem when user A performs an operation, and does not see the results of his action, therefore does not get feedback.
I have gotten used to doing something like this to work around this problem:
def create_project(request):
response_context = {}
new_project = Project(name=request.POST['name'])
project.save()
response_context['projects'] = Project.get_serialized_projects()
# on GAE, eventual consistency means we are not guaranteed to see the
# new projects while querying for all projects, therefore we might need
# to add it manually...
if project.serialize() not in response_context['projects']:
response_context['projects'].append(project.serialize())
return render('projects.html', response_context)
The problem is that this happens in many places in my code, so I'm thinking maybe I'm missing something there, since this pattern is such a basic web app pattern.
Any suggestions for other ways to handle this?
Yes its a common issue. No theres no magic fix.
From client-side once you know the commit succeeded you can save the item locally (globals or storage) and then when querying from datastore merge your saved data. Put an expiration on it so its temporary. Its not trivial to make it work in all cases (say added an item then removed/renamed it so also update cache etc).
From server-side its common to cache recent saves in memcache and also merge with your queries.

In Django, how can I tell my project to read/write in distinct log directories at runtime?

I am working on a django project that, along with having a database for its models and relations, writes to a log directory called activity_logs outside of the project directory to keep track of formatted user activities, one file for each user. This is an alternative file-structure-based solution to having a database table carry this information along, because this offloads some storage from the DB and is relatively easy to format and express such activities. Perhaps some of you may recommend storing this kind of data in the database, which is fine, but I still believe there is question from all of this that I need help answering.
This django project has multiple apps that have an extensive test suite, one for each app. Additionally, there is a logging.py file that encapsulates the logging functionality (writing/reading activities to/from log files), and so both the test cases within the test suites as well as the view functions (and various other utility functions) all utilize these logging functions in order to store these user activities and retrieve them based on model relationships to emulate a user notification system. Since the logging module takes care of this logging, it needs to know where to write to, and so we have a directory structure called activity_logs to which it writes user log files, creating one for a new user and deleting one for a user removed from the database. One of the newest changes we would like to make in this project is to create a separate logging directory for testing this logging functionality, something like test_activity_logs, so that it would never be confused when writing to the test directory for test users or the regular activity log directory for real users.
My problem is this: at runtime, how can I tell the system, at whichever startpoint of execution (whether it be from a view function call through the django test Client object, a test case, an actual HTTPrequest made via a URL, etc.), when to look inside the activity_logs or the test_activity_logs directory? It solely depends on whether I am generating new information for a real user or a test user, but a User is a User in our system, and I'm facing some trouble trying to tell these functions that call some logging functions to write to the test log directory vs. the regular one. For example, one approach I am trying is to pass a keyword argument (kwarg) to the logging functions so that they can be made aware of which directory to read/write to/from, like so:
self.assertTrue(activity_has_been_logged(ACTIVITY_ACCOUNT_CREATED, user.get_profile(), use_test_activity_log_directory=True) == True)
the kwarg called use_test_activity_log_directory=True will tell the logging function called activity_has_been_logged to read the test activity log directory. Unfortunately, apart from being a little inflexible (but tolerable), this doesn't solve the situation where the django test client object sends a GET or POST request via a URL to a view function that writes activities to log files:
response = client.post(propose_match_url, post) #Can't write to test_log_directory if by default it writes to regular directory!
How do I let the client pass on this kwarg to those view functions? I think that it should totally be possible to do this, but I'm not sure if fiddling with these kwargs is the best way, or maybe create a global variable in the project settings file, but maybe that might cause some trouble with race conditions with a shared mutable variable.
Your help would be great. Thanks in advance!
So I just solved this problem. The logging file hosting all logging functionality is really the only place that needs to know where to look (either test_activity_logs or activity_logs), since all other components will invoke functions from the logging module to write/read to/from these directories. I gave an additional field to the model class of the UserProfile class called is_test that is a boolean field to determine whether to look in the test_activity_logs if is_test=True, or activity_logs if is_test=False. That way, the logging module needs only to check the input parameter of type UserProfile and its new field to determine where to perform its logging functionalities. Problem solved!
Check out daemontools if you're on a *nix box or launchd on OS X. Both can make sure your Django instance stays running in whatever mode you prefer (daemontools has a few more options for that) and can isolate a directory for logging stdout/stderr.
You can set environment variables for each instance to help other log files and temporary files know where to be created, which you then get from os.environment or simply use the current working directory as a base if using daemontools.
The directory is automatically created for you using daemontools.

Running multiple sites on the same python process

In our company we make news portals for a pretty big number of local newspapers (currently 13, going to 30 next month and more in the future), each with 2k to 100k page views/day. Since we are evolving from a situation where each site was heavily customized to one where each difference is a matter of configuration or custom template, our software is already pretty much the same for all sites. Right now our deployment strategy is one gunicorn instance for each site (with 1-17 workers each, depending on the site traffic), on a 16-core server and 12GB RAM. The problem with this setup is that each worker (regular pre-forked gunicorn) takes 110MB, whether its being used or not. Now with the new sites we would need to add more RAM to serve not that much many requests, so basically it doesn't scale. Also, since we are moving from this model where each site is independent, each site has its own database and I quite like it that way, especially since we are using relational databases (mysql, but migrating to pgsql), so its much easier to shard this way.
I'm doing some research and experimenting with running all sites on one gunicorn instance, so I could use the servers fully and add more servers behind a load balancer when it came to it. The problem is that django assumes in a lot of places that only one site is running per process, so for what I've thought of so far I'd have to implement:
A middleware that takes the HTTP_HOST from the request and places an identifier on a threadlocal variable.
A template loader that uses that variable to load custom templates accordingly.
Monkey patch django.db.model.Model, probably adding a metaclass (not even sure that's possible, but I think I would need it because of the custom managers we sometimes need to use) that would overwrite the managers for one that would first call db_manager(identifier) on the original manager and then call the intended method. I would also need to overwrite the save and delete methods to always include the using=identifier parameter.
I guess I would need to stop using inclusion_tag decorators, not a big problem, but I need to think of other cases like this.
Heavy and ugly patching of urlresolvers if I need custom or extra urls for each site. I don't need them now, but probably will at some point.
And this is just is what I came up with without even implementing it and seeing where it breaks, I'm sure I'd need many more changes for it to work. So I really don't want to do it, especially with the extra maintenance effort I'll need, but I don't see any alternatives and would love to learn that someone already solved this in a better way. Of course I could also stop using django altogether (I already have many reasons to do so) but that would mean a major rewrite and having two maintain two incompatible branches of the software until the new one reached feature parity with the django version, so to me it seems even worse than all the ugly hacks.
I've recently developed an e-commerce system with similar requirements -- many instances running from the same project sharing almost everything. The previous version of the system was a bunch of independent installations (~30) so it was pretty unmaintainable. I'm sure the requirements still differ from yours (for example, all instances shared the same models in my case), but it still might be useful to share my experience.
You are right that Django doesn't help with scenarios like this out of the box, but it's actually surprisingly easy to work it around. Here is a brief description of what I did.
I could see a synergy between what I wanted to achieve and django.contrib.sites. Also because many third-party Django apps out there know how to work with it and use it, for example, to generate absolute URLs to the current site. The major problem with sites is that it wants you to specify the current site id in settings.SITE_ID, which a very naive approach to the multi host problem. What one naturally wants, and what you also mention, is to determine the current site from the Host request header. To fix this problem, I borrowed the hook idea from django-multisite: https://github.com/shestera/django-multisite/blob/master/multisite/threadlocals.py#L19
Next I created an app encapsulating all the functionality related to the multi host aspect of my project. In my case the app was called stores and among other things it featured two important classes: stores.middleware.StoreMiddleware and stores.models.Store.
The model class is a subclass of django.contrib.sites.models.Site. The good thing about subclassing Site is that you can pass a Store to any function where a Site is expected. So you are effectively still just using the old, well documented and tested sites framework. To the Store class I added all the fields needed to configure all the different stores. So it's got fields like urlconf, theme, robots_txt and whatnot.
The middleware class' function was to match the Host header with the corresponding Store instance in the database. Once the matching Store was retrieved, It would patch the SITE_ID in a way similar to https://github.com/shestera/django-multisite/blob/master/multisite/middleware.py. Also, it looked at the store's urlconf and if it was not None, it would set request.urlconf to apply its special URL requirements. After that, the current Store instance was stored in request.store. This has proven to be incredibly useful, because I was able to do things like this in my views:
def homepage(request):
featured = Product.objects.filter(featured=True, store=request.store)
...
request.store became a natural additional dimension of the request object throughout the project for me.
Another thing that was defined on the Store class was a function get_absolute_url whose implementation looked roughly like this:
def get_absolute_url(self, to='/'):
"""
Return an absolute url to this `Store` or to `to` on this store.
The URL includes http:// and the domain name of the store.
`to` can be an object with `get_absolute_url()` or an absolute path as string.
"""
if isinstance(to, basestring):
path = to
elif hasattr(to, 'get_absolute_url'):
path = to.get_absolute_url()
else:
raise ValueError(
'Invalid argument (need a string or an object with get_absolute_url): %s' % to
)
url = 'http://%s%s%s' % (
self.domain,
# This setting allowed for a sane development environment
# where I just set it to ".dev:8000" and configured `dnsmasq`.
# The same value was also removed from the `Host` value in the middleware
# before looking up the `Store` in database.
settings.DOMAIN_SUFFIX,
path
)
return url
So I could easily generate URLs to objects on other than the current store, e.g.:
# Redirect to `product` on `store`.
redirect(store.get_absolute_url(product))
This was basically all I needed to be able to implement a system allowing users to create a new e-shop living on its own domain via the Django admin.

DJANGO persistant site wide memory

I am new to Django, and probably using it in a way thats not normal.
That said, I would like to find a way to have site wide memory.
To Explain.
I have a very simple setup where one compter will make posts to the site every few seconds.
I want this data to be saved off somewhere.
I want everyone who is viewing the webpage to see updates based on this data in near real time via some javascript.
So using the sample code below.
Computer A would do a post to set_data and set data to "data set"
Computer B,C,D,etc.... would then do a get to get_data and see "data set"
Unfortunatly B,C,D just see ""
I have a feeling what i need is memcached, but I am on a hostgator shared server and cannot install that. In the meantime I am just writing them to files. This works but is really inneficient, and I am hopeing to serve a large user base.
Thanks for any help.
#view.py
data=""
def set_data(request):
data = request.POST['data']
return HttpResponse("");
def get_data(request):
return HttpResponse(data);
memcached is lossy, hence doesn't fulfil "persistent".
Files are fine, but switch to accessing them via mmap.
Persistent storage is also called database (although for some cases Django's cache backend might work as well). Don't ever try to use global variables in web development.
Whether you should use a Django model or the cache backend really depends on your use case, but you just described a contrived example (or does your web app consist of a getter and a setter?).

Opinion: Where to put this code in django app:

Trying to figure out the best way to accomplish this. I have inhereted a Django project that is pretty well done.
There are a number of pre coded modules that a user can include in a page's (a page and a module are models in this app) left well in the admin (ie: side links, ads, constant contact).
A new requirement involves inserting a module of internal links in the same well. These links are not associated with a page in the same way the other modules, they are a seperate many to many join - ie one link can be reused in a set across all the pages.
the template pseudo code is:
if page has modules:
loop through modules:
write the pre coded content of module
Since the links need to be in the same well as the modules, I have created a "link placeholder module" with a slug of link-placeholder.
The new pseudo code is:
if page has modules:
loop through modules:
if module.slug is "link-placeholder":
loop through page.links and output each
else:
write pre-coded module
My question is where is the best place to write this output for the links? As I see it, my options are:
Build the out put in the template (easy, but kind of gets messy - code is nice and neat now)
Build a function in the page model that is called when the "link placeholder is encountered) page.get_internal_link_ouutput. Essentially this would query, build and print internal link module output.
Do the same thing with a custom template tag.
I am leaning towards 2 or 3, but it doesn't seem like the right place to do it. I guess I sometimes get a little confused about the best place to put code in django apps, though I do really like the framework.
Thanks in advance for any advice.
I'd recommend using a custom template tag.
Writing the code directly into the template is not the right place for that much logic, and I don't believe a model should have template-specific methods added to it. Better to have template-specific logic live in template-specific classes and functions (e.g. template tags).